If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
#os.chdir('/content/drive/MyDrive/HES/NLP Final Project')
os.chdir('/content/drive/MyDrive/115_experimentation/NLP Final Project')

In [None]:
!pip install datasets[s3] transformers



In [None]:
!pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu111.html
!pip install torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu111.html
!pip install torch-geometric

Looking in links: https://data.pyg.org/whl/torch-1.10.0+cu111.html
Looking in links: https://data.pyg.org/whl/torch-1.10.0+cu111.html


If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

Then you need to install Git-LFS. Uncomment the following instructions:

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers
import pickle
import re
import spacy
from collections import deque
import networkx as nx
import pandas as pd
import numpy as np
from datasets import ClassLabel, Sequence
import random
import pandas as pd

import transformers
import dask.dataframe as dd
import pickle
import re
import spacy
from collections import deque
import networkx as nx
import torch
from torch import nn
from torch import nn, optim, Tensor
from torch.nn import CrossEntropyLoss
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer, AutoModel
from transformers.modeling_outputs import QuestionAnsweringModelOutput
from tqdm import tqdm
print(transformers.__version__)
print(transformers.__version__)

4.13.0
4.13.0


In [None]:
# https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string
# function to remove accent from Emglish words
import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

In [None]:
# upload ConceptNet English data
cnet_df = pd.read_csv('data/cnet_en_encoded.csv')
cnet_word_set = set(cnet_df[['source', 'target']].values.flatten())
nlp = spacy.load("en_core_web_sm")



In [None]:
import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

In [None]:
# Upload pre-proccsed data
file_name = ['english_words', 'english_word_indices', 'english_embeddings', 'normalized_embeddings', 'word_index']

for name in file_name:
    with open(f'data/{name}.pickle', 'rb') as pickle_file:
        temp = pickle.load(pickle_file)
        globals()[name] = temp

with open('cnet_embedding_data/conceptnet_ih.graph', 'rb') as pickle_file:
    conceptnet = pickle.load(pickle_file)

word_embeddings_dict = {key:normalized_embeddings[value] for key,value in word_index.items()}


# Fine-tuning a model on a question-answering task

In [None]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = True
model_checkpoint = 'albert-base-v2'
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:

from datasets import load_dataset, load_metric

For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used.

In [None]:
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

Reusing dataset squad_v2 (/root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d)


  0%|          | 0/2 [00:00<?, ?it/s]

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [None]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = True
model_checkpoint = 'albert-base-v2'
batch_size = 16

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The following assertion ensures that our tokenizer is a fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing.

In [None]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

You can directly call this tokenizer on two sentences (one for the answer, one for the context):

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
pad_on_right = tokenizer.padding_side == "right"

Now let's put everything together in one function we will apply to our training set. In the case of impossible answers (the answer is in another feature given by an example with a long context), we set the cls index for both the start and end position. We could also simply discard those examples from the training set if the flag `allow_impossible_answers` is `False`. Since the preprocessing is already complex enough as it is, we've kept is simple for this part.

In [None]:
# upload ConceptNet English data
cnet_df = pd.read_csv('data/cnet_en_encoded.csv')
cnet_word_set = set(cnet_df[['source', 'target']].values.flatten())
nlp = spacy.load("en_core_web_sm")



In [None]:
import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

In [None]:

# Upload pre-proccsed data
file_name = ['english_words', 'english_word_indices', 'english_embeddings', 'normalized_embeddings', 'word_index']

for name in file_name:
    with open(f'data/{name}.pickle', 'rb') as pickle_file:
        temp = pickle.load(pickle_file)
        globals()[name] = temp

with open('cnet_embedding_data/conceptnet_ih.graph', 'rb') as pickle_file:
    conceptnet = pickle.load(pickle_file)

word_embeddings_dict = {key:normalized_embeddings[value] for key,value in word_index.items()}


In [None]:
pad_on_right = tokenizer.padding_side == "right"

In [None]:
def extract_graph(initial_set, max_hop = 2):
    working_set = initial_set
    counter = 0
    while counter < max_hop:

        node_set = deque(working_set)
        m = len(node_set)
        working_set = set()
        for i in range(m):
            node = node_set.pop()
            working_set = working_set.union(set(conceptnet.neighbors(node)))
        counter += 1
        initial_set = initial_set.union(working_set)
    return initial_set

cashed_stop = nlp.Defaults.stop_words

def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # if len(examples["question"]) > 1:
    #     raise ValueError(" number of rows can be at most one")

    reduced_list = [examples["context"][0]]
    mapping_dict = {}
    j=0
    for i, sent_i in enumerate(examples["context"]):
        if not sent_i == reduced_list[-1]:
            reduced_list.append(sent_i)
            j += 1
        mapping_dict[i] = j

    # working_sentence = [examples["question"][i]+" "+examples["context"][i] for i in range(len(examples["question"]))]
    working_sentence = [strip_accents(working_sent) for working_sent in reduced_list]
    # working_sentence = re.sub('[^A-Za-z0-9 ]+', '', working_sentence)
    encoded_docs = [nlp(working_sent) for working_sent in working_sentence]
    flatten_set = [{x.text.lower() for x in encoded_doc.ents} for encoded_doc in encoded_docs]
    final_list = [{'_'.join([word for word in text.split() if word not in cashed_stop]) for text in word_set} for word_set in flatten_set ]

    cnet_set = [{x for x in final_set if x in cnet_word_set and word_index.get(x) is not None} for final_set in final_list]


    extracted_list = [extract_graph(initial_set) for initial_set in cnet_set]

    #prune the node set
    filtered_list = [[ent for ent in initial_set if word_embeddings_dict.get(ent) is not None] for initial_set in extracted_list]

    sub_list = [conceptnet.subgraph(filtered_set) for filtered_set in filtered_list]
    sub_list = [nx.relabel_nodes(G=sub_cnet, mapping=word_index) for sub_cnet in  sub_list]
    adj_df_list = [nx.convert_matrix.to_pandas_edgelist(sub_cnet).values.astype(float) for sub_cnet in  sub_list]



    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    # todo: extract qa graph:


    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []


    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    # data for columns: ['source', 'target', 'code', 'weight']
    tokenized_examples['adj_values'] = [adj_df_list[mapping_dict[i]] for i in sample_mapping]


    return tokenized_examples

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

# Preparing dataset 

In [None]:
from datasets import load_from_disk
loaded_dataset = load_from_disk('tokenized_datasets_batched_small')

In [None]:
loaded_dataset

DatasetDict({
    train: Dataset({
        features: ['adj_values', 'attention_mask', 'end_positions', 'input_ids', 'start_positions', 'token_type_ids'],
        num_rows: 1319
    })
    validation: Dataset({
        features: ['adj_values', 'attention_mask', 'end_positions', 'input_ids', 'start_positions', 'token_type_ids'],
        num_rows: 120
    })
})

In [None]:
class SquadDataset(Dataset):
    def __init__(self, data):
        self.length = len(data)
        self.input_ids = data['input_ids']
        self.token_type_ids = data['token_type_ids']
        self.start_positions = data['start_positions']
        self.end_positions = data['end_positions']
        self.attention_mask = data['attention_mask']
        self.adj_list = data['adj_values']


    def __len__(self):
        return self.length

    def __getitem__(self, idx: int):

        input_ids = self.input_ids[idx]
        token_type_ids = self.token_type_ids[idx]
        start_positions = self.start_positions[idx]
        end_positions = self.end_positions[idx]
        attention_mask = self.attention_mask[idx]
        adj_list = self.adj_list[idx]

        return (torch.tensor(input_ids),
                torch.tensor(token_type_ids),
                torch.tensor(attention_mask),
                torch.from_numpy(np.array(adj_list)),
                torch.tensor(start_positions),
                torch.tensor(end_positions),
                )

In [None]:
train_ds = SquadDataset(loaded_dataset['train'])

def collate_fn(batch):
    return tuple(zip(*batch))
train_dataloader = DataLoader(train_ds,batch_size=1, shuffle=True, drop_last=False, collate_fn=collate_fn)

In [None]:
train_dataloader.batch_size

1

In [None]:
val_ds = SquadDataset(loaded_dataset['validation'])
val_dataloader = DataLoader(val_ds,batch_size=1, shuffle=True, drop_last=False, collate_fn=collate_fn)

In [None]:
test_batch = next(iter(train_dataloader))

In [None]:
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree, softmax
from torch_geometric.nn import global_add_pool, global_mean_pool, global_max_pool, GlobalAttention, Set2Set, GATConv
import torch.nn.functional as F
from torch_scatter import scatter_add, scatter
from torch_geometric.nn.inits import glorot, zeros


class GAT(torch.nn.Module):
    def __init__(self, num_features=300, out_dim=768):
        super(GAT, self).__init__()
        self.hid = 8
        self.in_head = 8
        self.out_head = 1
        self.num_features = num_features
        self.out_dim = out_dim
        
        
        self.conv1 = GATConv(self.num_features, self.hid, heads=self.in_head, dropout=0.6)
        self.conv2 = GATConv(self.hid*self.in_head, self.out_dim, concat=False,
                             heads=self.out_head, dropout=0.6)

    def forward(self, x, edge_index):
        #x, edge_index = data.x, data.edge_index
        
        x = F.dropout(x, p=0.6, training=self.training)
        #import pdb
        #pdb.set_trace()
        x = self.conv1(x, edge_index)
        x = F.elu(x)
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

In [None]:
def batch_graph(edge_index_init, edge_type_init, n_nodes):
        #edge_index_init: list of (n_examples, ). each entry is torch.tensor(2, E)
        #edge_type_init:  list of (n_examples, ). each entry is torch.tensor(E, )
    n_examples = len(edge_index_init)
    edge_index = [edge_index_init[_i_] + _i_ * n_nodes for _i_ in range(n_examples)]
    edge_index = torch.cat(edge_index, dim=1) #[2, total_E]
    edge_type = torch.cat(edge_type_init, dim=0) #[total_E, ]
    return edge_index, edge_type

In [None]:
class AlbertQA(torch.nn.Module):

    def __init__(self, gnn_present=False):
        super().__init__()

        self.lm = AutoModel.from_pretrained(model_checkpoint)
        self.lm.resize_token_embeddings(len(tokenizer))
        self.gnn_present = gnn_present
        self.gnn = GAT() #nn.Linear(self.lm.encoder.config.hidden_size, self.lm.encoder.config.hidden_size)
        self.hidden_size = 2*self.lm.encoder.config.hidden_size if gnn_present else self.lm.encoder.config.hidden_size
        self.qa_outputs = nn.Linear(self.hidden_size//2, self.lm.encoder.config.num_labels)

    def forward(
            self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            output_attentions=None,
            output_hidden_states=None,
            return_dict=None,
            adj_matrix=None,
    ):
        outputs = self.lm(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        lm_output = outputs[0]
        
        #import pdb
        #pdb.set_trace()

        #edge_index = [torch.transpose(adj_matrix[i][:,:2], 0,1) for i in range(len(adj_matrix)-1)]
        #edge_type = [torch.transpose(adj_matrix[i][:,2], 0,1) for i in range(len(adj_matrix)-1)]
        edge_index = torch.transpose(adj_matrix[0][:, :2], 0, 1)
        edge_type = torch.tensor(adj_matrix[0][:, 2])
        
        features_x = torch.tensor(normalized_embeddings) #torch.tensor([normalized_embeddings[i] for i in input_ids[0]])
        
        #edge_index, _ = batch_graph(edge_index, edge_type, 200)
        
        edge_index = edge_index.long()
        
        gnn_output = self.gnn(features_x, edge_index)#*0
        #gnn_output = gnn_output.unsqueeze(0)

        import pdb
        pdb.set_trace()

        lm_output = lm_output.squeeze()

        sequence_output = torch.cat([lm_output, gnn_output]) if self.gnn_present else lm_output
        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1).contiguous()
        end_logits = end_logits.squeeze(-1).contiguous()

        return QuestionAnsweringModelOutput(
            start_logits=start_logits,
            end_logits=end_logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )


In [None]:
def loss_compute(start_positions, end_positions, start_logits, end_logits):
    total_loss = None
    import pdb
    pdb.set_trace()
    if start_positions is not None and end_positions is not None:
        # If we are on multi-GPU, split add a dimension
        if len(start_positions.size()) > 1:
            start_positions = start_positions.squeeze(-1)
        if len(end_positions.size()) > 1:
            end_positions = end_positions.squeeze(-1)
        # sometimes the start/end positions are outside our model inputs, we ignore these terms
        import pdb
        pdb.set_trace()
        ignored_index = start_logits.size(1)
        start_positions = start_positions.clamp(0, ignored_index)
        end_positions = end_positions.clamp(0, ignored_index)

        loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
        start_loss = loss_fct(start_logits, start_positions)
        end_loss = loss_fct(end_logits, end_positions)
        total_loss = (start_loss + end_loss) / 2
    return total_loss

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device = "cpu"
qamodel = AlbertQA(gnn_present=True)
qamodel.to(device)
optimizer = optim.Adam(qamodel.parameters(), lr=5e-5, weight_decay=0.001)

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertModel: ['predictions.dense.weight', 'predictions.decoder.bias', 'predictions.dense.bias', 'predictions.bias', 'predictions.LayerNorm.weight', 'predictions.LayerNorm.bias', 'predictions.decoder.weight']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
import matplotlib.pyplot as plt


In [None]:
train_losses = []
test_losses = []
n_epochs = 5
for epoch in range(n_epochs):
    running_loss = 0
    n_correct = 0
    qamodel.train()
    for batch in tqdm(train_dataloader, leave=False):
        torch.cuda.empty_cache()
        input_ids = torch.stack(list(batch[0]), dim=0).to(device)
        attention_mask = torch.stack(list(batch[2]), dim=0).to(device)
        token_type_ids = torch.stack(list(batch[1]), dim=0).to(device)

        start_positions = torch.stack(list(batch[4]), dim=0).to(device)
        end_positions = torch.stack(list(batch[5]), dim=0).to(device)
        out = qamodel(input_ids=input_ids,
                      attention_mask=attention_mask,
                      token_type_ids=token_type_ids,
                      adj_matrix=batch[3],
                      )

        loss = loss_compute(start_positions, end_positions, out['start_logits'], out['end_logits'])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        running_loss += loss.cpu().item()
    train_losses.append(running_loss / len(train_ds))

    running_loss = 0
    with torch.no_grad():
        qamodel.eval()
        for batch in tqdm(val_dataloader, leave=False):
            input_ids = torch.stack(list(batch[0]), dim=0).to(device)
            attention_mask = torch.stack(list(batch[2]), dim=0).to(device)
            token_type_ids = torch.stack(list(batch[1]), dim=0).to(device)

            start_positions = torch.stack(list(batch[4]), dim=0).to(device)
            end_positions = torch.stack(list(batch[5]), dim=0).to(device)
            out = qamodel(input_ids=input_ids,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids,
                        adj_matrix=batch[3],
                        )

            loss = loss_compute(start_positions, end_positions, out['start_logits'], out['end_logits'])
            running_loss += loss.cpu().item()

    test_losses.append(running_loss / len(val_ds))

    print("=" * 20)
    print(f"Epoch {epoch + 1}/{n_epochs} Train Loss: {running_loss / len(train_ds)}")
    print(f"Epoch {epoch+1}/{n_epochs} Test Loss: {running_loss / len(val_ds)}")

plt.plot(train_losses, label="train")
plt.plot(test_losses, label="test")
plt.legend()
plt.xlabel("epoch")
plt.ylabel("mean loss")
plt.show()

In [None]:
plt.plot(train_losses, label="train")
# plt.plot(test_losses, label="test")
plt.legend()
plt.xlabel("epoch")
plt.ylabel("mean loss")
plt.show()

In [None]:
torch.save(qamodel.state_dict(), 'albertqa_nognn_1.mdl')


In [None]:
model = AlbertQA()
model.load_state_dict(torch.load('albertqa_nognn_1.mdl'))
model.eval()

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertModel: ['predictions.decoder.bias', 'predictions.decoder.weight', 'predictions.LayerNorm.weight', 'predictions.LayerNorm.bias', 'predictions.bias', 'predictions.dense.bias', 'predictions.dense.weight']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


AlbertQA(
  (lm): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
               