<a href="https://colab.research.google.com/github/miataigeli/capstone_FHIS/blob/darya/bert_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BERT

In this notebook, we create a BERT pipeline to load the Multiligual BERT model and use it along with two linear layers to do a text classification task - to determine whether the text passed in is 'A' reading level or 'B'.

Based on tutorial here: https://www.youtube.com/watch?v=mw7ay38--ak as well as the BERT tutorial from COLX585: https://github.ubc.ca/MDS-CL-2020-21/COLX_585_trends_students/blob/master/tutorials/BPE-BERT/bert_pytorch.ipynb.

#### Imports and Installations

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 8.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 35.1MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |███████

In [2]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast, BertModel, AdamW, get_linear_schedule_with_warmup
from keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from tqdm import tqdm

In [3]:
#specify GPU
device = torch.device("cuda")

In [4]:
#connect to my drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Load and Prepare Dataset

We read in the corpus of json created previously. For the classification, we will only need the text and its label, which are contained in the `content` and `level` columns, so those are the only ones we keep.

In [5]:
# Read in all json files into one pandas dataframe
import os

corpus_dir = "/content/drive/MyDrive/capstone/corpus"
corpus_df = pd.DataFrame([], columns = ['content', 'level'])

for filename in os.listdir(corpus_dir):
    if filename.endswith(".json"): 
         file_path = os.path.join(corpus_dir, filename)
         df = pd.read_json(file_path)
         df = df.drop(columns=['source', 'author', 'title'])
         corpus_df = pd.concat([corpus_df, df])
    else:
        continue

print(corpus_df.describe())

                                                  content level
count                                                 308   308
unique                                                308     5
top     CApÍtULO 4\n\nEl local donde ensayan Los Ectop...    A1
freq                                                    1    94


In [6]:
corpus_df['level'].value_counts(normalize = True)

A1    0.305195
B     0.288961
A2    0.201299
B1    0.136364
B2    0.068182
Name: level, dtype: float64

As we can see, currently the text is classified into A1, A2, B1, B2 and B levels. This would make classification difficult, since the B level is very similar to the B1 and B2 levels. Therefore, for now we change the levels to A or B only, to simplify the classification.

In [6]:
# Change levels to A or B (for now)

level_starts_with_A = corpus_df['level'].map(lambda x: x.startswith('A'))
level_starts_with_B = corpus_df['level'].map(lambda x: x.startswith('B'))
corpus_df.loc[level_starts_with_A, 'level'] = 0 # where the level in the df starts with A, replace level with 'A'
corpus_df.loc[level_starts_with_B, 'level'] = 1 # where the level in the df starts with B, replace level with 'B'

# print out to test if it worked
corpus_df['level'].value_counts(normalize = True)

0    0.506494
1    0.493506
Name: level, dtype: float64

In [7]:
# Reduce size of corpus
corpus_df = corpus_df.iloc[:30]
corpus_df.describe()

Unnamed: 0,content,level
count,30,30
unique,30,2
top,"¡Hola! Mi nombre es Javier. Cuando era niño, m...",0
freq,1,19


In [8]:
corpus_df['level'].value_counts(normalize = True)

0    0.633333
1    0.366667
Name: level, dtype: float64

### Split into train, validation and test sets

We split the data into training, validation and test sets. TODO: make sure the split is the same as the one used for SVM for better comparison.

In [9]:
train_text, test_text, train_levels, test_levels = train_test_split(list(corpus_df['content']), list(corpus_df['level']),
                                                                    random_state = 2021,
                                                                    test_size = 0.3) #did not include stratify

# split test into validation and test
val_text, test_text, val_levels, test_levels = train_test_split(test_text, test_levels,
                                                                random_state = 2021,
                                                                test_size=0.5)

We load the pretrained BERT model. We will test the Spanish model and the multilingual BERT model.

In [10]:
# model_path = 'dccuchile/bert-base-spanish-wwm-uncased' # Spanish model TODO: cased or uncased?
# model_path = 'bert-base-multilingual-cased' # multilingual model
model_path = 'distilbert-base-multilingual-cased'
# tokenizer from pre-trained BERT model
tokenizer = BertTokenizerFast.from_pretrained(model_path, return_tensors='pt')
# Define label to number dictionary (make it 0 and 1 to be able to use cross-entropy loss)
lab2ind = {'A': 0,
           'B': 1
           }

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961828.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




In [11]:
# Download the bert model
model_path = 'distilbert-base-multilingual-cased'
bert_model = BertModel.from_pretrained(model_path, output_attentions = True).to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466.0, style=ProgressStyle(description_…

You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=541808922.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing BertModel: ['distilbert.transformer.layer.4.attention.v_lin.bias', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.transformer.layer.5.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.lin2.bias', 'distilbert.transformer.layer.3.ffn.lin1.weight', 'distilbert.transformer.layer.2.sa_layer_norm.weight', 'distilbert.transformer.layer.3.output_layer_norm.bias', 'distilbert.transformer.layer.3.attention.q_lin.bias', 'distilbert.transformer.layer.3.attention.v_lin.bias', 'distilbert.transformer.layer.2.ffn.lin1.weight', 'distilbert.transformer.layer.5.ffn.lin2.weight', 'distilbert.transformer.layer.5.attention.out_lin.bias', 'distilbert.transformer.layer.2.attention.v_lin.bias', 'vocab_transform.bias', 'distilbert.transformer.layer.1.attention.k_lin.weight', 'distilbert.transformer.layer.2.output_layer_norm.bias', 'distilbert.transformer.layer.3.ffn.lin2.weight', 'disti

In [12]:
# Prepare data
def prepare_data(text, levels):
  ''' Preprocesses the data for classification. '''

  # Tokenize text
  tokenized_texts = tokenizer.batch_encode_plus(text, padding=True, return_token_type_ids=False, return_tensors='pt')

  # Create label tensor
  #labels = [lab2ind[i] for i in levels]
  labels = torch.tensor(levels) #torch.tensor(labels)

  # Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
  # Truncate the texts at 512 since that is the maximum allowed input to BERT
  input_ids = tokenized_texts['input_ids'][:, :300]
  attention_masks = tokenized_texts['attention_mask'][:, :300]

  # Convert all of our data into torch tensors, the required datatype for our model
  inputs = torch.tensor(input_ids)
  masks = torch.tensor(attention_masks)

  return inputs, labels, masks

In [13]:
# Training data
train_inputs, train_labels, train_masks = prepare_data(train_text, train_levels)
print(train_inputs.shape)
print(train_labels.shape)
print(train_masks.shape)

torch.Size([21, 300])
torch.Size([21])
torch.Size([21, 300])




In [14]:
# Validation data
valid_inputs, valid_labels, valid_masks = prepare_data(val_text, val_levels)
print(valid_inputs.shape)
print(valid_labels.shape)
print(valid_masks.shape)

torch.Size([4, 281])
torch.Size([4])
torch.Size([4, 281])




In [14]:
# Test data
test_inputs, test_labels, test_masks = prepare_data(test_text, test_levels)
print(test_inputs.shape)
print(test_labels.shape)
print(test_masks.shape)

torch.Size([47, 512])
torch.Size([47])
torch.Size([47, 512])




In [15]:
# Create an iterator for our data
batch_size = 16
# We'll take training samples in random order in each epoch. 
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(train_data, 
                              sampler = RandomSampler(train_data), # Select batches randomly
                              batch_size=batch_size)

# We'll just read validation set sequentially.
validation_data = TensorDataset(valid_inputs, valid_masks, valid_labels)
validation_dataloader = DataLoader(validation_data, 
                                   sampler = SequentialSampler(validation_data), # Pull out batches sequentially.
                                   batch_size=batch_size)

#### Testing on one example

In [18]:
dataiter = iter(train_dataloader)
batch = dataiter.next()
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
input_ids, input_mask, labels = batch

In [19]:
print(input_ids.shape)
print(input_mask.shape)
print(labels.shape)

torch.Size([16, 512])
torch.Size([16, 512])
torch.Size([16])


In [20]:
outputs = bert_model(input_ids, attention_mask = input_mask)
print(outputs.keys())

odict_keys(['last_hidden_state', 'pooler_output', 'attentions'])


In [21]:
last_hidden_state = outputs["last_hidden_state"]
pooler_output = outputs["pooler_output"]
#hidden_states = outputs["hidden_states"]
#attentions = outputs["attentions"]
print(last_hidden_state.shape)

torch.Size([16, 512, 768])


The last dimension is 768 because that is the dimension of BERT encodings for the model we loaded. 

In [22]:
dense = nn.Linear(768, 768).to(device)
dropout = nn.Dropout(0.1).to(device)
fc = nn.Linear(768, 2).to(device)
softmax = nn.Softmax(dim=1)
dense_output = dense(pooler_output)
drop_output = dropout(dense_output)
fc_output = fc(drop_output)
fc_softmax_output = softmax(fc_output)

print(fc_softmax_output)

tensor([[0.4844, 0.5156],
        [0.4706, 0.5294],
        [0.4815, 0.5185],
        [0.4866, 0.5134],
        [0.4358, 0.5642],
        [0.5017, 0.4983],
        [0.4656, 0.5344],
        [0.4936, 0.5064],
        [0.4764, 0.5236],
        [0.4427, 0.5573],
        [0.4811, 0.5189],
        [0.4551, 0.5449],
        [0.4462, 0.5538],
        [0.4343, 0.5657],
        [0.4647, 0.5353],
        [0.4764, 0.5236]], device='cuda:0', grad_fn=<SoftmaxBackward>)


In [23]:
labels = labels.to(device)
print(labels)
criterion = nn.CrossEntropyLoss()
criterion(fc_softmax_output, labels)

tensor([1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1], device='cuda:0')


tensor(0.6849, device='cuda:0', grad_fn=<NllLossBackward>)

#### BERT Class

We create a BERT class so that we have a pipeline to train the model. The class is initialized with the pretrained BERT model (that's been loaded in earlier code), a hidden layer size, two linear layers and a dropout. The `forward` method generates a BERT representation of the input using the pretrained model, passes the representations into the first linear layer, then to a TanH activation function and dropout function, and then to the final linear layer.

We use `pooler_output` as context representation and pass it to a fully connected layer which outputs the prediction probabilities across all labels.

Two new feed-forward layers for classification are added on top of BERT. Each input is a context representation (`pooler_output`) that is a 768-dimensional vector, and the output is the probability distribution across all labels that is a 5-dimensional vector.

In [16]:
#model_path = "dccuchile/bert-base-spanish-wwm-uncased"
model_path = 'distilbert-base-multilingual-cased'
class Bert_cls(nn.Module):

    def __init__(self, lab2ind, model_path, hidden_size):
        ''' Initializes the class. 

        Arguments: label to index dictionary, path to pretrained model, hidden layer size.

        Returns: None

        '''
        super(Bert_cls, self).__init__()
        self.model_path = model_path
        self.hidden_size = hidden_size
        self.bert_model = BertModel.from_pretrained(model_path)
        
        self.label_num = len(lab2ind)
        
        self.dense = nn.Linear(self.hidden_size, self.hidden_size)
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(self.hidden_size, self.label_num)

    def forward(self, input_ids, attention_mask):
        ''' Generates a BERT representation of the input using the pretrained model, 
        passes the representations into the first linear layer, then to a TanH activation function and dropout function,
        and then to the final linear layer.
        
        Arguments: input_ids, attention mask

        Returns: outputs of neural network and attention mask.
        '''
        outputs = self.bert_model(input_ids, attention_mask)
        pooler_output = outputs['pooler_output']
        #attentions = outputs['attentions']
        
        x = self.dense(pooler_output)
        x = torch.tanh(x)
        x = self.dropout(x)
        fc_output = self.fc(x)

        return fc_output#, attentions

In [17]:
# Instantiate model
model_path = 'distilbert-base-multilingual-cased'
bert_model = Bert_cls(lab2ind, model_path, 768).to(device)

You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing BertModel: ['distilbert.transformer.layer.4.attention.v_lin.bias', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.transformer.layer.5.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.lin2.bias', 'distilbert.transformer.layer.3.ffn.lin1.weight', 'distilbert.transformer.layer.2.sa_layer_norm.weight', 'distilbert.transformer.layer.3.output_layer_norm.bias', 'distilbert.transformer.layer.3.attention.q_lin.bias', 'distilbert.transformer.layer.3.attention.v_lin.bias', 'distilbert.transformer.layer.2.ffn.lin1.weight', 'distilbert.transformer.layer.5.ffn.lin2.weight', 'distilbert.transformer.layer.5.attention.out_lin.bias', 'distilbert.transformer.layer.2.attention.v_lin.bias', 'vocab_transform.bias', 'distilbert.tra

In [18]:
# Count number of parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(bert_model):,} trainable parameters')

The model has 178,445,570 trainable parameters


#### Training the Model

In [19]:
# Parameters:
lr = 2e-5
max_grad_norm = 1.0
epochs = 10
warmup_proportion = 0.1
num_training_steps  = len(train_dataloader) * epochs
num_warmup_steps = num_training_steps * warmup_proportion

### In Transformers, optimizer and schedules are instantiated like this:
# Note: AdamW is a class from the huggingface library
# the 'W' stands for 'Weight Decay"
optimizer = AdamW(bert_model.parameters(), lr=lr, correct_bias=False)
# schedules
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler

# We use nn.CrossEntropyLoss() as our loss function. 
criterion = nn.CrossEntropyLoss()

In [20]:
# Training the model
def train(model, iterator, optimizer, scheduler, criterion):
    
    model.train()
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        

        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        input_ids, input_mask, labels = batch

        # outputs,_ = model(input_ids, input_mask)
        outputs = model(input_ids, input_mask)

        loss = criterion(outputs, labels)
        # delete used variables to free GPU memory
        del batch, input_ids, input_mask, labels
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore
        optimizer.step()
        scheduler.step()
        epoch_loss += loss.cpu().item()
        optimizer.zero_grad()
    
    # free GPU memory
    if device == 'cuda':
        torch.cuda.empty_cache()

    return epoch_loss / len(iterator)

In [21]:
# Evaluate function
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    all_pred=[]
    all_label = []
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            # Add batch to GPU
            batch = tuple(t.to(device) for t in batch)
            # Unpack the inputs from our dataloader
            input_ids, input_mask, labels = batch

            outputs = model(input_ids, input_mask)
            
            loss = criterion(outputs, labels)

            # delete used variables to free GPU memory
            del batch, input_ids, input_mask
            epoch_loss += loss.cpu().item()

            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(labels.cpu())
    
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    return epoch_loss / len(iterator), accuracy, f1score

In [22]:
# create checkpoint directory
import os
save_path = './drive/My Drive/Colab Notebooks/ckpt_BERT/'
if os.path.exists(save_path) == False:
    os.makedirs(save_path)

In [23]:
from tqdm import trange
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, classification_report, confusion_matrix
# Train the model
loss_list = []
acc_list = []

for epoch in trange(epochs, desc="Epoch"):
    train_loss = train(bert_model, train_dataloader, optimizer, scheduler, criterion)  
    val_loss, val_acc, val_f1 = evaluate(bert_model, validation_dataloader, criterion)

    # Create checkpoint at end of each epoch
    state = {
        'epoch': epoch,
        'state_dict': bert_model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'scheduler': scheduler.state_dict()
        }

    torch.save(state, "./drive/My Drive/Colab Notebooks/ckpt_BERT/BERT_"+str(epoch+1)+".pt")

    print('\n Epoch [{}/{}], Train Loss: {:.4f}, Validation Loss: {:.4f}, Validation Accuracy: {:.4f}, Validation F1: {:.4f}'.format(epoch+1, epochs, train_loss, val_loss, val_acc, val_f1))

Epoch:  10%|█         | 1/10 [00:53<08:01, 53.52s/it]


 Epoch [1/10], Train Loss: 0.6941, Validation Loss: 1.8138, Validation Accuracy: 0.2500, Validation F1: 0.2000


Epoch:  20%|██        | 2/10 [01:14<05:51, 43.89s/it]


 Epoch [2/10], Train Loss: 0.7253, Validation Loss: 0.6542, Validation Accuracy: 0.7500, Validation F1: 0.4286


Epoch:  30%|███       | 3/10 [01:37<04:21, 37.42s/it]


 Epoch [3/10], Train Loss: 0.6849, Validation Loss: 0.9764, Validation Accuracy: 0.2500, Validation F1: 0.2000


Epoch:  40%|████      | 4/10 [01:59<03:16, 32.81s/it]


 Epoch [4/10], Train Loss: 0.4959, Validation Loss: 1.2869, Validation Accuracy: 0.2500, Validation F1: 0.2000


Epoch:  50%|█████     | 5/10 [02:21<02:28, 29.69s/it]


 Epoch [5/10], Train Loss: 0.5746, Validation Loss: 1.2609, Validation Accuracy: 0.2500, Validation F1: 0.2000


Epoch:  60%|██████    | 6/10 [02:43<01:49, 27.43s/it]


 Epoch [6/10], Train Loss: 0.6447, Validation Loss: 0.9942, Validation Accuracy: 0.2500, Validation F1: 0.2000


Epoch:  70%|███████   | 7/10 [03:10<01:21, 27.12s/it]


 Epoch [7/10], Train Loss: 0.5912, Validation Loss: 0.8479, Validation Accuracy: 0.2500, Validation F1: 0.2000


Epoch:  80%|████████  | 8/10 [03:32<00:51, 25.75s/it]


 Epoch [8/10], Train Loss: 0.5842, Validation Loss: 0.8989, Validation Accuracy: 0.2500, Validation F1: 0.2000


Epoch:  90%|█████████ | 9/10 [03:55<00:24, 24.72s/it]


 Epoch [9/10], Train Loss: 0.5199, Validation Loss: 0.9439, Validation Accuracy: 0.2500, Validation F1: 0.2000


Epoch: 100%|██████████| 10/10 [04:28<00:00, 26.87s/it]


 Epoch [10/10], Train Loss: 0.5328, Validation Loss: 0.9625, Validation Accuracy: 0.2500, Validation F1: 0.2000





#### Testing Model

Below, I test the model with one example that I took from the corpus manually. It predicts 'A' level, which is correct.

In [24]:
# Test model
text = ['Un chico pelirrojo, un poco gordo, se les acerca sonriendo.\n\u2014Hola, M\u00f3nica. Hola, Laura \u2014dice.\nEs Guillermo.\n\u2014Hola, Guille \u2014contestan las chicas\u2014. Llegas tarde.\n\u2014Es que me he dormido.\n\u2014S\u00ed, ya lo veo.\nGuillermo se sienta al lado de las chicas.\n\u2014\u00bfC\u00f3mo van? \u2014pregunta.\n\u2014Perdemos por 3 a 1.\n\u2014\u00bfDe verdad?\n\u2014S\u00ed, es que...\nUn grito interrumpe la conversaci\u00f3n. \u00ab\u00a1Goool!\u00bb.\n\u2014\u00bfQui\u00e9n ha marcado? \u2014pregunta Laura.\n\u2014Nosotros.\n\u2014Ha marcado Ra\u00fal, despu\u00e9s de un pase de Sergio \u2014explica M\u00f3nica, contenta.\n\n\ufffd\n\n4  f\u00fatbol sala: modalidad del f\u00fatbol que se juega en un recinto m\u00e1s peque\u00f1o, con \ncinco jugadores por equipo.\n\n5  fase eliminatoria: fase de la competici\u00f3n entre 16 equipos, anterior a los cuartos de final, entre los ocho mejores.']
# Tokenize text
tokenized_text = tokenizer.batch_encode_plus(text, padding=True, return_token_type_ids=False, return_tensors='pt')

# Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
input_ids = tokenized_text['input_ids'].to(device)
print(input_ids.shape)
attention_masks = tokenized_text['attention_mask'].to(device)

# # Convert all of our data into torch tensors, the required datatype for our model
# input = torch.tensor(input_ids)
# input = input.to(device)
# mask = torch.tensor(attention_masks)
# mask = mask.to(device)
outputs = bert_model(input_ids, attention_masks)

torch.Size([1, 215])


In [25]:
print(outputs[0])

tensor([ 0.4961, -0.2841], device='cuda:0', grad_fn=<SelectBackward>)


In [26]:
ind2lab =  {0 :'A', 1: 'B'}

In [31]:
probabilities, predicted = torch.max(outputs[0].cpu().data,0)
print("the prediction is: ", ind2lab[predicted.item()])

the prediction is:  A


(TODO) Test on the test data.
Note: this section is incomplete.

In [32]:
# Test data
test_inputs, test_labels, test_masks = prepare_data(test_text, test_levels)
print(test_inputs.shape)
print(test_labels.shape)
print(test_masks.shape)

torch.Size([5, 300])
torch.Size([5])
torch.Size([5, 300])




In [36]:
bert_model = Bert_cls(lab2ind, model_path, 768).to(device)

You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing BertModel: ['distilbert.transformer.layer.4.attention.v_lin.bias', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.transformer.layer.5.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.lin2.bias', 'distilbert.transformer.layer.3.ffn.lin1.weight', 'distilbert.transformer.layer.2.sa_layer_norm.weight', 'distilbert.transformer.layer.3.output_layer_norm.bias', 'distilbert.transformer.layer.3.attention.q_lin.bias', 'distilbert.transformer.layer.3.attention.v_lin.bias', 'distilbert.transformer.layer.2.ffn.lin1.weight', 'distilbert.transformer.layer.5.ffn.lin2.weight', 'distilbert.transformer.layer.5.attention.out_lin.bias', 'distilbert.transformer.layer.2.attention.v_lin.bias', 'vocab_transform.bias', 'distilbert.tra

In [39]:
for test_input, test_mask in zip(test_inputs, test_masks):
  outputs = bert_model(test_input, test_mask)
  print(outputs)

print(test_labels)

ValueError: ignored