<a href="https://colab.research.google.com/github/miataigeli/capstone_FHIS/blob/darya/src/bert_pipeline_darya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BERT

In this notebook, we create a BERT pipeline to load the Multiligual BERT model and use it along with two linear layers to do a text classification task - to determine whether the text passed in is 'A1', 'A2' or 'B' level according to the European CERF framework.

Based on tutorial here: https://www.youtube.com/watch?v=mw7ay38--ak as well as the BERT tutorial from COLX585: https://github.ubc.ca/MDS-CL-2020-21/COLX_585_trends_students/blob/master/tutorials/BPE-BERT/bert_pytorch.ipynb.

#### Imports and Installations

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/00/92/6153f4912b84ee1ab53ab45663d23e7cf3704161cb5ef18b0c07e207cef2/transformers-4.7.0-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 6.9MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 28.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 60.2MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast, BertModel, AdamW, get_linear_schedule_with_warmup
from keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from tqdm import tqdm
import math

In [None]:
#specify GPU
device = torch.device("cuda")

In [None]:
#connect to my drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Load and Prepare Dataset

We read in the corpus from JSON files created previously. For the classification, we will only need the text and its label, which are contained in the `content` and `level` columns, so those are the only ones we keep. The splits are already done, so the files we read in are `train.json`, `val.json` and `test.json`.

In [None]:
# Read in all json files into one pandas dataframe
import os

corpus_dir = "/content/drive/MyDrive/capstone/corpus"

for filename in os.listdir(corpus_dir):
    if filename.endswith("train.json"): 
         file_path = os.path.join(corpus_dir, filename)
         train_df = pd.read_json(file_path)
         train_df = train_df.drop(columns=['source', 'author', 'title'])
    elif filename.endswith("val.json"):
         file_path = os.path.join(corpus_dir, filename)
         val_df = pd.read_json(file_path)
         val_df = val_df.drop(columns=['source', 'author', 'title'])
    elif filename.endswith("test.json"):
         file_path = os.path.join(corpus_dir, filename)
         test_df = pd.read_json(file_path)
         test_df = test_df.drop(columns=['source', 'author', 'title'])
    else:
        continue

print("Train: \n", train_df.describe(), "\n")
print("Val: \n", val_df.describe(), "\n")
print("Test: \n", test_df.describe(), "\n")

Train: 
        level                                            content
count    257                                                257
unique     3                                                257
top        B  Cierto hombre rico tenía tres hijos. El hijo m...
freq     122                                                  1 

Val: 
        level                                            content
count     32                                                 32
unique     3                                                 32
top        B   \nLA CONSTANCIA\nMis arreos son las armas,\nm...
freq      15                                                  1 

Test: 
        level                                            content
count     32                                                 32
unique     3                                                 32
top        B  ¿En qué mes hablan menos las mujeres?—En el de...
freq      15                                                  1 



In [None]:
# View class splits
train_df['level'].value_counts(normalize = True)

B     0.474708
A1    0.330739
A2    0.194553
Name: level, dtype: float64

As we can see, currently the text is classified into A1, A2, and B levels. Although the CERF framework includes other levels, we will only use these for now.

### Split into text and level lists

In [None]:
# Define label to number dictionary
lab2ind = {'A1': 0,
           'A2': 1,
           'B': 2
           }

In [None]:
train_text, train_levels = list(train_df['content']), list(train_df['level'])
val_text, val_levels = list(val_df['content']), list(val_df['level'])
test_text, test_levels = list(test_df['content']), list(test_df['level'])

We load the pretrained BERT model. We use the multilingual BERT model.

In [None]:
# model_path = 'dccuchile/bert-base-spanish-wwm-uncased' # Spanish model TODO: cased or uncased?
model_path = 'bert-base-multilingual-cased' # multilingual model
# model_path = 'distilbert-base-multilingual-cased' #lighter, faster model
# tokenizer from pre-trained BERT model
tokenizer = BertTokenizerFast.from_pretrained(model_path, return_tensors='pt')


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961828.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




In [None]:
# Download the bert model
bert_model = BertModel.from_pretrained(model_path, output_attentions = False).to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=714314041.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Prepare data for classification

In [None]:
# We made the chunk size 256 in case we wanted to append linguistic features after the text,
# and stay within the 512-token limit for BERT.
# Regardless, most of the text is used for training BERT, so largely this number shouldn't matter
CHUNK_SIZE = 256

# Prepare data
def prepare_data(texts, levels, return_texts=False, first_256_only=False):
  ''' Preprocesses the data for classification. Tokenizes the texts, and splits them into chunks of 
      CHUNK_SIZE tokens in order to be below the limit for BERT.

      Arguments:
      ---------------------
      texts: texts to prepare
      levels: the labels for the texts

      Returns:
      ---------------------
      inputs: prepared chunked and tokenized texts
      labels: the labels corresponding to the inputs
      orig_texts: a list of the original texts
    '''

  # Tokenize texts
  tokenized_texts = []
  for text in texts:
     tok_text = tokenizer.batch_encode_plus([text], padding=False, return_token_type_ids=False, return_tensors='pt')#, max_length=256)
     tokenized_texts.append(tok_text)

  # Convert levels to their corresponding number
  levels = [lab2ind[i] for i in levels]
  labels_orig = torch.tensor(levels)
  
  # Split texts into CHUNK_SIZE tokens per 
  orig_texts = []
  input_ids_chunks = []
  labels_chunks = []
  mask_chunks = []
  i = 0
  for tok_text, label in zip(tokenized_texts, levels):
      input_id = list(tok_text['input_ids'][0])
      attention_mask = list(tok_text['attention_mask'][0])
      if len(input_id) > CHUNK_SIZE:
          # Chop up into smaller pieces
          # in this case, we consider all chunks of 256
          # and discard any tokens left outside of the last 256-token chunk
          if first_256_only:
            input_ids_chunks += [np.array(input_id[:256])]
            labels_chunks += [label]
            attention = [1] * 256
            mask_chunks += [attention]
            orig_texts.append(texts[i][:256])
          else:
            remainder = len(input_id) % CHUNK_SIZE
            input_id = input_id[:-remainder]
            attention_mask = attention_mask[:-remainder]
            num_chunks = len(input_id) // CHUNK_SIZE
            input_id_lst = np.array_split(np.array(input_id),num_chunks)
            mask_lst = np.array_split(np.array(attention_mask),num_chunks)
            labels_lst = [label] * len(input_id_lst)
            for chunk in range(num_chunks):
              orig_text = texts[i][256*chunk:(256*chunk)+256]
              orig_texts.append(orig_text)
            input_ids_chunks += input_id_lst
            labels_chunks += labels_lst
            mask_chunks += mask_lst
      else:
          # In this case, we consider the entire string since this is a full text,
          # pad the remaining 256-len(tokens) chars,
          # and make an attention mask to distinguish content from padding
          padding = [0] * (CHUNK_SIZE-len(input_id))
          input_ids_chunks += [np.array(input_id + padding)]
          labels_chunks += [label]
          attention = [1] * len(input_id)
          mask_chunks += [attention + padding]
          orig_texts.append(texts[i])
      i += 1
  
  # Test that all labels have length 256
  for i, (input_id_chunk, mask_chunk) in enumerate(zip(input_ids_chunks, mask_chunks)):
    assert len(input_id_chunk) == CHUNK_SIZE, f"Length of text not {CHUNK_SIZE} at index {i}!"
    assert len(mask_chunk) == CHUNK_SIZE, f"Length of mask not {CHUNK_SIZE} at index {i}!"

  # Convert all of our data into torch tensors, the required datatype for our model
  inputs = torch.tensor(input_ids_chunks)
  masks = torch.tensor(mask_chunks)
  labels = torch.tensor(labels_chunks)

  if return_texts:
    return inputs, masks, labels, orig_texts
  else:
    return inputs, masks, labels

In [None]:
# Training data
train_inputs, train_masks, train_labels = prepare_data(train_text, train_levels, first_256_only=True)
print(train_inputs.shape)
print(train_masks.shape)
print(train_labels.shape)

torch.Size([257, 256])
torch.Size([257, 256])
torch.Size([257])


In [None]:
# Validation data
valid_inputs, valid_masks, valid_labels = prepare_data(val_text, val_levels, first_256_only=True)
print(valid_inputs.shape)
print(valid_masks.shape)
print(valid_labels.shape)



torch.Size([32, 256])
torch.Size([32, 256])
torch.Size([32])


In [None]:
# Test data
test_inputs, test_masks, test_labels, test_texts = prepare_data(test_text, test_levels, return_texts=True, first_256_only=True)
print(test_inputs.shape)
print(test_masks.shape)
print(test_labels.shape)
print(len(test_texts))

torch.Size([32, 256])
torch.Size([32, 256])
torch.Size([32])
32


In [None]:
print(test_texts[0])

CAPÍtULO 7

—¡Paren ya de pelearse! —el hombre alto llega delante de la 
choza. Está enfadado.

El hombre de la trenza y el guardián de la choza paran al oír la voz.
Junto al hombre alto hay otros dos huaqueros, que miran divertidos la escena.
—¿Qué pasó..


Although we started with an 80-10-10 split between training data, validation data and test data from the JSON files we imported, the variability of the lengths of the texts in the splits changed the split slightly. Splitting the texts into 256-character chunks gave us 1401 texts in total, with 1133 of those from the training data, 129 from the validation data and 139 from the test data. It is still roughly 80-10-10, but with a little bit more test data than validation data.

In [None]:
# Create an iterator for our data
batch_size = 3
# We'll take training samples in random order in each epoch. 
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(train_data, 
                              sampler = RandomSampler(train_data), # Select batches randomly
                              batch_size=batch_size)

# We'll just read validation set sequentially.
validation_data = TensorDataset(valid_inputs, valid_masks, valid_labels)
validation_dataloader = DataLoader(validation_data, 
                                   sampler = SequentialSampler(validation_data), # Pull out batches sequentially.
                                   batch_size=batch_size)

In [None]:
# Make sure dataloaders are correct
print(len(validation_dataloader))

11


#### BERT Class

We create a BERT class so that we have a pipeline to train the model. The class is initialized with the pretrained BERT model, a hidden layer size, two linear layers and a dropout. The `forward` method generates a BERT representation of the input using the pretrained model (contained in `pooler_output`), passes the representations into the first linear layer, then to a TanH activation function and dropout function, and then to the final linear layer, and returns the output.

This way, two feed-forward layers are added on top of the BERT representation in order to provide a classification. 

In [None]:
#model_path = "dccuchile/bert-base-spanish-wwm-uncased" # Spanish model
model_path = 'bert-base-multilingual-cased'
class Bert_cls(nn.Module):

    def __init__(self, lab2ind, model_path, hidden_size):
        ''' Initializes the class. 

        Arguments: label to index dictionary, path to pretrained model, hidden layer size.

        Returns: None

        '''
        super(Bert_cls, self).__init__()
        self.model_path = model_path
        self.hidden_size = hidden_size
        self.bert_model = BertModel.from_pretrained(model_path)
        
        self.label_num = len(lab2ind)
        
        self.dense = nn.Linear(self.hidden_size, self.hidden_size)
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(self.hidden_size, self.label_num)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input_ids, input_masks): # add input_masks if attention_masks added
        ''' Generates a BERT representation of the input using the pretrained model, 
        passes the representations into the first linear layer, then to a TanH activation function and dropout function,
        and then to the final linear layer, followed by a softmax function to get the final class probabilities.
        
        Arguments: input_ids, attention mask

        Returns: outputs of neural network and attention mask.
        '''
        outputs = self.bert_model(input_ids, input_masks) # add input_masks if attention_masks added
        pooler_output = outputs['pooler_output']
        #attentions = outputs['attentions']
        
        x = self.dense(pooler_output)
        x = torch.tanh(x)
        x = self.dropout(x)
        fc_output = self.fc(x)
        output = self.softmax(fc_output)

        return output#, attentions

In [None]:
# Instantiate model
bert_model = Bert_cls(lab2ind, model_path, 768).to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=714314041.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Count number of parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(bert_model):,} trainable parameters')

The model has 178,446,339 trainable parameters


#### Training the Model

In [None]:
# Parameters:
lr = 5e-6 # 2e-5
max_grad_norm = 1.0
epochs = 30
warmup_proportion = 0.1
num_training_steps  = len(train_dataloader) * epochs
num_warmup_steps = num_training_steps * warmup_proportion

### Instantiate optimizer and scheduler
optimizer = AdamW(bert_model.parameters(), lr=lr, correct_bias=False)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler

# Use Cross-Entropy loss as our loss function
criterion = nn.CrossEntropyLoss()

In [None]:
# Training the model
def train(model, iterator, optimizer, scheduler, criterion, max_grad_norm=1.0):
    
    model.train()
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        

        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        input_ids, input_masks, labels = batch # add input_masks if attention_masks added

        # outputs,_ = model(input_ids, input_mask)
        outputs = model(input_ids, input_masks) # add input_masks if attention_masks added

        loss = criterion(outputs, labels)
        # delete used variables to free GPU memory
        del batch, input_ids, input_masks, labels # add input_masks if attention_masks added
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore
        optimizer.step()
        scheduler.step()
        epoch_loss += loss.cpu().item()
        optimizer.zero_grad()
    
    # free GPU memory
    if device == 'cuda':
        torch.cuda.empty_cache()

    return epoch_loss / len(iterator)

In [None]:
# Evaluate function
def evaluate(model, iterator, criterion, return_preds=False):
    
    model.eval()
    
    epoch_loss = 0
    all_pred=[]
    all_label = []
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            # Add batch to GPU
            batch = tuple(t.to(device) for t in batch)
            # Unpack the inputs from our dataloader
            input_ids, input_masks, labels = batch # add input_masks if attention_masks added

            outputs = model(input_ids, input_masks) # add input_masks if attention_masks added
            
            loss = criterion(outputs, labels)

            # delete used variables to free GPU memory
            del batch, input_ids, input_masks # add input_masks if attention_masks added
            epoch_loss += loss.cpu().item()

            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(labels.cpu())
    
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    if return_preds:
      return epoch_loss / len(iterator), accuracy, f1score, all_pred
    else:
      return epoch_loss / len(iterator), accuracy, f1score

In [None]:
# create checkpoint directory
import os
save_path = './drive/My Drive/Colab Notebooks/ckpt_BERT/'
if os.path.exists(save_path) == False:
    os.makedirs(save_path)

In [None]:
from tqdm import trange
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, classification_report, confusion_matrix
# Train the model
loss_list = []
acc_list = []

for i in range(epochs):
    epoch_loss = train(bert_model, train_dataloader, optimizer, scheduler, criterion)  
    train_loss, train_acc, train_f1 = evaluate(bert_model, train_dataloader, criterion)
    val_loss, val_acc, val_f1 = evaluate(bert_model, validation_dataloader, criterion)

    # # Create checkpoint at end of each epoch
    # state = {
    #     'epoch': epoch,
    #     'state_dict': bert_model.state_dict(),
    #     'optimizer': optimizer.state_dict(),
    #     'scheduler': scheduler.state_dict()
    #     }

    #torch.save(state, "./drive/My Drive/Colab Notebooks/ckpt_BERT/BERT_"+str(epoch+1)+".pt")
    print(f'epoch: {i}, Train Loss: {epoch_loss:.3f}, Train Acc: {train_acc:.3f}, Train f1: {train_f1:.3f}, Dev Acc: {val_acc:.3f}, Dev f1: {val_f1:.3f}')

epoch: 0, Train Loss: 1.051, Train Acc: 0.482, Train f1: 0.231, Dev Acc: 0.469, Dev f1: 0.213
epoch: 1, Train Loss: 0.807, Train Acc: 0.732, Train f1: 0.540, Dev Acc: 0.750, Dev f1: 0.554
epoch: 2, Train Loss: 0.678, Train Acc: 0.755, Train f1: 0.557, Dev Acc: 0.781, Dev f1: 0.653
epoch: 3, Train Loss: 0.536, Train Acc: 0.837, Train f1: 0.716, Dev Acc: 0.812, Dev f1: 0.679
epoch: 4, Train Loss: 0.480, Train Acc: 0.879, Train f1: 0.831, Dev Acc: 0.812, Dev f1: 0.729
epoch: 5, Train Loss: 0.295, Train Acc: 0.938, Train f1: 0.920, Dev Acc: 0.781, Dev f1: 0.746
epoch: 6, Train Loss: 0.201, Train Acc: 0.977, Train f1: 0.968, Dev Acc: 0.812, Dev f1: 0.762
epoch: 7, Train Loss: 0.070, Train Acc: 0.992, Train f1: 0.989, Dev Acc: 0.844, Dev f1: 0.816
epoch: 8, Train Loss: 0.019, Train Acc: 0.992, Train f1: 0.989, Dev Acc: 0.812, Dev f1: 0.785
epoch: 9, Train Loss: 0.013, Train Acc: 1.000, Train f1: 1.000, Dev Acc: 0.781, Dev f1: 0.753
epoch: 10, Train Loss: 0.003, Train Acc: 1.000, Train f1: 1.

#### Evaluate on Test Data

In [None]:
# Test data
test_inputs, test_masks, test_labels, test_texts = prepare_data(test_text, test_levels, return_texts=True, first_256_only=True)
print(test_inputs.shape)
print(test_masks.shape)
print(test_labels.shape)
print(len(test_texts))

torch.Size([32, 256])
torch.Size([32, 256])
torch.Size([32])
32


In [None]:
# We'll just read test set sequentially.
test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_dataloader = DataLoader(test_data, 
                                   sampler = SequentialSampler(test_data), # Pull out batches sequentially.
                                   batch_size=batch_size)

In [None]:
avg_epoch_loss_test, test_accuracy, test_fscore, test_preds = evaluate(bert_model, test_dataloader, criterion, return_preds=True)
print(test_accuracy)
print(test_fscore)

0.9375
0.9184574379476929


In [None]:
# Convert predictions to strings
ind2lab = {0: 'A1', 1: 'A2', 2: 'B'}
test_preds = [ind2lab[int(x)] for x in test_preds]
test_labels = [ind2lab[int(x)] for x in test_labels]

In [None]:
len(test_texts)

32

In [None]:
test_df = pd.DataFrame(np.vstack((test_texts, test_preds, test_labels))).astype("string").T
test_df.columns = ['test_text', 'test_pred', 'test_gold']
print(test_df.shape)
# Write to file


(32, 3)


In [None]:
test_df.head()

Unnamed: 0,test_text,test_pred,test_gold
0,CAPÍtULO 7 —¡Paren ya de pelearse! —el hombre...,A1,A1
1,"¡Es con voz de la Biblia, o verso de Walt Whit...",B,B
2,39. LOS CUATRO HERMANOS Un zapatero tenía cuat...,A2,A2
3,Una mañana entró un caballero en la tienda de ...,A1,A1
4,Había un viejo que tenía una hija muy hermosa....,A1,A1


In [None]:
# Write to JSON file
test_json = test_df.to_json('/content/drive/MyDrive/capstone/BERT_test_pred_prelim.json')
# with open('/content/drive/MyDrive/capstone/BERT_test_pred_final.json', 'w') as json_file:
#   json_file.write(test_json)

In [None]:
# Test out reading the dataframe
test_df_read = pd.read_json('/content/drive/MyDrive/capstone/BERT_test_pred_prelim.json')
test_df_read.head()

Unnamed: 0,test_text,test_pred,test_gold
0,CAPÍtULO 7\n\n—¡Paren ya de pelearse! —el homb...,A1,A1
1,"¡Es con voz de la Biblia, o verso de Walt Whit...",B,B
2,39. LOS CUATRO HERMANOS\nUn zapatero tenía cua...,A2,A2
3,Una mañana entró un caballero en la tienda de ...,A1,A1
4,Había un viejo que tenía una hija muy hermosa....,A1,A1


#### Testing Model

Below, I test the model with a few examples.

In [None]:
# Index to label dictionary
ind2lab =  {0 :'A1', 1: 'A2', 2: 'B'}

The first example I took from the corpus manually, it is annotated as A2-level.

In [None]:
text = 'Un chico pelirrojo, un poco gordo, se les acerca sonriendo.\n\u2014Hola, M\u00f3nica. Hola, Laura \u2014dice.\nEs Guillermo.\n\u2014Hola, Guille \u2014contestan las chicas\u2014. Llegas tarde.\n\u2014Es que me he dormido.\n\u2014S\u00ed, ya lo veo.\nGuillermo se sienta al lado de las chicas.\n\u2014\u00bfC\u00f3mo van? \u2014pregunta.\n\u2014Perdemos por 3 a 1.\n\u2014\u00bfDe verdad?\n\u2014S\u00ed, es que...\nUn grito interrumpe la conversaci\u00f3n. \u00ab\u00a1Goool!\u00bb.\n\u2014\u00bfQui\u00e9n ha marcado? \u2014pregunta Laura.\n\u2014Nosotros.\n\u2014Ha marcado Ra\u00fal, despu\u00e9s de un pase de Sergio \u2014explica M\u00f3nica, contenta.\n\n\ufffd\n\n4  f\u00fatbol sala: modalidad del f\u00fatbol que se juega en un recinto m\u00e1s peque\u00f1o, con \ncinco jugadores por equipo.\n\n5  fase eliminatoria: fase de la competici\u00f3n entre 16 equipos, anterior a los cuartos de final, entre los ocho mejores.'

# Use the prepare_data function to prepare the text for classification
inputs, masks, label = prepare_data([text], ['A1']) # we need to put the label as one of the arguments to the function
inputs = inputs.to(device)
masks = masks.to(device)
outputs = bert_model(inputs, masks)
print(outputs)

tensor([[-4.6848e-05, -1.0671e+01, -1.0655e+01]], device='cuda:0',
       grad_fn=<LogSoftmaxBackward>)


In [None]:
probabilities, predicted = torch.max(outputs[0].cpu().data,0)
print("the prediction is: ", ind2lab[predicted.item()])

the prediction is:  A1


The prediction is correct! Now let's test with a random example from the Internet. This one I took from an article in El País, and it should be classified as B level (or above).

In [None]:
text = "Un estudio alerta de que hasta el 91% de la sabiduría tribal sobre plantas con potencial farmacológico y terapéutico desaparecerá con la muerte de sus lenguas."

# Use the prepare_data function to prepare the text for classification
inputs, masks, label = prepare_data([text], ['B']) # we need to put the label as one of the arguments to the function
inputs = inputs.to(device)
masks = masks.to(device)
outputs = bert_model(inputs, masks)
print(outputs)

tensor([[-1.0738e+01, -1.1960e+01, -2.8133e-05]], device='cuda:0',
       grad_fn=<LogSoftmaxBackward>)


In [None]:
probabilities, predicted = torch.max(outputs[0].cpu().data,0)
print("the prediction is: ", ind2lab[predicted.item()])

the prediction is:  B


### Cross-Validation and Hyperparameter Tuning

The code below can be used for cross-validation via GridSearch. We did not run it because it would take a really long time, and our results were already very good.

In [None]:
lr_list = [1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4]
max_grad_norms_list = [0.8, 0.9, 1.0, 1.1, 1.2]
#num_epochs = [10, 20, 30, 40, 50]
#chunk_sizes = #from 256 to 510

In [None]:
num_epochs = 20
def grid_search():

  num_trials = len(lr_list) * len(max_grad_norms_list) # make sure that len(lr_list) and len(max_grad_norms_list) are coprime. 
  for i in range(num_trials):
    lr_idx = i % len(lr_list)
    norm_idx = i % len(max_grad_norms_list)
    lr = lr_list[lr_idx]
    max_grad_norm = max_grad_norms_list[norm_idx]

    model = Bert_cls(lab2ind, model_path, 768)
    model.to(device)
    optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = num_warmup_steps, num_training_steps = num_training_steps)
    criterion = nn.CrossEntropyLoss()

    print(f'\ni: {i}, lr: {lr}, max_grad:{max_grad_norm}')
    for i in range(epochs):
      epoch_loss = train(bert_model, train_dataloader, optimizer, scheduler, criterion)  
      train_loss, train_acc, train_f1 = evaluate(bert_model, train_dataloader, criterion)
      val_loss, val_acc, val_f1 = evaluate(bert_model, validation_dataloader, criterion)

      print(f'epoch: {i}, Train Loss: {epoch_loss:.3f}, Train Acc: {train_acc:.3f}, Train f1: {train_f1:.3f}, Dev Acc: {val_acc:.3f}, Dev f1: {val_f1:.3f}')
  
    print('\n\n')

In [None]:
grid_search()

### Visualizations

Some placeholder cells in case we decide to add visualizations after cross-validation.

In [None]:
# Num epochs vs accuracy (train, validation)

In [None]:
# learning rate vs accuracy (train, validation)

In [None]:
# CHUNK_SIZE vs accuracy (train, validation)