# MMAI 894 - Exercise 3
## Transfer learning with DistilBert
The goal of this excercise is to build a text classifier using the pretrained DistilBert published by HuggingFace. You will be doing this using the Glue/CoLA dataset (https://nyu-mll.github.io/CoLA/).

Submission instructions:

- You cannot edit this notebook directly. Save a copy to your drive, and make sure to identify yourself in the title using name and student number
- Do not insert new cells before the final one (titled "Further exploration") 
- Verify that your notebook can _restart and run all_. 
- Unlike previous assignments, please **submit all three formats: .py, .ipynb, and html** (see https://torbjornzetterlund.com/how-to-save-a-google-colab-notebook-as-html/)
 - The notebook and html submissions should show the completion of your best performing run
 - Submission files should be named: `studentID_lastname_firstname_ex3.py (or .html, .ipynb)`
- The mark will be assessed on the implementation of the functions with #TODO
- **Do not change anything outside the functions**  unless in the further exploration section
- - As you are encouraged to explore the network configuration, 20% of the mark is based on final accuracy. 
- Note: You do not have to answer the questions in thie notebook as part of your submission. They are meant to guide you.

- You should not need to use any additional libraries other than the ones listed below. You may want to import additional modules from those libraries, however.

In [7]:
# This cell installs and sets up DistilBert import, as well as the dataset, which we will 
# use tf.datasets to load (https://www.tensorflow.org/datasets/catalog/overview)

!pip install -q transformers tfds-nightly
!pip install spacy
!python -m spacy download en_core_web_sm 
!pip install scikit-learn  -U

import tensorflow as tf

import matplotlib.pyplot as plt
import tensorflow.keras as keras
import pandas as pd

try: # this is only working on the 2nd try in colab :)
    from transformers import DistilBertTokenizer, TFDistilBertModel
except Exception as err: # so we catch the error and import it again
    from transformers import DistilBertTokenizer, TFDistilBertModel

import numpy as np
import tensorflow.keras as keras
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras import regularizers

import tensorflow_datasets as tfds

dbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

import spacy, re
from spacy.attrs import ORTH
from spacy.lang.en.examples import sentences

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 5.5 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


# Data Preparation

In [8]:
def load_data(save_dir="./"):
    dataset = tfds.load('glue/cola', shuffle_files=True)
    train = tfds.as_dataframe(dataset["train"])
    val = tfds.as_dataframe(dataset["validation"])
    test = tfds.as_dataframe(dataset["test"])
    return train, val, test

def prepare_raw_data(df):
    raw_data = df.loc[:, ["idx", "sentence", "label"]]
    raw_data["label"] = raw_data["label"].astype('category')
    return raw_data

train, val, test = load_data()
train = prepare_raw_data(train)
val = prepare_raw_data(val)
test = prepare_raw_data(test)

Before using this data, we need to clean and QA it. Unlike MNIST, this is a text dataset, and we should be more caerful. For example:
- Are there any duplicate entries? 
- What is the range of lengths for the sentences? Should we impose a minimum sentence length?
- Are there "non-sentence" entries? For example, hashtags or other features we should remove? (luckily, this dataset is quite clean, but that might not always be the case!)

NOTE! The sentences are encoded as binary strings. To do text manipulations, you might need to decode them using `s.decode("utf-8")`

You may notice that that test set has no labels. This is because Glue is a benchmark dataset, and only gets scored on submissions.

In [9]:
def clean_data(df):

#   # TODO: What data cleaning/filtering should you consider?
#   # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
  spacy_lang = spacy.load("en_core_web_sm")
  clean_sentence = []
  #remove duplicate entries
  clean_df = df.drop_duplicates(subset=["sentence", "label"], keep='first')
  
  lemma_dict = { 
      "yrs":  "year","hrs":  "hour","nd":  "and","avg":  "average","bc":  "because","cc":  "credit card","cu":  "credit union","cuz":  "because","dept":  "department",
      "dunno":  "do not know","em":  "them","goo":  "good","idk":  "i do not know","info":  "information","lo":  "low","prob":  "problem","probs":  "problem",
      "refi":  "refinance","thay":  "they","thye":  "they","yr":  "year","yrs":  "year","cant":  "can not","doesnt":  "do not","dont":  "do not","havent":  "have not",
      "im":  "i be","didnt":  "do not","isnt":  "be not","ive":  "i have","thats":  "that be","theres":  "there be","wont":  "will not","cust":  "customer",
      "dk":  "do not know","haven":  "have not","ain't":  "am not","aren't":  "are not","can't":  "cannot","can't've":  "cannot have","'cause":  "because","'coz":  "because",
      "could've":  "could have","couldn't":  "could not", "couldn't've":  "could not have","didn't":  "did not","doesn't":  "does not","don't":  "do not","hadn't":  "had not",
      "hadn't've":  "had not have","hasn't":  "has not", "haven't":  "have not","havent":  "have not",  "he'd":  "he would","he'd've":  "he would have","he'll":  "he will",
      "he'll've":  "he shall have","he's":  "he is","how'd":  "how did","how'd'y":  "how do you","how'll":  "how will","how's":  "how is","i'd":  "I would","i'd've":  "I would have",
      "i'll":  "I will", "i'll've":  "I will have","i'm":  "I am","i've":  "I have","isn't":  "is not","it'd":  "it would","it'd've":  "it would have","it'll":  "it will",
      "it'll've":  "it will have","it's":  "it is","let's":  "let us","ma'am":  "madam","mayn't":  "may not","might've":  "might have","mightn't":  "might not",
      "mightn't've":  "might not have","must've":  "must have","mustn't":  "must not","mustn't've":  "must not have","needn't":  "need not","needn't've":  "need not have",
      "o'clock":  "of the clock","oughtn't":  "ought not","oughtn't've":  "ought not have","shan't":  "shall not","sha'n't":  "shall not","shan't've":  "shall not have",
      "she'd":  "she would","she'd've":  "she would have","she'll":  "she will","she'll've":  "she will have","she's":  "she is","should've":  "should have","shouldn't":  "should not",
      "shouldn't've":  "should not have","so've":  "so have","so's":  "so is","that'd":  "that would","that'd've":  "that would have","that's":  "that is","there'd":  "there would",
      "there'd've":  "there would have","there's":  "there is","they'd":  "they would","they'd've":  "they would have","they'll":  "they will","they'll've":  "they will have","they're":  "they are",
      "they've":  "they have","to've":  "to have","wasn't":  "was not","we'd":  "we would","we'd've":  "we would have","we'll":  "we will","we'll've":  "we will have","we're":  "we are",
      "we've":  "we have","weren't":  "were not","what'll":  "what will","what'll've":  "what will have","what're":  "what are","what's":  "what is","what've":  "what have",
      "when's":  "when is","when've":  "when have","where'd":  "where did","where's":  "where is","where've":  "where have","who'll":  "who will","who'll've":  "who will have","who's":  "who is",
      "who've":  "who have","why's":  "why is","why've":  "why have","will've":  "will have","won't":  "will not","won't've":  "will not have","would've":  "would have","wouldn't":  "would not",
      "wouldn't've":  "would not have","y'all":  "you all","y'all'd":  "you all would","y'all'd've":  "you all would have","y'all're":  "you all are","y'all've":  "you all have","you'd":  "you would",
      "you'd've":  "you would have","you'll":  "you will","you'll've":  "you will have","you're":  "you are","you've":  "you have","4g":  "Fourth Gen",  "4G":  "Fourth Gen","3g":  "Third Gen",
      "3G":  "Third Gen","isnt":  "is not","isn": "isnot","didn": "did not","didnt":  "did not","nt": "not","co": "company","cos": "because","ve": "have","doesn": "does not","m": "am","nt": "not","re":  "are",
      "dint":  "did not","dont":  "do not","ok ":  "okay","cust ":  "customer","cuz ":  "because","app ":  "application","& ":  "and"
  }

  for key in lemma_dict:
    case = [{ORTH: key}]
    spacy_lang.tokenizer.add_special_case(key, case)
  
  for text in df.sentence:
    clean_str = ''
    substituted_text = ''
    #lower_case for spacy substitutions
    clean_str = text.lower().decode('utf-8')
    #remove letter repetition (if more than 2)
    clean_str = re.sub(r'([a-z])\1{2,}', r'\1', clean_str)
    #remove phrase repetition
    clean_str = re.sub(r'\b(\w+)( \1\b)+', r'\1', clean_str)
    
    #replace contractions with custom substitution and lemmatization
    clean_str = spacy_lang(clean_str)
    for word in clean_str:
      if word.is_stop == False:
        word = word.text if word.lemma_ == '-PRON-' else word.lemma_
        word.replace("â€™","\'")
        word.replace("'s","")
        word.replace("â","\'")
        word.replace("b'","")
        word = lemma_dict.get(word, word)
        word = re.sub('[^a-z]+', '', word)
        substituted_text += word + ' '
        
    
    clean_sentence.append(re.sub('\s\s+',' ', substituted_text).strip())
    
  clean_data = pd.concat([clean_df.idx, pd.Series(clean_sentence).to_frame(name='sentence'), clean_df.label],axis = 1).dropna()

  return clean_data

#commenting out this part as it's repition of the same process in the next cell
# train = clean_data(train)
# val = clean_data(val)
# test = clean_data(test)

print(train.head())
print(test.head())

    idx                                           sentence label
0  1680  b'It is this hat that it is certain that he wa...     1
1  1456  b'Her efficient looking up of the answer pleas...     1
2  4223          b'Both the workers will wear carnations.'     1
3  4093  b'John enjoyed drawing trees for his syntax ho...     1
4  7111  b'We consider Leslie rather foolish, and Lou a...     1
    idx                                         sentence label
0   163            b'Brian was wiping behind the stove.'    -1
1   131       b'You could give a headache to a Tylenol.'    -1
2  1021                          b'I want to meet at 6.'    -1
3   166                        b'Packages carry easily.'    -1
4  1039  b"Many people said they were sick who weren't."    -1


Next, we need to prepare the text for DistilBert. Instead of ingesting raw text, the model uses token IDs to map to internal embedding. Additionally, since the input is fixed size (due to our use of batches), we need to let the model know which tokens to use (i.e. are part of the sentence).

Luckily, `dbert_tokenizer` takes care of all that for us - 
- Preprocessing: https://huggingface.co/transformers/preprocessing.html
- Summary of tokenizers (DistilBert uses WordPiece): https://huggingface.co/transformers/tokenizer_summary.html#wordpiece

In [10]:
def extract_text_and_y(df):
  #text = [x.decode('utf-8') for x in  df.sentence.values] #already decoded to 'utf-8' in text cleaning function, for applying re substituion
  text = list(df.sentence.values)
  # for multiclass problems, you can use sklearn.preprocessing.OneHotEncoder, but we only have two classes, so we'll use a single sigmoid output
  y = np.array([x for x in df.label.values])
  return text, y

def encode_text(text):
    # TODO: encode text using dbert_tokenizer
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    input_ids = []
    attention_mask = []
    for text in text:
      encoded_dict = dbert_tokenizer.encode_plus(
          text,                      # Sentence to encode.
          add_special_tokens = True, # Add '[CLS]' and '[SEP]'
          max_length = 25,           # Pad & truncate all sentences.
          truncation = True,
          padding = 'max_length',                      
          return_attention_mask = True   # Construct attn. masks.
          )
      # Add the encoded sentence to the list.    
      input_ids.append(encoded_dict['input_ids'])
      # And its attention mask (simply differentiates padding from non-padding).
      attention_mask.append(encoded_dict['attention_mask'])

    input_ids = np.asarray(input_ids)
    attention_mask = np.array(attention_mask)

    return input_ids, attention_mask

# the following prepares the input for running in DistilBert
train_text, train_y = extract_text_and_y(clean_data(train))
val_text, val_y = extract_text_and_y(clean_data(val))
test_text, test_y = extract_text_and_y(clean_data(test))

train_input, train_mask = encode_text(train_text)
val_input, val_mask = encode_text(val_text)
test_input, test_mask = encode_text(test_text)

train_model_inputs_and_masks = {
    'inputs' : train_input,
    'masks' : train_mask
}

val_model_inputs_and_masks = {
    'inputs' : val_input,
    'masks' : val_mask
}

test_model_inputs_and_masks = {
    'inputs' : test_input,
    'masks' : test_mask
}

# Modelling

## Build and Train Model

Resources:
- BERT paper https://arxiv.org/pdf/1810.04805.pdf
- DistilBert paper: https://arxiv.org/abs/1910.01108
- DistilBert Tensorflow Documentation: https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertmodel

In [11]:
def build_model(base_model, trainable=False, params={}):
    # TODO: build the model, with the option to freeze the parameters in distilBERT
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # Hint 1: the cls token (token for classification in bert / distilBert) corresponds to the first element in the sequence in DistilBert. Take a look at Figure 2 in BERT paper.
    # Hint 2: this guide may be helpful for parameter freezing: https://keras.io/guides/transfer_learning/
    # Hint 3: double check that your number of parameters make sense
    # Hint 4: carefully consider your final layer activation and loss function

    # Refer to https://keras.io/api/layers/core_layers/input/
    max_seq_len = 25
    inputs = Input(shape = (max_seq_len,), dtype='int64')
    masks  = Input(shape = (max_seq_len,), dtype='int64')

    base_model.trainable = trainable

    dbert_output = base_model(inputs, attention_mask=masks)
    # dbert_last_hidden_state gets you the output encoding for each of your tokens.
    # Each such encoding is a vector with 768 values. The first token fed into the model is [cls]
    # which can be used to build a sentence classification network
    dbert_last_hidden_state = dbert_output.last_hidden_state


    # Any additional layers should go here
    # use the 'params' as a dictionary for hyper parameter to facilitate experimentation
    
    dense_layer = Dense(params['dense_l_1'],activation='relu')(dbert_last_hidden_state[:, 0, :])
    dropout_layer = Dropout(params['dropout_l_1'])(dense_layer)
    dense_layer = Dense(params['dense_l_2'],activation='relu')(dropout_layer)
    dropout_layer = Dropout(params['dropout_l_2'])(dense_layer)
    dense_layer = Dense(params['dense_l_3'],activation='relu')(dropout_layer)
    probs = Dense(1, activation='sigmoid',kernel_regularizer=regularizers.l2(params['regularizer_L2']), name="output")(dense_layer)

    model = keras.Model(inputs=[inputs, masks], outputs=probs)
    model.summary()
    return model

dbert_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')
params={# add parameters here
        "dense_l_1": 256,
	      "dense_l_2": 128,
        "dense_l_3": 64,
        "dropout_l_1": 0.3,
        "dropout_l_2": 0.3,
        "regularizer_L2":0.001 
        }

model = build_model(dbert_model, params=params)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_transform', 'vocab_projector', 'activation_13', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_3 (InputLayer)           [(None, 25)]         0           []                               
                                                                                                  
 input_4 (InputLayer)           [(None, 25)]         0           []                               
                                                                                                  
 tf_distil_bert_model_1 (TFDist  TFBaseModelOutput(l  66362880   ['input_3[0][0]',                
 ilBertModel)                   ast_hidden_state=(N               'input_4[0][0]']                
                                one, 25, 768),                                                    
                                 hidden_states=None                                         

In [12]:
def compile_model(model):
    # TODO: compile the model, include relevant auc metrics when training
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    METRICS = [
          tf.keras.metrics.BinaryAccuracy(name='accuracy'),
          tf.keras.metrics.Precision(name='precision'),
          tf.keras.metrics.Recall(name='recall')
    ]
    optimizer = keras.optimizers.Adagrad(learning_rate=2e-3)
    loss = keras.losses.binary_crossentropy
    model.compile(optimizer=optimizer,
              loss=loss,
              metrics=METRICS)
    return model


model = compile_model(model)

In [13]:
def train_model(model, model_inputs_and_masks_train, model_inputs_and_masks_val,
    y_train, y_val, batch_size, num_epochs):
    # TODO: train the model
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION

    history=model.fit([model_inputs_and_masks_train['inputs'],model_inputs_and_masks_train['masks']],y_train,
                      batch_size=batch_size,
                      epochs=num_epochs,
                      validation_data=([model_inputs_and_masks_val['inputs'],model_inputs_and_masks_val['masks']],y_val))

    
    return model, history

model, history = train_model(model, train_model_inputs_and_masks, val_model_inputs_and_masks, train_y, val_y, batch_size=128, num_epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# Further exploration (REMOVE ALL CODE AFTER THIS CELL BEFORE SUBMISSION)
Any code after this is not evaluated, and must be removed before submission.
Leaving code below will result in losing marks.