## Experiment with using Transformer LM to do sentence classification

1. Finetune a classifier head on top of pretrained BERT
2. Take embeddings from pretrained BERT and train a classifier on top of it. This is not finetuning of BERT since BERT is used only for getting embeddings
3. Finetune GPT based LM to classify sentence.

- Use pretrained DistilBERT model from HuggingFace

In [1]:
from IPython.display import display, HTML
display(HTML("<style>:root { --jp-notebook-max-width: 100% !important; }</style>"))

In [2]:
import numpy as np
import pandas as pd

In [20]:
from functools import partial

In [3]:
from transformers import AutoTokenizer

In [4]:
from transformers import DistilBertModel, DistilBertConfig

In [5]:
from transformers import DataCollatorWithPadding

In [6]:
import evaluate

In [7]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

In [8]:
from datasets import load_dataset

## 1. Finetune a classifier head on top of pretrained BERT

In [9]:
# train_df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None).rename(columns={0: "text", 1:"label"})
# dev_df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/dev.tsv', delimiter='\t', header=None).rename(columns={0: "text", 1:"label"})
# test_df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/test.tsv', delimiter='\t', header=None).rename(columns={0: "text", 1:"label"})

# dev_df.shape

# base_url = "https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/"
# data = load_dataset("csv", data_files={"train": f"{base_url}train.tsv", "validate": f"{base_url}dev.tsv", "test": f"{base_url}test.tsv"}, delimiter="\t")

# train_df["data_type"] = "train"
# dev_df["data_type"] = "dev"
# test_df["data_type"] = "test"

# df = pd.concat([train_df, dev_df, test_df], axis=0).reset_index(drop=True)

# print(f"Shape of df: {df.shape}")

# df.head()

# ## Quite balanced
# df[df.data_type =='train'].label.value_counts(normalize=True)

## Load dataset at https://huggingface.co/datasets/stanfordnlp/sst2

In [10]:
df = load_dataset('stanfordnlp/sst2')

DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. It was pretrainined with the following objectives:
it was pretrained with three objectives:

1. Distillation loss: the model was trained to return the same probabilities as the BERT base model.
2. Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
3. Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model.

https://huggingface.co/distilbert/distilbert-base-uncased

## Step1: Get tokenizer for specific model

In [11]:
## Based on the name of the model(distilbert), AutoTokenizer automatically instantiates one of the tokenizer classes of the library from a pretrained model vocabulary.
## https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer
## WordPiece based tokizer
## Returns DistilBertTokenizer or DistilBertTokenizerFast based on use_fast=True
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", use_fast=True)



In [12]:
print(f"tokenizer model_max_length: {tokenizer.model_max_length}") ## A very large values => unreliable
print(f"tokenizer truncation_side: {tokenizer.truncation_side}")
print(f"tokenizer padding_side: {tokenizer.padding_side}") 
print(f"tokenizer model_input_names: {tokenizer.model_input_names}") 
print(f"tokenizer bos_token: {tokenizer.bos_token}") 
print(f"tokenizer eos_token: {tokenizer.eos_token}") 
print(f"tokenizer unk_token: {tokenizer.unk_token}") 
print(f"tokenizer sep_token: {tokenizer.sep_token}") 
print(f"tokenizer pad_token: {tokenizer.pad_token}") 
print(f"tokenizer cls_token: {tokenizer.cls_token}") 
print(f"tokenizer mask_token: {tokenizer.mask_token}") 

tokenizer model_max_length: 1000000000000000019884624838656
tokenizer truncation_side: right
tokenizer padding_side: right
tokenizer model_input_names: ['input_ids', 'attention_mask']
tokenizer bos_token: None
tokenizer eos_token: None
tokenizer unk_token: [UNK]
tokenizer sep_token: [SEP]
tokenizer pad_token: [PAD]
tokenizer cls_token: [CLS]
tokenizer mask_token: [MASK]


In [13]:
## Check configuration of pretrained DistilBERT model
configuration = DistilBertConfig()
print(f"DistilBERT config: {configuration}")

DistilBERT config: DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.40.1",
  "vocab_size": 30522
}



In [14]:
def preprocess_function(df, text_column="text"):
    ## truncation=True ensures that sequences to be no longer than DistilBERT’s maximum input length
    ## https://huggingface.co/docs/transformers/v4.40.1/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__
    return tokenizer(df[text_column], truncation=True)

## tokenizer returns input_ids (token id) and attention_mask to be input to model
 https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [15]:
tokenizer(['a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films', 'my name is hardik'], truncation=True)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': [[101, 1037, 18385, 1010, 6057, 1998, 2633, 18276, 2128, 16603, 1997, 5053, 1998, 1996, 6841, 1998, 5687, 5469, 3152, 102], [101, 2026, 2171, 2003, 2524, 5480, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}

## encode returns input_ids
https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [16]:
sample_encoding = tokenizer.encode('A stirring , Funny and finally transporting re imagining of beauty and the beast and 1930s horror films amzertfys', truncation=True)

## decode converts token/ input_ids to tokens and returns sentences
https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [17]:
tokenizer.decode(sample_encoding)

'[CLS] a stirring, funny and finally transporting re imagining of beauty and the beast and 1930s horror films amzertfys [SEP]'

## See the tokenization (Wordpiece result) using convert_ids_to_tokens
https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [18]:
tokenizer.convert_ids_to_tokens(sample_encoding)

['[CLS]',
 'a',
 'stirring',
 ',',
 'funny',
 'and',
 'finally',
 'transporting',
 're',
 'imagining',
 'of',
 'beauty',
 'and',
 'the',
 'beast',
 'and',
 '1930s',
 'horror',
 'films',
 'am',
 '##zer',
 '##tf',
 '##ys',
 '[SEP]']

## Step2: Tokenize the entries in text column to get input_ids(token_ids) and attention masks

In [21]:
#tokenized_dict_list = preprocess_function(df, text_column="text")
tokenized_df = df.map(partial(preprocess_function, text_column="sentence"), batched=True)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [24]:
tokenized_df

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label', 'input_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label', 'input_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1821
    })
})

## Step3: Padd shorted sequences to ensure all are of length 512 tokens (In step 2 we truncated long sequences)

https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/data_collator#transformers.DataCollatorWithPadding

In [25]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

## Step 4: Get eveluation metric (scikit learn or evaluate library)
https://huggingface.co/docs/evaluate/package_reference/loading_methods

In [26]:
##evaluate.list_evaluation_modules(module_type="metric", include_community=True, with_details=True)

In [27]:
[metric for metric in evaluate.list_evaluation_modules(module_type="metric", include_community=True) if 'f1' in metric]

['nbansal/semf1']

In [28]:
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

In [29]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy_value = accuracy(predictions=predictions, references=labels)
    precision_value = precision(predictions=predictions, references=labels)
    recall_value = recall(predictions=predictions, references=label)
    f1_value = f1(predictions=predictions, references=label)
    return accuracy_value, precision_value, recall_value, f1_value

## Step 5: Get id2label and label2id mapping

In [30]:
id2label = {0:"negative", 1:"positive"}
label2id = {"negative":0, "positive":1}

## Step 6: Train model
1. Use Trainer API by Hugging face which abstracts the training loop
2. Manually write training loop in native Pytorch/ Tensorflow

### Step 6.1 Use Trainer API
https://huggingface.co/docs/transformers/en/training#train-with-pytorch-trainer

In [31]:
## https://huggingface.co/docs/transformers/v4.40.1/en/model_doc/auto#transformers.AutoModelForSequenceClassification
## model with be instantiated with a classification head (Linear+Softmax)
## https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/configuration#transformers.PretrainedConfig
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [32]:
print(f"Pretrained model with classification head architecture: {model}")

Pretrained model with classification head architecture: DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN

In [None]:
##https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.TrainingArguments
training_arguments = TrainingArguments(output_dir="./results", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=5, weight_decay=0.01)

In [None]:
trainer = Trainer(model=model, args=training_arguments, train_dataset=tokenized_df["train"], eval_dataset=tokenized_df["val"], tokenizer=tokenizer, data_collator=data_collator)