## Experiment with using Transformer LM to do sentence classification

<!-- 1. Finetune a classifier head on top of pretrained BERT -->
Take embeddings from pretrained BERT and train a logistic classifier on top of it. This is not finetuning of BERT since BERT is used only for getting embeddings
<!-- 3. Finetune GPT based LM to classify sentence. -->

In [1]:
from IPython.display import display, HTML
display(HTML("<style>:root { --jp-notebook-max-width: 100% !important; }</style>"))

In [2]:
import numpy as np
import pandas as pd

In [3]:
from functools import partial

In [4]:
from transformers import AutoTokenizer

In [5]:
from transformers import DistilBertModel, DistilBertConfig

In [6]:
# from transformers import DataCollatorWithPadding

In [7]:
# import evaluate

In [8]:
# from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

In [9]:
from datasets import load_dataset

In [10]:
import torch

In [11]:
from torch.utils.data import DataLoader

In [12]:
from tqdm.auto import tqdm

In [13]:
from transformers import pipeline

In [14]:
torch.cuda.is_available()

False

# Train logistic classifier top of pretrained BERT

## Load dataset at https://huggingface.co/datasets/stanfordnlp/sst2

In [15]:
train_df = load_dataset('stanfordnlp/sst2', split="train").shuffle().select(range(5000))
validation_df = load_dataset('stanfordnlp/sst2', split="validation")
test_df = load_dataset('stanfordnlp/sst2', split="test")

In [16]:
test_df

Dataset({
    features: ['idx', 'sentence', 'label'],
    num_rows: 1821
})

DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. It was pretrainined with the following objectives:
it was pretrained with three objectives:

1. Distillation loss: the model was trained to return the same probabilities as the BERT base model.
2. Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
3. Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model.

https://huggingface.co/distilbert/distilbert-base-uncased

## Step1: Get tokenizer for specific model

In [17]:
## Based on the name of the model(distilbert), AutoTokenizer automatically instantiates one of the tokenizer classes of the library from a pretrained model vocabulary.
## https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer
## WordPiece based tokizer
## Returns DistilBertTokenizer or DistilBertTokenizerFast based on use_fast=True
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", use_fast=True)



In [18]:
print(f"tokenizer model_max_length: {tokenizer.model_max_length}") ## A very large values => unreliable
print(f"tokenizer truncation_side: {tokenizer.truncation_side}")
print(f"tokenizer padding_side: {tokenizer.padding_side}") 
print(f"tokenizer model_input_names: {tokenizer.model_input_names}") 
print(f"tokenizer bos_token: {tokenizer.bos_token}") 
print(f"tokenizer eos_token: {tokenizer.eos_token}") 
print(f"tokenizer unk_token: {tokenizer.unk_token}") 
print(f"tokenizer sep_token: {tokenizer.sep_token}") 
print(f"tokenizer pad_token: {tokenizer.pad_token}") 
print(f"tokenizer cls_token: {tokenizer.cls_token}") 
print(f"tokenizer mask_token: {tokenizer.mask_token}") 

tokenizer model_max_length: 1000000000000000019884624838656
tokenizer truncation_side: right
tokenizer padding_side: right
tokenizer model_input_names: ['input_ids', 'attention_mask']
tokenizer bos_token: None
tokenizer eos_token: None
tokenizer unk_token: [UNK]
tokenizer sep_token: [SEP]
tokenizer pad_token: [PAD]
tokenizer cls_token: [CLS]
tokenizer mask_token: [MASK]


In [19]:
## Check configuration of pretrained DistilBERT model
configuration = DistilBertConfig()
print(f"DistilBERT config: {configuration}")

DistilBERT config: DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.40.1",
  "vocab_size": 30522
}



In [20]:
def preprocess_function(df, text_column="text"):
    ## truncation=True ensures that sequences to be no longer than DistilBERT’s maximum input length
    ## https://huggingface.co/docs/transformers/v4.40.1/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__
    return tokenizer(df[text_column], truncation=True, padding="longest") ## padding=longest will pad all input sequences in the batch to the length of longest sequence(THIS MIGHT BE LESSER THAN MAX TOKEN COUNT i.e. 512)

## tokenizer returns input_ids (token id) and attention_mask to be input to model
 https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [21]:
tokenizer(['a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films', 'my name is hardik'], truncation=True, padding="longest")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': [[101, 1037, 18385, 1010, 6057, 1998, 2633, 18276, 2128, 16603, 1997, 5053, 1998, 1996, 6841, 1998, 5687, 5469, 3152, 102], [101, 2026, 2171, 2003, 2524, 5480, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

In [22]:
tokenizer(['a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films', 'my name is hardik'], truncation=True, padding="max_length", max_length=512)

{'input_ids': [[101, 1037, 18385, 1010, 6057, 1998, 2633, 18276, 2128, 16603, 1997, 5053, 1998, 1996, 6841, 1998, 5687, 5469, 3152, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [23]:
#tokenizer.convert_ids_to_tokens(sample_encoding)

## Step2: Tokenize the entries in text column to get input_ids(token_ids) and attention masks

In [24]:
#tokenized_dict_list = preprocess_function(df, text_column="text")
train_tokenized_df = train_df.map(partial(preprocess_function, text_column="sentence"), batched=True)
validation_tokenized_df = validation_df.map(partial(preprocess_function, text_column="sentence"), batched=True)
test_tokenized_df = test_df.map(partial(preprocess_function, text_column="sentence"), batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

## This shows that the padding is different for different batches

In [25]:
max_input_id_length = -1
min_input_id_length = np.inf

for i in train_tokenized_df:
    inp_id = i["input_ids"]
    ll = len(inp_id)
    if ll > max_input_id_length:
        max_input_id_length = ll
    if ll < min_input_id_length:
        min_input_id_length = ll

for i in validation_tokenized_df:
    inp_id = i["input_ids"]
    ll = len(inp_id)
    if ll > max_input_id_length:
        max_input_id_length = ll
    if ll < min_input_id_length:
        min_input_id_length = ll

for i in test_tokenized_df:
    inp_id = i["input_ids"]
    ll = len(inp_id)
    if ll > max_input_id_length:
        max_input_id_length = ll
    if ll < min_input_id_length:
        min_input_id_length = ll

In [26]:
print(f"min_input_id_length: {min_input_id_length}, max_input_id_length: {max_input_id_length}")

min_input_id_length: 51, max_input_id_length: 64


## Step 3: Now I have to iterate over the dataset and make sure everything is padded to exact same length = 66 (i.e. longest overall else model will not be able to handle it)

## Code inspired from https://madewithml.com/courses/mlops/training/#model

In [27]:
def pad_to_max_length(batch, column_name, max_length=max_input_id_length, dtype=np.int32, pad_value=0):
    ## UDF toi be used in map function
    arr_to_pad = batch[column_name]
    row_count = len(arr_to_pad)
    padded_arr = np.full((row_count, max_length), fill_value=pad_value, dtype=dtype)
    for i, row in enumerate(arr_to_pad):
        padded_arr[i][:len(row)] = row
    return {column_name: padded_arr.tolist()}

In [28]:
## Use huggungface map (https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Dataset.map)
train_tokenized_df = train_tokenized_df.map(partial(pad_to_max_length, column_name="input_ids"), batched=True)
train_tokenized_df = train_tokenized_df.map(partial(pad_to_max_length, column_name="attention_mask"), batched=True)

validation_tokenized_df = validation_tokenized_df.map(partial(pad_to_max_length, column_name="input_ids"), batched=True)
validation_tokenized_df = validation_tokenized_df.map(partial(pad_to_max_length, column_name="attention_mask"), batched=True)

test_tokenized_df = test_tokenized_df.map(partial(pad_to_max_length, column_name="input_ids"), batched=True)
test_tokenized_df = test_tokenized_df.map(partial(pad_to_max_length, column_name="attention_mask"), batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [29]:
len(train_tokenized_df[1]["input_ids"])

64

## Step4: Load pretrained DistilBert model

In [30]:
model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased")

In [31]:
print(model)

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

## STep 5: Remove unnecessary rows tokenized_df = tokenized_df.remove_columns(["sentence", "idx"]) 

In [32]:
filtered_train_tokenized_df = train_tokenized_df.remove_columns(["sentence", "idx", "label"])

In [33]:
filtered_validation_tokenized_df = validation_tokenized_df.remove_columns(["sentence", "idx", "label"])

In [34]:
filtered_test_tokenized_df = test_tokenized_df.remove_columns(["sentence", "idx", "label"])

## Set format to pytorch tensor

In [35]:
filtered_train_tokenized_df.set_format("torch")
filtered_validation_tokenized_df.set_format("torch")
filtered_test_tokenized_df.set_format("torch")

## Step 6: Prepare data using DataLoader
This is not needed since we only need to get embeddings of pretrained model, We do not have to finetune/ train the model on our data

In [36]:
## CORRECT BUT UNNECESSARY
# train_dataloader = DataLoader(filtered_train_tokenized_df, shuffle=True, batch_size=8)
# eval_dataloader = DataLoader(filtered_validation_tokenized_df["validation"], shuffle=True,batch_size=8)

## Step7: Set device to cuda if available

In [37]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

## Step 8: Way 1: Get train data pretrained embeddings corresponding to [CLS] token

In [38]:
## CORRECT BUT UNNECESSARY
# progress_bar = tqdm(range(len(train_dataloader)))

# with torch.no_grad():
#     #train_last_hidden_states = model(input_ids, attention_mask=attention_mask)
#     for batch in enumerate(train_dataloader):
#         ## Bring tensor to device
#         batch = {k: v.to(device) for k, v in batch.items()}
#         ## Pass batch through the model in train mode
#         outputs = model(**batch)
#         progress_bar.update(1)

In [39]:
filtered_train_tokenized_df["input_ids"]

tensor([[  101,  2529,  3325,  ...,     0,     0,     0],
        [  101,  6649,  5493,  ...,     0,     0,     0],
        [  101,  2002,  1005,  ...,     0,     0,     0],
        ...,
        [  101,  1037,  3185,  ...,     0,     0,     0],
        [  101,  2053,  3815,  ...,     0,     0,     0],
        [  101, 17475,  1996,  ...,     0,     0,     0]])

In [40]:
def extract_hidden_state(batch):
    inputs = {k: torch.tensor(v).to(device) for k,v in batch.items()}
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state

    return {"hidden_state": last_hidden_state[:,0,:].cpu().numpy()}

In [41]:
filtered_train_tokenized_df

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 5000
})

In [42]:
train_last_hidden_states = filtered_train_tokenized_df.map(extract_hidden_state, batched=True)
validation_last_hidden_states = filtered_validation_tokenized_df.map(extract_hidden_state, batched=True)
test_last_hidden_states = filtered_test_tokenized_df.map(extract_hidden_state, batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

  inputs = {k: torch.tensor(v).to(device) for k,v in batch.items()}


In [47]:
train_last_hidden_states["hidden_state"]

tensor([[-0.1911, -0.0145, -0.3096,  ..., -0.1753,  0.5269,  0.3486],
        [-0.1433, -0.1866,  0.0773,  ..., -0.1666,  0.2274,  0.2909],
        [-0.1332, -0.0910, -0.2276,  ...,  0.0641,  0.5383,  0.2201],
        ...,
        [-0.1054, -0.1327,  0.0182,  ..., -0.1939,  0.2630,  0.1924],
        [-0.0829, -0.0046, -0.0758,  ..., -0.1752,  0.3061,  0.2136],
        [-0.2615,  0.1163, -0.0217,  ..., -0.1203,  0.1313,  0.3862]])

In [43]:
# # This takes a lot of memory, not possible to do on personal laptop
# with torch.no_grad():
#     train_last_hidden_states = model(input_ids=filtered_train_tokenized_df["input_ids"], attention_mask=filtered_train_tokenized_df["attention_mask"])
#     validation_last_hidden_states = model(input_ids=filtered_validation_tokenized_df["input_ids"], attention_mask=filtered_validation_tokenized_df["attention_mask"])
#     test_last_hidden_states = model(input_ids=filtered_test_tokenized_df["input_ids"], attention_mask=filtered_test_tokenized_df["attention_mask"])

In [44]:
# train_features = train_last_hidden_states.last_hidden_state[:,0,:].numpy()
# validation_features = validation_last_hidden_states.last_hidden_state[:,0,:].numpy()
# test_features = test_last_hidden_states.last_hidden_state[:,0,:].numpy()

In [48]:
train_features = train_last_hidden_states["hidden_state"]
validation_features = validation_last_hidden_states["hidden_state"]
test_features = test_last_hidden_states["hidden_state"]

In [49]:
print(f"Shape of train_features: {train_features.shape}")
print(f"Shape of validation_features: {validation_features.shape}")
print(f"Shape of test_features: {test_features.shape}")

Shape of train_features: torch.Size([5000, 768])
Shape of validation_features: torch.Size([872, 768])
Shape of test_features: torch.Size([1821, 768])


In [50]:
train_labels = train_tokenized_df["label"]
validation_labels = validation_tokenized_df["label"]

In [51]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [52]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [53]:
lr_clf.score(validation_features, validation_labels)

0.8268348623853211

In [54]:
validation_prediction = lr_clf.predict(validation_features)

In [55]:
print(f"precision: {precision_score(validation_labels, validation_prediction)}")
print(f"recall: {recall_score(validation_labels, validation_prediction)}")
print(f"f1: {f1_score(validation_labels, validation_prediction)}")

precision: 0.8248337028824834
recall: 0.8378378378378378
f1: 0.8312849162011173


In [56]:
train_features.shape

torch.Size([5000, 768])

## Step 8: Way 2 Use HuggingFace pipeline to get embeddings from pretrained model
https://huggingface.co/tasks/feature-extraction

In [None]:
feature_extractor = pipeline("feature-extraction", framework="pt", model="distilbert/distilbert-base-uncased")

In [None]:
train_hidden_states = feature_extractor(train_df["sentence"],return_tensors = "pt")#[0].numpy().mean(axis=0)

In [None]:
train_hidden_states[0][0].numpy().mean(axis=0).shape

In [None]:
train_hidden_states[0].shape

In [None]:
train_hidden_states[0][0]