<center><h2>ML Ecosystem</h2></center>


In this notebook, we will use the example of HuggingFace to demonstrate how to easily load a pretrained model, load the dataset, and finetune the model for a downstream task. 

To prepare, make sure PyTorch is installed in your environment. Also install the following packages 

`pip install transformers datasets accelerate peft`



### Loading the model

Our goal is to finetune a model for sentiment classification, that is given a text, classify whether it has a positive or negative sentiment. 

To start, let's load a pretrained model from HuggingFace model hub. The name of the model we are loading is "bert-base-uncased" and its webpage can be found at [https://huggingface.co/google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased). Here, 'bert' is a language model Google developed in 2018, 'base' means it's the base version of the model (that has a relatively small size), and 'uncased' means it does not distinguash upper and lower cases. 

To load the model, we use `transformers.AutoModelForSequenceClassification.from_pretrained` and provide the model name - it will automatically load the model from HuggingFace. Note that the 'ForSequenceClassification' indicates we will load this model for sequence classification purposes, which will add an additional linear layer to the bert model such that the output size equals the number of classes in our problem. 

We also need to use a tokenizer, which maps words in string format into feature vectors. We again will load this from pretrained.

In [14]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer,pipeline
import torch

model_name = "bert-base-uncased"

id2label = {0: "Negative", 1: "Positive"} # how to iterpret the output 
label2id = {"Negative":0, "Positive":1}

# loading the model from pretrained, setting number of classes to be 2, and setting the class label names.
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2,id2label=id2label, label2id = label2id)

# loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Transformers don't understand strings of words, they must be tokenized first 

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
print("isinstance(model,torch.nn.Module) = ",isinstance(model,torch.nn.Module)) # the loaded model is a torch.nn.Module
# i.e. the model was created using pytorch 

# let's print out the model structure
model 

# Last layer (classifier layer I believe?) is not trained yet, we will train before using it for classification 

isinstance(model,torch.nn.Module) =  True


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

You will see the  model has many layers, in particular the 12 transformer blocks (the BertLayer). The original bert-base-uncased model ends at the "BertPooler" layer, which outputs a 768 dimensional feature vector. The final Linear layer is added when we load the model to convert the output dimension to 2.

Let's see how we can use the model to make prediction. 

In [18]:
from transformers import pipeline

# pipeline connects a tokenizer and a model. It is designed such that one can make inference easily. 
classifier = pipeline("sentiment-analysis",model=model,tokenizer=tokenizer)

# to classify a sentiment of a sentence
classifier("I feel very sad today. ")

# Below is not finetuned yet, that's why it does not work now 

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'Positive', 'score': 0.7406206727027893}]

The model is pretrained to "understand" texts, and the 768 dimensional output of BERT contains useful features of the texts. However, the pretrained model is not specifically designed to predict sentiment. Further, the final linear layer is added when we load the model and it is NOT trained yet.  So we wouldn't expect the model to work well now. To make it work well, we need to fine-tune it on some dataset. 


### Loading the dataset



We will use the `datasets` package to load datasets from HuggingFace. Today we will load the [IMDB](https://huggingface.co/datasets/stanfordnlp/imdb) dataset consisting of movie reviews.

In [19]:
from datasets import load_dataset

# let's load the dataset from HuggingFace
dataset = load_dataset("imdb")
train_dataset = dataset["train"]
test_dataset = dataset["test"]

# let's see how the dataset looks like
for i in range(10):
    print(train_dataset[i])


{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [20]:
# Tokenize the dataset - convert the 'text' from string to a vector
def tokenize_function(examples):
    # the tokenizer will tranform the text into 'input_ids', which is a token ID that represents the word; and "attention_masks"
    return tokenizer(examples["text"], padding="max_length", truncation=True)
    

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

print(train_dataset[0])

# These integers are ids of words in the "dictionary" 

Map: 100%|██████████| 25000/25000 [00:08<00:00, 3117.37 examples/s]
Map: 100%|██████████| 25000/25000 [00:06<00:00, 3783.35 examples/s]

{'label': tensor(0), 'input_ids': tensor([  101,  1045, 12524,  1045,  2572,  8025,  1011,  3756,  2013,  2026,
         2678,  3573,  2138,  1997,  2035,  1996,  6704,  2008,  5129,  2009,
         2043,  2009,  2001,  2034,  2207,  1999,  3476,  1012,  1045,  2036,
         2657,  2008,  2012,  2034,  2009,  2001,  8243,  2011,  1057,  1012,
         1055,  1012,  8205,  2065,  2009,  2412,  2699,  2000,  4607,  2023,
         2406,  1010,  3568,  2108,  1037,  5470,  1997,  3152,  2641,  1000,
         6801,  1000,  1045,  2428,  2018,  2000,  2156,  2023,  2005,  2870,
         1012,  1026,  7987,  1013,  1028,  1026,  7987,  1013,  1028,  1996,
         5436,  2003,  8857,  2105,  1037,  2402,  4467,  3689,  3076,  2315,
        14229,  2040,  4122,  2000,  4553,  2673,  2016,  2064,  2055,  2166,
         1012,  1999,  3327,  2016,  4122,  2000,  3579,  2014,  3086,  2015,
         2000,  2437,  2070,  4066,  1997,  4516,  2006,  2054,  1996,  2779,
        25430, 14728,  2245,  




### FineTuning

Now let's conduct the actual fine-tuning. First, let's see the size of the model.



In [21]:
print("number of total parameters in model = ", model.num_parameters(),"number of TRAINABLE parameters in model = ", model.num_parameters(only_trainable = True))


number of total parameters in model =  109483778 number of TRAINABLE parameters in model =  109483778


To reduce the memory needed for training, a common method is to use peft to reduce the number of trainable parameters of the model. If we don't use peft, all the model parameters will be trained by default, which will take a lot of memory. peft, standing for "Parameter Efficient Fine Tuning", uses "LoRA" to significantly reduce the number of trainable parameters (but still keeping good performance). 

In [24]:
from transformers import Trainer, TrainingArguments
from peft import LoraConfig,get_peft_model
import numpy as np

# use peft to reduce the number of trainable parameters
peft_config = LoraConfig(
    r=8, 
)
model = get_peft_model(model, peft_config)

print("number of total parameters with LoRA ", model.num_parameters(),"number of TRAINABLE parameters with LoRA = ", model.num_parameters(only_trainable = True))

# Note the significant drop in trainable parameters, we will only finetune these parameters 

number of total parameters with LoRA  109778690 number of TRAINABLE parameters with LoRA =  294912


You can see the number of trainable parameters goes down from 109,483,778 to 294,912, which is 3 orders of magnitudes less!

Let's now conduct the finetuneing. We will use the Trainer class for this, which conducts all the training loops. As an alternative, you can also write your training loop as this is just a PyTorch module. Given the large memory need, you would need a GPU to run the below code. 

In [None]:
training_args = TrainingArguments(
    output_dir='./results', 
    evaluation_strategy="epoch", 
    num_train_epochs=3, 
    per_device_train_batch_size=64, 
    per_device_eval_batch_size=64
) # also you can decide learning rate here

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=train_dataset, 
    eval_dataset=test_dataset
)

trainer.train()

### Checking out the fine-tuned model. 



In [25]:
# replace the path to where you stored the fine tuned model
model_finetuned = AutoModelForSequenceClassification.from_pretrained('/Users/coolq/Library/CloudStorage/Box-Box/Teaching/Tool Chain/Toolchain 2024 Fall/notebooks/fine_tune_models/fine_tuned_model_fullfinetuning', num_labels = 2, id2label=id2label, label2id=label2id)


In [26]:
classifier_finetuned = pipeline("sentiment-analysis", model=model_finetuned, tokenizer=tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [27]:
classifier_finetuned("I am very happy today.")

[{'label': 'Positive', 'score': 0.9922531247138977}]

In [28]:
classifier_finetuned("This is my first semester at CMU. The courses are challenging, but I have learned a lot. ")

[{'label': 'Positive', 'score': 0.9922582507133484}]

In [29]:
classifier_finetuned("A pleasant surprise, the cinematography is impeccable, the characters quite well done, the plot looks like a link between the stories of the First Age, the Silmarillion and the stories of the Lord of the Rings of the Third Age, the rhythm of narration is pleasant albeit a bit slow. If the outcome of the series will be to narrate how Sauron forged the Rings of Power, it will definitely be something to watch. Until this moment, I think that in general terms, at least the first chapter delivers. I think enough to be cautiously optimistic about what the next 7 episodes might turn out to be. I must add, again that I am pleasantly surprised.")

[{'label': 'Positive', 'score': 0.9936692118644714}]