# Application of LLM for our own tasks
We can take advantage of previously developed LLM and optimize it for our own specific tasks, and this way we can save our time and resource significantly. When doing so, we still want to fine-tune the parameters so the LLM can be further optimized. There are many techniques to do that. In this exercise we will try two methods : 
1. Updating all the layers
2. Updating only the final layers

This notebook is shows an example of how to use a LLM model for a specific task of classifying spam SMS messages. We will use two different fine-tuning methods applied to the previously trained DistilBERT model parameters.

## 1. First prepare for SMS Spam dataset from the Hugging Face Transformers libarary.

In [1]:
from datasets import load_dataset
# SMS Spam dataset has only train so we need to create test dataset out of the train set.
sms_dataset = load_dataset("sms_spam", split='train').train_test_split(test_size=0.2, shuffle=True, seed=7)

In [2]:
splits = ['train', 'test']

print(sms_dataset['train'])
print(sms_dataset['test'])
print(sms_dataset['train'][0])

Dataset({
    features: ['sms', 'label'],
    num_rows: 4459
})
Dataset({
    features: ['sms', 'label'],
    num_rows: 1115
})
{'sms': 'The monthly amount is not that terrible and you will not pay anything till 6months after finishing school.\n', 'label': 0}


In [3]:
from transformers import AutoTokenizer

# We will use DistilBERT LLM model to tokenize the data in the dataset
sms_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = sms_dataset[split].map(lambda x: sms_tokenizer(x['sms'], truncation=True), batched=True)

tokenized_dataset['test']


Dataset({
    features: ['sms', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1115
})

## 2. Set up a model that would allow updating all parameters in all layers

In [4]:
from transformers import AutoModelForSequenceClassification

# First we will start with the parameters that were trained previously. We are attaching a fully-connected layer to classify "It's a Spam" vs "Not a Spam".
# You will get a warning message that says something like some weights are not initialized because of the added classification layer.
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", 
                                                           num_labels=2, 
                                                           id2label={0:"Not a Spam", 1:" It's a Spam"},
                                                           label2id={"Not a Spam":0, "It's a Spam":1})

# We will enable updating of all parameters in this model
for param in model.parameters():
    param.requires_grad=True
                                                           

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 3. Train the model
Trainer class will make training easier.

In [5]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions==labels).mean()}

trainer = Trainer(model = model,
                  args=TrainingArguments(output_dir="./data/sms_spam",
                                         learning_rate=2e-5,
                                         per_device_train_batch_size=64,
                                         per_device_eval_batch_size=64,
                                         eval_strategy="epoch",
                                         save_strategy="epoch",
                                         num_train_epochs=10,
                                         weight_decay=0.01,
                                         load_best_model_at_end=True),
                  train_dataset=tokenized_dataset['train'],
                  eval_dataset=tokenized_dataset['test'],
                  tokenizer=sms_tokenizer,
                  data_collator=DataCollatorWithPadding(tokenizer=sms_tokenizer),
                  compute_metrics=compute_metrics)

  trainer = Trainer(model = model,


In [6]:
import time

# We will calculate the time spent for training
start_time = time.perf_counter()
trainer.train()
end_time = time.perf_counter()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.4f} seconds")

The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh(<full-path-to-git-executable>)

All git commands will error until this is rectified.

This initial message can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - quiet|q|silence|s|silent|none|n|0: for no message or exception
    - error|e|exception|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.034815,0.989238
2,No log,0.026115,0.991928
3,No log,0.024361,0.994619
4,No log,0.024565,0.994619
5,No log,0.02979,0.992825
6,No log,0.032537,0.994619
7,No log,0.0329,0.994619
8,0.030600,0.031381,0.993722
9,0.030600,0.03213,0.993722
10,0.030600,0.032946,0.994619


Elapsed time: 100.6676 seconds


## 4. Evaluate the model

In [7]:
trainer.evaluate()

{'eval_loss': 0.024361399933695793,
 'eval_accuracy': 0.9946188340807175,
 'eval_runtime': 1.0806,
 'eval_samples_per_second': 1031.878,
 'eval_steps_per_second': 16.658,
 'epoch': 10.0}

In [8]:
import pandas as pd

items_for_check = tokenized_dataset['test'].select([32, 434, 51, 900, 234, 124, 2])

results = trainer.predict(items_for_check)
df = pd.DataFrame({"sms": [item["sms"] for item in items_for_check],
                   "Predictions": results.predictions.argmax(axis=1),
                   "labels": results.label_ids})
                   
df.head(10)


Unnamed: 0,sms,Predictions,labels
0,And miss vday the parachute and double coins??...,0,0
1,Tmrw. Im finishing 9 doors\n,0,0
2,No it will reach by 9 only. She telling she wi...,0,0
3,Thanks love. But am i doing torch or bold.\n,0,0
4,Cool. Do you like swimming? I have a pool and ...,0,0
5,FREE for 1st week! No1 Nokia tone 4 ur mob eve...,1,1
6,"2 celebrate my bday, y else?\n",0,0


### If we update all parameters, we start from 97% and end with 99%. And we spent about 100s for fine-tuning the parameters. Now let's just update the final classification and pre-classification layers while freezing parameter updates on all the other layers. This way we will save some time on training

## 5. Now try only updating smaller set of parameters (only at the final classifiers)

In [9]:
# Now create a fresh model that has the original pretrained parameters.
model_update_only_classifier = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", 
                                                           num_labels=2, 
                                                           id2label={0:"Not a Saapam", 1:" It's a Spam"},
                                                           label2id={"Not a Spam":0, "It's a Spam":1})

# Let's freeze parameter updates in all layers except for the final classification layer that we added for this specific task.
for param in model_update_only_classifier.parameters():
    param.requires_grad=False
for param in model_update_only_classifier.pre_classifier.parameters():
    param.requires_grad=True
for param in model_update_only_classifier.classifier.parameters():
    param.requires_grad=True

trainer_update_only_classifier = Trainer(model = model_update_only_classifier,
                  args=TrainingArguments(output_dir="./data/sms_spam",
                                         learning_rate=2e-5,
                                         per_device_train_batch_size=64,
                                         per_device_eval_batch_size=64,
                                         eval_strategy="epoch",
                                         save_strategy="epoch",
                                         num_train_epochs=10,
                                         weight_decay=0.01,
                                         load_best_model_at_end=True),
                  train_dataset=tokenized_dataset['train'],
                  eval_dataset=tokenized_dataset['test'],
                  tokenizer=sms_tokenizer,
                  data_collator=DataCollatorWithPadding(tokenizer=sms_tokenizer),

                                         compute_metrics=compute_metrics)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer_update_only_classifier = Trainer(model = model_update_only_classifier,


In [10]:
# We will calculate the time spent for training
start_time = time.perf_counter()
trainer_update_only_classifier.train()
end_time = time.perf_counter()
elapsed_time = end_time - start_time
print(f"Elapsed time without updating all parameters : {elapsed_time:.4f} seconds")

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.356493,0.852018
2,No log,0.26685,0.852018
3,No log,0.197318,0.916592
4,No log,0.157021,0.943498
5,No log,0.130478,0.96861
6,No log,0.115748,0.973094
7,No log,0.105438,0.978475
8,0.215400,0.099316,0.979372
9,0.215400,0.095866,0.980269
10,0.215400,0.094809,0.980269


Elapsed time without updating all parameters : 42.9449 seconds


### If we only update the final 2 layers, we can save the training time by about 57% while we still managed to get decent results. 99.5% vs 98%. But fine tuning all layers seems to reach 99.5% after the 3rd epoch while updating partial layers take about 9 epochs.