## Assignment - Part 2
#### Practical Fine-tuning Session

### Hands- on Fine-tuning of a Small Scale LLM

In [6]:
# Load Pre-trained model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
# Prepare dataset
from datasets import load_dataset
dataset = load_dataset("imdb")

# Select first 500 rows for training and 100 rows for validation
train_data = dataset['train'].select(range(500)) # mini dataset
eval_data = dataset['test'].select(range(100))  # validation set


# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train = train_data.map(tokenize_function, batched=True) # training data
tokenized_eval = eval_data.map(tokenize_function, batched=True) # test dataset

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [19]:
# Set up trainer
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results", evaluation_strategy="epoch",
learning_rate=2e-5, per_device_train_batch_size=16,
num_train_epochs=10, weight_decay=0.01,
)
trainer = Trainer(
model=model, args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_eval,
)

# Train model
trainer.train()

# Save fine-tune model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")



Epoch,Training Loss,Validation Loss
1,No log,6.4e-05
2,No log,2.6e-05
3,No log,1.6e-05
4,No log,1.1e-05
5,No log,9e-06
6,No log,7e-06
7,No log,7e-06
8,No log,6e-06
9,No log,6e-06
10,No log,6e-06


('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [20]:
# Evaluate fine-tune model using metrics
results = trainer.evaluate()
print(results)

{'eval_loss': 5.519375463336473e-06, 'eval_runtime': 1.4501, 'eval_samples_per_second': 68.962, 'eval_steps_per_second': 8.965, 'epoch': 10.0}


In [21]:
# Detail metrics using sklearn
from sklearn.metrics import classification_report
predictions = trainer.predict(tokenized_eval)
y_pred = predictions.predictions.argmax(axis=1)
y_true = tokenized_eval['label']
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       100

    accuracy                           1.00       100
   macro avg       1.00      1.00      1.00       100
weighted avg       1.00      1.00      1.00       100



# **Write a detailed report covering the following**


1. **Dataset Insight** - The dataset used for fine-tuning was from the IMDb dataset in Hugging Face. The dataset consists of movie ratings.

1.   **Training Process** - Since the dataset was large, I used a subset consisting of 500 rows for training and testing, with 100 rows used as validation. This ensures that the dataset is large enough to fine-tune. In addition, I increased the number of epochs from 3 to 10, to increase learning and reduce loss, as shown above.

2.   **Evaluate Result** - Since a subset of the entire dataset was used, the maybe a problem of overfitting as can be seen from the above metrics. Therefore, increasing the sample size can help improve accuracy and reduce overfitting

1.   **Application and Impact** - Fine-tuning can be used in the real world to train chatcbots in a specfic domain. Since the dataset used was a subset of the entire dataset, we could increase the number used in training so increase the accuracy of the model.





