# **Build our First Simple LLM**

## **Overall Steps to Build and Fine-tune your LLM**

* **`Environment Setup`: Getting your python enviromment up and runinning**
* **`Choose your base model`:Selecting a model suitable to compute**
* **`Preparing custom dataset`: Format the data in a way the model can understand**
* **`Tokenization + Embedding`: Convert the data into numerical representation**
* **`Load the model and it's tokenizer`: Bring in your model and it's tokenizer**
* **`Set-up your configuration`: Define the parameter**
* **`Fine-tuning the model`: Running the training process**
* **Save to the model and Inference (Test the model)**

## **Fine-tuning**

### **1. Environment Setup**

In [None]:
!pip install transformers torch datasets



* **torch: Training DNN**
* **Datasets: Manage and download dataset from HF**

In [None]:
import torch
import numpy as np
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer



### **Choosing the base Model**

In [None]:
MODEL_NAME = "distilbert-base-uncased"

### **Preparing your own custome dataset**

**For this use case, we will create a dummy dataset for the model, otherwise, future usage if you have a data, then you can use it**

In [None]:
# texts = [
#     "I love this product, it's amazing",
#     "This is the worst experience I've ever had",
#     "The movie was okay, not great but not bad",
#     "Absolutely fantastic service!",
#     "I'm very disappointed with the quality",
#     "What a wonderful day",
#     "This is quite frustrating",
#     "I'm feeling pretty good about this",
#     "Not happy with the outcome",
#     "This is excellent, highly recommend."
# ]

In [None]:
# # 1 - Positive, 0 - Negative
# labels = [1, 0, 0, 1, 0, 1, 0, 1, 0, 1]

In [None]:
texts = [
    "The day was average",
    "Really poor quality service",
    "Absolutely fantastic product!",
    "The trip was okay, not great but not bad",
    "I'm very disappointed with the device",
    "The product exceeded my expectations",
    "The meal was a waste of time",
    "Great experience with this trip",
    "Neither good nor bad experience with this movie",
    "Really satisfied with the quality",
    "This is the worst meal I've ever had",
    "The device did not meet expectations",
    "The product was average",
    "Very happy with the service",
    "The service was okay, not great but not bad",
    "Not happy with the product",
    "The trip exceeded my expectations",
    "Worst experience with this movie",
    "The quality was delightful",
    "The game was average",
    "I hate this service, awful",
    "The meal could be better but it's fine",
    "What a wonderful day",
    "The movie was okay, not great but not bad",
    "Absolutely fantastic service!",
    "This is quite frustrating",
    "The device was average",
    "The game exceeded my expectations",
    "The trip was terrible",
    "Great experience with this quality",
    "Really poor quality experience",
    "I'm very disappointed with the movie",
    "The service was delightful",
    "Not happy with the trip",
    "The product did not meet expectations",
    "Absolutely fantastic game!",
    "Neither good nor bad experience with this meal",
    "The quality was okay, not great but not bad",
    "Worst experience with this service",
    "The movie exceeded my expectations",
    "The device was average",
    "I'm feeling pretty good about this product",
    "The trip did not meet expectations",
    "Really satisfied with the meal",
    "This is the worst service I've ever had",
    "The day was okay, not great but not bad",
    "Absolutely fantastic quality!",
    "Not happy with the device",
    "The product was average",
    "Great experience with this meal",
    "The service did not meet expectations",
    "The trip was average",
    "Very happy with the product",
    "The movie was a waste of time",
    "I'm very disappointed with the quality",
    "The device was delightful",
    "The service was okay, not great but not bad",
    "Worst experience with this day",
    "The product exceeded my expectations",
    "Not happy with the meal",
    "The trip was average",
    "This is excellent, highly recommend.",
    "The quality could be better but it's fine",
    "Really poor quality device",
    "The service was average",
    "Absolutely fantastic experience!",
    "The game exceeded my expectations",
    "The product was okay, not great but not bad",
    "Worst experience with this trip",
    "Really satisfied with the service",
    "The day was a waste of time",
    "Great experience with this device",
    "The movie did not meet expectations",
    "The service was average",
    "The trip was delightful",
    "I'm very disappointed with the game",
    "Absolutely fantastic meal!",
    "The quality was okay, not great but not bad",
    "The product exceeded my expectations",
    "Not happy with the service",
    "The day was average",
    "Worst experience with this device",
    "The movie exceeded my expectations",
    "The trip was okay, not great but not bad",
    "Really poor quality product",
    "The game was average",
    "The device exceeded my expectations",
    "Very happy with the quality",
    "The service was average",
    "Absolutely fantastic trip!",
    "Not happy with the movie",
    "The product was okay, not great but not bad",
    "The day exceeded my expectations",
    "Really satisfied with the product",
    "The meal was terrible",
    "Neither good nor bad experience with this game",
    "Great experience with this service",
    "I'm very disappointed with the trip",
    "The device was average",
    "Absolutely fantastic movie!",
    "The product did not meet expectations",
    "The quality was delightful"
]

labels = [
    0,0,1,0,0,
    1,0,1,0,1,
    0,0,0,1,0,
    0,1,0,1,0,
    0,0,1,0,1,
    0,0,1,0,1,
    0,0,1,0,0,
    1,0,0,1,0,
    0,1,0,0,1,
    0,1,0,0,0,
    1,0,1,0,0,
    1,0,0,1,1,
    0,0,1,0,1,
    0,1,0,0,1,
    0,0,1,0,0,
    1,0,0,1,0,
    1,0,1,0,1,
    0,0,1,0,0,
    1,0,0,1,0,
    1,0,0,1,0,0,0
]


In [None]:
len(texts),len(labels)

(102, 102)

### **Data Splitting**

In [None]:
train_texts, evals_texts, train_label, eval_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

In [None]:
len(train_texts)

81

In [None]:
len(evals_texts)

21

***This data needs to be converted into HuggingFace Dataset Format, for the same we will use the Dataset package**

#### **Converted the data into specific format**

In [None]:
train_dataset = Dataset.from_dict({
    'text': train_texts,
    'label': train_label
})

eval_dataset = Dataset.from_dict({
    'text': evals_texts,
    'label': eval_labels
})

In [None]:
train_dataset, eval_dataset

(Dataset({
     features: ['text', 'label'],
     num_rows: 81
 }),
 Dataset({
     features: ['text', 'label'],
     num_rows: 21
 }))

In [None]:
raw_dataset = DatasetDict({
    "train": train_dataset,
    "eval": eval_dataset
})

In [None]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 81
    })
    eval: Dataset({
        features: ['text', 'label'],
        num_rows: 21
    })
})

### **Tokenization**

MODEL_NAME = "distilbert-base-uncased"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

#### **Create a tokenize function, taht will help us to tokenize the data**

In [None]:
def tokenize_function(data):
    return tokenizer(data["text"], padding = "max_length",truncation=True, max_length=128)

In [None]:
tokenized_dataset = raw_dataset.map(tokenize_function)

Map:   0%|          | 0/81 [00:00<?, ? examples/s]

Map:   0%|          | 0/21 [00:00<?, ? examples/s]

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 81
    })
    eval: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 21
    })
})

**We have removed the text from the tokenized_dataset since, we don't need it anymore**

In [None]:
tokenized_dataset = tokenized_dataset.remove_columns(["text"])

**We need to rename the label columns**

In [None]:
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")

ValueError: Original column name label not in the dataset. Current columns in the dataset: ['labels', 'input_ids', 'attention_mask']

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 81
    })
    eval: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 21
    })
})

In [None]:
tokenized_dataset.set_format("torch")
#set_format("torch") converts dataset columns like input_ids, attention_mask,
#labels into PyTorch tensors, making the dataset directly usable for training models in PyTorch.

**Tokenized Data**

In [None]:
tokenized_dataset['train']

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 81
})

In [None]:
train_dataset_for_trainer = tokenized_dataset["train"]
eval_dataset_for_trainer = tokenized_dataset["eval"]

### **Loading the pre-trained model**

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels = 2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### **Training Arguments**

**TrainingArguments defines all the hyperparameter for fine-tuning**

In [None]:


from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./model_results",       # where to save model checkpoints
    num_train_epochs=3,                 # how many epochs to train
    eval_strategy="epoch",              # run eval at end of each epoch
    save_strategy="epoch",              # save model at end of each epoch
    load_best_model_at_end=True         # load best model after training
)


### **Metrics for evaluation**

In [None]:


def compute_metrics(pred):
    # Get the true labels from the predictions object
    labels = pred.label_ids

    # Get the predicted labels by taking the index of the highest score (argmax)
    preds = pred.predictions.argmax(-1)

    # Calculate accuracy
    acc = accuracy_score(labels, preds)

    # Return all metrics in a dictionary
    return {
        "accuracy": acc
    }


### **Train the model**

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset_for_trainer,
    eval_dataset = eval_dataset_for_trainer,
    compute_metrics = compute_metrics,
    tokenizer = tokenizer,
)

  trainer = Trainer(


In [None]:
print("Starting Training")
try:
  trainer.train()
  print("Training Complete")
except Exception as e:
  print(f"An error occured during training: {e}")

Starting Training


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33majinkyaintellipaat[0m ([33majinkyaintellipaat-intellipaat[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.59661,0.857143
2,No log,0.587758,0.857143
3,No log,0.60057,0.857143




Training Complete


#### **Save the model**

In [None]:
SAVE_PATH = "./fine_tuned_sentiment_model"

In [None]:
try:
  trainer.save_model(SAVE_PATH)
  tokenizer.save_pretrained(SAVE_PATH)
  print("Model saved successfully")
except Exception as e:
  print(f"An error occured during saving the model: {e}")

Model saved successfully


### **Inference**

SAVE_PATH = "./fine_tuned_sentiment_model"

In [None]:
try:
  loaded_model = AutoModelForSequenceClassification.from_pretrained(SAVE_PATH)
  loaded_tokenizer = AutoTokenizer.from_pretrained(SAVE_PATH)
  print("Model loaded successfully")
except Exception as e:
  print(f"An error occured during loading the model: {e}")

Model loaded successfully


In [None]:
test_texts = [
    "This is absolutely wonderful",
    "I'm no really happy about this situation.",
    "The product is mediocre at best",
]

In [None]:
from transformers import pipeline

In [None]:
sentiment_analysis_pipeline = pipeline("sentiment-analysis", model = loaded_model, tokenizer = loaded_tokenizer)

Device set to use cpu


In [None]:
results = sentiment_analysis_pipeline(test_texts)

In [None]:
for texts, results in zip(test_texts, results):
  print(f"Text: {texts}")
  print(f"Sentiment: {results}")
  print("-" * 50)

Text: This is absolutely wonderful
Sentiment: {'label': 'LABEL_0', 'score': 0.6487889885902405}
--------------------------------------------------
Text: I'm no really happy about this situation.
Sentiment: {'label': 'LABEL_0', 'score': 0.6179913878440857}
--------------------------------------------------
Text: The product is mediocre at best
Sentiment: {'label': 'LABEL_0', 'score': 0.6114521622657776}
--------------------------------------------------


In [None]:
sentiment_analysis_pipeline("i love this movie it very good")

[{'label': 'LABEL_0', 'score': 0.7373517751693726}]