## Fine-tune Llama 3 for Sentiment Analysis


## Installations and imports

In [1]:
# !pip install -q -U torch --index-url https://download.pytorch.org/whl/cu117
# !pip install -q -U -i https://pypi.org/simple/ bitsandbytes
# !pip install -q -U transformers=="4.40.0"
# !pip install -q -U accelerate
# !pip install -q -U datasets
# !pip install -q -U trl
# !pip install -q -U peft
# !pip install -q -U tensorboard



In [2]:
print(1)

1


In [3]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [4]:
import warnings
warnings.filterwarnings("ignore")

In the following cell there are all the other imports for running the notebook

In [5]:
import time

# Bắt đầu tính thời gian
start_time = time.time()

import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split

# Kết thúc tính thời gian
end_time = time.time()
print(f"Time to import libraries: {end_time - start_time:.2f} seconds")


Detected accelerate version: 0.34.2
Detected bitsandbytes version: 0.44.1
Detected datasets version: 3.0.1
Detected jinja2 version: 2.11.3
Detected nltk version: 3.7
Detected pandas version: 1.4.2
Detected peft version: 0.13.0
Detected psutil version: 5.8.0
Detected pytest version: 7.1.1
Detected safetensors version: 0.4.5
Detected scipy version: 1.7.3
Detected tokenizers version: 0.19.1
Detected torchvision version: 0.15.2+cu117
Detected torch version: 2.1.2+cu118
Detected PIL version 9.0.1


_default_log_level: 10


Detected rich version: 13.9.1


Time to import libraries: 3.05 seconds


In [6]:
print(f"pytorch version {torch.__version__}")
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"working on {device}")

pytorch version 2.1.2+cu118
working on cuda:0


Disabling two features in PyTorch related to memory efficiency and speed during operations on the Graphics Processing Unit (GPU) specifically for the scaled dot product attention (SDPA) function.

In [7]:
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

## Preparing the data and the core evaluation functions

In [8]:
import time
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset

filename = "all-data.csv"

# Bắt đầu tính thời gian cho việc đọc dữ liệu
start_time_read = time.time()

df = pd.read_csv(filename, 
                 names=["sentiment", "text"],
                 encoding="utf-8", encoding_errors="replace")

end_time_read = time.time()
print(f"Time to read data: {end_time_read - start_time_read:.2f} seconds")

# X_train = list()
# X_test = list()

# # Bắt đầu tính thời gian cho việc chia tập dữ liệu
# start_time_split = time.time()

# for sentiment in ["positive", "neutral", "negative"]:
#     train, test  = train_test_split(df[df.sentiment == sentiment], 
#                                     train_size=300,
#                                     test_size=300, 
#                                     random_state=42)
#     X_train.append(train)
#     X_test.append(test)

# X_train = pd.concat(X_train).sample(frac=1, random_state=10)
# X_test = pd.concat(X_test)

# # Kết thúc thời gian chia tập dữ liệu
# end_time_split = time.time()
# print(f"Time to split data: {end_time_split - start_time_split:.2f} seconds")

# eval_idx = [idx for idx in df.index if idx not in list(X_train.index) + list(X_test.index)]
# X_eval = df[df.index.isin(eval_idx)]
# X_eval = (X_eval
#           .groupby('sentiment', group_keys=False)
#           .apply(lambda x: x.sample(n=50, random_state=10, replace=True)))
# X_train = X_train.reset_index(drop=True)



X_train = list()
X_test = list()
X_eval = list()

# Bắt đầu tính thời gian cho việc chia tập dữ liệu
start_time_split = time.time()

for sentiment in ["positive", "neutral", "negative"]:
    # Chia dữ liệu thành tập huấn luyện và tập còn lại (test + eval)
    train_val, test = train_test_split(df[df.sentiment == sentiment], 
                                       train_size=0.8,  # 80% cho train và eval
                                       random_state=42)
    
    # Chia tập còn lại thành test và eval
    eval_size = int(len(train_val) * 0.25)  # 20% của 80% là 20% tổng thể
    train, eval = train_test_split(train_val, 
                                   test_size=eval_size, 
                                   random_state=42)
    
    # Thêm vào các danh sách
    X_train.append(train)
    X_eval.append(eval)
    X_test.append(test)

# Kết hợp các tập lại
X_train = pd.concat(X_train).sample(frac=1, random_state=10).reset_index(drop=True)
X_eval = pd.concat(X_eval).reset_index(drop=True)
X_test = pd.concat(X_test).reset_index(drop=True)


# Thống kê số lượng nhãn cho từng tập
train_label_counts = X_train['sentiment'].value_counts()
eval_label_counts = X_eval['sentiment'].value_counts()
test_label_counts = X_test['sentiment'].value_counts()

# In ra kết quả thống kê
print("Số lượng nhãn trong tập huấn luyện:")
print(train_label_counts)
print("\nSố lượng nhãn trong tập đánh giá:")
print(eval_label_counts)
print("\nSố lượng nhãn trong tập kiểm tra:")
print(test_label_counts)

# Bắt đầu tính thời gian cho việc tạo prompt
start_time_prompt = time.time()

def generate_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative".

            [{data_point["text"]}] = {data_point["sentiment"]}
            """.strip()

def generate_test_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative".

            [{data_point["text"]}] = """.strip()

X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), 
                       columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), 
                      columns=["text"])

y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

# Kết thúc thời gian tạo prompt
end_time_prompt = time.time()
print(f"Time to create prompts: {end_time_prompt - start_time_prompt:.2f} seconds")

train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)


Time to read data: 0.01 seconds
Số lượng nhãn trong tập huấn luyện:
neutral     1728
positive     818
negative     363
Name: sentiment, dtype: int64

Số lượng nhãn trong tập đánh giá:
neutral     575
positive    272
negative    120
Name: sentiment, dtype: int64

Số lượng nhãn trong tập kiểm tra:
neutral     576
positive    273
negative    121
Name: sentiment, dtype: int64
Time to create prompts: 0.03 seconds


In [9]:

# In ra giá trị của từng biến và tổng số lượng bản ghi
print("\n--- Final Values ---")
print(f"Total records in df: {df.shape[0]}")
print(f"Total records in X_train: {X_train.shape[0]}")
print(f"Total records in X_eval: {X_eval.shape[0]}")
print(f"Total records in X_test: {X_test.shape[0]}")
print(f"Total records in y_true: {len(y_true)}")
print("\ny_true distribution:")
print(y_true.value_counts())  # Hiển thị số lượng từng nhãn trong y_true
print("\nTrain dataset:")
print(train_data)
print("\nEval dataset:")
print(eval_data)


--- Final Values ---
Total records in df: 4846
Total records in X_train: 2909
Total records in X_eval: 967
Total records in X_test: 970
Total records in y_true: 970

y_true distribution:
neutral     576
positive    273
negative    121
Name: sentiment, dtype: int64

Train dataset:
Dataset({
    features: ['text'],
    num_rows: 2909
})

Eval dataset:
Dataset({
    features: ['text'],
    num_rows: 967
})


In [10]:

# In ra giá trị của từng biến
print("\n--- Final Values ---")
print("X_train:")
print(X_train.head())  # Hiển thị 5 hàng đầu tiên của X_train
print("\nX_eval:")
print(X_eval.head())   # Hiển thị 5 hàng đầu tiên của X_eval
print("\nX_test:")
print(X_test.head())   # Hiển thị 5 hàng đầu tiên của X_test
print("\ny_true:")
print(y_true.value_counts())  # Hiển thị số lượng từng nhãn trong y_true
print("\nTrain dataset:")
print(train_data)
print("\nEval dataset:")
print(eval_data)


--- Final Values ---
X_train:
                                                text
0  Analyze the sentiment of the news headline enc...
1  Analyze the sentiment of the news headline enc...
2  Analyze the sentiment of the news headline enc...
3  Analyze the sentiment of the news headline enc...
4  Analyze the sentiment of the news headline enc...

X_eval:
                                                text
0  Analyze the sentiment of the news headline enc...
1  Analyze the sentiment of the news headline enc...
2  Analyze the sentiment of the news headline enc...
3  Analyze the sentiment of the news headline enc...
4  Analyze the sentiment of the news headline enc...

X_test:
                                                text
0  Analyze the sentiment of the news headline enc...
1  Analyze the sentiment of the news headline enc...
2  Analyze the sentiment of the news headline enc...
3  Analyze the sentiment of the news headline enc...
4  Analyze the sentiment of the news headline enc.

In [11]:
train_data[0]



{'text': 'Analyze the sentiment of the news headline enclosed in square brackets, \n            determine if it is positive, neutral, or negative, and return the answer as \n            the corresponding sentiment label "positive" or "neutral" or "negative".\n\n            [The contractor of the shopping center , China State Construction Engineering Corporation , has previously built e.g. airports , hotels and factories for large international customers in different parts of the world .] = neutral'}

In [12]:
def evaluate(y_true, y_pred):
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none':1, 'negative': 0}
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

## Testing the model without fine-tuning

In [13]:
import warnings
warnings.filterwarnings("ignore")

# Khôi phục trạng thái cảnh báo về mặc định
warnings.resetwarnings()

# Nếu bạn đã từng dùng filterwarnings, không cần thiết phải xóa nó,
# nhưng nếu có, bạn có thể xóa hoặc bình luận dòng này.


In [14]:

# Bắt đầu tính thời gian
start_time = time.time()

# Đoạn mã của bạn
model_name = "E:\gemma\gemma-transformers-1.1-7b-it-v1"

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',  # Thay thế device bằng 'auto' nếu bạn chưa định nghĩa
    torch_dtype=compute_dtype,
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

max_seq_length = 512  # 2048
tokenizer = AutoTokenizer.from_pretrained(model_name, max_seq_length=max_seq_length)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Kết thúc tính thời gian
end_time = time.time()
execution_time = end_time - start_time

print(f"Execution time: {execution_time:.2f} seconds")

  model_name = "E:\gemma\gemma-transformers-1.1-7b-it-v1"
loading configuration file E:\gemma\gemma-transformers-1.1-7b-it-v1\config.json
Model config GemmaConfig {
  "_name_or_path": "E:\\gemma\\gemma-transformers-1.1-7b-it-v1",
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu_pytorch_tanh",
  "hidden_activation": "gelu_pytorch_tanh",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 24576,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 16,
  "num_hidden_layers": 28,
  "num_key_value_heads": 16,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "torch_dtype": "float16",
  "transformers_version": "4.40.0",
  "use_cache": true,
  "vocab_size": 256000
}

loading weights file E:\gemma\gemma-transformers-1.1-7b-it-v1\model.safetensors.index.json
Instantiating GemmaForCa

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing GemmaForCausalLM.

All the weights of GemmaForCausalLM were initialized from the model checkpoint at E:\gemma\gemma-transformers-1.1-7b-it-v1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GemmaForCausalLM for predictions without further training.
loading configuration file E:\gemma\gemma-transformers-1.1-7b-it-v1\generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0
}

loading file tokenizer.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json


Execution time: 14.19 seconds


In [15]:

# In kết quả của tokenizer
print(f"Tokenizer : {tokenizer}")
print(f"Tokenizer pad_token_id: {tokenizer.pad_token_id}")
print(f"Tokenizer eos_token_id: {tokenizer.eos_token_id}")

Tokenizer : GemmaTokenizerFast(name_or_path='E:\gemma\gemma-transformers-1.1-7b-it-v1', vocab_size=256000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<bos>', 'eos_token': '<eos>', 'unk_token': '<unk>', 'pad_token': '<eos>', 'additional_special_tokens': ['<start_of_turn>', '<end_of_turn>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<eos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<bos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<mask>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	5: AddedToken("<2mass>", rs

In [16]:
def predict(X_test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        max_new_tokens = 1, 
                        temperature = 0.0,
                       )
        result = pipe(prompt)
        answer = result[0]['generated_text'].split("=")[-1]
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

In [17]:
print(1)


1


At this point, we are ready to test the Llama 3 8b-chat-hf model and see how it performs on our problem without any fine-tuning. This allows us to get insights on the model itself and establish a baseline.

In [18]:
y_pred = predict(X_test, model, tokenizer)

100%|██████████| 970/970 [01:56<00:00,  8.35it/s]


In the following cell, we evaluate the results. There is little to be said, it is performing really terribly because the 8b-chat-hf model tends to just predict a neutral sentiment and seldom it detects positive or negative sentiment.

In [19]:
evaluate(y_true, y_pred)

Accuracy: 0.594
Accuracy for label 0: 0.000
Accuracy for label 1: 1.000
Accuracy for label 2: 0.000

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       121
           1       0.59      1.00      0.75       576
           2       0.00      0.00      0.00       273

    accuracy                           0.59       970
   macro avg       0.20      0.33      0.25       970
weighted avg       0.35      0.59      0.44       970


Confusion Matrix:
[[  0 121   0]
 [  0 576   0]
 [  0 273   0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Fine-tuning

In [20]:
from sklearn.metrics import (accuracy_score, 
                             recall_score, 
                             precision_score, 
                             f1_score)

from transformers import EarlyStoppingCallback, IntervalStrategy

def compute_metrics(p):    
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)    
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

In [21]:
output_dir="trained_weigths"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
)

training_arguments = TrainingArguments(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=5,                       # number of training epochs
    per_device_train_batch_size=1,            # batch size per device during training
    gradient_accumulation_steps=8,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,                         # log every 10 steps
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    # report_to="tensorboard",                  # report metrics to tensorboard
    #evaluation_strategy="steps",              # save checkpoint every epoch
    #load_best_model_at_end = True,
    #eval_steps = 25,
    #metric_for_best_model = 'accuracy',
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    },
    #compute_metrics=compute_metrics,
    #callbacks = [EarlyStoppingCallback(early_stopping_patience=3)],
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
PyTorch: setting up devices
PyTorch: setting up devices


Map:   0%|          | 0/2909 [00:00<?, ? examples/s]

Map:   0%|          | 0/967 [00:00<?, ? examples/s]

Using auto half precision backend


The following code will train the model using the trainer.train() method and then save the trained model to the trained-model directory. Using The standard GPU P100 offered by Kaggle, the training should be quite fast.

In [22]:

# Train model
trainer.train()

Debug start _inner_training_loop :****
Currently training with a batch size of: 1
***** Running training *****
  Num examples = 2,909
  Num Epochs = 5
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 8
  Total optimization steps = 1,815
  Number of trainable parameters = 200,015,872


start def train(self, *args, **kwargs):
start super().train(*args, **kwargs)
Train! inner:


  0%|          | 0/1815 [00:00<?, ?it/s]

 Epoch : 0


{'loss': 4.4819, 'grad_norm': 1.4594395160675049, 'learning_rate': 9.090909090909092e-05, 'epoch': 0.07}
{'loss': 1.2559, 'grad_norm': 1.1500604152679443, 'learning_rate': 0.00018181818181818183, 'epoch': 0.14}
{'loss': 1.0858, 'grad_norm': 0.6350141167640686, 'learning_rate': 0.0001999362825656992, 'epoch': 0.21}
{'loss': 1.0584, 'grad_norm': 0.6502953171730042, 'learning_rate': 0.00019967756964555045, 'epoch': 0.28}
{'loss': 0.9899, 'grad_norm': 0.6841915845870972, 'learning_rate': 0.00019922039361395185, 'epoch': 0.34}
{'loss': 0.9924, 'grad_norm': 0.6516330242156982, 'learning_rate': 0.00019856566473163746, 'epoch': 0.41}
{'loss': 0.9792, 'grad_norm': 0.8412651419639587, 'learning_rate': 0.00019771468659711595, 'epoch': 0.48}
{'loss': 0.9978, 'grad_norm': 0.5898518562316895, 'learning_rate': 0.00019666915355113975, 'epoch': 0.55}
{'loss': 0.9538, 'grad_norm': 0.6696980595588684, 'learning_rate': 0.0001954311473031864, 'epoch': 0.62}
{'loss': 0.9592, 'grad_norm': 0.7401172518730164,

 Epoch : 0


{'loss': 0.8302, 'grad_norm': 0.6910934448242188, 'learning_rate': 0.00018412535328311814, 'epoch': 1.03}
{'loss': 0.7418, 'grad_norm': 0.6784517765045166, 'learning_rate': 0.0001816298010041806, 'epoch': 1.1}
{'loss': 0.7193, 'grad_norm': 0.5864966511726379, 'learning_rate': 0.0001789717196391916, 'epoch': 1.17}
{'loss': 0.7052, 'grad_norm': 0.6511955261230469, 'learning_rate': 0.00017615640156335712, 'epoch': 1.24}
{'loss': 0.692, 'grad_norm': 0.7553870677947998, 'learning_rate': 0.00017318945221817255, 'epoch': 1.31}
{'loss': 0.7156, 'grad_norm': 0.5481001734733582, 'learning_rate': 0.00017007677895070357, 'epoch': 1.38}
{'loss': 0.7152, 'grad_norm': 0.5327635407447815, 'learning_rate': 0.00016682457925175763, 'epoch': 1.44}
{'loss': 0.762, 'grad_norm': 0.545109212398529, 'learning_rate': 0.00016343932841636456, 'epoch': 1.51}
{'loss': 0.7089, 'grad_norm': 0.5970158576965332, 'learning_rate': 0.0001599277666511347, 'epoch': 1.58}
{'loss': 0.7543, 'grad_norm': 0.6380588412284851, 'le

 Epoch : 0


{'loss': 0.4552, 'grad_norm': 0.6802445650100708, 'learning_rate': 0.00013242551492328875, 'epoch': 2.06}
{'loss': 0.4299, 'grad_norm': 0.8375889658927917, 'learning_rate': 0.00012817325568414297, 'epoch': 2.13}
{'loss': 0.4199, 'grad_norm': 0.5536882877349854, 'learning_rate': 0.00012386490205984488, 'epoch': 2.2}
{'loss': 0.4298, 'grad_norm': 0.6665387153625488, 'learning_rate': 0.00011950903220161285, 'epoch': 2.27}
{'loss': 0.4437, 'grad_norm': 0.8988458514213562, 'learning_rate': 0.00011511431886790407, 'epoch': 2.34}
{'loss': 0.4399, 'grad_norm': 0.6777801513671875, 'learning_rate': 0.00011068951215651132, 'epoch': 2.41}
{'loss': 0.4447, 'grad_norm': 0.6529946327209473, 'learning_rate': 0.00010624342208267292, 'epoch': 2.48}
{'loss': 0.4497, 'grad_norm': 0.5233761668205261, 'learning_rate': 0.0001017849010378846, 'epoch': 2.54}
{'loss': 0.4283, 'grad_norm': 0.8006649613380432, 'learning_rate': 9.732282616433756e-05, 'epoch': 2.61}
{'loss': 0.4359, 'grad_norm': 0.6416663527488708,

 Epoch : 0


{'loss': 0.3575, 'grad_norm': 0.4743308424949646, 'learning_rate': 7.097153227455379e-05, 'epoch': 3.03}
{'loss': 0.2284, 'grad_norm': 0.5421464443206787, 'learning_rate': 6.673151176035762e-05, 'epoch': 3.09}
{'loss': 0.231, 'grad_norm': 0.5631994009017944, 'learning_rate': 6.25577304985103e-05, 'epoch': 3.16}
{'loss': 0.2399, 'grad_norm': 0.5290246605873108, 'learning_rate': 5.845849869981137e-05, 'epoch': 3.23}
{'loss': 0.2403, 'grad_norm': 0.4836066961288452, 'learning_rate': 5.4441978143287066e-05, 'epoch': 3.3}
{'loss': 0.2371, 'grad_norm': 0.5625641345977783, 'learning_rate': 5.051616592567323e-05, 'epoch': 3.37}
{'loss': 0.2474, 'grad_norm': 0.8049702048301697, 'learning_rate': 4.668887853878896e-05, 'epoch': 3.44}
{'loss': 0.237, 'grad_norm': 0.6854503750801086, 'learning_rate': 4.296773630650358e-05, 'epoch': 3.51}
{'loss': 0.2262, 'grad_norm': 0.5659397840499878, 'learning_rate': 3.9360148212284475e-05, 'epoch': 3.58}
{'loss': 0.2339, 'grad_norm': 0.551835298538208, 'learnin

 Epoch : 0


{'loss': 0.1706, 'grad_norm': 0.5493859648704529, 'learning_rate': 1.7857922487950874e-05, 'epoch': 4.06}
{'loss': 0.1554, 'grad_norm': 0.39107540249824524, 'learning_rate': 1.5395482812393514e-05, 'epoch': 4.13}
{'loss': 0.1534, 'grad_norm': 0.3682272434234619, 'learning_rate': 1.3101495034123313e-05, 'epoch': 4.19}
{'loss': 0.1641, 'grad_norm': 0.5411484837532043, 'learning_rate': 1.0980526599494733e-05, 'epoch': 4.26}
{'loss': 0.1575, 'grad_norm': 0.48541510105133057, 'learning_rate': 9.036800464548157e-06, 'epoch': 4.33}
{'loss': 0.1575, 'grad_norm': 0.5195474624633789, 'learning_rate': 7.2741866868895395e-06, 'epoch': 4.4}
{'loss': 0.1546, 'grad_norm': 0.4063849151134491, 'learning_rate': 5.696194720208792e-06, 'epoch': 4.47}
{'loss': 0.1498, 'grad_norm': 0.5368598699569702, 'learning_rate': 4.305966426779118e-06, 'epoch': 4.54}
{'loss': 0.164, 'grad_norm': 0.7164586186408997, 'learning_rate': 3.1062698218492724e-06, 'epoch': 4.61}
{'loss': 0.1616, 'grad_norm': 0.502795934677124, 



Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 27712.4243, 'train_samples_per_second': 0.525, 'train_steps_per_second': 0.065, 'train_loss': 0.5570154850804773, 'epoch': 4.99}


TrainOutput(global_step=1815, training_loss=0.5570154850804773, metrics={'train_runtime': 27712.4243, 'train_samples_per_second': 0.525, 'train_steps_per_second': 0.065, 'total_flos': 5.91609879430656e+16, 'train_loss': 0.5570154850804773, 'epoch': 4.99140598143692})

The model and the tokenizer are saved to disk for later usage.

In [23]:
# Save trained model and tokenizer
trainer.save_model()
tokenizer.save_pretrained(output_dir)

Saving model checkpoint to trained_weigths
loading configuration file E:\gemma\gemma-transformers-1.1-7b-it-v1\config.json
Model config GemmaConfig {
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu_pytorch_tanh",
  "hidden_activation": "gelu_pytorch_tanh",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 24576,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 16,
  "num_hidden_layers": 28,
  "num_key_value_heads": 16,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.0",
  "use_cache": true,
  "vocab_size": 256000
}

tokenizer config file saved in trained_weigths\tokenizer_config.json
Special tokens file saved in trained_weigths\special_tokens_map.json
tokenizer config file saved in trained_weigths\tokenizer

('trained_weigths\\tokenizer_config.json',
 'trained_weigths\\special_tokens_map.json',
 'trained_weigths\\tokenizer.json')

Afterwards, loading the TensorBoard extension and start TensorBoard, pointing to the logs/runs directory, which is assumed to contain the training logs and checkpoints for your model, will allow you to understand how the models fits during the training.

## Testing

The following code will first predict the sentiment labels for the test set using the predict() function. Then, it will evaluate the model's performance on the test set using the evaluate() function. The result now should be impressive with an overall accuracy of over 0.8 and high accuracy, precision and recall for the single sentiment labels. The prediction of the neutral label can still be improved, yet it is impressive how much could be done with little data and some fine-tuning.

In [24]:
y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
100%|██████████| 970/970 [26:27<00:00,  1.64s/it]

Accuracy: 0.869
Accuracy for label 0: 0.860
Accuracy for label 1: 0.911
Accuracy for label 2: 0.784

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.86      0.89       121
           1       0.88      0.91      0.89       576
           2       0.84      0.78      0.81       273

    accuracy                           0.87       970
   macro avg       0.87      0.85      0.86       970
weighted avg       0.87      0.87      0.87       970


Confusion Matrix:
[[104  16   1]
 [ 10 525  41]
 [  0  59 214]]





The following code will create a Pandas DataFrame called evaluation containing the text, true labels, and predicted labels from the test set. This is expectially useful for understanding the errors that the fine-tuned model makes, and gettting insights on how to improve the prompt.

In [25]:
evaluation = pd.DataFrame({'text': X_test["text"], 
                           'y_true':y_true, 
                           'y_pred': y_pred},
                         )
evaluation.to_csv("test_predictions.csv", index=False)

The evaluation results are indeed good when compared to simpler benchmarks such as a CONV1D + bidirectional LSTM based model () such as: https://www.kaggle.com/code/lucamassaron/lstm-baseline-for-sentiment-analysis

Here are the results of the baseline model:

Accuracy: 0.623
Accuracy for label 0: 0.620
Accuracy for label 1: 0.590
Accuracy for label 2: 0.660

Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.62      0.69       300
           1       0.61      0.59      0.60       300
           2       0.53      0.66      0.59       300

    accuracy                           0.62       900
   macro avg       0.64      0.62      0.63       900
weighted avg       0.64      0.62      0.63       900


Confusion Matrix:

[[186  39  75]\
 [ 23 177 100]\
 [ 27  75 198]]
 

With this testing, the fine-tuning of Llama 3 has reached its conclusion. Dont't forget to upvote if you find the notebook useful for your projects or work! 