## Fine-tune Phi-3 for Sentiment Analysis

In [19]:
# !pip install -q -U torch --index-url https://download.pytorch.org/whl/cu117
# !pip install -q -U -i https://pypi.org/simple/ bitsandbytes
# !pip install -q -U transformers=="4.40.0"
# !pip install -q -U accelerate
# !pip install -q -U datasets
# !pip install -q -U trl
# !pip install -q -U peft
# !pip install -q -U tensorboard
# !pip install -q -U einops

The code imports the os module and sets two environment variables:
* CUDA_VISIBLE_DEVICES: This environment variable tells PyTorch which GPUs to use. In this case, the code is setting the environment variable to 0, which means that PyTorch will use the first GPU.
* TOKENIZERS_PARALLELISM: This environment variable tells the Hugging Face Transformers library whether to parallelize the tokenization process. In this case, the code is setting the environment variable to false, which means that the tokenization process will not be parallelized.

In [20]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

The code import warnings; warnings.filterwarnings("ignore") imports the warnings module and sets the warning filter to ignore. This means that all warnings will be suppressed and will not be displayed. Actually during training there are many warnings that do not prevent the fine-tuning but can be distracting and make you wonder if you are doing the correct things.

In [21]:
import warnings
warnings.filterwarnings("ignore")

In the following cell there are all the other imports for running the notebook

In [22]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split

In [23]:
import time
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset

filename = "all-data.csv"

# Bắt đầu tính thời gian cho việc đọc dữ liệu
start_time_read = time.time()

df = pd.read_csv(filename, 
                 names=["sentiment", "text"],
                 encoding="utf-8", encoding_errors="replace")

end_time_read = time.time()
print(f"Time to read data: {end_time_read - start_time_read:.2f} seconds")


X_train = list()
X_test = list()
X_eval = list()

# Bắt đầu tính thời gian cho việc chia tập dữ liệu
start_time_split = time.time()

for sentiment in ["positive", "neutral", "negative"]:
    # Chia dữ liệu thành tập huấn luyện và tập còn lại (test + eval)
    train_val, test = train_test_split(df[df.sentiment == sentiment], 
                                       train_size=0.8,  # 80% cho train và eval
                                       random_state=42)
    
    # Chia tập còn lại thành test và eval
    eval_size = int(len(train_val) * 0.25)  # 20% của 80% là 20% tổng thể
    train, eval = train_test_split(train_val, 
                                   test_size=eval_size, 
                                   random_state=42)
    
    # Thêm vào các danh sách
    X_train.append(train)
    X_eval.append(eval)
    X_test.append(test)

# Kết hợp các tập lại
X_train = pd.concat(X_train).sample(frac=1, random_state=10).reset_index(drop=True)
X_eval = pd.concat(X_eval).reset_index(drop=True)
X_test = pd.concat(X_test).reset_index(drop=True)


# Thống kê số lượng nhãn cho từng tập
train_label_counts = X_train['sentiment'].value_counts()
eval_label_counts = X_eval['sentiment'].value_counts()
test_label_counts = X_test['sentiment'].value_counts()

# In ra kết quả thống kê
print("Số lượng nhãn trong tập huấn luyện:")
print(train_label_counts)
print("\nSố lượng nhãn trong tập đánh giá:")
print(eval_label_counts)
print("\nSố lượng nhãn trong tập kiểm tra:")
print(test_label_counts)

# Bắt đầu tính thời gian cho việc tạo prompt
start_time_prompt = time.time()


Time to read data: 0.02 seconds
Số lượng nhãn trong tập huấn luyện:
neutral     1728
positive     818
negative     363
Name: sentiment, dtype: int64

Số lượng nhãn trong tập đánh giá:
neutral     575
positive    272
negative    120
Name: sentiment, dtype: int64

Số lượng nhãn trong tập kiểm tra:
neutral     576
positive    273
negative    121
Name: sentiment, dtype: int64


In [24]:
# filename = "../input/sentiment-analysis-for-financial-news/all-data.csv"

# df = pd.read_csv(filename, 
#                  names=["sentiment", "text"],
#                  encoding="utf-8", encoding_errors="replace")

# X_train = list()
# X_test = list()
# for sentiment in ["positive", "neutral", "negative"]:
#     train, test  = train_test_split(df[df.sentiment==sentiment], 
#                                     train_size=300,
#                                     test_size=300, 
#                                     random_state=42)
#     X_train.append(train)
#     X_test.append(test)

# X_train = pd.concat(X_train).sample(frac=1, random_state=10)
# X_test = pd.concat(X_test)

# eval_idx = [idx for idx in df.index if idx not in list(train.index) + list(test.index)]
# X_eval = df[df.index.isin(eval_idx)]
# X_eval = (X_eval
#           .groupby('sentiment', group_keys=False)
#           .apply(lambda x: x.sample(n=50, random_state=10, replace=True)))
# X_train = X_train.reset_index(drop=True)

def generate_prompt(data_point

# def generate_prompt(data_point):
#     return f"""
#             Analyze the sentiment of the news headline enclosed in square brackets, 
#             determine if it is positive, neutral, or negative, and return the answer as 
#             the corresponding sentiment label "positive" or "neutral" or "negative".

#             [{data_point["text"]}] = {data_point["sentiment"]}
#             """.strip()

# def generate_test_prompt(data_point):
#     return f"""
#             Analyze the sentiment of the news headline enclosed in square brackets, 
#             determine if it is positive, neutral, or negative, and return the answer as 
#             the corresponding sentiment label "positive" or "neutral" or "negative".

#             [{data_point["text"]}] = """.strip()
            
X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), 
                       columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), 
                      columns=["text"])

y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

Next we create a function to evaluate the results from our fine-tuned sentiment model. The function performs the following steps:

1. Maps the sentiment labels to a numerical representation, where 2 represents positive, 1 represents neutral, and 0 represents negative.
2. Calculates the accuracy of the model on the test data.
3. Generates an accuracy report for each sentiment label.
4. Generates a classification report for the model.
5. Generates a confusion matrix for the model.

In [25]:
def evaluate(y_true, y_pred):
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none':1, 'negative': 0}
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

Next we need to take care of the model, which is a 7b-v0.1-hf (7 billion parameters, version 0.1, in the HuggingFace compatible format), loading from Kaggle models and quantization.

Model loading and quantization:

* First the code loads the Phi-3 language model from the Hugging Face Hub.
* Then the code gets the float16 data type from the torch library. This is the data type that will be used for the computations.
* Next, it creates a BitsAndBytesConfig object with the following settings:
    1. load_in_4bit: Load the model weights in 4-bit format.
    2. bnb_4bit_quant_type: Use the "nf4" quantization type. 4-bit NormalFloat (NF4), is a new data type that is information theoretically optimal for normally distributed weights.
    3. bnb_4bit_compute_dtype: Use the float16 data type for computations.
    4. bnb_4bit_use_double_quant: Do not use double quantization (reduces the average memory footprint by quantizing also the quantization constants and saves an additional 0.4 bits per parameter.).
* Then the code creates a AutoModelForCausalLM object from the pre-trained Phi-3 language model, using the BitsAndBytesConfig object for quantization.
* After that, the code disables caching for the model.
* Finally the code sets the pre-training token probability to 1.

Tokenizer loading:

* First, the code loads the tokenizer for the Phi-3 language model.
* Then it sets the padding token to be the end-of-sequence (EOS) token.
* Finally, the code sets the padding side to be "left", which means that the input sequences will be padded on the left side.

In [26]:
model_name = "microsoft/Phi-3-mini-4k-instruct" # microsoft/Phi-3-mini-4k-instruct | microsoft/Phi-3-mini-128k-instruct

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

max_seq_length = 2048
tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                          trust_remote_code=True,
                                          max_seq_length=max_seq_length,
                                         )
tokenizer.pad_token = tokenizer.eos_token

loading configuration file config.json from cache at C:\Users\huyinit\.cache\huggingface\hub\models--microsoft--Phi-3-mini-4k-instruct\snapshots\0a67737cc96d2554230f90338b163bc6380a2a85\config.json
loading configuration file config.json from cache at C:\Users\huyinit\.cache\huggingface\hub\models--microsoft--Phi-3-mini-4k-instruct\snapshots\0a67737cc96d2554230f90338b163bc6380a2a85\config.json
Model config Phi3Config {
  "_name_or_path": "microsoft/Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing Phi3ForCausalLM.

All the weights of Phi3ForCausalLM were initialized from the model checkpoint at microsoft/Phi-3-mini-4k-instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Phi3ForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at C:\Users\huyinit\.cache\huggingface\hub\models--microsoft--Phi-3-mini-4k-instruct\snapshots\0a67737cc96d2554230f90338b163bc6380a2a85\generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": [
    32000,
    32001,
    32007
  ],
  "pad_token_id": 32000
}

loading file tokenizer.model from cache at C:\Users\huyinit\.cache\huggingface\hub\models--microsoft--Phi-3-mini-4k-instruct\snapshots\0a67737cc96d2554230f90338b163bc6380a2a85\tokenizer.model
loading file tokenizer.json from cache at C:\Users\huyinit\.cache\huggingface\hub\models--microsoft-

In the next cell, we set a function for predicting the sentiment of a news headline using the Phi-3 language model. The function takes three arguments:

test: A Pandas DataFrame containing the news headlines to be predicted.
model: The pre-trained Phi-3 language model.
tokenizer: The tokenizer for the Phi-3 language model.

The function works as follows:

1. For each news headline in the test DataFrame:
    * Create a prompt for the language model, which asks it to analyze the sentiment of the news headline and return the corresponding sentiment label.
    * Use the pipeline() function from the Hugging Face Transformers library to generate text from the language model, using the prompt.
    * Extract the predicted sentiment label from the generated text.
    * Append the predicted sentiment label to the y_pred list.
2. Return the y_pred list.

The pipeline() function from the Hugging Face Transformers library is used to generate text from the language model. The task argument specifies that the task is text generation. The model and tokenizer arguments specify the pre-trained Phi-3 language model and the tokenizer for the language model. The max_new_tokens argument specifies the maximum number of new tokens to generate. The temperature argument controls the randomness of the generated text. A lower temperature will produce more predictable text, while a higher temperature will produce more creative and unexpected text.

The if statement checks if the generated text contains the word "positive". If it does, then the predicted sentiment label is "positive". Otherwise, the if statement checks if the generated text contains the word "negative". If it does, then the predicted sentiment label is "negative". Otherwise, the if statement checks if the generated text contains the word "neutral". If it does, then the predicted sentiment label is "neutral.

In [27]:
def predict(X_test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer,
                        max_new_tokens = 3, 
                        temperature = 0.0,
                       )
        result = pipe(prompt, pad_token_id=pipe.tokenizer.eos_token_id)
        answer = result[0]['generated_text'].split("The correct option is")[-1].lower()
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

At this point, we are ready to test the Phi-3 model and see how it performs on our problem without any fine-tuning. This allows us to get insights on the model itself and establish a baseline.

In [28]:
y_pred = predict(X_test, model, tokenizer)

100%|██████████| 970/970 [02:26<00:00,  6.60it/s]


In the following cell, we evaluate the results. There is little to be said, it is performing really terribly because the 7b-hf model tends to just predict a neutral sentiment and seldom it detects positive or negative sentiment.

In [29]:
evaluate(y_true, y_pred)

Accuracy: 0.708
Accuracy for label 0: 0.967
Accuracy for label 1: 0.582
Accuracy for label 2: 0.861

Classification Report:
              precision    recall  f1-score   support

           0       0.73      0.97      0.83       121
           1       0.92      0.58      0.71       576
           2       0.53      0.86      0.65       273

    accuracy                           0.71       970
   macro avg       0.73      0.80      0.73       970
weighted avg       0.79      0.71      0.71       970


Confusion Matrix:
[[117   3   1]
 [ 32 335 209]
 [ 12  26 235]]


In [30]:
import re

def get_num_layers(model):
    numbers = set()
    for name, _ in model.named_parameters():
        for number in re.findall(r'\d+', name):
            numbers.add(int(number))
    return max(numbers)

def get_last_layer_linears(model):
    names = []
    
    num_layers = get_num_layers(model)
    for name, module in model.named_modules():
        if str(num_layers) in name and not "encoder" in name:
            if isinstance(module, torch.nn.Linear):
                names.append(name)
    return names

In [31]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0.00,
    bias="none",
    task_type="CAUSAL_LM",
)

training_arguments = TrainingArguments(
    output_dir="logs",
    num_train_epochs=5,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8, # 4
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    # report_to="tensorboard",
    evaluation_strategy="epoch"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    args=training_arguments,
    packing=False,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
PyTorch: setting up devices
PyTorch: setting up devices


Map:   0%|          | 0/2909 [00:00<?, ? examples/s]

Map:   0%|          | 0/967 [00:00<?, ? examples/s]

Using auto half precision backend


The following code will train the model using the trainer.train() method and then save the trained model to the trained-model directory. Using The standard GPU P100 offered by Kaggle, the training should be quite fast.

In [32]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained("trained-model")

Debug start _inner_training_loop :****
Currently training with a batch size of: 1


start def train(self, *args, **kwargs):
start super().train(*args, **kwargs)


***** Running training *****
  Num examples = 2,909
  Num Epochs = 5
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 8
  Total optimization steps = 1,815
  Number of trainable parameters = 25,165,824


Train! inner:


  0%|          | 0/1815 [00:00<?, ?it/s]

 Epoch : 0


{'loss': 2.3711, 'grad_norm': 1.1371639966964722, 'learning_rate': 9.090909090909092e-05, 'epoch': 0.07}
{'loss': 1.0468, 'grad_norm': 0.3748987019062042, 'learning_rate': 0.00018181818181818183, 'epoch': 0.14}
{'loss': 1.255, 'grad_norm': 0.6462491750717163, 'learning_rate': 0.0001999362825656992, 'epoch': 0.21}
{'loss': 0.8784, 'grad_norm': 0.24578849971294403, 'learning_rate': 0.00019967756964555045, 'epoch': 0.28}
{'loss': 1.2216, 'grad_norm': 0.3148752450942993, 'learning_rate': 0.00019922039361395185, 'epoch': 0.34}
{'loss': 0.8682, 'grad_norm': 0.806699812412262, 'learning_rate': 0.00019856566473163746, 'epoch': 0.41}
{'loss': 1.2044, 'grad_norm': 0.2879883944988251, 'learning_rate': 0.00019771468659711595, 'epoch': 0.48}
{'loss': 0.85, 'grad_norm': 0.2478295862674713, 'learning_rate': 0.00019666915355113975, 'epoch': 0.55}
{'loss': 1.1717, 'grad_norm': 0.30512911081314087, 'learning_rate': 0.0001954311473031864, 'epoch': 0.62}
{'loss': 0.8679, 'grad_norm': 0.21259626746177673, 

***** Running Evaluation *****
  Num examples = 967
  Batch size = 8


  0%|          | 0/121 [00:00<?, ?it/s]

 Epoch : 0


{'eval_loss': 1.1154446601867676, 'eval_runtime': 25.7391, 'eval_samples_per_second': 37.569, 'eval_steps_per_second': 4.701, 'epoch': 1.0}
{'loss': 1.0307, 'grad_norm': 0.2757444679737091, 'learning_rate': 0.00018412535328311814, 'epoch': 1.03}
{'loss': 0.9318, 'grad_norm': 0.41296064853668213, 'learning_rate': 0.0001816298010041806, 'epoch': 1.1}
{'loss': 0.8573, 'grad_norm': 0.40550878643989563, 'learning_rate': 0.0001789717196391916, 'epoch': 1.17}
{'loss': 0.899, 'grad_norm': 0.37900838255882263, 'learning_rate': 0.00017615640156335712, 'epoch': 1.24}
{'loss': 0.8873, 'grad_norm': 0.49943000078201294, 'learning_rate': 0.00017318945221817255, 'epoch': 1.31}
{'loss': 0.9255, 'grad_norm': 0.4308318793773651, 'learning_rate': 0.00017007677895070357, 'epoch': 1.38}
{'loss': 0.8626, 'grad_norm': 0.4493044912815094, 'learning_rate': 0.00016682457925175763, 'epoch': 1.44}
{'loss': 0.8881, 'grad_norm': 0.38317495584487915, 'learning_rate': 0.00016343932841636456, 'epoch': 1.51}
{'loss': 0.

***** Running Evaluation *****
  Num examples = 967
  Batch size = 8


  0%|          | 0/121 [00:00<?, ?it/s]

 Epoch : 0


{'eval_loss': 1.112160325050354, 'eval_runtime': 25.6825, 'eval_samples_per_second': 37.652, 'eval_steps_per_second': 4.711, 'epoch': 2.0}
{'loss': 0.8308, 'grad_norm': 0.5146345496177673, 'learning_rate': 0.00013242551492328875, 'epoch': 2.06}
{'loss': 0.6085, 'grad_norm': 0.5415874719619751, 'learning_rate': 0.00012817325568414297, 'epoch': 2.13}
{'loss': 0.8133, 'grad_norm': 0.7221730351448059, 'learning_rate': 0.00012386490205984488, 'epoch': 2.2}
{'loss': 0.6338, 'grad_norm': 0.5762335062026978, 'learning_rate': 0.00011950903220161285, 'epoch': 2.27}
{'loss': 0.805, 'grad_norm': 0.5438617467880249, 'learning_rate': 0.00011511431886790407, 'epoch': 2.34}
{'loss': 0.6485, 'grad_norm': 0.5894894003868103, 'learning_rate': 0.00011068951215651132, 'epoch': 2.41}
{'loss': 0.8023, 'grad_norm': 0.6199808120727539, 'learning_rate': 0.00010624342208267292, 'epoch': 2.48}
{'loss': 0.6392, 'grad_norm': 0.7406930923461914, 'learning_rate': 0.0001017849010378846, 'epoch': 2.54}
{'loss': 0.8147,

***** Running Evaluation *****
  Num examples = 967
  Batch size = 8


  0%|          | 0/121 [00:00<?, ?it/s]

 Epoch : 0


{'eval_loss': 1.186699390411377, 'eval_runtime': 25.4879, 'eval_samples_per_second': 37.94, 'eval_steps_per_second': 4.747, 'epoch': 3.0}
{'loss': 0.7117, 'grad_norm': 0.7239420413970947, 'learning_rate': 7.097153227455379e-05, 'epoch': 3.03}
{'loss': 0.6017, 'grad_norm': 0.8368927836418152, 'learning_rate': 6.673151176035762e-05, 'epoch': 3.09}
{'loss': 0.5575, 'grad_norm': 0.9011900424957275, 'learning_rate': 6.25577304985103e-05, 'epoch': 3.16}
{'loss': 0.6033, 'grad_norm': 0.7783979773521423, 'learning_rate': 5.845849869981137e-05, 'epoch': 3.23}
{'loss': 0.5325, 'grad_norm': 1.1938267946243286, 'learning_rate': 5.4441978143287066e-05, 'epoch': 3.3}
{'loss': 0.5795, 'grad_norm': 0.8630821704864502, 'learning_rate': 5.051616592567323e-05, 'epoch': 3.37}
{'loss': 0.5493, 'grad_norm': 1.153014063835144, 'learning_rate': 4.668887853878896e-05, 'epoch': 3.44}
{'loss': 0.5999, 'grad_norm': 1.2040904760360718, 'learning_rate': 4.296773630650358e-05, 'epoch': 3.51}
{'loss': 0.5396, 'grad_n

***** Running Evaluation *****
  Num examples = 967
  Batch size = 8


  0%|          | 0/121 [00:00<?, ?it/s]

 Epoch : 0


{'eval_loss': 1.3089641332626343, 'eval_runtime': 25.4864, 'eval_samples_per_second': 37.942, 'eval_steps_per_second': 4.748, 'epoch': 4.0}
{'loss': 0.583, 'grad_norm': 0.9099628925323486, 'learning_rate': 1.7857922487950874e-05, 'epoch': 4.06}
{'loss': 0.402, 'grad_norm': 0.6713342666625977, 'learning_rate': 1.5395482812393514e-05, 'epoch': 4.13}
{'loss': 0.4869, 'grad_norm': 1.2813866138458252, 'learning_rate': 1.3101495034123313e-05, 'epoch': 4.19}
{'loss': 0.3981, 'grad_norm': 0.7933583855628967, 'learning_rate': 1.0980526599494733e-05, 'epoch': 4.26}
{'loss': 0.5256, 'grad_norm': 1.1893237829208374, 'learning_rate': 9.036800464548157e-06, 'epoch': 4.33}
{'loss': 0.4105, 'grad_norm': 0.863466739654541, 'learning_rate': 7.2741866868895395e-06, 'epoch': 4.4}
{'loss': 0.5251, 'grad_norm': 1.2117359638214111, 'learning_rate': 5.696194720208792e-06, 'epoch': 4.47}
{'loss': 0.3955, 'grad_norm': 0.8434198498725891, 'learning_rate': 4.305966426779118e-06, 'epoch': 4.54}
{'loss': 0.5242, 'g

***** Running Evaluation *****
  Num examples = 967
  Batch size = 8


  0%|          | 0/121 [00:00<?, ?it/s]



Training completed. Do not forget to share your model on huggingface.co/models =)




{'eval_loss': 1.4060009717941284, 'eval_runtime': 25.4832, 'eval_samples_per_second': 37.947, 'eval_steps_per_second': 4.748, 'epoch': 4.99}
{'train_runtime': 2217.7317, 'train_samples_per_second': 6.559, 'train_steps_per_second': 0.818, 'train_loss': 0.7530069692732546, 'epoch': 4.99}


loading configuration file config.json from cache at C:\Users\huyinit\.cache\huggingface\hub\models--microsoft--Phi-3-mini-4k-instruct\snapshots\0a67737cc96d2554230f90338b163bc6380a2a85\config.json
Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-

Afterwards, loading the TensorBoard extension and start TensorBoard, pointing to the logs/runs directory, which is assumed to contain the training logs and checkpoints for your model, will allow you to understand how the models fits during the training.

In [33]:
# %load_ext tensorboard
# %tensorboard --logdir logs/runs

The following code will first predict the sentiment labels for the test set using the predict() function. Then, it will evaluate the model's performance on the test set using the evaluate() function. The result now should be impressive with an overall accuracy of over 0.8 and high accuracy, precision and recall for the single sentiment labels. The prediction of the neutral label can still be improved, yet it is impressive how much could be done with little data and some fine-tuning.

In [34]:
y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)

100%|██████████| 970/970 [02:27<00:00,  6.59it/s]

Accuracy: 0.869
Accuracy for label 0: 0.843
Accuracy for label 1: 0.910
Accuracy for label 2: 0.795

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.84      0.87       121
           1       0.88      0.91      0.89       576
           2       0.83      0.79      0.81       273

    accuracy                           0.87       970
   macro avg       0.87      0.85      0.86       970
weighted avg       0.87      0.87      0.87       970


Confusion Matrix:
[[102  18   1]
 [ 10 524  42]
 [  2  54 217]]





The following code will create a Pandas DataFrame called evaluation containing the text, true labels, and predicted labels from the test set. This is expectially useful for understanding the errors that the fine-tuned model makes, and gettting insights on how to improve the prompt.

In [35]:
evaluation = pd.DataFrame({'text': X_test["text"], 
                           'y_true':y_true, 
                           'y_pred': y_pred},
                         )
evaluation.to_csv("test_predictions.csv", index=False)