# ANLP Oct 2024 Project - Jason Ng

I do write food reviews in my leisure time and have created a [food telegram bot](https://t.me/jasonthefoodie_bot) for my users to retrieve my reviews. Currently, I am using GPT-4o-mini to auto assign a score to these reviews but they are not too accurate.

The scope of my project will be to fine-tune a small LM to analyse the sentiments of given reviews and tag them a score between 1 to 5. The fine-tuned model should hopefully do better than the GPT-4 and the non fine-tuned model.

### 1. Import libraries and create train/test set

I have scraped my food reviews from Burpple: https://www.burpple.com/@jasoneatfoodd/timeline and there are about 1000 reviews after removal of duplicates.

In [None]:
!pip install pandas scikit-learn torch transformers accelerate ipywidgets unsloth bitsandbytes langchain_openai

In [1]:
# import libraries
import pandas as pd
from sklearn.model_selection  import train_test_split
from sklearn.metrics import accuracy_score
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_openai import AzureChatOpenAI
from tqdm import tqdm

In [2]:
# load the dataset
df = pd.read_csv('reviews_data.csv')
df.head()

Unnamed: 0,review,num_stars
0,Recently tried @toriyamasg 's yakitori and I m...,4
1,Finally tried @doqoosg 's famous mochi waffles...,4
2,Got myself a bowl of dry Ban Mian with meatbal...,5
3,Korean BBQ restaurants are in abundance here i...,5
4,"If you have a spicy appetite, @xiaolongkansg w...",3


In [3]:
len(df)

988

In [4]:
# split the dataset
train, test = train_test_split(df, test_size=0.2, random_state=42)

### 2. Test on baseline model and GPT-4o-mini

We will first evaluate the accuracy of GPT-4o-mini model and the baseline llama-3.2-1B-instruct model in rating the reviews correctly, against the groundtruth.

In [5]:
# system prompt and instruction to instruct the LLM
system_prompt = "You are a food critic. You will be provided written food reviews to rate."
instruction = """
Rate the food review on a scale between 1 to 5.
Only return the number of the rating.

Rating:"""

In [10]:
# initialize the openai LLM
llm = AzureChatOpenAI(
                azure_endpoint="OPENAI_ENDPOINT",
                openai_api_type="azure",
                openai_api_key="OPENAI_KEY",
                azure_deployment="gpt-4o-mini",
                openai_api_version="2024-08-01-preview"
            )

# function to format the LLM output
def openai_generate(prompt):
    return llm(prompt).content

In [6]:
# initialize the llama LLM
model_id = "./Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

In [6]:
# function to format the LLM output
def llama_generate(pipe, input_text, system_prompt):
    messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": input_text}
            ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    output = pipe(
        prompt,
        max_length=2048,
        return_full_text=False,
        pad_token_id=tokenizer.eos_token_id
    )
    text = output[0]["generated_text"]

    return text

In [7]:
# function to rate reviews using an LLM and calculate accuracy against ground truth ratings.
def rate_reviews(df, llm_model, system_prompt, instruction, llm_type, pipe=None):
    """
    Function to rate reviews using an LLM and calculate accuracy against ground truth ratings.
    
    Parameters:
    df (pd.DataFrame): The dataframe containing 'reviews' and 'groundtruth' columns.
    llm_model: The large language model capable of generating ratings from reviews.
    system_prompt (str): The system prompt that instructs the LLM on how to generate ratings.
    
    Returns:
    accuracy (float): The accuracy score comparing LLM ratings to groundtruth ratings.
    """
    
    # Create a list to hold LLM-generated ratings
    llm_ratings = []
    
    # Iterate over each review in the dataframe
    for _, row in tqdm(df.iterrows(), total=len(df)):
        while True:
            try:
                # Extract the review
                review = row['review']
                
                if llm_type == 'openai':

                    # Create a prompt to send to the LLM
                    prompt = f"{system_prompt}\nReview: {review}\n{instruction}"
                
                    # Send the prompt to the LLM to generate a rating (assuming LLM returns a rating as integer)
                    llm_generated_rating = llm_model(prompt)

                elif llm_type == 'llama':
                    llm_generated_rating = llm_model(pipe, review + "\n" + instruction, system_prompt)

                llm_generated_rating = int(llm_generated_rating)
                break
            
            except ValueError:
                continue
            
        # Append the generated rating to the list
        llm_ratings.append(llm_generated_rating)
    
    # Add the LLM-generated ratings to the dataframe
    df['llm_ratings'] = llm_ratings
    
    # Calculate accuracy using the ground truth and the generated ratings
    accuracy = accuracy_score(df['num_stars'], df['llm_ratings'])
    
    return df, accuracy

In [15]:
# accuracy score for GPT4o-mini
openai_df, openai_acc = rate_reviews(test, openai_generate, system_prompt, instruction, 'openai')
print(openai_acc)

100%|██████████| 198/198 [17:34<00:00,  5.33s/it]

0.7575757575757576





In [29]:
# accuracy score for llama
llama_df, llama_acc = rate_reviews(test, llama_generate, system_prompt, instruction, 'llama', pipe)
print(llama_acc)

100%|██████████| 198/198 [01:14<00:00,  2.64it/s]

0.31313131313131315





From the results above, it is obvious that the 1B model performs much worse than GPT-4o-mini. Let's see if we can close the gap by fine-tuning it with some of the review data and their respective groundtruth ratings.

### 3. Finetune Llama3.2 using qLORA via Unsloth

In [8]:
# import libraries
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import FastLanguageModel, is_bfloat16_supported

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [41]:
# initialise quantised llama 3 model
max_seq_length = 2048
ft_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

==((====))==  Unsloth 2024.9.post4: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA GeForce RTX 4090 Laptop GPU. Max memory: 15.6 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


In [42]:
# initialise quantised llama 3 model
ft_model = FastLanguageModel.get_peft_model(
    ft_model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"], 
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)

In [43]:
# setup and format training datasets
from datasets import Dataset

TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{answer}<|eot_id|>"""

def format_dataset(df, system_prompt):
    review_ls = df['review'].tolist()
    rating_ls = df['num_stars'].tolist()

    # Create a list to store the formatted text
    formatted_data = []

    # Iterate over the reviews and ratings
    for review, rating in zip(review_ls, rating_ls):
        # Format the template with the current review and rating
        formatted_text = TEMPLATE.format(
            context=system_prompt,
            question=review,
            answer=str(rating)  # Ensure rating is converted to string
        )
        # Append the formatted text to the list
        formatted_data.append({"text": formatted_text})

    # Convert the list of dictionaries into a Dataset object
    dataset = Dataset.from_list(formatted_data)
    
    return dataset

ft_train = format_dataset(train, system_prompt)

In [44]:
# finetune model
trainer=SFTTrainer(
    model=ft_model,
    tokenizer=tokenizer,
    train_dataset=ft_train,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=9e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        num_train_epochs=15,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        seed=42,
    ),
)

trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 73 | Num Epochs = 15
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 2
\        /    Total batch size = 16 | Total steps = 75
 "-____-"     Number of trainable parameters = 11,272,192


  0%|          | 0/75 [00:00<?, ?it/s]

{'loss': 3.5708, 'grad_norm': 5.317651748657227, 'learning_rate': 9e-05, 'epoch': 0.2}
{'loss': 3.559, 'grad_norm': 5.036423206329346, 'learning_rate': 0.00018, 'epoch': 0.4}
{'loss': 3.3079, 'grad_norm': 3.9521358013153076, 'learning_rate': 0.00027, 'epoch': 0.6}
{'loss': 3.0138, 'grad_norm': 2.1832804679870605, 'learning_rate': 0.00036, 'epoch': 0.8}
{'loss': 2.8699, 'grad_norm': 2.3651890754699707, 'learning_rate': 0.00045, 'epoch': 1.0}
{'loss': 2.758, 'grad_norm': 1.5491721630096436, 'learning_rate': 0.00054, 'epoch': 1.2}
{'loss': 2.6457, 'grad_norm': 1.6231796741485596, 'learning_rate': 0.0006299999999999999, 'epoch': 1.4}
{'loss': 2.6013, 'grad_norm': 1.5870009660720825, 'learning_rate': 0.00072, 'epoch': 1.6}
{'loss': 2.6139, 'grad_norm': 1.6099863052368164, 'learning_rate': 0.00081, 'epoch': 1.8}
{'loss': 2.5652, 'grad_norm': 1.6665542125701904, 'learning_rate': 0.0009, 'epoch': 2.0}
{'loss': 2.4407, 'grad_norm': 1.3268182277679443, 'learning_rate': 0.0008861538461538462, 'ep

TrainOutput(global_step=75, training_loss=1.4254746309916178, metrics={'train_runtime': 345.068, 'train_samples_per_second': 3.173, 'train_steps_per_second': 0.217, 'total_flos': 1.677999904653312e+16, 'train_loss': 1.4254746309916178, 'epoch': 15.0})

The above finetuning run took about 5.5mins and used about 14.7GB of GRAM. I trained the model using my laptop's RTX4090 card with 16GB GRAM.

In [45]:
# save model weights
ft_model.save_pretrained_merged("./Llama-3.2-1B-Instruct-ft", tokenizer, save_method="merged_16bit")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 36.17 out of 62.53 RAM for saving.


100%|██████████| 16/16 [00:00<00:00, 135.10it/s]

Unsloth: Saving tokenizer...




 Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


### 4. Test new finetuned model

In [46]:
max_seq_length = 2048
ft_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./Llama-3.2-1B-Instruct-ft",
    max_seq_length=max_seq_length,
    dtype=None,
)
FastLanguageModel.for_inference(ft_model)

ft_pipe = pipeline(
    "text-generation",
    model=ft_model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

ft_llama_df, ft_llama_acc = rate_reviews(test, llama_generate, system_prompt, instruction, 'llama', ft_pipe)
print(ft_llama_acc)

==((====))==  Unsloth 2024.9.post4: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA GeForce RTX 4090 Laptop GPU. Max memory: 15.6 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


100%|██████████| 198/198 [00:12<00:00, 15.39it/s]

0.6616161616161617





I have tweaked several parameters (learning rate and num of epochs) in order to achieve the accuracy score above, which is quite close to the raw performance of GPT-4o-mini. I believe much more than be done in the future, once I am able to get more training data samples (which means more eating and reviewing on my side hahaha) as well as more compute to train larger models. With more datapoints and a usage of a larger model, i believe the fine-tuned model can achieve a higher performance compared to that of GPT-4o-mini. At the current performance of the finetuned model, I believe it is quite a feat considering only 600+ datapoints were fed to it during the training process.