<a href="https://colab.research.google.com/github/kkrusere/youTube-comments-Analyzer/blob/main/SAnalysis_on_YT_comments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>  **Sentiment Analysis and Explanation Generation with BART and LoRA**</center>

 The notebook illustrates the fine-tuning of a BART model (`"facebook/bart-large-cnn"`)(with LoRA for efficient adaptation) to generate `sentiment labels` and `explanations` od the sentiments for YouTube video comments, based on the video's title, description, and comment text. It uses the Hugging Face Trainer API for streamlined training, evaluation, and deployment to the Hugging Face Hub.


### **1. Setup and Dependencies Installation**
The first section installs necessary libraries:
- `bitsandbytes`: for efficient 8-bit optimization of models (especially useful for large models like BART).
- `accelerate`: helps optimize training for multiple devices (e.g., GPUs).
- `trl, peft`: for task-specific fine-tuning using techniques like LoRA.
- `datasets, evaluate, rouge-score`: for data management and evaluation metrics.
- `huggingface_hub`: for interacting with Hugging Face's model hub.

In [None]:
%%shell
pip install bitsandbytes
pip install accelerate
pip install trl peft
pip install datasets
pip install rouge-score
pip install evaluate
pip install huggingface_hub


---

### **2. Import Libraries**
We will import all the required libraries for data manipulation, model training, and evaluation.


In [None]:
import re
import json
import random
import time

import evaluate
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold, train_test_split

import torch
import torch.nn as nn
from datasets import Dataset
from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model

from torch.nn import CrossEntropyLoss
from tqdm import tqdm
import transformers
from transformers import (
                            AutoModelForCausalLM,
                            AutoTokenizer,
                            BartForConditionalGeneration,
                            BartTokenizer,
                            BitsAndBytesConfig,
                            EarlyStoppingCallback,
                            logging,
                            pipeline,
                            Trainer,
                            TrainingArguments,
)


import warnings
warnings.filterwarnings("ignore")

---

### **3. Mount Google Drive and HuugingFace login**

- `from google.colab import drive, userdata`: Imports modules specific to Google Colab for interacting with Google Drive and user data storage.
- `from huggingface_hub import login`: Imports the login function from the Hugging Face Hub library for authentication.
Mount Google Drive:
- We then mount the Google Drive to the Colab virtual machine, making its
- After we change the working directory to the `"NLP_Data"` folder within the Google Drive.
- `huggingface_token = userdata.get('Hugging_Face_Hub_API_TOKEN')`: Retrieves the `Hugging Face Hub API token` from Colab's user data storage.
- `login(huggingface_token, add_to_git_credential=True):` Logs the notebook into the Hugging Face Hub using the retrieved token and adds it to the Git credentials for future use.



In [None]:
from google.colab import drive, userdata
from huggingface_hub import login

import os
import json
#mounting google drive
drive.mount('/content/drive')

########################################

#changing the working directory
os.chdir("/content/drive/MyDrive/NLP_Data")

!pwd

huggingface_token = userdata.get('Hugging_Face_Hub_API_TOKEN')

#logging into huggingface
login(huggingface_token, add_to_git_credential=True)

Mounted at /content/drive
/content/drive/MyDrive/NLP_Data
Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


---

### **4. Load and Explore Data**
Next, we will load the training data and inspect the first few rows to understand its structure. The data contains the following columns of interest:
- `channel_name`: Name of the channel
- `video_title`: Title of the YouTube video
- `video_description`: Description of the YouTube video
- `comment_text`: Comment on the video
- `Sentiment`: Sentiment label for the comment (e.g., Positive, Negative)
- `Explanation`: Explanation of the sentiment


In [None]:
df = pd.read_csv('/content/drive/MyDrive/NLP_Data/train_valid_data.csv')
df.head()


Unnamed: 0,channel_name,video_title,video_description,comment_text,like_count,reply_count,Sentiment,Explanation
0,BBC,Can Cuttlefish camouflage in a living room? | ...,The final episode of Richard Hammond’s Miracle...,The big white square on his back was impressiv...,482.0,3.0,Positive,The comment expresses admiration for the cuttl...
1,BBC,"""I Cut Off My Own Arm To Save My Life"" | Pleas...",Pleasure is vital for our survival - without i...,"The truck may be strong, but family is stronger",0.0,0.0,Positive,The comment highlights the importance and stre...
2,The Late Show with Stephen Colbert,Elon Musk Might Be A Super Villain,The Late Show with Stephen Colbert is broadcas...,His mentality is genius. The way people look a...,820.0,22.0,Positive,This comment expresses admiration for Elon Mus...
3,hbomberguy,"Bloodborne Is Genius, And Here's Why","Channel: hbomberguy, Title: Bloodborne Is Geni...","""To my recollection, Ludwig's Holy Blade is th...",82.0,2.0,Neutral,"This comment is purely informative, providing ..."
4,Mentour Pilot,What REALLY Caused the Tenerife Airport Disast...,Mentour Pilot: What REALLY Caused the Tenerife...,Insurance fraud?,0.0,0.0,Negative/Speculative,The comment suggests a possible motive of insu...


In [None]:
print(
    f"""
        Channel Name: {df['channel_name'][0]}
        Video Title: {df['video_title'][0]}
        Description: {df['video_description'][0]}
        Comment Text: {df['comment_text'][0]}
        \n
        Sentiment: {df['Sentiment'][0]}
        Explanaition: {df['Explanation'][0]}


    """
)

#returns
        # Channel Name: BBC
        # Video Title: Can Cuttlefish camouflage in a living room? | Richard Hammond's Miracles of Nature - BBC
        # Description: The final episode of Richard Hammond’s Miracles Of Nature. Richard is once again investigating the extraordinary super-powers of the animal kingdom. Cuttlefish survive by being able to blend into their surroundings through camouflage. Richard Hammond puts this to the test and experiments if the fish are able to camouflage in a tank set up like a living room.
        # Comment Text: The big white square on his back was impressive af even tho it wasn't fooling our human perception.


        # Sentiment: Positive
        # Explanaition: The comment expresses admiration for the cuttlefish's camouflage abilities, despite it not being completely convincing to humans.



        Channel Name: BBC
        Video Title: Can Cuttlefish camouflage in a living room? | Richard Hammond's Miracles of Nature - BBC
        Description: The final episode of Richard Hammond’s Miracles Of Nature. Richard is once again investigating the extraordinary super-powers of the animal kingdom. Cuttlefish survive by being able to blend into their surroundings through camouflage. Richard Hammond puts this to the test and experiments if the fish are able to camouflage in a tank set up like a living room.
        Comment Text: The big white square on his back was impressive af even tho it wasn't fooling our human perception.
        

        Sentiment: Positive
        Explanaition: The comment expresses admiration for the cuttlefish's camouflage abilities, despite it not being completely convincing to humans.


    


---

### **5. Train-Validation Split**
We will split the data into training and validation sets using an 80-20 split.


In [None]:
# Load Data
test_df = pd.read_csv('/content/drive/MyDrive/NLP_Data/test_data.csv')
train_valid_data = pd.read_csv('/content/drive/MyDrive/NLP_Data/train_valid_data.csv')

# Split the dataset into training and validation sets (80-20 split)
train_df, val_df = train_test_split(train_valid_data, test_size=0.2, random_state=42)

---

### **6. Initialize Model and Tokenizer**
We will now initialize the BART model (`facebook/bart-large-cnn`) and its corresponding tokenizer for our fine-tuning task.


In [None]:
# Initialize tokenizer and model
model_name = "facebook/bart-large-cnn"  # BART model name
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

---

### **7. Data Formatting**
To prepare the data for training, we will format each input row by combining the `channel_name`, `video_title`, `video_description`, and `comment_text` into a single input text. The output will be the `Sentiment` and `Explanation`.


In [None]:
# Data Preparation
def format_data(df, for_test=False):
    return [
        {
            "input": f"Channel: {row['channel_name']}, Title: {row['video_title']}, Description: {row['video_description']}, Comment Text: {row['comment_text']}",
            "output": f"Sentiment: {row['Sentiment']}, Explanation: {row['Explanation']}" if not for_test else "Sentiment: , Explanation: "
        }
        for _, row in df.iterrows()
    ]

# Format the data
formatted_train_data = format_data(train_df)
formatted_val_data = format_data(val_df)
formatted_test_data = format_data(test_df, for_test=True)

# Convert to Dataset objects
train_dataset = Dataset.from_list(formatted_train_data)
val_dataset = Dataset.from_list(formatted_val_data)
test_dataset = Dataset.from_list(formatted_test_data)

---

### **9. Tokenization**
We need to tokenize the input and output text so that it can be fed into the BART model. We'll create a helper function to handle this process.


In [None]:
# Tokenization
def tokenize_data(example):
    model_inputs = tokenizer(
        example["input"],
        max_length=512,
        padding="max_length",
        truncation=True
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            example["output"],
            max_length=128,
            padding="max_length",
            truncation=True
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize datasets
tokenized_train_dataset = train_dataset.map(tokenize_data, batched=True)
tokenized_val_dataset = val_dataset.map(tokenize_data, batched=True)



Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

---

### **10. LoRA (Low-Rank Adaptation) Configuration**
LoRA allows us to fine-tune the model efficiently by adapting only a subset of parameters. We will configure LoRA to only fine-tune specific layers of the model.


In [None]:
# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)
lora_model = get_peft_model(model, lora_config)

---

### **11. Training Configuration**
Now, we will set up training parameters, such as the number of epochs, batch size, learning rate, and evaluation strategy. Early stopping will also be used to prevent overfitting.


In [None]:
# Training Arguments with Optimizations
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=24,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=1000,                 # Increased warmup steps
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=1000,
    evaluation_strategy="steps",
    eval_steps=500,
    gradient_accumulation_steps=8,     # Simulate larger batch size
    fp16=True,                         # Mixed precision training
    learning_rate=1e-5,                # Optimized learning rate
    lr_scheduler_type="linear",        # Linear decay
    load_best_model_at_end=True,       # Save best model
    metric_for_best_model="eval_loss", # Track best model by validation loss
)

# Add Early Stopping
# early_stopping = EarlyStoppingCallback(early_stopping_patience=3)

In [None]:
# Define Evaluation Metric and Compute Function
rouge_metric = evaluate.load('rouge')  # Load the metric with evaluate

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = ["\n".join(pred.strip().split(". ")) for pred in decoded_preds]
    decoded_labels = ["\n".join(label.strip().split(". ")) for label in decoded_labels]

    result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return result

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

---

### **12. Trainer Initialization**
We will initialize the `Trainer` class, which will handle the training and evaluation processes.


In [None]:
# Trainer with Early Stopping
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    compute_metrics=compute_metrics,
    # callbacks=[early_stopping]
)

# Train the model
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=288, training_loss=9.750903209050497, metrics={'train_runtime': 494.8821, 'train_samples_per_second': 19.399, 'train_steps_per_second': 0.582, 'total_flos': 1.0052813337919488e+16, 'train_loss': 9.750903209050497, 'epoch': 23.04})

---

### **14. Evaluation**
Once the model is trained, we can evaluate its performance on the validation dataset.


In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")

Evaluation Results: {'eval_runtime': 5.3651, 'eval_samples_per_second': 18.639, 'eval_steps_per_second': 4.66, 'epoch': 23.04}


---

### **15. Save the Model**
After training and evaluation, we will save the fine-tuned model and tokenizer locally for future use.


In [None]:
# Save the Fine-Tuned Model
lora_model.save_pretrained("./SA-bart-fine-tuned-lora-model")
tokenizer.save_pretrained("./SA-bart-fine-tuned-lora-model")

('./SA-bart-fine-tuned-lora-model/tokenizer_config.json',
 './SA-bart-fine-tuned-lora-model/special_tokens_map.json',
 './SA-bart-fine-tuned-lora-model/vocab.json',
 './SA-bart-fine-tuned-lora-model/merges.txt',
 './SA-bart-fine-tuned-lora-model/added_tokens.json')

---

### **16. Push to Hugging Face Hub**
Finally, we will push the model to Hugging Face Hub for sharing or further use in other projects.


In [None]:
# Push to Hugging Face Hub
lora_model.push_to_hub("kkrusere/SA-bart-fine-tuned-lora-model")
tokenizer.push_to_hub("kkrusere/SA-bart-fine-tuned-lora-model")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/9.46M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/kkrusere/SA-bart-fine-tuned-lora-model/commit/4d90a328267e5820eed41633f66b1ef324e47e3b', commit_message='Upload tokenizer', commit_description='', oid='4d90a328267e5820eed41633f66b1ef324e47e3b', pr_url=None, pr_revision=None, pr_num=None)

---

### **17. Inference on Test Data**
To perform inference on the test dataset, we need to format the test data similarly to the training data. We then tokenize the data and use the trained model for predictions.


In [None]:
import re
import json
import random
import time

import evaluate
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold, train_test_split

import torch
import torch.nn as nn
from datasets import Dataset
from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model

from torch.nn import CrossEntropyLoss
from tqdm import tqdm
import transformers
from transformers import (
                            AutoModelForCausalLM,
                            AutoTokenizer,
                            BartForConditionalGeneration,
                            BartTokenizer,
                            BitsAndBytesConfig,
                            EarlyStoppingCallback,
                            logging,
                            pipeline,
                            Trainer,
                            TrainingArguments,
)


import warnings
warnings.filterwarnings("ignore")


from google.colab import drive, userdata
from huggingface_hub import login

import os
import json
#mounting google drive
drive.mount('/content/drive')

########################################

#changing the working directory
os.chdir("/content/drive/MyDrive/NLP_Data")

!pwd

huggingface_token = userdata.get('Hugging_Face_Hub_API_TOKEN')

#logging into huggingface
login(huggingface_token, add_to_git_credential=True)


# . Load and Explore Data

# Next, we will load the training data and inspect the first few rows to understand its structure. The data contains the following columns of interest:
# channel_name: Name of the channel
# video_title: Title of the YouTube video
# video_description: Description of the YouTube video
# comment_text: Comment on the video
# Sentiment: Sentiment label for the comment (e.g., Positive, Negative)
# Explanation: Explanation of the sentiment



print(
    f"""
        Channel Name: {df['channel_name'][0]}
        Video Title: {df['video_title'][0]}
        Description: {df['video_description'][0]}
        Comment Text: {df['comment_text'][0]}
        \n
        Sentiment: {df['Sentiment'][0]}
        Explanaition: {df['Explanation'][0]}


    """
)

#returns
        # Channel Name: BBC
        # Video Title: Can Cuttlefish camouflage in a living room? | Richard Hammond's Miracles of Nature - BBC
        # Description: The final episode of Richard Hammond’s Miracles Of Nature. Richard is once again investigating the extraordinary super-powers of the animal kingdom. Cuttlefish survive by being able to blend into their surroundings through camouflage. Richard Hammond puts this to the test and experiments if the fish are able to camouflage in a tank set up like a living room.
        # Comment Text: The big white square on his back was impressive af even tho it wasn't fooling our human perception.


        # Sentiment: Positive
        # Explanaition: The comment expresses admiration for the cuttlefish's camouflage abilities, despite it not being completely convincing to humans.




# Load Data
test_df = pd.read_csv('/content/drive/MyDrive/NLP_Data/test_data.csv')
train_valid_data = pd.read_csv('/content/drive/MyDrive/NLP_Data/train_valid_data.csv')

# Split the dataset into training and validation sets (80-20 split)
train_df, val_df = train_test_split(train_valid_data, test_size=0.2, random_state=42)

# Initialize tokenizer and model
model_name = "facebook/bart-large-cnn"  # BART model name
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)


# 7. Data Formatting

# To prepare the data for training, we will format each input row by combining the channel_name, video_title, video_description, and comment_text into a single input text. The output will be the Sentiment and Explanation.


# Data Preparation
def format_data(df, for_test=False):
    return [
        {
            "input": f"Channel: {row['channel_name']}, Title: {row['video_title']}, Description: {row['video_description']}, Comment Text: {row['comment_text']}",
            "output": f"Sentiment: {row['Sentiment']}, Explanation: {row['Explanation']}" if not for_test else "Sentiment: , Explanation: "
        }
        for _, row in df.iterrows()
    ]

# Format the data
formatted_train_data = format_data(train_df)
formatted_val_data = format_data(val_df)
formatted_test_data = format_data(test_df, for_test=True)

# Convert to Dataset objects
train_dataset = Dataset.from_list(formatted_train_data)
val_dataset = Dataset.from_list(formatted_val_data)
test_dataset = Dataset.from_list(formatted_test_data)


# Tokenization
def tokenize_data(example):
    model_inputs = tokenizer(
        example["input"],
        max_length=512,
        padding="max_length",
        truncation=True
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            example["output"],
            max_length=128,
            padding="max_length",
            truncation=True
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize datasets
tokenized_train_dataset = train_dataset.map(tokenize_data, batched=True)
tokenized_val_dataset = val_dataset.map(tokenize_data, batched=True)


# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)
lora_model = get_peft_model(model, lora_config)


# Training Arguments with Optimizations
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=24,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=1000,                 # Increased warmup steps
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=1000,
    evaluation_strategy="steps",
    eval_steps=500,
    gradient_accumulation_steps=8,     # Simulate larger batch size
    fp16=True,                         # Mixed precision training
    learning_rate=1e-5,                # Optimized learning rate
    lr_scheduler_type="linear",        # Linear decay
    load_best_model_at_end=True,       # Save best model
    metric_for_best_model="eval_loss", # Track best model by validation loss
)

# Add Early Stopping
# early_stopping = EarlyStoppingCallback(early_stopping_patience=3)



# Define Evaluation Metric and Compute Function
rouge_metric = evaluate.load('rouge')  # Load the metric with evaluate

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = ["\n".join(pred.strip().split(". ")) for pred in decoded_preds]
    decoded_labels = ["\n".join(label.strip().split(". ")) for label in decoded_labels]

    result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return result




# Trainer with Early Stopping
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    compute_metrics=compute_metrics,
    # callbacks=[early_stopping]
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")

# Save the Fine-Tuned Model
lora_model.save_pretrained("./SA-bart-fine-tuned-lora-model")
tokenizer.save_pretrained("./SA-bart-fine-tuned-lora-model")