<a href="https://colab.research.google.com/github/kkrusere/youTube-comments-Analyzer/blob/main/LLM_fine-tuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **LLM-Powered Sentiment Analysis Pipeline**

1. Introduction and Overview

    **What is Sentiment Analysis?**

    >> Sentiment analysis, also known as opinion mining, is a field of natural language processing (NLP) that focuses on determining the emotional tone or attitude expressed within a piece of text. It aims to categorize text as positive, negative, or neutral, and sometimes even delve into more nuanced emotions like joy, anger, or sadness.

    > * **Why Does Sentiment Analysis Matter?**

    > Sentiment Analysis can play a pivotal role in numerous applications, including (but not limited to):

    > * **Brand Monitoring:** Track customer opinions about products and services across social media and review platforms.
    > * **Market Research:** Gain insights into consumer sentiment towards brands, products, or trends.
    > * **Customer Service:** Analyze customer feedback to identify areas for improvement.
    > * **Social Media Analysis:** Monitor public sentiment towards events, news, or policies.
    > * **Financial Analysis:** Assess market sentiment to make informed investment decisions.

    **Limitations in Creating/developing, Training and Testing Sentiment Analysis models**

    > Creating, developing, training, and testing Sentiment Analysis (SA) models involves several limitations and challenges. Here are some of the key ones:

    > * Lack of Labeled Training Data
    >>  * Limited Availability: Acquiring a large and diverse dataset with accurate sentiment labels can be difficult, especially for niche or specialized domains.
    >>  * Cost and Time: Annotating data manually is time-consuming and expensive. Crowd-sourcing can introduce noise and inconsistency.
    >>  * Quality of Annotations: The subjectivity of sentiment can lead to inconsistent labels even among human annotators.

    > * Complexity of Human Emotions
    >>  * Subtlety and Nuance: Human emotions are complex and nuanced. Simple positive, negative, and neutral labels often fail to capture the full spectrum of sentiments.
    >>  * Context-Dependence: The sentiment of a statement can depend heavily on the context, which might not be captured in the data.

    > * Ambiguity and Sarcasm
    >>  * Ambiguity: Words and phrases can have different sentiments depending on the context. For example, "I saw this movie last night" could be positive or negative based on the speaker's tone and further context.
    >>  * Sarcasm and Irony: Detecting sarcasm and irony is particularly challenging for SA models, as they often rely on cultural and contextual clues beyond the text itself.

    > * Language and Cultural Differences
    >> * Multilingual Challenges: Developing models that work across multiple languages requires extensive resources. Each language might require a separate model or significant adjustments to handle linguistic nuances.
    >> * Cultural Differences: Sentiments expressed in different cultures can vary widely, making it hard to generalize models across different demographics.

    > *  Domain-Specific Challenges
    >>  * Generalization: Models trained on generic datasets may not perform well in specialized domains such as medical or legal texts. Domain-specific models require specialized training data, which is often scarce.
    >>  * Jargon and Slang: Different domains use specific jargon and slang that might not be well-represented in general sentiment datasets.

    > * Evolving Language
    >>  * Language Change: Language and expressions evolve over time, and models need regular updates to stay relevant. New slang, trends, and shifts in meaning can quickly make a model outdated.

    > * Technical Challenges
    >>  * Feature Extraction: Identifying the right features that capture sentiment effectively is challenging. Simple keyword-based approaches may miss nuances, while more sophisticated methods like embeddings require substantial computational resources.
    >>  * Model Complexity: Building models that are both accurate and efficient can be difficult. More complex models like deep neural networks offer better performance but require more data and computational power.

    > * Evaluation and Metrics
    >>  * Evaluation Metrics: Standard metrics like accuracy, precision, recall, and F1-score may not fully capture the effectiveness of a sentiment analysis model, especially in imbalanced datasets.
    >>  * Real-world Testing: Models may perform well on test datasets but struggle with real-world data due to noise, variations in text, and other unforeseen factors.

    > Addressing these challenges often requires a combination of advanced techniques, including the use of transfer learning, semi-supervised learning, and the integration of external knowledge sources to improve the robustness and accuracy of Sentiment Analysis models.

    **Enter Large Language Models (LLMs)**

    > Large Language Models (LLMs) are sophisticated machine learning models that have been trained on massive amounts of text data. They possess a remarkable ability to understand and generate human-like language.  LLMs can be fine-tuned for specific tasks, such as sentiment analysis, allowing them to leverage their vast knowledge and linguistic capabilities to make accurate predictions about the emotional tone of text.

    Pretrained Large Language Models (LLMs) like GPT-3, BERT, and their successors have shown considerable promise in addressing many of the challenges in creating, developing, training, and testing Sentiment Analysis (SA) models. Here’s how they can help:

    > * Lack of Labeled Training Data
    >>  * Transfer Learning: LLMs are pretrained on vast amounts of data across diverse domains. Fine-tuning these models on a smaller, domain-specific dataset can significantly enhance performance without requiring extensive labeled data.
    >>  * Zero-shot and Few-shot Learning: LLMs can perform tasks with little to no specific task-related training data, allowing for sentiment analysis with minimal labeled examples.
    
    > * Complexity of Human Emotions
    >>  * Rich Representations: LLMs capture nuanced language representations, which helps in understanding the subtleties and complexities of human emotions beyond simple positive, negative, and neutral sentiments.
    >>  * Context-Awareness: These models consider the context within and around the text, leading to better handling of context-dependent sentiment.

    > * Ambiguity and Sarcasm
    >>  * Contextual Understanding: LLMs use context to disambiguate meanings and can better detect sarcasm and irony, thanks to their sophisticated language understanding capabilities.
    >>  * Contextual Embeddings: By generating context-specific embeddings, LLMs can differentiate between sentiments in ambiguous phrases more effectively.

    > * Language and Cultural Differences
    >>  * Multilingual Capabilities: Models like mBERT and XLM-R are pretrained on multiple languages, enabling sentiment analysis across different languages without needing separate models for each.
    >>  * Cultural Sensitivity: Pretrained LLMs can be fine-tuned on culturally specific data to better capture the nuances and sentiments of different cultural contexts.

    > * Domain-Specific Challenges
    >>  * Fine-Tuning: LLMs can be fine-tuned on domain-specific datasets, leveraging their general language understanding to quickly adapt to specialized domains.
    >>  * Domain Adaptation: Techniques such as continual learning allow LLMs to incorporate new domain-specific jargon and slang without forgetting previously learned information.

    > * Evolving Language
    >>  * Adaptability: Pretrained models can be periodically fine-tuned on recent data to stay updated with evolving language and trends.
    >>  * Dynamic Updates: Ongoing training or incremental updates ensure that LLMs remain relevant and effective as language evolves.

    > *  Technical Challenges
    >>  * Feature Extraction: LLMs inherently generate rich features and embeddings, reducing the need for manual feature engineering.
    >>  * Efficiency Improvements: Although LLMs can be computationally intensive, optimizations like distillation and pruning can make them more efficient for deployment.

    By leveraging these capabilities, pretrained LLMs significantly mitigate many of the traditional challenges in sentiment analysis, leading to more robust, accurate, and contextually aware models.

This Jupyter Notebook embarks on an exploratory journey into the realm of sentiment analysis for social media comments. Leveraging the power of Large Language Models (LLMs), we will delve into various techniques to extract sentiment from unlabeled text data. The focus will be on exploring different approaches, assessing their strengths and weaknesses, and ultimately uncovering valuable insights hidden within the vast landscape of social media conversations.

Social media platforms are teeming with user-generated content, offering a treasure trove of opinions, emotions, and reactions. Understanding the sentiment behind these comments is crucial for businesses, marketers, researchers, and anyone interested in gauging public opinion. However, the sheer volume and unstructured nature of social media data present a challenge for traditional sentiment analysis methods.

This notebook harnesses the capabilities of LLMs, which excel at understanding and generating human-like language, to tackle this challenge head-on. We will investigate various strategies, from zero-shot and few-shot learning to fine-tuning pre-trained models, and evaluate their effectiveness in extracting sentiment from unlabeled social media comments.



In [1]:
from google.colab import drive

import pandas as pd
import numpy as np

import re
import os
import json

In [None]:
#mounting google drive
drive.mount('/content/drive')

########################################

#changing the working directory
os.chdir("/content/drive/MyDrive/NLP_Data")

!pwd

In [None]:
# Function to save comments data to a JSON file
def save_comments_to_json(comments, filename = 'youtube_comments.json'):
    with open(filename, 'w') as json_file:
        json.dump(comments, json_file, indent=4)

def load_comments_from_json(filename = 'youtube_comments.json'):
    with open(filename, 'r') as json_file:
        comments = json.load(json_file)
    return comments

In [None]:
# read the sample_comments_and_sentiment.csv file from the current working directory

sample_comments_and_sentiment_df = pd.read_csv("sample_comments_and_sentiment.csv")
sample_comments_and_sentiment_df.head()

In [None]:
# Extract the comment_text, sentiment, and sentiment_brief_explanation columns
comment_text_list = sample_comments_and_sentiment_df.head(15)['comment_text'].tolist()
sentiment_list = sample_comments_and_sentiment_df.head(15)['sentiment'].tolist()
sentiment_brief_explanation_list = sample_comments_and_sentiment_df.head(15)['sentiment_brief_explanation'].tolist()

# Print the lists
print("Comment Text:")
print(comment_text_list)

print("\nSentiment:")
print(sentiment_list)

print("\nSentiment Brief Explanation:")
print(sentiment_brief_explanation_list)


In [None]:
!pip install torch tensorboard transformers datasets accelerate bitsandbytes trl peft

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)
from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)
from sklearn.model_selection import train_test_split

In [None]:
print(f"pytorch version {torch.__version__}")

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"working on {device}")

In [None]:
#Define features and labels
X = sample_comments_and_sentiment_df['comment_text']
y = sample_comments_and_sentiment_df[['sentiment', 'sentiment_brief_explanation']]

# First, split into train and temp sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)  # 30% temp

# Then, split the temp set into test and evaluation sets
X_test, X_eval, y_test, y_eval = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)  # 50% of temp

# Check the sizes of the resulting sets
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
print(f"Evaluation set size: {len(X_eval)}")

In [None]:
# For Training Data (Generate Prompt):
# This function should generate a prompt for training the model to predict sentiment and provide an explanation for each comment.

def generate_prompt(data_point):
    return f"""
    Analyze the sentiment of the comment enclosed in square brackets and provide a detailed sentiment label ("positive", "neutral", "negative", "mixed", "humorous", etc.) and a brief explanation of the sentiment.

    Comment: [{data_point[0]}]
    Sentiment: {data_point["sentiment"]}
    Sentiment Brief Explanation: {data_point["sentiment_brief_explanation"]}
    """.strip()

# For Test Data (Generate Test Prompt):
# This function should generate a prompt for testing the model, where the sentiment and explanation are left blank for the model to predict.
def generate_test_prompt(data_point):
    return f"""
    Analyze the sentiment of the comment enclosed in square brackets and provide a detailed sentiment label ("positive", "neutral", "negative", "mixed", "humorous", etc.) and a brief explanation of the sentiment.

    Comment: [{data_point[0]}]
    Sentiment:
    Sentiment Brief Explanation:
    """.strip()



* Training Data (`X_train`):
    - We generate training prompts using the generate_prompt function, which includes both the sentiment and the brief explanation.
* Evaluation Data (`X_eval_sampled`):
  - The evaluation prompts are generated similarly using the `generate_prompt` function.
* Test Data (`X_test and y_true`):
  - For the test data, we use the `generate_test_prompt` function, which leaves the sentiment and explanation blank, allowing the model to predict these during test & evaluation.
* The `y_true` variable is retained to compare the model's predictions with the actual sentiment labels during evaluation.
* Dataset Conversion:
  - Finally, we convert the resulting DataFrames into Hugging Face Dataset objects, which are typically used in fine-tuning models with the transformers library.

In [None]:
# Combine X_train with y_train into a single DataFrame
train_df = pd.concat([X_train, y_train], axis=1)

# Apply generate_prompt to the combined DataFrame
X_train = train_df.apply(generate_prompt, axis=1)

# Generate evaluation prompts
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), columns=["text"])

# Extract true sentiment labels from the test set for later evaluation
y_true = y_test['sentiment'].reset_index(drop=True)

# Generate test prompts (where sentiment and explanation are left blank)
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

# Convert the prepared dataframes into Hugging Face datasets
train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)
test_data = Dataset.from_pandas(X_test)

In [37]:
train_df[0][435]

'Analyze the sentiment of the comment enclosed in square brackets and provide a detailed sentiment label ("positive", "neutral", "negative", "mixed", "humorous", etc.) and a brief explanation of the sentiment.\n\n    Comment: [Yes .. the most beautiful. Jordan is still also the most Fly to ever play on the court. His look, his combination of athleticism, skill, and style/aesthetics, he is without equal imo.]\n    Sentiment: Positive\n    Sentiment Brief Explanation: Acknowledges both Jordan and Lebron\'s greatness but favors Jordan\'s style.'

In [35]:
train_df[0][0]

'Analyze the sentiment of the comment enclosed in square brackets and provide a detailed sentiment label ("positive", "neutral", "negative", "mixed", "humorous", etc.) and a brief explanation of the sentiment.\n\n    Comment: [these thumbnails make it seem like a mass nuclear explosion happen]\n    Sentiment: Negative\n    Sentiment Brief Explanation: The comment suggests the thumbnails are alarming or disturbing.'

In [None]:
def evaluate(y_true, y_pred):
    """
    Evaluate the performance of a sentiment classification model.

    This function calculates the accuracy, classification report, and confusion
    matrix for a set of true and predicted sentiment labels. The sentiment labels
    are dynamically mapped to numeric values based on the unique labels present
    in the input data.

    Parameters:
    ----------
    y_true : list or array-like
        The true sentiment labels.
    y_pred : list or array-like
        The predicted sentiment labels.

    Returns:
    -------
    None
        The function prints the overall accuracy, accuracy per sentiment label,
        a detailed classification report, and the confusion matrix.

    """

    # Get unique sentiment labels from y_true
    unique_labels = sorted(set(y_true))

    # Dynamically create a mapping from labels to numeric values
    mapping = {label: idx for idx, label in enumerate(unique_labels)}

    def map_func(x):
        return mapping.get(x, 1)

    # Apply the mapping to y_true and y_pred
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)

    # Calculate overall accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    print(f'Overall Accuracy: {accuracy:.3f}')

    # Generate accuracy report for each label
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) if y_true[i] == label]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')

    # Generate classification report
    class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped, target_names=unique_labels)
    print('\nClassification Report:')
    print(class_report)

    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped, labels=[mapping[label] for label in unique_labels])
    print('\nConfusion Matrix:')
    print(conf_matrix)


In [None]:
# Specify the GPT-3 model name
model_name = "gpt-3"

# Set the compute data type
compute_dtype = getattr(torch, "float16")

# Configure the model to use 4-bit quantization for efficient training
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

# Load the GPT-3 model with the specified configuration
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=compute_dtype,
    quantization_config=bnb_config,
)

# Disable caching for training
model.config.use_cache = False

# Adjust pretraining TP if necessary
model.config.pretraining_tp = 1

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)

# Set the padding token and side
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Apply any necessary chat formatting
model, tokenizer = setup_chat_format(model, tokenizer)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./results")