<a href="https://colab.research.google.com/github/kkrusere/youTube-comments-Analyzer/blob/main/fine-tuned_LLM_text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### <center>**Transformer-Based Summarization for Cleaning YouTube Video Descriptions**</center>

<center><em>
Leverage the power of transformer-based text summarization to automatically remove irrelevant information from YouTube video descriptions, ensuring they're concise and informative.
</em></center>

#### Intro:

YouTube video descriptions are vital for attracting viewers, but often contain extraneous information that hinders understanding. This project utilizes transformer-based text summarization models (like BERT and GPT) to automatically clean these descriptions.

By training a summarization model on a dataset of YouTube descriptions paired with their human-refined counterparts, the model learns to identify and remove irrelevant content while preserving key points. This leads to concise, informative descriptions.

The project will explore the fine-tuning and evaluation of transformer models for this specific summarization task, focusing on their ability to remove extraneous information and produce distilled video descriptions.

**Key Points:**
- Problem: YouTube descriptions often contain excessive tags, promotions, and irrelevant details.
- Solution: Transformer-based text summarization models trained to clean descriptions.
- Approach: Fine-tune models on a dataset of original and human-cleaned descriptions.
- Goal: Produce concise, informative descriptions that enhance user experience.
- Evaluation: Focus on the models' ability to remove extraneous information effectively.

In [None]:
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb

wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/118.0.5993.70/linux64/chromedriver-linux64.zip -P /tmp/
unzip -o /tmp/chromedriver-linux64.zip -d /tmp/
chmod +x /tmp/chromedriver-linux64/chromedriver
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver
pip install selenium chromedriver_autoinstaller

In [None]:
!pip install peft
!pip install datasets
!pip install rouge-score
!pip install evaluate

In [7]:
import re
import json
import random
import time

import evaluate
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold, train_test_split

import torch
import torch.nn as nn
from datasets import Dataset
from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model

from torch.nn import CrossEntropyLoss
from tqdm import tqdm
import transformers
from transformers import (
                            AutoModelForCausalLM,
                            AutoTokenizer,
                            BartForConditionalGeneration,
                            BartTokenizer,
                            BitsAndBytesConfig,
                            EarlyStoppingCallback,
                            logging,
                            pipeline,
                            Trainer,
                            TrainingArguments,
)


import warnings
warnings.filterwarnings("ignore")

**Data Collection - Overview**
- The data collection process involves gathering YouTube video descriptions along with additional metadata, such as the channel name and video title. We are going to use the above functions for this. This data will be used to train and evaluate our transformer-based summarization model.

**Steps:**

- Fetch Video Data:
Iterate through a predefined list of YouTube video IDs.
> For each video ID, use a custom function to retrieve the video data.
- The function fetches:
> - Channel Name: The name of the channel where the video was uploaded.
> - Video Title: The title of the video.
> - Video Description: The description text provided by the video uploader.
- Store Data:
> - Append the retrieved data, formatted as a dictionary, to the list.
> - Store the collected data in a file (e.g., JSON or CSV) to facilitate access and further processing.

**Example Output**
> - The collected data will be a list of dictionaries, each containing the following keys:

> - ```yaml
channel_name: The name of the YouTube channel.
video_title: The title of the video.
video_description: The description text of the video.


In [9]:
from google.colab import drive, userdata
from huggingface_hub import login

import os
import json
#mounting google drive
drive.mount('/content/drive')

########################################

#changing the working directory
os.chdir("/content/drive/MyDrive/NLP_Data")

!pwd

huggingface_token = userdata.get('Hugging_Face_Hub_API_TOKEN')

#logging into huggingface
login(huggingface_token, add_to_git_credential=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/NLP_Data
Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [10]:
# below are functions for reading a writting json file for the current working directory

def save_to_json(data, filename):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)

def load_from_json(filename):
    with open(filename, 'r') as json_file:
        comments = json.load(json_file)
    return comments

###### **Data Preparation for Fine-Tuning**


1. **Data Formatting:**
- Each row is formatted to include a combined input of `channel_name`, `video_title`, and `video_description`.
- The target output is `clean_video_descriptions`.
2. **Converting to Dataset Object:**
- `Dataset.from_list(formatted_data)` converts the list of formatted `input-output` pairs into a `Hugging Face Dataset` object.
3. **Tokenization:**
- The `tokenize_data` function tokenizes both the input text and the target text.
- The tokenized target is added to the input dictionary under `"labels"`, as required for `seq2seq` training.
4. **Tokenized Dataset:**
- The tokenized dataset, `tokenized_datasets`, is now ready for `fine-tuning` the `BART` model using `LoRA`.


In [None]:
df = pd.read_csv('video_data.csv')
df.head()


Unnamed: 0,channel_name,video_title,video_description,clean_video_description
0,LastWeekTonight,Miss America Pageant: Last Week Tonight with J...,The Miss America Pageant…how is this still a t...,John Oliver criticizes the Miss America Pagean...
1,ESPN,"Smooth 🔥 (via @dariusgaddy2, @d.looo_/TT) #shorts",✔️ Subscribe to ESPN+ http://espnplus.com/yout...,This is a short video showcasing smooth moves ...
2,PowerfulJRE,Joe Rogan Experience #1227 - Mike Tyson,Mike Tyson is the former undisputed heavyweigh...,"Mike Tyson, the former undisputed heavyweight ..."
3,PowerfulJRE,Joe Rogan Experience #872 - Graham Hancock & R...,Graham Hancock is an English author and journa...,Graham Hancock and Randall Carlson discuss cro...
4,Mentour Pilot,HOW was THIS Allowed to HAPPEN?!,Go to https://curiositystream.thld.co/mento......,This video explores a close call between two A...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233 entries, 0 to 232
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   channel_name             233 non-null    object
 1   video_title              233 non-null    object
 2   video_description        233 non-null    object
 3   clean_video_description  233 non-null    object
dtypes: object(4)
memory usage: 7.4+ KB


In [None]:
from sklearn.model_selection import train_test_split
# Split the dataset into training and validation sets (80-20 split)
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)


In [None]:
# Tokenizer and Model
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Tokenization Function (with dynamic padding)
def tokenize_data(example):
    model_inputs = tokenizer(
        example["input"],
        max_length=512,
        padding="longest",  # Dynamic padding
        truncation=True
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            example["output"],
            max_length=128,
            padding="longest",  # Dynamic padding for labels
            truncation=True
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Data preparation (assuming train_df and val_df exist)
formatted_train_data = [
    {
        "input": f"Channel: {row['channel_name']}, Title: {row['video_title']}, Description: {row['video_description']}",
        "output": row['clean_video_description']
    }
    for _, row in train_df.iterrows()
]

formatted_val_data = [
    {
        "input": f"Channel: {row['channel_name']}, Title: {row['video_title']}, Description: {row['video_description']}",
        "output": row['clean_video_description']
    }
    for _, row in val_df.iterrows()
]

# Convert data to Hugging Face Dataset
train_dataset = Dataset.from_list(formatted_train_data)
val_dataset = Dataset.from_list(formatted_val_data)

# Tokenize datasets
tokenized_train_dataset = train_dataset.map(tokenize_data, batched=True)
tokenized_val_dataset = val_dataset.map(tokenize_data, batched=True)



###### **Fine-Tuning the BART Model with LoRA**

**Who is LoRA?**

LoRA, which stands for Low-Rank Adaptation, is a technique used in fine-tuning large language models (LLMs) to make them more efficient and less computationally expensive.

Key points about LoRA:

- Reduced parameter updates: Instead of updating all the parameters of a pre-trained LLM during fine-tuning, LoRA focuses on updating a smaller set of parameters, specifically low-rank matrices that are added to the existing model weights.
- Efficiency: This approach significantly reduces the number of trainable parameters, leading to faster training times and lower memory requirements compared to traditional fine-tuning methods.
- Preserved performance: Despite the reduction in updated parameters, LoRA has been shown to achieve comparable or even better performance than full fine-tuning in many cases.
- Adaptability: It can be easily integrated with various LLM architectures and fine-tuning tasks.

LoRA offers a practical and effective solution to fine-tune large language models for specific tasks without incurring the high computational costs associated with full fine-tuning.



**To fine-tune the BART model with LoRA, we will follow these steps:**


1. **Set Up LoRA Configuration:** Defining the LoRA parameters such as rank `(r)`, scaling factor `(lora_alpha)`, target modules `(q_proj and v_proj)`, dropout rate `(lora_dropout)`, etc.
2. **Wrap the BART Model with LoRA:** We use the peft library to apply LoRA to the original BART model, which allows for efficient fine-tuning with fewer trainable parameters.
3. **Define Training Arguments:** Configuring the training parameters like `batch size`, `number of epochs`, `learning rate`, `logging steps`, `evaluation strategy`, and `saving intervals` using the `TrainingArguments` class from Hugging Face.
4. **Define the Compute Metrics Function:** Setting up a function to compute evaluation metrics such as `ROUGE scores`, which measure the quality of the generated summaries against the reference summaries.
5. **Train the Model:** We use the Hugging Face Trainer to fine-tune the LoRA-wrapped model on the training dataset while evaluating it on a validation dataset during training to monitor the model's performance.
6. **Evaluate the Fine-Tuned Model:** After training,we evaluate the model's performance on the validation dataset using the `ROUGE metric` to understand how well the model generates summaries.
7. **Save the Fine-Tuned Model:** Lastly we save the fine-tuned model and tokenizer for future use in generating summaries or further fine-tuning.



1. **LoRA Configuration:**
- The `LoraConfig` class is used to define the configuration for Low-Rank Adaptation.
- Key parameters include:
> - `r`: The rank of the LoRA matrix.
> - `lora_alpha`: Scaling factor for LoRA.
> - `target_modules`: Specifies which modules in the model should have LoRA applied (usually the attention layers).
> - `lora_dropout`: Dropout rate to be applied to LoRA.
> - `bias`: Specifies how to handle biases; in this case, no bias is applied ("none").
2. **Wrap the BART Model with LoRA:**
- The `get_peft_model` function from the peft library wraps the original BART model with LoRA, making it suitable for `parameter-efficient fine-tuning`.
3. **Defining Training Arguments:**
- TrainingArguments defines various parameters for the training process:
> - `num_train_epochs`: Number of epochs for training.
> - `per_device_train_batch_size` and `per_device_eval_batch_size`: Batch sizes for training and evaluation.
> - `logging_steps`: Frequency of logging training metrics.
> - `eval_steps`: Frequency of evaluation during training.
> - `save_steps`: Frequency of saving model checkpoints.
4. **Trainer Setup and Training:**
- The Trainer class handles the training loop, evaluation, and checkpointing. It takes the LoRA model and training arguments as input.
5. **Save the Fine-Tuned Model:**
- After training, the fine-tuned model and tokenizer are saved using the save_pretrained method.


In [None]:
# LoRA Configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)

# Apply LoRA
lora_model = get_peft_model(model, lora_config)

# Post-processing function to clean generated summaries
def post_process_summary(summary):
    summary = re.sub(r"http\S+", "", summary)  # Remove URLs
    return summary.strip()


# # Define Evaluation Metric and Compute Function
rouge_metric = evaluate.load('rouge')  # Load the metric with evaluate

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Post-process summaries to remove unwanted tokens
    decoded_preds = [post_process_summary(pred) for pred in decoded_preds]
    decoded_labels = [post_process_summary(label) for label in decoded_labels]

    # ROUGE expects newline after each sentence
    decoded_preds = ["\n".join(pred.strip().split(". ")) for pred in decoded_preds]
    decoded_labels = ["\n".join(label.strip().split(". ")) for label in decoded_labels]

    # Compute ROUGE scores
    rouge_result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    rouge_result = {key: value.mid.fmeasure * 100 for key, value in rouge_result.items()}

    return rouge_result


In [None]:
# Training Arguments (with enhancements)
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=12,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=1000,
    evaluation_strategy="steps",
    eval_steps=500,
    gradient_accumulation_steps=8,   # Simulate larger batch size (32)
    fp16=True,                       # Mixed precision training
    lr_scheduler_type="cosine_with_restarts",  # Learning rate scheduler
)

# Trainer with Early Stopping
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]  # Early stopping after 3 non-improving evaluations
)

# Train the model
trainer.train()


In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")

# Save and Push to Hugging Face Hub
lora_model.save_pretrained("./Bart-Desc-Sum-fine-tuned-lora-model")
tokenizer.save_pretrained("./Bart-Desc-Sum-fine-tuned-lora-model")


In [None]:
# Push to Hugging Face Hub
from huggingface_hub import notebook_login

notebook_login()

lora_model.push_to_hub("kkrusere/Bart-Desc-Sum-fine-tuned-lora-model")
tokenizer.push_to_hub("kkrusere/Bart-Desc-Sum-fine-tuned-lora-model")