### **<center>Transformer-Based Summarization for Cleaning YouTube Video Descriptions</center>**

<center><em>
Leverage the power of transformer-based text summarization to automatically remove irrelevant information from YouTube video descriptions, ensuring they're concise and informative.
</em></center>

#### Intro:

YouTube video descriptions are vital for attracting viewers, but often contain extraneous information that hinders understanding. This project utilizes transformer-based text summarization models (like BERT and GPT) to automatically clean these descriptions.

By training a summarization model on a dataset of YouTube descriptions paired with their human-refined counterparts, the model learns to identify and remove irrelevant content while preserving key points. This leads to concise, informative descriptions.

The project will explore the fine-tuning and evaluation of transformer models for this specific summarization task, focusing on their ability to remove extraneous information and produce distilled video descriptions.

**Key Points:**
- Problem: YouTube descriptions often contain excessive tags, promotions, and irrelevant details.
- Solution: Transformer-based text summarization models trained to clean descriptions.
- Approach: Fine-tune models on a dataset of original and human-cleaned descriptions.
- Goal: Produce concise, informative descriptions that enhance user experience.
- Evaluation: Focus on the models' ability to remove extraneous information effectively.

In [1]:
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb

wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/118.0.5993.70/linux64/chromedriver-linux64.zip -P /tmp/
unzip -o /tmp/chromedriver-linux64.zip -d /tmp/
chmod +x /tmp/chromedriver-linux64/chromedriver
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver
pip install selenium chromedriver_autoinstaller

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Ign:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Hit:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,569 kB]
Hit:12 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:13 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:14 https://r2u.stat.illinois.e



In [2]:
!pip install datasets



In [3]:
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')


from selenium import webdriver
import chromedriver_autoinstaller
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.service import Service
from selenium.common.exceptions import TimeoutException, ElementNotInteractableException
from selenium.webdriver.common.action_chains import ActionChains

import pandas as pd
import numpy as np

import time
import random
import re
import json
import random

In [4]:
def init_webdriver():
    """Initializes and returns a Chrome WebDriver instance with options.

    Returns:
        webdriver.Chrome: A configured Chrome WebDriver instance.

    Raises:
        Exception: If the WebDriver fails to initialize.
    """
    try:
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chromedriver_autoinstaller.install()
        driver = webdriver.Chrome(options=chrome_options)

        print("WebDriver initialized successfully")
        return driver
    except Exception as e:
        print(f"Failed to initialize WebDriver: {e}")
        raise


def close_webdriver(driver):
    """Closes the provided WebDriver instance.

    Args:
        driver (webdriver.Chrome): The WebDriver instance to close.

    Prints:
        str: Confirmation message that the WebDriver has been closed.
    """
    print("WebDriver successfully closed")
    driver.quit()


In [5]:
def get_video_data(video_id):
    """Fetches video data from YouTube given a video ID.

    Args:
        video_id (str): The ID of the YouTube video to fetch data for.

    Returns:
        dict: A dictionary containing the video data with the following keys:
            - 'channel_name': The name of the channel that uploaded the video.
            - 'video_title': The title of the video.
            - 'video_description': The description of the video.

    Raises:
        Exception: If there is an error accessing or processing the video data.
    """
    driver = init_webdriver()
    video_url = f"https://www.youtube.com/watch?v={video_id}"
    video_data = {}

    try:
        driver.get(video_url)

        try:
            # Wait for the bottom-row element to be present
            bottom_row = WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.XPATH, '//*[@id="bottom-row"]'))
            )

            # Locate and click the expand button if it exists
            try:
                expand_button = WebDriverWait(driver, 10).until(
                    EC.element_to_be_clickable((By.XPATH, '/html/body/ytd-app/div[1]/ytd-page-manager/ytd-watch-flexy/div[5]/div[1]/div/div[2]/ytd-watch-metadata/div/div[4]/div[1]/div/ytd-text-inline-expander/tp-yt-paper-button[1]'))
                )
                expand_button.click()
            except TimeoutException:
                pass  # Ignore if the expand button is not found

            # Wait for elements to be visible and extract data
            expanded_description = WebDriverWait(driver, 10).until(
                EC.visibility_of_element_located((By.ID, 'description-inline-expander'))
            )
            title_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, '//h1[@class="style-scope ytd-watch-metadata"]//yt-formatted-string'))
            )
            channel_name_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, '//ytd-channel-name[@id="channel-name"]//yt-formatted-string//a'))
            )

            video_data = {
                'channel_name': channel_name_element.text,
                'video_title': title_element.text,
                'video_description': expanded_description.text
            }

        except TimeoutException:
            print(f"Error processing {video_url}: Elements not found within timeout.")

    except Exception as e:
        print(f"Error processing {video_url}: {e}")

    finally:
        # Close the browser when done
        close_webdriver(driver)

    return video_data


In [6]:
#Test:
video_id = "CETSlLO_jio"
# Get video data
video_data = get_video_data(video_id)

# Print the data
video_data


WebDriver initialized successfully
WebDriver successfully closed


{'channel_name': 'truTV',
 'video_title': 'Funniest If You Laugh You Lose Moments (Mashup) | Impractical Jokers | truTV',
 'video_description': 'It\'s impossible not to laugh at the way Murr rides away on his luggage. We can\'t get enough of these "If You Laugh, You Lose" challenges. Watch Impractical Jokers on truTV.\n\n#ImpracticalJokers  #truTV #BrianQuinn #JamesMurray #SalVulcano\n\nSubscribe: http://bit.ly/truTVSubscribe\nWatch More Impractical Jokers: http://bit.ly/2p59m19\nWatch full episodes for Free: http://bit.ly/ImpracticalJokersTruTV\n\nAbout Impractical Jokers:\nThree comedians and lifelong friends compete to embarrass each other amongst the general public with a series of hilarious and outrageous dares. When Sal, Q, and Murr challenge each other to say or do something, they have to do it… if they refuse, they lose! At the end of every episode - with the help of a celebrity guest - the episode\'s loser must endure a punishment of epic proportions.\n\nDownload the Jokers Wh

**Data Collection - Overview**
- The data collection process involves gathering YouTube video descriptions along with additional metadata, such as the channel name and video title. We are going to use the above functions for this. This data will be used to train and evaluate our transformer-based summarization model.

**Steps:**

- Fetch Video Data:
Iterate through a predefined list of YouTube video IDs.
> For each video ID, use a custom function to retrieve the video data.
- The function fetches:
> - Channel Name: The name of the channel where the video was uploaded.
> - Video Title: The title of the video.
> - Video Description: The description text provided by the video uploader.
- Store Data:
> - Append the retrieved data, formatted as a dictionary, to the list.
> - Store the collected data in a file (e.g., JSON or CSV) to facilitate access and further processing.

**Example Output**
> - The collected data will be a list of dictionaries, each containing the following keys:

> - ```yaml
channel_name: The name of the YouTube channel.
video_title: The title of the video.
video_description: The description text of the video.


In [7]:
from google.colab import drive
import os
import json
#mounting google drive
drive.mount('/content/drive')
########################################
#changing the working directory
os.chdir("/content/drive/MyDrive/NLP_Data")

!pwd

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/NLP_Data


In [8]:
# below are functions for reading a writting json file for the current working directory

def save_to_json(data, filename):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)

def load_from_json(filename):
    with open(filename, 'r') as json_file:
        comments = json.load(json_file)
    return comments

In [9]:
df = pd.read_csv('video_descriptions_train_data_df.csv')
df.head()


Unnamed: 0,channel_name,video_title,video_description,clean_video_descriptions
0,48 Hours,The Murdaugh Mysteries | Full Episode,"On March 2, 2023, a jury found Alex Murdaugh g...","This video is about Alex Murdaugh, who was fou..."
1,ESPN,How LeBron changed his game to dominate into h...,ESPN analyst Kirk Goldsberry explains how LeBr...,ESPN analyst Kirk Goldsberry explains how LeBr...
2,Tech Quarks,Modern Warehouse Technology For A Next level A...,Warehouse automation | Intelligent Warehouse |...,This video showcases the most modern warehouse...
3,Discovery,Destroyed in Seconds - Bulldozer Rampage,An angry man bent on revenge builds a customiz...,An angry man seeking revenge uses a customized...
4,The Babylon Bee,Californians Move to Texas | Episode 2: The Co...,"Our Californian couple is back in Texas, and t...",A Californian couple experiences culture shock...


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   channel_name              150 non-null    object
 1   video_title               150 non-null    object
 2   video_description         150 non-null    object
 3   clean_video_descriptions  150 non-null    object
dtypes: object(4)
memory usage: 4.8+ KB


###### **Data Preparation for Fine-Tuning**


1. **Data Formatting:**
- Each row is formatted to include a combined input of `channel_name`, `video_title`, and `video_description`.
- The target output is `clean_video_descriptions`.
2. **Converting to Dataset Object:**
- `Dataset.from_list(formatted_data)` converts the list of formatted `input-output` pairs into a `Hugging Face Dataset` object.
3. **Tokenization:**
- The `tokenize_data` function tokenizes both the input text and the target text.
- The tokenized target is added to the input dictionary under `"labels"`, as required for `seq2seq` training.
4. **Tokenized Dataset:**
- The tokenized dataset, `tokenized_datasets`, is now ready for `fine-tuning` the `BART` model using `LoRA`.


In [11]:
from datasets import Dataset
from transformers import BartTokenizer

In [12]:
# Loading the tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)

# Data Cleaning (If needed)
def clean_text(text):
    text = text.strip()  # Remove leading and trailing whitespaces
    text = " ".join(text.split())  # Remove extra spaces
    return text

# Apply cleaning to all relevant columns
df['channel_name'] = df['channel_name'].apply(clean_text)
df['video_title'] = df['video_title'].apply(clean_text)
df['video_description'] = df['video_description'].apply(clean_text)
df['clean_video_descriptions'] = df['clean_video_descriptions'].apply(clean_text)

# Data Formatting
formatted_data = [
    {
        "input": f"Channel: {row['channel_name']}, Title: {row['video_title']}, Description: {row['video_description']}",
        "output": row['clean_video_descriptions']
    }
    for _, row in df.iterrows()
]

# Converting list of dictionaries to Dataset object
train_dataset = Dataset.from_list(formatted_data)

# Step 4: Tokenization function
def tokenize_data(data):
    # Tokenize input text and target text using BART tokenizer
    model_inputs = tokenizer(
        data["input"],
        max_length=512,
        padding="max_length",
        truncation=True
    )

    # Tokenize output (target) text
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            data["output"],
            max_length=128,
            padding="max_length",
            truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

# Apply tokenization to the dataset
tokenized_datasets = train_dataset.map(tokenize_data, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]



Map:   0%|          | 0/150 [00:00<?, ? examples/s]

