<a href="https://colab.research.google.com/github/kkrusere/youTube-comments-Analyzer/blob/main/fine-tuned_LLM_text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **<center>Transformer-Based Summarization for Cleaning YouTube Video Descriptions</center>**

<center><em>
Leverage the power of transformer-based text summarization to automatically remove irrelevant information from YouTube video descriptions, ensuring they're concise and informative.
</em></center>

#### Intro:

YouTube video descriptions are vital for attracting viewers, but often contain extraneous information that hinders understanding. This project utilizes transformer-based text summarization models (like BERT and GPT) to automatically clean these descriptions.

By training a summarization model on a dataset of YouTube descriptions paired with their human-refined counterparts, the model learns to identify and remove irrelevant content while preserving key points. This leads to concise, informative descriptions.

The project will explore the fine-tuning and evaluation of transformer models for this specific summarization task, focusing on their ability to remove extraneous information and produce distilled video descriptions.

**Key Points:**
- Problem: YouTube descriptions often contain excessive tags, promotions, and irrelevant details.
- Solution: Transformer-based text summarization models trained to clean descriptions.
- Approach: Fine-tune models on a dataset of original and human-cleaned descriptions.
- Goal: Produce concise, informative descriptions that enhance user experience.
- Evaluation: Focus on the models' ability to remove extraneous information effectively.

In [1]:
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb

wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/118.0.5993.70/linux64/chromedriver-linux64.zip -P /tmp/
unzip -o /tmp/chromedriver-linux64.zip -d /tmp/
chmod +x /tmp/chromedriver-linux64/chromedriver
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver
pip install selenium chromedriver_autoinstaller

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,499 kB]
Ign:11 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:13 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packa



In [2]:
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')


from selenium import webdriver
import chromedriver_autoinstaller
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.service import Service
from selenium.common.exceptions import TimeoutException, ElementNotInteractableException
from selenium.webdriver.common.action_chains import ActionChains


import time
import random
import re
import json
import random

In [3]:
def init_webdriver():
    """Initializes and returns a Chrome WebDriver instance with options.

    Returns:
        webdriver.Chrome: A configured Chrome WebDriver instance.

    Raises:
        Exception: If the WebDriver fails to initialize.
    """
    try:
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chromedriver_autoinstaller.install()
        driver = webdriver.Chrome(options=chrome_options)

        print("WebDriver initialized successfully")
        return driver
    except Exception as e:
        print(f"Failed to initialize WebDriver: {e}")
        raise


def close_webdriver(driver):
    """Closes the provided WebDriver instance.

    Args:
        driver (webdriver.Chrome): The WebDriver instance to close.

    Prints:
        str: Confirmation message that the WebDriver has been closed.
    """
    print("WebDriver successfully closed")
    driver.quit()


In [4]:
def get_video_data(video_id):
    """Fetches video data from YouTube given a video ID.

    Args:
        video_id (str): The ID of the YouTube video to fetch data for.

    Returns:
        dict: A dictionary containing the video data with the following keys:
            - 'channel_name': The name of the channel that uploaded the video.
            - 'video_title': The title of the video.
            - 'video_description': The description of the video.

    Raises:
        Exception: If there is an error accessing or processing the video data.
    """
    driver = init_webdriver()
    video_url = f"https://www.youtube.com/watch?v={video_id}"
    video_data = {}

    try:
        driver.get(video_url)

        try:
            # Wait for the bottom-row element to be present
            bottom_row = WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.XPATH, '//*[@id="bottom-row"]'))
            )

            # Locate and click the expand button if it exists
            try:
                expand_button = WebDriverWait(driver, 10).until(
                    EC.element_to_be_clickable((By.XPATH, '/html/body/ytd-app/div[1]/ytd-page-manager/ytd-watch-flexy/div[5]/div[1]/div/div[2]/ytd-watch-metadata/div/div[4]/div[1]/div/ytd-text-inline-expander/tp-yt-paper-button[1]'))
                )
                expand_button.click()
            except TimeoutException:
                pass  # Ignore if the expand button is not found

            # Wait for elements to be visible and extract data
            expanded_description = WebDriverWait(driver, 10).until(
                EC.visibility_of_element_located((By.ID, 'description-inline-expander'))
            )
            title_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, '//h1[@class="style-scope ytd-watch-metadata"]//yt-formatted-string'))
            )
            channel_name_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, '//ytd-channel-name[@id="channel-name"]//yt-formatted-string//a'))
            )

            video_data = {
                'channel_name': channel_name_element.text,
                'video_title': title_element.text,
                'video_description': expanded_description.text
            }

        except TimeoutException:
            print(f"Error processing {video_url}: Elements not found within timeout.")

    except Exception as e:
        print(f"Error processing {video_url}: {e}")

    finally:
        # Close the browser when done
        close_webdriver(driver)

    return video_data


In [5]:
#Test:
video_id = "CETSlLO_jio"
# Get video data
video_data = get_video_data(video_id)

# Print the data
video_data


WebDriver initialized successfully
Error processing https://www.youtube.com/watch?v=CETSlLO_jio: Message: element click intercepted: Element <tp-yt-paper-button id="expand" class="button style-scope ytd-text-inline-expander" style-target="host" role="button" tabindex="0" animated="" elevation="0" aria-disabled="false" style="left: 3px;">...</tp-yt-paper-button> is not clickable at point (61, 378). Other element would receive the click: <tp-yt-paper-dialog id="dialog" aria-labelledby="cb-header" modal="" class="eom-v1-dialog style-scope ytd-consent-bump-v2-lightbox style-scope ytd-consent-bump-v2-lightbox" style-target="host" role="dialog" tabindex="-1" style="outline: none; position: fixed; top: 0px; left: 40px; box-sizing: border-box; max-height: 388px; max-width: 780px; z-index: 2202;">...</tp-yt-paper-dialog>
  (Session info: chrome=128.0.6613.119)
Stacktrace:
#0 0x55fe3af6a86a <unknown>
#1 0x55fe3ac38e50 <unknown>
#2 0x55fe3ac8f346 <unknown>
#3 0x55fe3ac8d25d <unknown>
#4 0x55fe3ac

{}

**Data Collection - Overview**
- The data collection process involves gathering YouTube video descriptions along with additional metadata, such as the channel name and video title. We are going to use the above functions for this. This data will be used to train and evaluate our transformer-based summarization model.

**Steps:**

- Fetch Video Data:
Iterate through a predefined list of YouTube video IDs.
> For each video ID, use a custom function to retrieve the video data.
- The function fetches:
> - Channel Name: The name of the channel where the video was uploaded.
> - Video Title: The title of the video.
> - Video Description: The description text provided by the video uploader.
- Store Data:
> - Append the retrieved data, formatted as a dictionary, to the list.
> - Store the collected data in a file (e.g., JSON or CSV) to facilitate access and further processing.

**Example Output**
> - The collected data will be a list of dictionaries, each containing the following keys:

> - ```yaml
channel_name: The name of the YouTube channel.
video_title: The title of the video.
video_description: The description text of the video.


In [6]:
from google.colab import drive
import os
import json
#mounting google drive
drive.mount('/content/drive')
########################################
#changing the working directory
os.chdir("/content/drive/MyDrive/NLP_Data")

!pwd

Mounted at /content/drive
/content/drive/MyDrive/NLP_Data


In [7]:
# below are functions for reading a writting json file for the current working directory

def save_to_json(data, filename):
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)

def load_from_json(filename):
    with open(filename, 'r') as json_file:
        comments = json.load(json_file)
    return comments