<a href="https://colab.research.google.com/github/kkrusere/EV_Market-Analysis-and-Consumer-Behavior/blob/main/EV_YouTube_comments_Data_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb

wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/118.0.5993.70/linux64/chromedriver-linux64.zip -P /tmp/
unzip -o /tmp/chromedriver-linux64.zip -d /tmp/
chmod +x /tmp/chromedriver-linux64/chromedriver
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver
pip install selenium chromedriver_autoinstaller

# Web Scraping and Data Retrieval


## Dependencies:
 - Selenium WebDriver: Ensure you have selenium, chromedriver_autoinstaller, and a compatible version of ChromeDriver installed.
 - BeautifulSoup: Required for parsing HTML content (bs4 library).


In [None]:
import time
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import chromedriver_autoinstaller


1. WebDriver Management
- `init_webdriver()`
  * Purpose: Initializes a headless Chrome WebDriver instance with specified options for automated browser interactions.
  * Dependencies: Requires selenium, chromedriver_autoinstaller, and a compatible version of ChromeDriver.

In [1]:
def init_webdriver():
    """Initializes and returns a Chrome WebDriver instance with options."""
    try:
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument("--headless")  # Run in headless mode
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chromedriver_autoinstaller.install()  # Automatically install chromedriver
        driver = webdriver.Chrome(options=chrome_options)
        print("WebDriver initialized successfully")
        return driver
    except Exception as e:
        print(f"Failed to initialize WebDriver: {e}")
        raise


- `close_webdriver(driver)`
  * Purpose: Closes the WebDriver instance to free up system resources.

In [None]:
def close_webdriver(driver):
    """Closes the provided WebDriver instance."""
    driver.quit()
    print("WebDriver successfully closed")


2. YouTube URL and Video ID Handling
- `get_youtube_url(video_id)`
 - Purpose: Creates a standard YouTube video URL using the provided video ID.

In [None]:
def get_youtube_url(video_id):
    """Constructs a YouTube URL from a given video ID."""
    return f"https://www.youtube.com/watch?v={video_id}"


- `get_youtube_videoID(youtube_url)`
 - Purpose: Extracts the video ID from various formats of YouTube URLs, handling both standard and shortened links.

In [2]:
def get_youtube_videoID(youtube_url):
    """Extracts the YouTube video ID from a given YouTube URL."""
    if not youtube_url:
        return None
    try:
        if "watch?v=" in youtube_url:
            video_id = youtube_url.split("watch?v=")[1].split("&")[0]
            return video_id
        elif "youtu.be/" in youtube_url:
            video_id = youtube_url.split("youtu.be/")[1].split("?")[0]
            return video_id
        else:
            return None
    except Exception as e:
        print(f"Error extracting video ID: {e}")
        return None


3. Comment Fetching and Parsing
- `get_comments_html(video_url, driver)`
 - Purpose: Scrolls through the YouTube video page to dynamically load all comments and retrieves the HTML content.
 - Notes: Uses JavaScript execution for scrolling and waits to handle dynamic content loading.

In [None]:
def get_comments_html(video_url, driver):
    """
    Fetches the HTML content of the comments section from a YouTube video.

    Args:
        video_url (str): The URL of the YouTube video from which to fetch comments.
        driver: An initialized WebDriver instance (from Selenium).

    Returns:
        str: The HTML content of the comments section.

    Raises:
        TimeoutException: If the comments section does not load within the specified time.
    """
    # Navigate to the video URL
    driver.get(video_url)

    # Wait until the comments section is loaded
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "ytd-comments"))
    )

    # Scroll to the comments section to load initial comments
    driver.execute_script(
        "window.scrollTo(0, document.documentElement.scrollHeight);"
    )

    # Initialize variables for dynamic scrolling
    last_height = driver.execute_script(
        "return document.documentElement.scrollHeight"
    )
    scroll_pause_time = 2  # Time to wait between scrolls
    max_scrolls = 100      # Max number of scrolls to ensure all comments are loaded
    scroll_count = 0

    while scroll_count < max_scrolls:
        # Scroll down to the bottom
        driver.execute_script(
            "window.scrollTo(0, document.documentElement.scrollHeight);"
        )

        # Wait for new comments to load
        time.sleep(scroll_pause_time)

        # Calculate new scroll height and compare with last height
        new_height = driver.execute_script(
            "return document.documentElement.scrollHeight"
        )
        if new_height == last_height:
            # Break if no new content is loaded
            print("All comments have been loaded.")
            break
        last_height = new_height
        scroll_count += 1

    # Get the HTML content
    comments_html = driver.page_source
    return comments_html


- `get_comment_thread_renderers(comments_html)`
 - Purpose: Parses the HTML content using BeautifulSoup to find all comment thread elements.


In [None]:
def get_comment_thread_renderers(comments_html):
    """
    Parses the provided HTML content to extract YouTube comment threads.

    Args:
        comments_html (str): The HTML content of the comments section of a YouTube video.

    Returns:
        list: A list of `ytd-comment-thread-renderer` elements found in the HTML.
    """
    soup = BeautifulSoup(comments_html, "html.parser")
    comment_thread_renderers = soup.find_all(
        "ytd-comment-thread-renderer",
        class_="style-scope ytd-item-section-renderer",
    )
    return comment_thread_renderers


- `get_comments(comment_thread_renderers)`
 - Purpose: Extracts individual comment texts, like counts, and reply counts from the parsed HTML elements.
 - Notes: Handles potential variations in YouTube's HTML structure.

In [None]:
def get_comments(comment_thread_renderers):
    """
    Extracts comments and associated data from the list of comment thread renderers.

    Args:
        comment_thread_renderers (list): List of 'ytd-comment-thread-renderer' elements.

    Returns:
        tuple: A tuple containing a list of comment texts and a list of dictionaries with comment data.
    """
    comments = []
    comments_data = []

    for comment_thread_renderer in comment_thread_renderers:
        # Extract the comment text
        comment_text_element = comment_thread_renderer.find(
            "yt-formatted-string", id="content-text"
        )
        comment_text = (
            comment_text_element.get_text(strip=True)
            if comment_text_element
            else None
        )

        # Extract the number of likes
        like_count_element = comment_thread_renderer.find(
            "span", id="vote-count-middle"
        )
        like_count = (
            like_count_element.get_text(strip=True)
            if like_count_element
            else None
        )

        # Extract the number of replies
        reply_count_element = comment_thread_renderer.find(
            "ytd-comment-replies-renderer"
        )
        reply_count = (
            reply_count_element.get("reply-count")
            if reply_count_element
            else None
        )

        comments.append(comment_text)
        comments_data.append(
            {
                "comment_text": comment_text,
                "like_count": like_count,
                "reply_count": reply_count,
            }
        )

    return comments, comments_data


- `get_video_comments(video_url, driver)`
 - Purpose: Combines the above functions to retrieve and structure comment data for a given video.

In [None]:
def get_video_comments(video_url, driver):
    """
    Retrieves comments from the provided YouTube video URL.

    Args:
        video_url (str): The URL of the YouTube video.
        driver: An initialized WebDriver instance.

    Returns:
        list: A list of dictionaries containing comment data.
    """
    comments_html = get_comments_html(video_url, driver)
    comment_thread_renderers = get_comment_thread_renderers(comments_html)
    _, comments_data = get_comments(comment_thread_renderers)
    return comments_data


4. Video Data Retrieval
- `get_video_data(video_id)`

 - Purpose: Navigates to the YouTube video page, handles potential consent and modal dialogs, extracts video metadata (channel name, title, description), and retrieves comments.

 - Notes:
   - Error Handling: Includes try-except blocks to handle exceptions and ensure the WebDriver is closed properly.
   - Selectors: Uses CSS selectors and XPaths to locate elements on the page.


In [None]:
def get_video_data(video_id):
    """
    Fetches video data from YouTube given a video ID.

    Args:
        video_id (str): The ID of the YouTube video to fetch data for.

    Returns:
        dict: A dictionary containing the video data with keys:
            - 'channel_name'
            - 'video_title'
            - 'video_description'
            - 'comments'
    """
    driver = init_webdriver()
    video_url = get_youtube_url(video_id)
    video_data = {}

    try:
        driver.get(video_url)

        # Handle YouTube consent dialog if it appears
        try:
            consent_button = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable(
                    (By.XPATH, '//button[contains(text(), "I agree")]')
                )
            )
            consent_button.click()
        except TimeoutException:
            print("No consent dialog found or already handled.")

        # Handle any other potential modal dialogs
        try:
            dialog_close_button = WebDriverWait(driver, 5).until(
                EC.element_to_be_clickable(
                    (By.XPATH, '//button[@aria-label="Close"]')
                )
            )
            dialog_close_button.click()
        except TimeoutException:
            print("No additional modal dialogs found.")

        # Extract video details
        try:
            # Wait for page elements to load
            WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.ID, "info-contents"))
            )

            # Extract channel name
            channel_name_element = driver.find_element(
                By.CSS_SELECTOR, 'ytd-channel-name#channel-name a'
            )
            channel_name = channel_name_element.text

            # Extract video title
            title_element = driver.find_element(
                By.CSS_SELECTOR, 'h1.title yt-formatted-string'
            )
            video_title = title_element.text

            # Expand the description if the expand button exists
            try:
                expand_button = driver.find_element(
                    By.CSS_SELECTOR, 'tp-yt-paper-button#expand'
                )
                driver.execute_script("arguments[0].click();", expand_button)
            except Exception:
                pass  # Description is already expanded or expand button not found

            # Extract video description
            description_element = driver.find_element(
                By.CSS_SELECTOR, 'yt-formatted-string.content'
            )
            video_description = description_element.text

            video_data = {
                "channel_name": channel_name,
                "video_title": video_title,
                "video_description": video_description,
            }

            # Fetch comments
            comments_data = get_video_comments(video_url, driver)
            video_data["comments"] = comments_data

        except TimeoutException as e:
            print(f"Error processing {video_url}: {e}")

    except Exception as e:
        print(f"Error processing {video_url}: {e}")

    finally:
        # Close the browser
        close_webdriver(driver)

    return video_data


In [None]:
def main(youtube_url):
    """
    Main function to execute the data retrieval steps.
    """
    video_id = get_youtube_videoID(youtube_url)
    if not video_id:
        print("Invalid YouTube URL.")
        return

    # Step 1: Fetch video data without cleaning description
    video_data = get_video_data(video_id)
    if not video_data:
        print("Failed to retrieve video data.")
        return

    # The video_data now contains:
    # - 'channel_name'
    # - 'video_title'
    # - 'video_description' (original, uncleaned)
    # - 'comments' (list of comment data)



Using the above functions, we now focus on creating a Lambda function that performs the Web Scraping and Data Retrieval functions. This will make it easy to intergrate to the final Streamlit Dashboard.

Creating a Lambda function that uses Selenium and headless Chrome to scrape YouTube data involves several steps due to the dependencies and environment setup required.

# Overview

- **Goal:** Create an AWS Lambda function that performs web scraping of YouTube video data, including comments and metadata.
- **Challenges:** AWS Lambda has limitations on deployment package size.
Selenium and headless Chrome require native binaries and specific configurations.
- **Solution:** Use AWS Lambda Layers to include the necessary binaries and libraries, and package the Python code with its dependencies.