# NLP Comment Parsing on the FermiLab YouTube Channel

This notebook is a simple analysis of comments on the YouTube channel Fermilab on popular topics in physics. The comments parsed are using the NLP library called spacy

This does NOT use the YouTube API is probably a violation of fair use policy. I do not endorse this solution nor should this notebook be considered an endorsement.


In [98]:
import os
from collections import defaultdict, Counter
from pathlib import Path
import sys
import time
from typing import Optional, Literal, Final, List, Dict

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.remote.webdriver import WebDriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.common.by import By
import spacy
import tomli

Get the pre-trained embedding 'en_core_web_lg' for token similarity in Spacy. BeautifulSoup is used for html parsing.

In [None]:
!python -m spacy download en_core_web_lg

Select a video from the Subatomic Stories series: "3 Subatomic Stories: Charged leptons"

In [99]:
vid_url: Final[str] = (
    # u"https://www.youtube.com/watch?v=ilwMM-CEO6w"
    u"https://www.youtube.com/watch?v=RN10TgkPCbQ"
    # u"https://www.youtube.com/watch?v=jtp3Jk-nKhQ"
)

Define some helper functions to parse YouTube comments without the API

In [100]:
def get_chrome_driver(
    driver_exe: Optional[os.PathLike] = None,
) -> Optional[WebDriver]:
    """Get the Chrome driver executable for Selenium. The driver is
    searched in the environment variable 'PATH'

    Args:
        driver_exe: Path of executable (default="chromedriver")

    Returns:
        Chrome webdriver if executable found, else None
    """
    driver: WebDriver

    if driver_exe is None:
        driver_exe = "chromedriver"

    if Path(driver_exe).exists():
        return webdriver.Chrome(driver_exe)

    split_token: Literal[";", ":"] = ":"
    if "win" in sys.platform:
        split_token = ";"
        driver_exe += ".exe"
    chrome_driver_path: Optional[Path] = None

    for file_path in os.environ.get("PATH").split(split_token):
        temp_file = Path(file_path) / driver_exe
        if temp_file.exists():
            chrome_driver_path = temp_file
            break

    if chrome_driver_path is not None and chrome_driver_path.exists():
        driver = webdriver.Chrome(chrome_driver_path)
        return driver
    return None


def get_driver(
    driver_type: Literal["chrome"] = "chrome",
    driver_exe: Optional[os.PathLike] = None,
) -> Optional[WebDriver]:
    """Get the specified driver for Selenium

    Args:
        driver_type: Type of driver like 'chrome'
        driver_exe: PathLike executable

    Returns:
        Webdriver if executable found, else None
    """
    if driver_type == "chrome":
        return get_chrome_driver(driver_exe)
    return None


def get_yt_comments(
    vid_url: str,
    pages: int = 7,
    min_sleep_sec: float = 2.0
) -> List[str]:
    """Obtain Youtube comments as a list of unicode strings. Due to the nature of
    this algorithm, this violates YouTube fair use policy and will NOT be used in
    production code. This function is unable to extract more than 100 comments.

    Args:
        vid_url: URL of video on Youtube
        pages: Number comment pages to load
        min_sleep_sec: Wait time between comment page scrolling

    Returns:
         Comments list
    """
    comments: List[str]

    driver = get_driver()
    if driver is None:
        raise TypeError("Unable to get web driver!")
    driver.get(vid_url)
    driver.maximize_window()

    # Scroll to first comments page
    time.sleep(2 * min_sleep_sec)
    driver.execute_script("window.scrollTo(0, 1000);")

    for _ in range(pages - 1):
      time.sleep(min_sleep_sec)
      driver.execute_script("window.scrollTo(0, 10000);")

    comments_section: WebElement
    [comments_section] = driver.find_elements(
        by=By.XPATH,
        value='//*[@id="comments"]'
    )
    comments_html: Final[str] = comments_section.get_attribute("innerHTML")

    # parse the HTML content with BeautifulSoup
    soup: BeautifulSoup = BeautifulSoup(markup=comments_html, features="html.parser")
    comments = [
        comment.text
        for comment in soup.find_all(
            name="yt-formatted-string",
            attrs={"class": "style-scope ytd-comment-renderer"}
        )
    ]
    return comments


video_comments = get_yt_comments(vid_url, pages=8, min_sleep_sec=0.5)

  driver = webdriver.Chrome(chrome_driver_path)


In [101]:
len(video_comments)

100

We got 100 comments which is small but a modest start.

Let's load a hand-picked glossary of terms that map pop-culture terminology to over-arching topics.

This is what one entry is like

```toml
neutrino = ["neutrino", "oscillation", "ghost", "mass"]
```

The topic 'neutrino' is often associated with itself and other words like 'oscillation', 'mass', and 'ghost' as in ghost particle.

In [102]:
with Path("data/dict.toml").open(mode="rb") as fp:
    physics_glossary = tomli.load(fp)


In [103]:
nlp_web = spacy.load("en_core_web_lg")

For spacy to make token similarity calculations using an inner-product space embedding, we need to convert each word into a token

In [104]:
physics_glossary_nlp: Dict[str, List[spacy.tokens.doc.Doc]] = {
    key: [nlp_web(val) for val in values]
    for key, values in physics_glossary.items()
}
# type(physics_glossary_nlp["quark"][0])

spacy.tokens.doc.Doc

This algorithm attempts to match comment tokens with the load pop-culture glossary. The threshold was picked such that the words 'quark' and 'quarks' were similar. Another metric to include might be to calculate the edit distance (number of insertions, deletions, and swaps) for word similarity.

The model/algorithm assumes a single comment asks at most a single question or discussion with a single over-arching topic. To break degeneracy, a Counter.most_common() method is used.

In [105]:
min_required_similarity = 0.80
key_counts: Dict[str, int] = defaultdict(int)
"""Dictionary of the popularity of topics for each comment"""
# Get comment tokens that are alphabetic
for comment in video_comments:
    comment_keys: List[str] = []
    for token in nlp_web(comment.lower()):
        if len(token.text) < 2:
            continue
        if not token.is_alpha:
            continue
        # Find the most discussed topic in the comment using the glossary
        for key, glossary_tokens in physics_glossary_nlp.items():
            for glossary_token in glossary_tokens:
                store_key = False
                temp_similarity = token.similarity(glossary_token)
                if temp_similarity > min_required_similarity:
                    store_key = True
                # Other logic here if necessary
                if store_key is True:
                    print(token.text, glossary_token.text, temp_similarity)
                    comment_keys.append(key)
    if len(comment_keys) == 0:
        continue
    # Use most_common() method on Counter to get descending sorted key list and then key
    most_common_key = Counter(comment_keys).most_common()[0][0]
    key_counts[most_common_key] += 1

  temp_similarity = token.similarity(glossary_token)


up up 1.0
up up 1.0
quantum quantum 1.0
up up 1.0
neutrinos neutrino 0.9374495466930591
neutrinos neutrino 0.9374495466930591
dark dark 1.0
top top 1.0
quark quark 1.0
bottom bottom 1.0
force force 1.0
force force 1.0
quark quark 1.0
annihilate annihilate 1.0
quark quark 1.0
up up 1.0
quark quark 1.0
down down 1.0
quark quark 1.0
quantum quantum 1.0
quarks quark 0.8979578916015317
annihilate annihilate 1.0
pairs pairs 1.0
annihilate annihilate 1.0
pairs pairs 1.0
gravitational gravitational 1.0
gravitational oscillation 0.8086236853918781
fragmentation oscillation 0.8116312590820531
fragmentation dilation 0.806441297694348
quarks quark 0.8979578916015317
big big 1.0
quark quark 1.0
quark quark 1.0
dimension dimension 1.0
dimension dimension 1.0
quarks quark 0.8979578916015317
neutrinos neutrino 0.9374495466930591
universe universe 1.0
neutrino neutrino 1.0
quarks quark 0.8979578916015317
quark quark 1.0
antimatter antimatter 1.0
quark quark 1.0
annihilate annihilate 1.0
quarks quark 0.

In [106]:
key_counts

defaultdict(int,
            {'quark': 20,
             'quantum': 5,
             'neutrino': 11,
             'antimatter': 4,
             'cosmos': 3,
             'exotic': 2,
             'force': 7,
             'higgs': 1})

As expected, most of the comments are about quarks, which are the most well known charged leptons. Other comments include neutral leptons like the neutrino and other physics like the Higgs boson.