## Topic modelling and sentiment analysis on the comments on LoFi Hip Hop videos on youtube (Part 1)

Part 1 of this project crawls through the comments of some of the popular lo-fi videos, and creates a dataset to be analyzed in later parts. The code for scraping was adapted from this article: [How to Scrape Youtube Comments with Python](https://towardsdatascience.com/how-to-scrape-youtube-comments-with-python-61ff197115d4).

### required packages

In [None]:
!pip install selenium
!pip install langdetect
!python -m spacy download en_core_web_md | grep -v 'already satisfied'

In [1]:
import re
import glob
import time
import pandas as pd
from tqdm import tqdm
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from langdetect import detect

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
import spacy
nlp = spacy.load("en_core_web_md")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.




In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


I also downloaded [Chromedriver](https://chromedriver.chromium.org/downloads) and put the executable in the same directory as this notebook.

### scraping

I scraped 12,391 comments from about 15 videos, mostly by [Lofi Girl](https://www.youtube.com/channel/UCSJ4gkVC6NrvII8umztf0Ow) and [Feardog Music](https://www.youtube.com/c/FeardogMusic). <br>

The function used to scrape comments and save them as a csv file is shown below.

In [None]:
def scrape_comments(URL):
    with Chrome(executable_path=r'/Users/mahotaka/youtube_scraping/chromedriver') as driver:
        data = []
        wait = WebDriverWait(driver,15)
        driver.get("https://youtu.be/{}".format(URL))

        for item in range(200): 
            wait.until(EC.visibility_of_element_located((By.TAG_NAME, "body"))).send_keys(Keys.END)
            time.sleep(15)

        for comment in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#content"))):
            data.append(comment.text)
            
        df = pd.DataFrame(data, columns=['comment'])
        df = df.loc[df['comment'] != '']
        df.to_csv('./{}.csv'.format(URL), index=False)

In [None]:
scrape_comments('_tV5LEBDs7w')

  with Chrome(executable_path=r'/Users/mahotaka/youtube_scraping/chromedriver') as driver:


Put the comment datasets together, drop rows with duplicate values, and save the combined dataset 

In [None]:
df = pd.DataFrame()

for f in glob.glob("./*.csv"):
    df1 = pd.read_csv(f)
    df = df.append(df1)

df = df.drop_duplicates(subset='comment', keep="first")
df = df.loc[df['comment'].str.contains('SKIP NAVIGATION') == False]
df.reset_index(drop=True, inplace=True)
print(len(df))

df.to_csv('./comments.csv', index=False)

### pre-processing
I will tidy up the dataset to make it suitable for further processing.

#### remove non-English comments

In [5]:
df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/comments.csv')
df = df.iloc[:, 1:]
len(df)

12391

In [5]:
tqdm.pandas()

def detect_en(x):
  try:
    return detect(x)
  except:
    y = 'n/a'
    return y

df['English'] = df['comment'].progress_apply(lambda x: detect_en(x))

100%|██████████| 12391/12391 [01:35<00:00, 129.51it/s]


In [6]:
df_en = df.loc[df['English'] == 'en']
len(df_en)

10521

Save the English-only dataset

In [6]:
df = df_en.drop(['English'], axis=1)
df.to_csv('./comments_en.csv')

#### make letters lower case, and remove stopwords & punctuation

In [9]:
df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/comments_en.csv')
df = df.iloc[:, 1:]
df.head()

Unnamed: 0,comment
0,✔️ | This music is free to use in your livestr...
1,January is half-way done and it is time for an...
2,study girl has such a chill life these days
3,We're actually planning a lofi sound bath! Tha...
4,"If you’re trying to rest, put your device away..."


In [12]:
tqdm.pandas()

def process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    clean = ' '.join(word.lower() for word in nopunc.split() if word.lower() not in stopwords.words('english'))
    return clean

df['comment'] = df['comment'].progress_apply(lambda x: process(x))

100%|██████████| 10521/10521 [00:26<00:00, 394.81it/s]


Save the cleaned dataset

In [14]:
df.to_csv('./comments_en_clean.csv')

That's it for Part 1!