### Sentiment Analysis on YouTube Comments

- This notebook is for developing a sentiment analysis model for YouTube Comments

In [1]:
from google.colab import drive
import os

#mounting google drive
drive.mount('/content/drive')

########################################

#changing the working directory
os.chdir("/content/drive/MyDrive/NLP_Data")

!pwd


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/NLP_Data


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import string

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

import warnings
warnings.filterwarnings("ignore")

import requests
import json
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# The functions below handle reading from and writing to a JSON file in the current working directory.

def save_to_json(data, filename):
    """
    Save data to a JSON file.

    Parameters:
    data (dict or list): The data to be saved.
    filename (str): The name of the file to save the data in.
    """
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)

def load_from_json(filename):
    """
    Load data from a JSON file.

    Parameters:
    filename (str): The name of the file to load the data from.

    Returns:
    dict or list: The data loaded from the JSON file.
    """
    with open(filename, 'r') as json_file:
        data = json.load(json_file)
    return data


In [4]:
# we are going to load the YouTube comments json file from the current working directory "filename = 'youtube_comments.json'"
filename = 'youtube_comments.json'
youtube_comments = load_from_json(filename)
# convert the youtube_comments to a pandas dataframe

youtube_comments_df = pd.DataFrame(youtube_comments)

youtube_comments_df.head()

Unnamed: 0,comment_text,like_count,reply_count
0,A major obstacle to EV adoption that is always...,6K,507 replies
1,A major obstacle to EV adoption that is always...,6K,507 replies
2,"Prices are too high, and dealerships keep addi...",3.9K,216 replies
3,The government isn‚Äôt fast enough to patch poth...,89,6 replies
4,We have the coldest winters in many years here...,34,1 reply


In [5]:
# Function to convert string values containing suffixes 'K', 'M', or 'B' to integers and extract numeric values.
def convert_to_int(value):
  """
    - If the value is NaN or an empty string, return 0.
    - If the value is a string:
      - Extract numeric digits from the string.
      - Convert the extracted digits to an integer.
      - If the string contains 'K', multiply the number by 1,000.
      - If the string contains 'M', multiply the number by 1,000,000.
      - If the string contains 'B', multiply the number by 1,000,000,000.
    - Return the converted integer value.

  """
  if pd.isna(value) or value == '':
      return 0
  if isinstance(value, str):
      # Extract numbers and convert them
      num = re.findall(r'\d+', value)
      if not num:
          return 0
      num = ''.join(num)
      if 'K' in value:
          return int(float(num) * 1000)
      if 'M' in value:
          return int(float(num) * 1000000)
      if 'B' in value:
          return int(float(num) * 1000000000)
      return int(num)
  return int(value)

In [6]:
# We are going to do a little bit of cleaning on the dataset

# Fill missing and empty values with 0
youtube_comments_df['like_count'].replace('', 0, inplace=True)
youtube_comments_df['reply_count'].replace('', 0, inplace=True)
youtube_comments_df.fillna({'like_count': 0, 'reply_count': 0}, inplace=True)

# Convert columns to integers
youtube_comments_df['like_count'] = youtube_comments_df['like_count'].apply(convert_to_int)
youtube_comments_df['reply_count'] = youtube_comments_df['reply_count'].apply(convert_to_int)

In [7]:
# Shuffle the rows of the DataFrame `comments_data` randomly with a fixed seed for reproducibility.
# Reset the index of the DataFrame to be sequential and drop the old index column.
# Display the first few rows of the shuffled DataFrame.
youtube_comments_df = youtube_comments_df.sample(frac=1, random_state=42).reset_index(drop=True)
youtube_comments_df.head()


Unnamed: 0,comment_text,like_count,reply_count
0,English please,0,0
1,I still fucking love this so much. Eat worm sh...,3,0
2,ngl i do enjoy the action stuff w smaug even i...,0,0
3,LeLowGear,0,0
4,This guys dish has more forks than his family ...,0,0


Note: We will be revisiting the text cleaning and preparation of the `YouTube_comments` dataset.
Given that the dataset may contain comments in different languages, special characters, or non-alphanumeric symbols,
it is crucial to address these aspects for effective preprocessing.
#### Below are some strategies we will use to improve the cleaning process:

1. **Language Detection**:
    - Identify and handle comments in various languages separately.
    - Utilize language identification tools like `langdetect` or `TextBlob` to determine the language of each comment.
2. **Character Removal**:
    - Remove or normalize special characters and non-alphanumeric symbols to ensure consistency.
    - Use regular expressions to filter out unwanted characters and retain only relevant text.

3. **Unicode Normalization**:
    - Normalize Unicode characters to handle different encodings and symbols.
    - Employ libraries like `unicodedata` to standardize text.

4. **Text Standardization**:
    - Convert text to a consistent case (e.g., lowercase) to ensure uniformity.
    - Remove extra whitespace and redundant characters.

5. **Language-Specific Processing**:
    - Apply language-specific preprocessing techniques for better accuracy, such as stemming or lemmatization.
    - Consider translation or transliteration if necessary for multilingual comments.

6. **Tokenization and Lemmatization**:
    - Tokenize text into words or phrases and apply lemmatization to reduce words to their base forms.

 Implementing these methods will enhance the quality of our text data, making it more suitable for analysis or model training.


In [8]:
# Now we get a random sample of 200 comments and do some tests
sample_comments = youtube_comments_df.sample(n=200, random_state=42).reset_index(drop=True)
sample_comments.head()

Unnamed: 0,comment_text,like_count,reply_count
0,"It might be possible to power EVs using solar,...",0,0
1,Imagine the smell of this factory,29,3
2,"i went for a history degree, and I still had t...",0,0
3,Wow They really made a 2 minute vid into a mess.,0,0
4,What if: instead of upgrading the ships protec...,0,0


In [9]:
sample_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   comment_text  200 non-null    object
 1   like_count    200 non-null    int64 
 2   reply_count   200 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ KB


In [10]:
pip install langdetect



In [11]:
# prompt: # Iterate over the 'comment_text' column and if the comment is contain non English characters print it out

from langdetect import detect
print("[")
for comment in sample_comments['comment_text']:
  try:
    language = detect(comment)
    if language != 'en':
      print(f"'{comment}',")
  except:
    pass
print("]")

[
'Jajaja I Go There Para Ponerme Mas JovenJajajajajaja',
'Giant mochi!!!!!',
'HELTER SKELTER',
'ÿ•ŸÜ ŸÖŸÜ ÿ£ÿπÿ∏ŸÖ ÿßŸÑŸàÿßÿ¨ÿ®ÿßÿ™ ÿßŸÑÿØŸäŸÜŸäÿ© ÿßŸÑÿ£ŸÖÿ± ÿ®ÿßŸÑŸÖÿπÿ±ŸàŸÅ ŸàÿßŸÑŸÜŸáŸä ÿπŸÜ ÿßŸÑŸÖŸÜŸÉÿ±ÿå ŸÇÿßŸÑ ÿßŸÑŸÑŸá ÿ™ÿπÿßŸÑŸâ: [ŸàŸéŸÑŸíÿ™ŸéŸÉŸèŸÜŸí ŸÖŸêŸÜŸíŸÉŸèŸÖŸí ÿ£ŸèŸÖŸëŸéÿ©Ÿå ŸäŸéÿØŸíÿπŸèŸàŸÜŸé ÿ•ŸêŸÑŸéŸâ ÿßŸÑŸíÿÆŸéŸäŸíÿ±Ÿê ŸàŸéŸäŸéÿ£ŸíŸÖŸèÿ±ŸèŸàŸÜŸé ÿ®ŸêÿßŸÑŸíŸÖŸéÿπŸíÿ±ŸèŸàŸÅŸê ŸàŸéŸäŸéŸÜŸíŸáŸéŸàŸíŸÜŸé ÿπŸéŸÜŸê ÿßŸÑŸíŸÖŸèŸÜŸíŸÉŸéÿ±Ÿê ŸàŸéÿ£ŸèŸàŸÑŸéÿ¶ŸêŸÉŸé ŸáŸèŸÖŸè ÿßŸÑŸíŸÖŸèŸÅŸíŸÑŸêÿ≠ŸèŸàŸÜŸé]. 
ÿ±ŸàŸä ÿπŸÜ ÿßŸÑŸÜÿ®Ÿä (ÿµŸÑŸâ ÿßŸÑŸÑŸá ÿπŸÑŸäŸá Ÿàÿ¢ŸÑŸá) ÿ£ŸÜŸá ŸÇÿßŸÑ: (ŸÉŸäŸÅ ÿ®ŸÉŸÖ ÿ•ÿ∞ÿß ŸÅÿ≥ÿØÿ™ ŸÜÿ≥ÿßÿ§ŸÉŸÖÿå ŸàŸÅÿ≥ŸÇ ÿ¥ÿ®ÿßÿ®ŸÉŸÖÿå ŸàŸÑŸÖ ÿ™ÿ£ŸÖÿ±Ÿàÿß ÿ®ÿßŸÑŸÖÿπÿ±ŸàŸÅ ŸàŸÑŸÖ ÿ™ŸÜŸáŸàÿß ÿπŸÜ ÿßŸÑŸÖŸÜŸÉÿ±) ŸÅŸÇŸäŸÑ ŸÑŸá: ŸàŸäŸÉŸàŸÜ ÿ∞ŸÑŸÉ Ÿäÿß ÿ±ÿ≥ŸàŸÑ ÿßŸÑŸÑŸáÿü ŸÇÿßŸÑ ÿµŸÑŸâ ÿßŸÑŸÑŸá ÿπŸÑŸäŸá Ÿàÿ¢ŸÑŸá: (ŸÜÿπŸÖ). ŸÅŸÇÿßŸÑ: (ŸÉŸäŸÅ ÿ®ŸÉŸÖ ÿ•ÿ∞ÿß ÿ£ŸÖÿ±ÿ™ŸÖ ÿ®ÿßŸÑŸÖŸÜŸÉÿ±ÿå ŸàŸÜŸáŸäÿ™ŸÖ ÿπŸÜ ÿßŸÑŸÖÿπÿ±ŸàŸÅ) ŸÅŸÇŸäŸÑ ŸÑŸá: Ÿäÿß ÿ±ÿ≥ŸàŸÑ ÿßŸÑŸÑŸá ŸàŸäŸÉŸàŸÜ ÿ∞ŸÑŸÉÿü ŸÅŸ

In [12]:
# prompt:

In [13]:
# Iterate over the 'comment_text' column and if the comment is not English print it out

from langdetect import detect

for i, comment in enumerate(sample_comments['comment_text']):
  try:
    language = detect(comment)
    if language != 'en':
      print(f"{i} --- {comment}")
  except:
    pass


11 --- Jajaja I Go There Para Ponerme Mas JovenJajajajajaja
13 --- Giant mochi!!!!!
14 --- HELTER SKELTER
16 --- ÿ•ŸÜ ŸÖŸÜ ÿ£ÿπÿ∏ŸÖ ÿßŸÑŸàÿßÿ¨ÿ®ÿßÿ™ ÿßŸÑÿØŸäŸÜŸäÿ© ÿßŸÑÿ£ŸÖÿ± ÿ®ÿßŸÑŸÖÿπÿ±ŸàŸÅ ŸàÿßŸÑŸÜŸáŸä ÿπŸÜ ÿßŸÑŸÖŸÜŸÉÿ±ÿå ŸÇÿßŸÑ ÿßŸÑŸÑŸá ÿ™ÿπÿßŸÑŸâ: [ŸàŸéŸÑŸíÿ™ŸéŸÉŸèŸÜŸí ŸÖŸêŸÜŸíŸÉŸèŸÖŸí ÿ£ŸèŸÖŸëŸéÿ©Ÿå ŸäŸéÿØŸíÿπŸèŸàŸÜŸé ÿ•ŸêŸÑŸéŸâ ÿßŸÑŸíÿÆŸéŸäŸíÿ±Ÿê ŸàŸéŸäŸéÿ£ŸíŸÖŸèÿ±ŸèŸàŸÜŸé ÿ®ŸêÿßŸÑŸíŸÖŸéÿπŸíÿ±ŸèŸàŸÅŸê ŸàŸéŸäŸéŸÜŸíŸáŸéŸàŸíŸÜŸé ÿπŸéŸÜŸê ÿßŸÑŸíŸÖŸèŸÜŸíŸÉŸéÿ±Ÿê ŸàŸéÿ£ŸèŸàŸÑŸéÿ¶ŸêŸÉŸé ŸáŸèŸÖŸè ÿßŸÑŸíŸÖŸèŸÅŸíŸÑŸêÿ≠ŸèŸàŸÜŸé]. 
ÿ±ŸàŸä ÿπŸÜ ÿßŸÑŸÜÿ®Ÿä (ÿµŸÑŸâ ÿßŸÑŸÑŸá ÿπŸÑŸäŸá Ÿàÿ¢ŸÑŸá) ÿ£ŸÜŸá ŸÇÿßŸÑ: (ŸÉŸäŸÅ ÿ®ŸÉŸÖ ÿ•ÿ∞ÿß ŸÅÿ≥ÿØÿ™ ŸÜÿ≥ÿßÿ§ŸÉŸÖÿå ŸàŸÅÿ≥ŸÇ ÿ¥ÿ®ÿßÿ®ŸÉŸÖÿå ŸàŸÑŸÖ ÿ™ÿ£ŸÖÿ±Ÿàÿß ÿ®ÿßŸÑŸÖÿπÿ±ŸàŸÅ ŸàŸÑŸÖ ÿ™ŸÜŸáŸàÿß ÿπŸÜ ÿßŸÑŸÖŸÜŸÉÿ±) ŸÅŸÇŸäŸÑ ŸÑŸá: ŸàŸäŸÉŸàŸÜ ÿ∞ŸÑŸÉ Ÿäÿß ÿ±ÿ≥ŸàŸÑ ÿßŸÑŸÑŸáÿü ŸÇÿßŸÑ ÿµŸÑŸâ ÿßŸÑŸÑŸá ÿπŸÑŸäŸá Ÿàÿ¢ŸÑŸá: (ŸÜÿπŸÖ). ŸÅŸÇÿßŸÑ: (ŸÉŸäŸÅ ÿ®ŸÉŸÖ ÿ•ÿ∞ÿß ÿ£ŸÖÿ±ÿ™ŸÖ ÿ®ÿßŸÑŸÖŸÜŸÉÿ±ÿå ŸàŸÜŸáŸäÿ™ŸÖ ÿπŸÜ ÿßŸÑŸÖÿπÿ±ŸàŸÅ) ŸÅŸÇŸäŸÑ ŸÑŸá: Ÿäÿß ÿ±ÿ≥ŸàŸÑ ÿßŸÑŸÑŸá ŸàŸäŸÉŸ

In [14]:
# Iterate over the 'comment_text' column and if the comment is contain non English characters print it out and drop the rows. from the dataframe
indices_to_drop = []
for index, comment in enumerate(sample_comments['comment_text']):
  try:
    language = detect(comment)
    if language != 'en':
      print(f"Dropping comment: '{comment}'")
      indices_to_drop.append(index)
  except:
    pass

# Drop the rows with non-English comments
sample_comments = sample_comments.drop(indices_to_drop).reset_index(drop=True)


Dropping comment: 'Jajaja I Go There Para Ponerme Mas JovenJajajajajaja'
Dropping comment: 'Giant mochi!!!!!'
Dropping comment: 'HELTER SKELTER'
Dropping comment: 'ÿ•ŸÜ ŸÖŸÜ ÿ£ÿπÿ∏ŸÖ ÿßŸÑŸàÿßÿ¨ÿ®ÿßÿ™ ÿßŸÑÿØŸäŸÜŸäÿ© ÿßŸÑÿ£ŸÖÿ± ÿ®ÿßŸÑŸÖÿπÿ±ŸàŸÅ ŸàÿßŸÑŸÜŸáŸä ÿπŸÜ ÿßŸÑŸÖŸÜŸÉÿ±ÿå ŸÇÿßŸÑ ÿßŸÑŸÑŸá ÿ™ÿπÿßŸÑŸâ: [ŸàŸéŸÑŸíÿ™ŸéŸÉŸèŸÜŸí ŸÖŸêŸÜŸíŸÉŸèŸÖŸí ÿ£ŸèŸÖŸëŸéÿ©Ÿå ŸäŸéÿØŸíÿπŸèŸàŸÜŸé ÿ•ŸêŸÑŸéŸâ ÿßŸÑŸíÿÆŸéŸäŸíÿ±Ÿê ŸàŸéŸäŸéÿ£ŸíŸÖŸèÿ±ŸèŸàŸÜŸé ÿ®ŸêÿßŸÑŸíŸÖŸéÿπŸíÿ±ŸèŸàŸÅŸê ŸàŸéŸäŸéŸÜŸíŸáŸéŸàŸíŸÜŸé ÿπŸéŸÜŸê ÿßŸÑŸíŸÖŸèŸÜŸíŸÉŸéÿ±Ÿê ŸàŸéÿ£ŸèŸàŸÑŸéÿ¶ŸêŸÉŸé ŸáŸèŸÖŸè ÿßŸÑŸíŸÖŸèŸÅŸíŸÑŸêÿ≠ŸèŸàŸÜŸé]. 
ÿ±ŸàŸä ÿπŸÜ ÿßŸÑŸÜÿ®Ÿä (ÿµŸÑŸâ ÿßŸÑŸÑŸá ÿπŸÑŸäŸá Ÿàÿ¢ŸÑŸá) ÿ£ŸÜŸá ŸÇÿßŸÑ: (ŸÉŸäŸÅ ÿ®ŸÉŸÖ ÿ•ÿ∞ÿß ŸÅÿ≥ÿØÿ™ ŸÜÿ≥ÿßÿ§ŸÉŸÖÿå ŸàŸÅÿ≥ŸÇ ÿ¥ÿ®ÿßÿ®ŸÉŸÖÿå ŸàŸÑŸÖ ÿ™ÿ£ŸÖÿ±Ÿàÿß ÿ®ÿßŸÑŸÖÿπÿ±ŸàŸÅ ŸàŸÑŸÖ ÿ™ŸÜŸáŸàÿß ÿπŸÜ ÿßŸÑŸÖŸÜŸÉÿ±) ŸÅŸÇŸäŸÑ ŸÑŸá: ŸàŸäŸÉŸàŸÜ ÿ∞ŸÑŸÉ Ÿäÿß ÿ±ÿ≥ŸàŸÑ ÿßŸÑŸÑŸáÿü ŸÇÿßŸÑ ÿµŸÑŸâ ÿßŸÑŸÑŸá ÿπŸÑŸäŸá Ÿàÿ¢ŸÑŸá: (ŸÜÿπŸÖ). ŸÅŸÇÿßŸÑ: (ŸÉŸäŸÅ ÿ®ŸÉŸÖ ÿ•ÿ∞ÿß ÿ£ŸÖÿ±ÿ™ŸÖ ÿ®ÿßŸÑŸÖŸÜŸÉÿ±ÿå ŸàŸÜŸáŸäÿ™ŸÖ ÿπŸÜ ÿßŸÑŸÖÿπÿ±

In [15]:
# Iterate over the 'comment_text' column and print each comment
for i, comment in enumerate(sample_comments['comment_text']):
  print(f"{i}: ->  {comment}, \n")

0: ->  It might be possible to power EVs using solar, wind,  and natural power., 

1: ->  Imagine the smell of this factory, 

2: ->  i went for a history degree, and I still had to take psych 101, 

3: ->  Wow They really made a 2 minute vid into a mess., 

4: ->  What if: instead of upgrading the ships protection torwards radiation, we increase human resistence torwards radiation, we probably need that anyway if we want to life on other planets. I don't know if It's possible but I think there are some living creatures that are resistent torwards it if we figure out why we might be able to apply it to humans., 

5: ->  Those mirrors look obnoxiously bad. It's like the bronco is trying to be a moose like ram trucks but a Dumbo the elephant version, 

6: ->  Great video. Btw It would be cool if you added motion effects to the ‚Äúskull playing chess/pong, drinking tea‚Äù picture. Like the clouds in the sky moving, steam from the tea moving, the pong game etc., 

7: ->  Damn Colby why you