<a href="https://colab.research.google.com/github/kkrusere/youTube-comments-Analyzer/blob/main/SA_on_YT_comments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Sentiment Analysis on YouTube Comments

- This notebook is for developing a sentiment analysis model for YouTube Comments

In [1]:
from google.colab import drive
import os

#mounting google drive
drive.mount('/content/drive')

########################################

#changing the working directory
os.chdir("/content/drive/MyDrive/NLP_Data")

!pwd


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/NLP_Data


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import string

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

import warnings
warnings.filterwarnings("ignore")

import requests
import json
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# The functions below handle reading from and writing to a JSON file in the current working directory.

def save_to_json(data, filename):
    """
    Save data to a JSON file.

    Parameters:
    data (dict or list): The data to be saved.
    filename (str): The name of the file to save the data in.
    """
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)

def load_from_json(filename):
    """
    Load data from a JSON file.

    Parameters:
    filename (str): The name of the file to load the data from.

    Returns:
    dict or list: The data loaded from the JSON file.
    """
    with open(filename, 'r') as json_file:
        data = json.load(json_file)
    return data


In [4]:
# we are going to load the YouTube comments json file from the current working directory "filename = 'youtube_comments.json'"
filename = 'youtube_comments.json'
youtube_comments = load_from_json(filename)
# convert the youtube_comments to a pandas dataframe

youtube_comments_df = pd.DataFrame(youtube_comments)

youtube_comments_df.head()

Unnamed: 0,comment_text,like_count,reply_count
0,A major obstacle to EV adoption that is always...,6K,507 replies
1,A major obstacle to EV adoption that is always...,6K,507 replies
2,"Prices are too high, and dealerships keep addi...",3.9K,216 replies
3,The government isn’t fast enough to patch poth...,89,6 replies
4,We have the coldest winters in many years here...,34,1 reply


In [5]:
# Function to convert string values containing suffixes 'K', 'M', or 'B' to integers and extract numeric values.
def convert_to_int(value):
  """
    - If the value is NaN or an empty string, return 0.
    - If the value is a string:
      - Extract numeric digits from the string.
      - Convert the extracted digits to an integer.
      - If the string contains 'K', multiply the number by 1,000.
      - If the string contains 'M', multiply the number by 1,000,000.
      - If the string contains 'B', multiply the number by 1,000,000,000.
    - Return the converted integer value.

  """
  if pd.isna(value) or value == '':
      return 0
  if isinstance(value, str):
      # Extract numbers and convert them
      num = re.findall(r'\d+', value)
      if not num:
          return 0
      num = ''.join(num)
      if 'K' in value:
          return int(float(num) * 1000)
      if 'M' in value:
          return int(float(num) * 1000000)
      if 'B' in value:
          return int(float(num) * 1000000000)
      return int(num)
  return int(value)

In [6]:
# We are going to do a little bit of cleaning on the dataset

# Fill missing and empty values with 0
youtube_comments_df['like_count'].replace('', 0, inplace=True)
youtube_comments_df['reply_count'].replace('', 0, inplace=True)
youtube_comments_df.fillna({'like_count': 0, 'reply_count': 0}, inplace=True)

# Convert columns to integers
youtube_comments_df['like_count'] = youtube_comments_df['like_count'].apply(convert_to_int)
youtube_comments_df['reply_count'] = youtube_comments_df['reply_count'].apply(convert_to_int)

In [7]:
# Shuffle the rows of the DataFrame `comments_data` randomly with a fixed seed for reproducibility.
# Reset the index of the DataFrame to be sequential and drop the old index column.
# Display the first few rows of the shuffled DataFrame.
youtube_comments_df = youtube_comments_df.sample(frac=1, random_state=50).reset_index(drop=True)
youtube_comments_df.head()


Unnamed: 0,comment_text,like_count,reply_count
0,Don't work with Prius Prime 2022...,0,0
1,"He's right, people get tired and get used to b...",0,0
2,,0,0
3,I need that shirt. Hand it over.,0,0
4,Raise you show all USA military equipment,1,0


Note: We will be revisiting the text cleaning and preparation of the `YouTube_comments` dataset.
Given that the dataset may contain comments in different languages, special characters, or non-alphanumeric symbols,
it is crucial to address these aspects for effective preprocessing.
#### Below are some strategies we will use to improve the cleaning process:

1. **Language Detection**:
    - Identify and handle comments in various languages separately.
    - Utilize language identification tools like `langdetect` or `TextBlob` to determine the language of each comment.
2. **Character Removal**:
    - Remove or normalize special characters and non-alphanumeric symbols to ensure consistency.
    - Use regular expressions to filter out unwanted characters and retain only relevant text.

3. **Unicode Normalization**:
    - Normalize Unicode characters to handle different encodings and symbols.
    - Employ libraries like `unicodedata` to standardize text.

4. **Text Standardization**:
    - Convert text to a consistent case (e.g., lowercase) to ensure uniformity.
    - Remove extra whitespace and redundant characters.

5. **Language-Specific Processing**:
    - Apply language-specific preprocessing techniques for better accuracy, such as stemming or lemmatization.
    - Consider translation or transliteration if necessary for multilingual comments.

6. **Tokenization and Lemmatization**:
    - Tokenize text into words or phrases and apply lemmatization to reduce words to their base forms.

 Implementing these methods will enhance the quality of our text data, making it more suitable for analysis or model training.


In [8]:
# Now we get a random sample of 200 comments and do some tests
sample_comments = youtube_comments_df.sample(n=250, random_state=48).reset_index(drop=True)
sample_comments.head()

Unnamed: 0,comment_text,like_count,reply_count
0,these thumbnails make it seem like a mass nucl...,0,0
1,I’d of like to have seen the snake get it’s mo...,0,0
2,I agree in the processed chesse. It's a chees...,0,0
3,the little girl is precious,0,0
4,I think we are all mixed with some race. We sh...,78,1


In [9]:
sample_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   comment_text  249 non-null    object
 1   like_count    250 non-null    int64 
 2   reply_count   250 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 6.0+ KB


In [10]:
pip install langdetect



In [11]:
# prompt: # Iterate over the 'comment_text' column and if the comment is contain non English characters print it out

from langdetect import detect
print("[")
for comment in sample_comments['comment_text']:
  try:
    language = detect(comment)
    if language != 'en':
      print(f"'{comment}',")
  except:
    pass
print("]")

[
'I believe in:
                     
                     
                     








THESE NUTS!!!',
'Mom it’s not what it looks like!',
'皆笑って終われるドッキリは見てる私も笑顔になれるよ',
'Stinger',
'Very nice.',
'Fantastic!',
'Sheeeesh!!!!!',
'EWING',
'DITTO ,!',
'понятно, тачка для темнокожих',
'Favorite try video to date...  Laughed until I almost cried lol',
'Don’t you see it’s getting worse 
Take our guns',
'2:20Donald trump vomits during helicopter ride',
'Beautiful',
'Absolutely!!',
'Nissan Maxima?',
'True master',
'haa hey saan karabaa adiga lee waaye',
'1:30THAT KICK THOUGH!',
'Bravo majstor diza to',
'Minute13:33guy',
'amazing episode',
'Video starts at2:48',
'Pregnancy',
'16:50it looks like a poorly rendered video game model',
'WOW',
'Man, Will Ferrell is
 so damn funny',
'️I love you all',
'póngase guantes amigo q asco .. eso se lo come solo el q lo hace',
'He looks like gregg the hammer valentine',
'tuff old birds',
'I'm listening bro!',
'i8:has a 1.5l engineMe: EW GROSS',
'Schön   verars

In [12]:
# prompt:

In [13]:
# Iterate over the 'comment_text' column and if the comment is not English print it out

from langdetect import detect

for i, comment in enumerate(sample_comments['comment_text']):
  try:
    language = detect(comment)
    if language != 'en':
      print(f"{i} --- {comment}")
  except:
    pass


15 --- I believe in:
                     
                     
                     








THESE NUTS!!!
16 --- Mom it’s not what it looks like!
18 --- 皆笑って終われるドッキリは見てる私も笑顔になれるよ
21 --- Stinger
29 --- Very nice.
34 --- Fantastic!
39 --- Sheeeesh!!!!!
46 --- EWING
53 --- DITTO ,!
64 --- понятно, тачка для темнокожих
76 --- Favorite try video to date...  Laughed until I almost cried lol
85 --- Don’t you see it’s getting worse 
Take our guns
87 --- Beautiful
105 --- Absolutely!!
107 --- Nissan Maxima?
117 --- True master
118 --- haa hey saan karabaa adiga lee waaye
125 --- 1:30THAT KICK THOUGH!
144 --- Bravo majstor diza to
148 --- Minute13:33guy
149 --- amazing episode
155 --- Video starts at2:48
158 --- Pregnancy
159 --- 16:50it looks like a poorly rendered video game model
168 --- WOW
187 --- Man, Will Ferrell is
 so damn funny
191 --- ️I love you all
193 --- póngase guantes amigo q asco .. eso se lo come solo el q lo hace
195 --- He looks like gregg the hammer valentine
220 --- I'm

In [14]:
# Iterate over the 'comment_text' column and if the comment is contain non English characters print it out and drop the rows. from the dataframe
indices_to_drop = []
for index, comment in enumerate(sample_comments['comment_text']):
  try:
    language = detect(comment)
    if language != 'en':
      print(f"Dropping comment: '{comment}'")
      indices_to_drop.append(index)
  except:
    pass

# Drop the rows with non-English comments
sample_comments = sample_comments.drop(indices_to_drop).reset_index(drop=True)


Dropping comment: 'I believe in:
                     
                     
                     








THESE NUTS!!!'
Dropping comment: 'Mom it’s not what it looks like!'
Dropping comment: '皆笑って終われるドッキリは見てる私も笑顔になれるよ'
Dropping comment: 'Stinger'
Dropping comment: 'Very nice.'
Dropping comment: 'Fantastic!'
Dropping comment: 'Sheeeesh!!!!!'
Dropping comment: 'EWING'
Dropping comment: 'DITTO ,!'
Dropping comment: 'понятно, тачка для темнокожих'
Dropping comment: 'Don’t you see it’s getting worse 
Take our guns'
Dropping comment: 'Beautiful'
Dropping comment: 'Que the Banjo man'
Dropping comment: 'Absolutely!!'
Dropping comment: 'Nissan Maxima?'
Dropping comment: 'True master'
Dropping comment: 'haa hey saan karabaa adiga lee waaye'
Dropping comment: '1:30THAT KICK THOUGH!'
Dropping comment: 'Bravo majstor diza to'
Dropping comment: 'Minute13:33guy'
Dropping comment: 'amazing episode'
Dropping comment: 'Video starts at2:48'
Dropping comment: 'Pregnancy'
Dropping comment: '16:50it looks

In [15]:
# Iterate over the 'comment_text' column and print each comment
for i, comment in enumerate(sample_comments['comment_text']):
  print(f"{comment}, \n")

these thumbnails make it seem like a mass nuclear explosion happen, 

I’d of like to have seen the snake get it’s mouth  over his nose, 

I agree in the processed chesse.  It's a cheeseburger, 

the little girl is precious, 

I think we are all mixed with some race. We should love everyone., 

one can never be sure. yet.  i'm sure that this car is TOP QUALITY.. but i'm not crazy about the styling., 

4% of people given the death penalty are completely innocent. That should be enough to completely abolish it, considering all the support for it is motivated by anger over people unjustly killed. But in case it’s not enough for you for some reason, it’s also on average twice as expensive to enforce it on a convict than life in prison without parole is. This is due to the appeals court process, which typically takes around a decade or more. So enforcing the death penalty on people means way more of our tax dollars are being spent on horrible people than necessary. The literal only reason to

In [16]:
# Iterate over the 'comment_text' column and print each comment
index_list = []
comment_list = []
for i, comment in enumerate(sample_comments['comment_text']):
  index_list.append(i)
  comment_list.append(comment)
  print(f"{i} : {comment}, \n")

0 : these thumbnails make it seem like a mass nuclear explosion happen, 

1 : I’d of like to have seen the snake get it’s mouth  over his nose, 

2 : I agree in the processed chesse.  It's a cheeseburger, 

3 : the little girl is precious, 

4 : I think we are all mixed with some race. We should love everyone., 

5 : one can never be sure. yet.  i'm sure that this car is TOP QUALITY.. but i'm not crazy about the styling., 

6 : 4% of people given the death penalty are completely innocent. That should be enough to completely abolish it, considering all the support for it is motivated by anger over people unjustly killed. But in case it’s not enough for you for some reason, it’s also on average twice as expensive to enforce it on a convict than life in prison without parole is. This is due to the appeals court process, which typically takes around a decade or more. So enforcing the death penalty on people means way more of our tax dollars are being spent on horrible people than necessary

In [None]:
# create a dictionary that has comment_text, sentiment, sentiment comment

sentiment_dict = {'comment_text': [], 'sentiment': [], 'sentiment_comment': []}
