# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & NLP

--- 

*Group 3* | *Team Members: Constance, Wenzhe, Matthew, Joel*

### <b> Notebook 3: Sentiment Analysis using VADER </b>

<b> (a) Overview of Notebook 3 </b>

In this notebook, we will make use of the SentimentIntensityAnalyzer for sentiment analysis from our 2 selected subreddits.

The rationale is that we want to preserve the original sentiments of the comments and if preprocessing were to be done on them i.e. stemming the words, the original sentiments will be lost. Based on background research, the VADER lexicon is robust enough to interpret text "as-is", hence we will put in the comments directly.

<br>

<b> (b) Structure of Notebook 3 </b>
* Use SentimentIntensityAnalyzer for Sentiment Analysis



---

### Import Libraries & Read Cleaned Data

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sentiment analysis import
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [3]:
# Read filtered cleaned coffee and tea datasets ()
df_coffee = pd.read_csv("data/coffee_comments_clean_merged_filtered.csv")
df_tea    = pd.read_csv("data/tea_comments_clean_merged_filtered.csv")

In [4]:
df_coffee.head()

Unnamed: 0,thread_id,comment_id,comment_text,comment_score,author_name,id,title,score,num_comments,post_hint,self_text,author_name_thread,url,url_is_media,comment_score_top_or_bottom_4
0,19agk2c,kikp8lv,Hi! I’m a morning coffee drinker but I never m...,2,testingpage2025,19agk2c,[MOD] The Daily Question Thread,2,58,,\n\nWelcome to the daily [/r/Coffee](https://...,menschmaschine5,https://www.reddit.com/r/Coffee/comments/19agk...,0,1
1,19agk2c,kikxhmz,Any Scooters employees around? I need to know ...,2,Ok_Bet_2634,19agk2c,[MOD] The Daily Question Thread,2,58,,\n\nWelcome to the daily [/r/Coffee](https://...,menschmaschine5,https://www.reddit.com/r/Coffee/comments/19agk...,0,1
2,19agk2c,kil6vl0,I can't find a pumpkin spice latte recipe that...,1,automirage04,19agk2c,[MOD] The Daily Question Thread,2,58,,\n\nWelcome to the daily [/r/Coffee](https://...,menschmaschine5,https://www.reddit.com/r/Coffee/comments/19agk...,0,1
3,19agk2c,kilaorz,Hey all! I live in Costa Rica and regularly bu...,1,chuvakinfinity,19agk2c,[MOD] The Daily Question Thread,2,58,,\n\nWelcome to the daily [/r/Coffee](https://...,menschmaschine5,https://www.reddit.com/r/Coffee/comments/19agk...,0,1
4,19agk2c,kilrgdc,Hi everyone! Any recommendations for pour over...,1,exposinglikeshane,19agk2c,[MOD] The Daily Question Thread,2,58,,\n\nWelcome to the daily [/r/Coffee](https://...,menschmaschine5,https://www.reddit.com/r/Coffee/comments/19agk...,0,1



---

### Using VADER to derive sentiment analysis for coffee and tea

In [8]:
# instantiate Sentiment Intensity Analyzer
sentiment_analyzer = SentimentIntensityAnalyzer()

<b> (a) Defining a sentiment score calculation function </b> 

In [9]:
# create a function to anaylze sentiment score for each row in subset df and return the overall average sentiment score for the entire df
from statistics import mean 

def sentiment_score(df, column):
    neg_score = []
    neu_score = []
    pos_score = []
    comp_score = []

    for row in df[column]:
        score = sentiment_analyzer.polarity_scores(str(row)) # for every row, it will come out as a dictionary result of {'neg': value, 'neu': value , 'pos': value, 'compound': value}
        
        # append each row's score into the respective component
        neg_score.append(score["neg"]) 
        neu_score.append(score["neu"])
        pos_score.append(score["pos"])
        comp_score.append(score["compound"])

        # calculate the overall average score for each component and return it 
        average_score = [f"neg score: {mean(neg_score)}, neu score: {mean(neu_score)}, pos score: {mean(pos_score)}, compound score: {mean(comp_score)}"]
    return average_score

<b> (b) Invoke the function on the subreddits </b> 

In [10]:
# invoke function on df_tea
sentiment_score(df_tea, "comment_text")

['neg score: 0.03949098532494759, neu score: 0.7836641509433963, pos score: 0.1753754716981132, compound score: 0.3475005031446541']

In [11]:
# invoke function on df_coffee_subs
sentiment_score(df_coffee, "comment_text")

['neg score: 0.0417536170212766, neu score: 0.8315363829787235, pos score: 0.1265008510638298, compound score: 0.3626987234042553']

Summary:

The sentiment analysis for each group and the scores are as follows:

Type| Negative score| Neutral score| Positive score| Compound score|
|---|---|---|---|---|
|Tea|0.040|0.784|0.176|0.348|
|Coffee|0.042|0.832|0.127|0.363|

Based on the results, we can deduce the following:
* coffee drinkers are highly neutral in their language - more neutral sentiments generated
* tea drinkers tend to produce more positive sentiment-inducing comments compared to coffee drinkers
* considering that compound score is computed based on the negative, neutral and positive scores and normalizing the sum, it could possibly be that the strong neutral score of coffee drinkers resulted in an overall higher compound score than tea drinkers