# Sentiment Analysis on Comments for Reddit Posts
Using nltk, I will attempt to quantify the sentiment of all comments on a post. This can then be expanded to multiple posts or entire subreddits. This analysis can be useful for certain subreddits to see how emotion changes over time. 

For example, being able to gauge the sentiment for various political subreddits over time, or in the lead up to an election can help determine if one candidate has the edge over another.

In [1]:
import nltk
import praw
import pandas as pd
import datetime
import json
import numpy as np
from textblob import TextBlob
import readability

In [2]:
# Load credfile and display when last updated
credfile = 'credfile.json'
credfile_prefix = ''

# Read credentials to a dictionary
with open(credfile) as fh:
    creds = json.loads(fh.read())

print(f"[{datetime.datetime.now()}]" + f"{credfile} {'.' * 10} is being used as credfile")

[2020-07-23 11:00:05.568652]credfile.json .......... is being used as credfile


In [3]:
reddit = praw.Reddit(client_id=creds['client_id'],
                     client_secret=creds['client_secret'],
                     user_agent=creds['user_agent']
                    )

In [4]:
print(reddit.read_only)  # Output: True

True


## Start with one post and analyze all comments

#### Get Comments

In [5]:
submission = reddit.submission(id='ba7uqx')

In [6]:
# save comments as a list
top_level_comments = list(submission.comments)
all_comments = submission.comments.list()

In [7]:
print("Number of top level comments: ", len(top_level_comments))
print("Total number of comments:     ", len(all_comments))

Number of top level comments:  131
Total number of comments:      602


#### For each comment, expolore the attributes

In [8]:
for comment in top_level_comments[:5]: # view the top 5 comments
    print("Votes:  ", comment.score)
    print("Author: ", comment.author)
    print("Body:   ",  comment.body)
    print("===================")

Votes:   1
Author:  AutoModerator
Body:    **Mirrors / Alternate angles**

*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/soccer) if you have any questions or concerns.*
Votes:   1508
Author:  FlyingArab
Body:    This was the most Diego Costa sequence ever
Votes:   2242
Author:  Sinnedd
Body:    Damn, Costa must have insulted this guy’s entire family 
Votes:   768
Author:  yammington
Body:    Simeone is gonna shank Costa at half time.
Votes:   1354
Author:  Juggernautspammer
Body:    What the fuck could he have said to get a straight red holy shit 


#### Clean up comments

In [9]:
# iterate over top comments in the submission and\= create list of sentences
submission.comments.replace_more(limit=None)
top_level_comment_list = []
top_level_comment_string = ''
for top_level_comment in submission.comments[1:]: # Skip AutoMod comment
    top_level_comment_list.append(top_level_comment.body)
    top_level_comment_string += (str(top_level_comment.body)+'. ')

In [10]:
top_level_comment_list[0:5]

['This was the most Diego Costa sequence ever',
 'Damn, Costa must have insulted this guy’s entire family ',
 'Simeone is gonna shank Costa at half time.',
 'What the fuck could he have said to get a straight red holy shit ',
 'I am so confused']

In [11]:
top_level_comment_string[0:500]

'This was the most Diego Costa sequence ever. Damn, Costa must have insulted this guy’s entire family . Simeone is gonna shank Costa at half time.. What the fuck could he have said to get a straight red holy shit . I am so confused. Thats our boy. Damn, the way atletico players surrounded the ref was inviting another red. The way the referee gets crowded in la Liga disgusts me every time. . classic Diego Costa. Gently whispered "Ur mom gay lol" to the ref.\n\nFair red imo.. [deleted]. Imagine being'

#### Polarity & Subjectivity using TextBlob

In [12]:
analysis = TextBlob(top_level_comment_string)
print('Polarity score:     ', analysis.sentiment[0])
print('Subjectivity score: ', analysis.subjectivity)

Polarity score:      -0.006176127142461299
Subjectivity score:  0.4890894786842422


#### Readability score

In [14]:

r = readability.getmeasures(top_level_comment_string, lang='en')
fk = r['readability grades']['Kincaid']

print("Flesch-kincaid score:       ", fk)

Flesch-kincaid score:        41.25158264403879


## Now, lets expand this to the hot submissions for the top 100 subreddits

We will identify the top subreddits by number of subscribers. Then, for each subreddit I will calculate various metrics including comment sentiment, subjectivity and engagement metrics (upvote ratio, number of comments) for the top 10 hottest posts at the moment. 

In [15]:
# params
n_posts = 10

# Get list of subs
top_subs = pd.read_html('https://redditmetrics.com/top')[0]
top_subs = top_subs[top_subs['Reddit']!='/r/announcements'] # announcements subreddit doesn't count
top_subs = top_subs[top_subs['Rank']<=100]
list_of_subs = [x.split('/')[-1] for x in top_subs['Reddit']]

In [16]:
list_of_subs = ['wallstreetbets', 'politics', 'economics', 'stockmarket', 'options', 'investing']

In [17]:
start_time = datetime.datetime.now() # Start timer
metrics_df = pd.DataFrame()

for sub in list_of_subs:
    subreddit = reddit.subreddit(sub)
    sub_n_subscribers = subreddit.subscribers
    sub_name = subreddit.display_name

    for submission in subreddit.top("day", limit=n_posts):
        # Get all top-level comments
        submission.comments.replace_more(limit=None)
        all_comments = submission.comments.list()

        # Analyze individual comments
        submission_sentiment_total = 0
        submission_subjectivity_total = 0
        reading_level_total = 0
        for comment in all_comments:
            # Sentiment Index
            analysis = TextBlob(comment.body)
            submission_sentiment_total = submission_sentiment_total + analysis.sentiment[0]
            submission_subjectivity_total = submission_subjectivity_total + analysis.subjectivity
            
            # Readability Metrics
            readability_results = readability.getmeasures(top_level_comment_string, lang='en')
            reading_level = readability_results['readability grades']['Kincaid']
            reading_level_total = reading_level_total + reading_level
            
        sentiment_avg = submission_sentiment_total / len(all_comments)
        subjectivity_avg = submission_subjectivity_total / len(all_comments)
        reading_level_avg = reading_level_total / len(all_comments)
        # Append to DF
        metrics_df = metrics_df.append({'subreddit': sub_name,
                                        'submission_id': submission.id,
                                        'submission_score': submission.score,
                                        'submission_upvote_ratio': submission.upvote_ratio,
                                        'n_comments': len(all_comments),
                                        'sentiment': sentiment_avg,
                                        'subjectivity': subjectivity_avg,
                                        'reading_level': reading_level_avg},
                                       ignore_index=True
                                      )
        
    print(f"Finished running r/{sub}")
    
end_time = datetime.datetime.now() # Finish timer

print(f"Runtime: {((end_time - start_time).seconds) / 60} minutes")

Finished running r/wallstreetbets
Finished running r/politics
Finished running r/economics
Finished running r/stockmarket
Finished running r/options
Finished running r/investing
Runtime: 34.6 minutes


In [26]:
comment.downs

0

In [30]:
comment.body

"> Are there that many people making 6 figures though? The US isn't loaded down with high income employees. \n\nYes, we are. Top 5% of families pull $250k a year. That’s not all that uncommon, that’s 1 in 20."

### Weighted averatge for all metrics using n_comments

In [19]:
avg_metrics = metrics_df.groupby('subreddit').apply(lambda x: pd.Series([np.average(x['sentiment'], weights=x['n_comments']), 
                                                                         np.average(x['subjectivity'],weights=x['n_comments']),
                                                                         np.average(x['reading_level'],weights=x['n_comments']),
                                                                         np.average(x['submission_upvote_ratio'],weights=x['n_comments'])
                                                                        ], 
                                                                        index=['sentiment','subjectivity', 'reading_level', 'submission_upvote_ratio'])).unstack()

In [20]:
avg_metrics = pd.DataFrame(avg_metrics)
avg_metrics

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,subreddit,Unnamed: 2_level_1
sentiment,Economics,0.06743
sentiment,StockMarket,0.080376
sentiment,investing,0.08639
sentiment,options,0.115824
sentiment,politics,0.039978
sentiment,wallstreetbets,0.062997
subjectivity,Economics,0.391559
subjectivity,StockMarket,0.409652
subjectivity,investing,0.382966
subjectivity,options,0.372836


In [37]:
sentiment_df = avg_metrics[avg_metrics.index.get_level_values(0)=='sentiment'].reset_index()

In [41]:
subjectivity_df = avg_metrics[avg_metrics.index.get_level_values(0)=='subjectivity'].reset_index()

## Data Visualization

We will use plotly which can be exported to HTML and embedded into websites. It also allows for interaction with the visual.

In [40]:
sentiment_df

Unnamed: 0,level_0,subreddit,0
0,sentiment,Economics,0.06743
1,sentiment,StockMarket,0.080376
2,sentiment,investing,0.08639
3,sentiment,options,0.115824
4,sentiment,politics,0.039978
5,sentiment,wallstreetbets,0.062997


In [44]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly import offline
import pandas as pd

# read in volcano database data
df = pd.read_csv(
    "https://raw.githubusercontent.com/plotly/datasets/master/volcano_db.csv",
    encoding="iso-8859-1",
)

# frequency of Country
freq = df
freq = freq.Country.value_counts().reset_index().rename(columns={"index": "x"})

# read in 3d volcano surface data
df_v = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/volcano.csv")

# Initialize figure with subplots
fig = make_subplots(
    rows=2, cols=2,
    column_widths=[0.6, 0.4],
    row_heights=[0.4, 0.6],
    specs=[[{"type": "bar", "rowspan": 2}, {"type": "bar"}],
           [            None                    , {"type": "surface"}]])

# Add first barplot

fig.add_trace(
    go.Bar(name='Sentiment', x=sentiment_df['subreddit'], y=sentiment_df[0]),
    row=1, col=1
)

fig.add_trace(
    go.Bar(name='Subjectivity', x=subjectivity_df['subreddit'], y=subjectivity_df[0]),
    row=1, col=1
)

# Add second barplot
fig.add_trace(
    go.Bar(x=freq["x"][0:10],y=freq["Country"][0:10], marker=dict(color="crimson"), showlegend=False),
    row=1, col=2
)

# Add surface ploot
fig.add_trace(
    go.Surface(z=df_v.values.tolist(), showscale=False),
    row=2, col=2
)

# # Update geo subplot properties
# fig.update_geos(
#     projection_type="orthographic",
#     landcolor="white",
#     oceancolor="MidnightBlue",
#     showocean=True,
#     lakecolor="LightBlue"
# )

# Rotate x-axis labels
fig.update_xaxes(tickangle=45)

# Set theme, margin, and annotation in layout
fig.update_layout(
    template="plotly_dark",
    margin=dict(r=10, t=25, b=40, l=60),
    annotations=[
        dict(
            text="Source: Reddit API",
            showarrow=False,
            xref="paper",
            yref="paper",
            x=0,
            y=0)
    ]
)

fig.show()

In [47]:
offline.plot(fig, filename='../../hm9464.github.io/site/plots/reddit_metrics.html')

'../../hm9464.github.io/site/plots/reddit_metrics.html'

## Future Ideas
* Live analysis of comments, scores etc.
* E.g. live sentiment analysis of comments of economic/stock subreddits, and overlayed with stock market data
* Analysis over time