# Sentiment Analysis on Comments for Reddit Posts
Using nltk, I will attempt to quantify the sentiment of all comments on a post. This can then be expanded to multiple posts or entire subreddits. This analysis can be useful for certain subreddits to see how emotion changes over time. 

For example, being able to gauge the sentiment for various political subreddits over time, or in the lead up to an election can help determine if one candidate has the edge over another.

In [1]:
import nltk
import praw
import pandas as pd
import datetime
import json
import numpy as np
from textblob import TextBlob
import readability

In [2]:
# Load credfile and display when last updated
credfile = 'credfile.json'
credfile_prefix = ''

# Read credentials to a dictionary
with open(credfile) as fh:
    creds = json.loads(fh.read())

print(f"[{datetime.datetime.now()}]" + f"{credfile} {'.' * 10} is being used as credfile")

[2020-07-29 12:22:01.005447]credfile.json .......... is being used as credfile


In [3]:
reddit = praw.Reddit(client_id=creds['client_id'],
                     client_secret=creds['client_secret'],
                     user_agent=creds['user_agent']
                    )

In [4]:
print(reddit.read_only)  # Output: True

True


## Start with one post and analyze all comments

#### Get Comments

In [5]:
submission = reddit.submission(id='ba7uqx')

In [6]:
# save comments as a list
top_level_comments = list(submission.comments)
all_comments = submission.comments.list()

In [7]:
print("Number of top level comments: ", len(top_level_comments))
print("Total number of comments:     ", len(all_comments))

Number of top level comments:  131
Total number of comments:      602


#### For each comment, expolore the attributes

In [8]:
for comment in top_level_comments[:5]: # view the top 5 comments
    print("Votes:  ", comment.score)
    print("Author: ", comment.author)
    print("Body:   ",  comment.body)
    print("===================")

Votes:   1
Author:  AutoModerator
Body:    **Mirrors / Alternate angles**

*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/soccer) if you have any questions or concerns.*
Votes:   1507
Author:  FlyingArab
Body:    This was the most Diego Costa sequence ever
Votes:   2236
Author:  Sinnedd
Body:    Damn, Costa must have insulted this guy’s entire family 
Votes:   768
Author:  yammington
Body:    Simeone is gonna shank Costa at half time.
Votes:   1353
Author:  Juggernautspammer
Body:    What the fuck could he have said to get a straight red holy shit 


#### Clean up comments

In [9]:
# iterate over top comments in the submission and\= create list of sentences
submission.comments.replace_more(limit=None)
top_level_comment_list = []
top_level_comment_string = ''
for top_level_comment in submission.comments[1:]: # Skip AutoMod comment
    top_level_comment_list.append(top_level_comment.body)
    top_level_comment_string += (str(top_level_comment.body)+'. ')

In [10]:
top_level_comment_list[0:5]

['This was the most Diego Costa sequence ever',
 'Damn, Costa must have insulted this guy’s entire family ',
 'Simeone is gonna shank Costa at half time.',
 'What the fuck could he have said to get a straight red holy shit ',
 'I am so confused']

In [11]:
top_level_comment_string[0:500]

'This was the most Diego Costa sequence ever. Damn, Costa must have insulted this guy’s entire family . Simeone is gonna shank Costa at half time.. What the fuck could he have said to get a straight red holy shit . I am so confused. Thats our boy. Damn, the way atletico players surrounded the ref was inviting another red. The way the referee gets crowded in la Liga disgusts me every time. . classic Diego Costa. Gently whispered "Ur mom gay lol" to the ref.\n\nFair red imo.. [deleted]. Imagine being'

#### Polarity & Subjectivity using TextBlob

In [12]:
analysis = TextBlob(top_level_comment_string)
print('Polarity score:     ', analysis.sentiment[0])
print('Subjectivity score: ', analysis.subjectivity)

Polarity score:      -0.006176127142461299
Subjectivity score:  0.4890894786842422


#### Readability score

In [13]:

r = readability.getmeasures(top_level_comment_string, lang='en')
fk = r['readability grades']['Kincaid']

print("Flesch-kincaid score:       ", fk)

Flesch-kincaid score:        41.25158264403879


## Now, lets expand this to the hot submissions for the top 100 subreddits

We will identify the top subreddits by number of subscribers. Then, for each subreddit I will calculate various metrics including comment sentiment, subjectivity and engagement metrics (upvote ratio, number of comments) for the top 10 hottest posts at the moment. 

### Get List of Top Subs

In [14]:
# params
n_posts = 20

# Get list of subs
top_subs = pd.read_html('https://redditmetrics.com/top')[0]
top_subs = top_subs[top_subs['Reddit']!='/r/announcements'] # announcements subreddit doesn't count
top_subs = top_subs[top_subs['Rank']<=100]
list_of_subs = [x.split('/')[-1] for x in top_subs['Reddit']]

### Record Metrics on Subreddit Activity

In [44]:
subreddit_activity = pd.DataFrame()

for sub in list_of_subs:
    subreddit = reddit.subreddit(sub)
    count = 0
    for submission in subreddit.rising():
        count +=1
    
    subreddit_activity = subreddit_activity.append(
        {'name': subreddit.display_name,
         'n_subscribers': int(subreddit.subscribers),
         'active_users': int(subreddit.accounts_active),
         'rising_posts': count
        }, ignore_index=True)
    
subreddit_activity['proportion_active'] = (subreddit_activity['active_users'] / subreddit_activity['n_subscribers'])*100

In [45]:
subreddit_activity

Unnamed: 0,active_users,n_subscribers,name,rising_posts,proportion_active
0,59246.0,32001451.0,funny,54.0,0.185135
1,149585.0,29052857.0,AskReddit,100.0,0.514872
2,39031.0,27249534.0,gaming,32.0,0.143235
3,37374.0,25876587.0,aww,100.0,0.144432
4,39326.0,25379770.0,pics,83.0,0.154950
5,5735.0,24673945.0,Music,25.0,0.023243
6,15258.0,24663008.0,science,25.0,0.061866
7,73084.0,24657219.0,worldnews,37.0,0.296400
8,34631.0,23480096.0,videos,24.0,0.147491
9,36040.0,23402128.0,todayilearned,23.0,0.154003


### Record Metrics on Comments

In [46]:
start_time = datetime.datetime.now() # Start timer
metrics_df = pd.DataFrame()

for sub in list_of_subs:
    subreddit = reddit.subreddit(sub)
    sub_n_subscribers = subreddit.subscribers
    sub_name = subreddit.display_name

    for submission in subreddit.top("hour", limit=n_posts):
        # Get all top-level comments
        submission.comments.replace_more(limit=None)
        all_comments = submission.comments.list()
        if len(all_comments)==0:
            # Catch 0 comments
            continue

        # Analyze individual comments
        submission_sentiment_total = 0
        submission_subjectivity_total = 0
        reading_level_total = 0
        comment_word_count_total = 0
        for comment in all_comments:
            # Sentiment Index
            analysis = TextBlob(comment.body)
            submission_sentiment_total = submission_sentiment_total + analysis.sentiment[0]
            submission_subjectivity_total = submission_subjectivity_total + analysis.subjectivity
            
            # Readability Metrics
            readability_results = readability.getmeasures(top_level_comment_string, lang='en')
            reading_level = readability_results['readability grades']['Kincaid']
            reading_level_total = reading_level_total + reading_level
            
            # General metrics
            comment_word_count = comment.body.split()
            comment_word_count_total = comment_word_count_total + len(comment_word_count)
            
        # Append to DF
        metrics_df = metrics_df.append({'subreddit': sub_name,
                                        'submission_id': submission.id,
                                        'submission_score': submission.score,
                                        'submission_upvote_ratio': submission.upvote_ratio,
                                        'n_comments': len(all_comments),
                                        'sentiment': submission_sentiment_total / len(all_comments),
                                        'subjectivity': submission_subjectivity_total / len(all_comments),
                                        'reading_level': reading_level_total / len(all_comments),
                                        'words_per_comment': comment_word_count_total / len(all_comments)
                                       },
                                       ignore_index=True
                                      )
        
    print(f"Finished running r/{sub}")
    
end_time = datetime.datetime.now() # Finish timer

print(f"Runtime: {((end_time - start_time).seconds) / 60} minutes")

Finished running r/funny
Finished running r/AskReddit
Finished running r/gaming
Finished running r/aww
Finished running r/pics
Finished running r/Music
Finished running r/science
Finished running r/worldnews
Finished running r/videos
Finished running r/todayilearned
Finished running r/movies
Finished running r/news
Finished running r/Showerthoughts
Finished running r/IAmA
Finished running r/gifs
Finished running r/EarthPorn
Finished running r/askscience
Finished running r/food
Finished running r/Jokes
Finished running r/explainlikeimfive
Finished running r/books
Finished running r/LifeProTips
Finished running r/Art
Finished running r/mildlyinteresting
Finished running r/blog
Finished running r/DIY
Finished running r/sports
Finished running r/nottheonion
Finished running r/space
Finished running r/gadgets
Finished running r/television
Finished running r/Documentaries
Finished running r/GetMotivated
Finished running r/photoshopbattles
Finished running r/listentothis
Finished running r/Up

#### Weighted averatge for all metrics using n_comments

In [47]:
avg_metrics = metrics_df.groupby('subreddit').apply(lambda x: pd.Series([np.average(x['sentiment'], weights=x['n_comments']), 
                                                                         np.average(x['subjectivity'],weights=x['n_comments']),
                                                                         np.average(x['reading_level'],weights=x['n_comments']),
                                                                         np.average(x['submission_upvote_ratio'],weights=x['n_comments']),
                                                                         np.average(x['words_per_comment'],weights=x['n_comments'])
                                                                        ], 
                                                                        index=['sentiment',
                                                                               'subjectivity', 
                                                                               'reading_level', 
                                                                               'submission_upvote_ratio', 
                                                                               'words_per_comment'])).unstack()

In [48]:
avg_metrics = pd.DataFrame(avg_metrics)
avg_metrics

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,subreddit,Unnamed: 2_level_1
sentiment,AdviceAnimals,0.000000
sentiment,AnimalsBeingBros,0.200000
sentiment,AnimalsBeingDerps,0.168333
sentiment,Art,0.417914
sentiment,AskReddit,0.031106
sentiment,BikiniBottomTwitter,0.000000
sentiment,BlackPeopleTwitter,-0.251157
sentiment,DIY,0.159722
sentiment,Documentaries,-0.750000
sentiment,EarthPorn,0.161715


In [49]:
# Save a copy
avg_metrics.to_csv("subreddit_nlp_metrics.csv", index=False)

#### Top 5 For Each Category

In [51]:
top5 = avg_metrics.sort_values(0, ascending=False).reset_index().groupby('level_0').head(5)
bottom5 = avg_metrics.sort_values(0, ascending=False).reset_index().groupby('level_0').tail(5)

In [77]:
top_bottom = pd.concat([top5, bottom5])

In [78]:
top_bottom

Unnamed: 0,level_0,subreddit,0
0,words_per_comment,history,191.0
1,words_per_comment,dataisbeautiful,186.0
2,words_per_comment,woahdude,104.0
3,words_per_comment,WritingPrompts,101.0
4,words_per_comment,relationships,93.90625
17,reading_level,EarthPorn,41.251583
18,reading_level,Tinder,41.251583
19,reading_level,FoodPorn,41.251583
20,reading_level,Overwatch,41.251583
21,reading_level,nba,41.251583


In [79]:
sentiment_df = top_bottom[top_bottom['level_0']=='sentiment']

In [80]:
subjectivity_df = top_bottom[top_bottom['level_0']=='subjectivity']

## Data Visualization

We will use plotly which can be exported to HTML and embedded into websites. It also allows for interaction with the visual.

#### Parameters

In [112]:
# TO keep fonts and layouts consistent
font_dict = dict(
        family="Helvectica",
        size=14,
        color="RebeccaPurple"
    )

### Subreddit Activity

#### 20 Most Active Subs

In [114]:
subreddit_activity_active = subreddit_activity.sort_values('proportion_active', ascending=False).head(20)
fig = go.Figure([go.Bar(x=subreddit_activity_active['name'], y=subreddit_activity_active['proportion_active'])])
fig.update_layout(
    title="Most Active Subreddits (Online Users as % of Total Subscribers)",
    xaxis_title="Subreddit Name",
    yaxis_title="Proportion of Subcribers Currentlty Active (%)",
    legend_title="Legend Title",
    font=font_dict
)
fig.show()

#### Active Users vs Rising Posts

In [106]:
fig = go.Figure(data=[go.Scatter(
    x=subreddit_activity['proportion_active'], y=subreddit_activity['rising_posts'],
    text=subreddit_activity['name'],
    mode='markers',
    marker=dict(
#         color=['rgb(93, 164, 214)', 'rgb(255, 144, 14)',  'rgb(44, 160, 101)', 'rgb(255, 65, 54)'],
        size=subreddit_activity['n_subscribers']/1000000,
    )
)])

fig.update_layout(
    title="Active Users vs # of Rising Posts (Past Hour)",
    xaxis_title="Proportion of Subcribers Currentlty Active (%)",
    yaxis_title="Number of Rising Posts (Past Hour)",
    legend_title="Legend Title",
    font=font_dict
)

fig.show()

### Comments

#### Highest and lowest sentiment indexes for comments

In [81]:
fig = go.Figure(go.Bar(
            x=list(sentiment_df[0]),
            y=list(sentiment_df['subreddit']),
            orientation='h', name='SF Zoo'))

fig.show()

In [99]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly import offline
import pandas as pd

fig = make_subplots(rows=1, cols=2, column_widths=[0.4, 0.4], row_heights=[0.8])

# Sentiment index
fig.add_trace(
    go.Bar(
            x=list(sentiment_df[0]),
            y=list(sentiment_df['subreddit']),
            orientation='h', name='Sentiment Index'),
    row=1, col=1
)

# Subjectivity index
fig.add_trace(
    go.Bar(
            x=list(subjectivity_df[0]),
            y=list(subjectivity_df['subreddit']),
            orientation='h', name='Subjectivity Index'),
    row=1, col=2
).update_layout(title_text='Subjectivity')

# Rotate x-axis labels
# fig.update_xaxes(tickangle=45)

# Set theme, margin, and annotation in layout
fig.update_layout(
    title_text="Highest and Lowest Sentiment & Subjectivity Indexes",
    template="plotly_dark",
    margin=dict(r=10, t=25, b=40, l=60),
    annotations=[
        dict(
            text="Source: Reddit API",
            showarrow=False,
            xref="paper",
            yref="paper",
            x=0,
            y=0)
    ]
)


fig.show()

In [None]:
fig = go.Figure(data=[go.Table(header=dict(values=['Subreddit', 'Sentiment Index']),
                 cells=dict(values=[sentiment_df['subreddit'], sentiment_df[0]]))
                     ])
fig.show()

In [47]:
offline.plot(fig, filename='../../hm9464.github.io/site/plots/reddit_metrics.html')

'../../hm9464.github.io/site/plots/reddit_metrics.html'

## Future Ideas
* Live analysis of comments, scores etc.
* E.g. live sentiment analysis of comments of economic/stock subreddits, and overlayed with stock market data
* Note: Cannot do analysis over time because reddit API does not support historical
* Most active subreddits at the moment