# Sentiment Analysis on Comments for Reddit Posts
Using nltk, I will attempt to quantify the sentiment of all comments on a post. This can then be expanded to multiple posts or entire subreddits. This analysis can be useful for certain subreddits to see how emotion changes over time. 

For example, being able to gauge the sentiment for various political subreddits over time, or in the lead up to an election can help determine if one candidate has the edge over another.

In [1]:
import nltk
import praw
import pandas as pd
import datetime
import json

In [2]:
# Load credfile and display when last updated
credfile = 'credfile.json'
credfile_prefix = ''

# Read credentials to a dictionary
with open(credfile) as fh:
    creds = json.loads(fh.read())

print(f"[{datetime.datetime.now()}]" + f"{credfile} {'.' * 10} is being used as credfile")

[2020-07-22 14:39:50.445092]credfile.json .......... is being used as credfile


In [3]:
reddit = praw.Reddit(client_id=creds['client_id'],
                     client_secret=creds['client_secret'],
                     user_agent=creds['user_agent']
                    )

In [4]:
print(reddit.read_only)  # Output: True

True


## Start with one post and analyze all comments

#### Get Comments

In [5]:
submission = reddit.submission(id='ba7uqx')

In [6]:
# save comments as a list
top_level_comments = list(submission.comments)
all_comments = submission.comments.list()

In [7]:
print("Number of top level comments: ", len(top_level_comments))
print("Total number of comments:     ", len(all_comments))

Number of top level comments:  131
Total number of comments:      602


In [8]:
for comment in top_level_comments[:5]: # view the top 5 comments
    print("Votes:  ", comment.score)
    print("Author: ", comment.author)
    print("Body:   ",  comment.body)
    print("===================")

Votes:   1
Author:  AutoModerator
Body:    **Mirrors / Alternate angles**

*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/soccer) if you have any questions or concerns.*
Votes:   1501
Author:  FlyingArab
Body:    This was the most Diego Costa sequence ever
Votes:   2238
Author:  Sinnedd
Body:    Damn, Costa must have insulted this guy’s entire family 
Votes:   774
Author:  yammington
Body:    Simeone is gonna shank Costa at half time.
Votes:   1359
Author:  Juggernautspammer
Body:    What the fuck could he have said to get a straight red holy shit 


#### Clean up comments

In [9]:
# iterate over top comments in the submission and\= create list of sentences
submission.comments.replace_more(limit=None)
top_level_comment_list = []
top_level_comment_string = ''
for top_level_comment in submission.comments[1:]: # Skip AutoMod comment
    top_level_comment_list.append(top_level_comment.body)
    top_level_comment_string += (str(top_level_comment.body)+'. ')

In [10]:
top_level_comment_list[0:5]

['This was the most Diego Costa sequence ever',
 'Damn, Costa must have insulted this guy’s entire family ',
 'Simeone is gonna shank Costa at half time.',
 'What the fuck could he have said to get a straight red holy shit ',
 'I am so confused']

In [11]:
top_level_comment_string[0:500]

'This was the most Diego Costa sequence ever. Damn, Costa must have insulted this guy’s entire family . Simeone is gonna shank Costa at half time.. What the fuck could he have said to get a straight red holy shit . I am so confused. Thats our boy. Damn, the way atletico players surrounded the ref was inviting another red. The way the referee gets crowded in la Liga disgusts me every time. . classic Diego Costa. Gently whispered "Ur mom gay lol" to the ref.\n\nFair red imo.. [deleted]. Imagine being'

#### Polarity & Subjectivity using TextBlob

In [13]:
from textblob import TextBlob

analysis = TextBlob(top_level_comment_string)
print('Polarity score:     ', analysis.sentiment[0])
print('Subjectivity score: ', analysis.subjectivity)

Polarity score:      -0.006176127142461299
Subjectivity score:  0.4890894786842422


#### Readability score

In [88]:
import readability

In [90]:
results

OrderedDict([('readability grades',
              OrderedDict([('Kincaid', 24.36708005465768),
                           ('ARI', 30.60815671676955),
                           ('Coleman-Liau', 11.963422991942707),
                           ('FleschReadingEase', 25.073084389577375),
                           ('GunningFogIndex', 29.137991801347596),
                           ('LIX', 82.24515855439853),
                           ('SMOGIndex', 18.744673284705062),
                           ('RIX', 13.789473684210526),
                           ('DaleChallIndex', 12.645107849974085)])),
             ('sentence info',
              OrderedDict([('characters_per_word', 4.807520143240824),
                           ('syll_per_word', 1.4431512981199641),
                           ('words_per_sentence', 58.78947368421053),
                           ('sentences_per_paragraph', 1.2666666666666666),
                           ('type_token_ratio', 0.5147717099373321),
                     

In [87]:
from readability import Readability

r = Readability(top_level_comment_string)
fk = r.flesch_kincaid()

print("Flesch-kincaid score:       ", fk.score)
print("Flesch-kincaid grade level: ", fk.grade_level)

ImportError: cannot import name 'Readability'

## Now, lets expand this to the hot submissions for the top 100 subreddits

We will identify the top subreddits by number of subscribers. Then, for each subreddit I will calculate various metrics including comment sentiment, subjectivity and engagement metrics (upvote ratio, number of comments) for the top 10 hottest posts at the moment. 

In [107]:
# params
n_posts = 10

# Get list of subs
top_subs = pd.read_html('https://redditmetrics.com/top')[0]
top_subs = top_subs[top_subs['Reddit']!='/r/announcements'] # announcements subreddit doesn't count
top_subs = top_subs[top_subs['Rank']<=100]
list_of_subs = [x.split('/')[-1] for x in top_subs['Reddit']]

In [108]:
list_of_subs = ['politics', 'wallstreetbets', 'economics', 'stockmarket', 'options']

In [109]:
start_time = datetime.datetime.now() # Start timer
metrics_df = pd.DataFrame()

for sub in list_of_subs:
    subreddit = reddit.subreddit(sub)
    sub_n_subscribers = subreddit.subscribers
    sub_name = subreddit.display_name

    for submission in subreddit.top("day", limit=n_posts):
        # Get all top-level comments
        submission.comments.replace_more(limit=None)
        all_comments = submission.comments.list()

        # Analyze individual comments
        submission_sentiment_total = 0
        submission_subjectivity_total = 0
        reading_ease_total = 0
        for comment in all_comments:
            # Sentiment Index
            analysis = TextBlob(comment.body)
            submission_sentiment_total = submission_sentiment_total + analysis.sentiment[0]
            submission_subjectivity_total = submission_subjectivity_total + analysis.subjectivity
            
            # Readability Metrics
            readability_results = readability.getmeasures(top_level_comment_string, lang='en')
            reading_ease = readability_results['readability grades']['FleschReadingEase']
            reading_ease_total = reading_ease_total + reading_ease
            
        sentiment_avg = submission_sentiment_total / len(all_comments)
        subjectivity_avg = submission_subjectivity_total / len(all_comments)
        reading_ease_avg = reading_ease_total / len(all_comments)
        # Append to DF
        metrics_df = metrics_df.append({'subreddit': sub_name,
                                        'submission_id': submission.id,
                                        'submission_score': submission.score,
                                        'submission_upvote_ratio': submission.upvote_ratio,
                                        'n_comments': len(all_comments),
                                        'sentiment': sentiment_avg,
                                        'subjectivity': subjectivity_avg,
                                        'reading_ease': reading_ease_avg},
                                       ignore_index=True
                                      )
        
    print(f"Finished running r/{sub}")
    
end_time = datetime.datetime.now() # Finish timer

print(f"Runtime: {((end_time - start_time).seconds) / 60} minutes")

Finished running r/politics
Finished running r/wallstreetbets
Finished running r/economics
Finished running r/stockmarket
Finished running r/options
Runtime: 11.75 minutes


In [110]:
metrics_df

Unnamed: 0,n_comments,reading_ease,sentiment,subjectivity,submission_id,submission_score,submission_upvote_ratio,subreddit
0,2225.0,25.073084,0.020377,0.398236,hvt578,68893.0,0.92,politics
1,3840.0,25.073084,0.028507,0.414638,hvrl1c,51342.0,0.93,politics
2,2629.0,25.073084,0.038471,0.38848,hvx1w3,56366.0,0.89,politics
3,1731.0,25.073084,0.048976,0.380211,hvictm,41351.0,0.86,politics
4,1180.0,25.073084,0.070977,0.416868,hvq6fb,32536.0,0.95,politics
5,2289.0,25.073084,0.060109,0.376056,hvm0c5,31485.0,0.93,politics
6,686.0,25.073084,0.036735,0.391497,hvkxqv,21659.0,0.91,politics
7,1000.0,25.073084,0.013885,0.378877,hvjjeu,18915.0,0.96,politics
8,582.0,25.073084,0.047798,0.351626,hvk5n2,15741.0,0.93,politics
9,587.0,25.073084,0.067036,0.403257,hvikpp,14785.0,0.95,politics


## Future Ideas
* Live analysis of comments, scores etc.
* E.g. live sentiment analysis of comments of economic/stock subreddits, and overlayed with stock market data