# Sentiment Analysis of Reddit's COVID-19 Comments
***This is a draft of an ongoing project*** -- 
*Last update made the 10th of April 2020* 

### Table of contents

1. [Aknowledgements]('Aknowledgements')
2. [Introduction]('Introduction')
3. [Data]('Data')
4. [Analysis]('Analysis')

## Aknowledgements

As a new-comer in the field of data science I humbly observe all the work that has been done before and thank those who created it as it has helped me to learn so much and start to work on my own projects.

Special thanks [Duarte O.Carmo](https://pbpython.com/interactive-dashboards.html) whose analysis using Pushshift API has inspired this analysis, and [Lorraine Li's](https://towardsdatascience.com/k-means-clustering-with-scikit-learn-6b47a369a83c) for her excellent article on K-means clustering.



## Introduction

This is a simple analysis of the sentiment of Reddit user's comments related to the disease starting from the 31st of December 2019, when the first cases were declared in China.

As the coronavirus has been expanding all over the world, citizens of every country have given their opinion of the current situation and events. This notebook is a first approach to the question of how the people's opinions have been changing over time.

Reddit platform of social sharing, which contents and comments are voted by its community, might show some light to answer this question.

## Data

In [None]:
# Let's import all the necessary libraries for this project

import requests
import numpy as np
import pandas as pd
import json
from datetime import datetime
import plotly.express as px
import textblob
from textblob.classifiers import NaiveBayesClassifier
import sklearn
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

We are going to retrieve the comments made on Reddit containing the word 'coronavirus' for each day since the start of the disease until the date of today.

In [None]:
# epoch time corresponding to Tuesday, 31 December 2019 0:00:00 GMT
start_date = 1577750400
now = datetime.now()
now_epoch = now.timestamp()
end_date = int(now_epoch)
df_final = pd.DataFrame()

def get_pushshift_data(base_url, **kwargs):
    base_url = 'https://api.pushshift.io/reddit/search/comment/?'
    params = kwargs
    request = requests.get(base_url, params=kwargs)
    return request.json()

while start_date <= end_date:
    data = get_pushshift_data(base_url = 'https://api.pushshift.io/reddit/search/comment/?',
                              q='coronavirus',
                              after=str(start_date),
                              before=str(start_date+86400), #86400 is the equivalent of 24 hours
                              size=500,
                              sort_type='score',
                              sort='desc').get('data')
    if data == None:
        start_date = start_date + 86400
    else:
        df = pd.DataFrame.from_records(data)
        df_final= df_final.append(df)
        start_date = start_date + 86400

# Let's keep only some columns for our final database
df_final = df_final[['created_utc', 'author', 'subreddit', 'body', 'score', 'permalink']]

# Let's create a new column with all the epoch dates converted
df_final['date'] = pd.to_datetime(df_final['created_utc'], unit='s')

We are going to use the NLP library Textblob for our first approach into a Sentiment Analysis of our data.

In [None]:
import textblob

# create a column with sentiment polarity
df_final["sentiment_polarity"] = df_final.apply(lambda row: textblob.TextBlob(row["body"]).sentiment.polarity, axis=1)

# create a column with sentiment subjectivity
df_final["sentiment_subjectivity"] = df_final.apply(lambda row: textblob.TextBlob(row["body"]).sentiment.subjectivity, axis=1)

# create a column with 'positive' or 'negative' depending on sentiment_polarity
df_final["sentiment"] = df_final.apply(lambda row: "positive" if row["sentiment_polarity"] >= 0 else "negative", axis=1)

# create a column with a text preview that shows the first 50 characters
df_final["preview"] = df_final["body"].str[0:50]

Here is a preview of the final dataset, containing new columns created after the sentiment analysis of Reddit's comments has been done.

In [None]:
df_final.to_csv(path_or_buf='covid19_comments.csv')
df_final.head()

### Data cleaning

Let's check if there are any null values in our dataset

In [None]:
pd.read_csv('covid19_comments.csv').isnull().values.any()

As it might happen that some comments may have a negative score and this would have bad consequences in the visualization of our data, we are going to check if some values are negative within our dataframe.

In [None]:
df_final.groupby('score')['score'].count().sort_values(ascending=False)

Let's replace negative values for a positive value of 1.

In [None]:
df_final['score'].replace(to_replace=-1, value=1, inplace=True)

## Analysis

### Exploratory Data Analysis

This is a first visualization of all the data in our dataframe using Plotly Express.

In [None]:
fig = px.scatter(df_final, x="date", 
           y="sentiment_polarity", 
           hover_data=["author", "permalink", "preview"], 
           color_discrete_sequence=["green", "red"], 
           color="sentiment", 
           size="score", 
           size_max=150,
           labels={"sentiment_polarity": "Comment positivity", "date": "Date comment was posted"}, 
           title=f"Reddit's COVID-19 related comments' sentiment over time", 
          )
fig.update_layout(
    autosize=False,
    width=800,
    height = 800,
    plot_bgcolor = 'white'
)
fig.show()

As this visualization may seem to complex to understand, we are going to split the comments we retrieved from Reddit for each month they were created.

First, we can see below these lines the distribution of comments from the first month of 2020 until now.

In [None]:
df_final['month'] = pd.DatetimeIndex(df_final['date']).month
df_final.groupby('month')['month'].count().sort_values(ascending=False)

Let's visualize the distribution of positive and negative comments in each month of 2020.

In [None]:
january_comments = df_final.loc[df_final['month'] == 1]
january_comments.to_csv(path_or_buf='january_comments.csv')

february_comments = df_final.loc[df_final['month'] == 2]
february_comments.to_csv(path_or_buf='february_comments.csv')

march_comments = df_final.loc[df_final['month'] == 3]
march_comments.to_csv(path_or_buf='march_comments.csv')

april_comments = df_final.loc[df_final['month'] == 4]
april_comments.to_csv(path_or_buf='april_comments.csv')

In [None]:
fig = px.scatter(january_comments, x="date", 
           y="sentiment_polarity",
           hover_data=["author", "permalink", "preview"], 
           color_discrete_sequence=["green", "red"],
           color="sentiment", 
           size="score", 
           size_max=50,
           labels={"sentiment_polarity": "Comment positivity", "date": "Date comment was posted"}, 
           title=f"Reddit's COVID-19 related comments' sentiment during January 2020")
fig.update_layout(
    autosize=False,
    width=800,
    height = 800,
    plot_bgcolor = 'white'
)
fig.show()

In [None]:
fig = px.scatter(february_comments, x="date", 
           y="sentiment_polarity",
           hover_data=["author", "permalink", "preview"],
           color_discrete_sequence=["green", "red"], 
           color="sentiment", 
           size="score", 
           size_max=50,
           labels={"sentiment_polarity": "Comment positivity", "date": "Date comment was posted"}, 
           title=f"Reddit's COVID-19 related comments' sentiment on February 2020",
          )
fig.update_layout(
    autosize=False,
    width=800,
    height = 800,
    plot_bgcolor = 'white'
)
fig.show()

In [None]:
fig = px.scatter(march_comments, x="date",
           y="sentiment_polarity",
           hover_data=["author", "permalink", "preview"], 
           color_discrete_sequence=["green", "red"], 
           color="sentiment", 
           size="score", 
           size_max=50,
           labels={"sentiment_polarity": "Comment positivity", "date": "Date comment was posted"}, 
           title=f"Reddit's COVID-19 related comments' sentiment on March 2020",
          )
fig.update_layout(
autosize=False,
    width=800,
    height = 800,
    plot_bgcolor = 'white')
fig.show()

In [None]:
fig = px.scatter(april_comments, x="date",
           y="sentiment_polarity",
           hover_data=["author", "permalink", "preview"], 
           color_discrete_sequence=["green", "red"], 
           color="sentiment", 
           size="score", 
           size_max=50,
           labels={"sentiment_polarity": "Comment positivity", "date": "Date comment was posted"}, 
           title=f"Reddit's COVID-19 related comments' sentiment on April 2020",
          )
fig.update_layout(
    autosize=False,
    width=800,
    height = 800,
    plot_bgcolor = 'white'
)
fig.show()

## Clustering


Let's use K-means clustering to clasify the comments in different clusters. First we are going to use the Elbow Method to determine the best value for *k*.

In [None]:
df_km = pd.DataFrame(df_final['score'])
df_km['sentiment_polarity'] = df_final['sentiment_polarity']

# ELBOW METHOD

# calculate distortion for a range of number of cluster
distortions = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(df_km)
    distortions.append(km.inertia_)

# plot
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

In [None]:
kclusters = 4

km = KMeans(n_clusters=kclusters, init='random',n_init=10, max_iter=300, 
    tol=1e-04, random_state=0).fit(df_km)

km_labels = km.labels_

# Let's change label numbers so they go from highest scores to lowest

replace_labels = {0:2, 1:0, 2:3, 3:1}

for i in range(len(km_labels)):
    km_labels[i] = replace_labels[km_labels[i]]
    
df_km['Cluster'] = km_labels
df_km.head()

Let's see which cluster has the comments with highest scores, regardless of their sentiment polarity.

In [None]:
import matplotlib.ticker as ticker

fig, axes = plt.subplots(1, kclusters, figsize=(20, 10), sharey=True)

axes[0].set_ylabel('Score & Sentiment Polarity', fontsize=25)

for k in range(kclusters):
    # We are going to set same y axis limits
    axes[k].set_ylim(-1,500)
    axes[k].xaxis.set_label_position('top')
    axes[k].set_xlabel('Cluster ' + str(k), fontsize=25)
    axes[k].tick_params(labelsize=20)
    plt.sca(axes[k])
    plt.xticks(rotation='vertical')
    sns.boxplot(data = df_km[df_km['Cluster'] == k].drop('Cluster',1), ax=axes[k])

plt.show()

As we have now the most rated comments, we can plot a graphic to see if there have been more positive or negative comments regarding the coronavirus.

In [None]:
df_final['Cluster'] = df_km['Cluster']

df_cluster2 = df_final.loc[df_final['Cluster'] == 2]

fig = px.scatter(df_cluster2, x="date",
           y="sentiment_polarity", 
           trendline = 'lowess',
           hover_data=["author", "permalink", "preview"], 
           color_discrete_sequence=["green", "red"], 
           color="sentiment", 
           size="score", 
           size_max=50,
           labels={"sentiment_polarity": "Comment positivity", "date": "Date comment was posted"}, 
           title=f"Sentiment of Reddit's most-voted comments on COVID-19",
          )
fig.update_layout(
    autosize=False,
    width=800,
    height = 800,
    plot_bgcolor = 'white'
)
fig.show()

### Emotion detection
We are going to analyse in more precision the emotions we can detect from the comments made regarding the coronavirus training a machine learning model. 

For this purpose, we are going to use the Python library [NRCLex](https://pypi.org/project/NRCLex/) to measure the emotional affect from the different comments in our dataset. 