# Reddit Comments Analysis

### Download and clean data. Estimate polarity scores.

Reddit monthly comments are zipped and available for some modeling.

URL for this is as following https://files.pushshift.io/reddit/comments/.

Since databricks environment is limited to 10GB, only smaller files are download.

First, I download 2011 September comments.

In [2]:
## nltk is required for sentiment analysis
!pip install nltk
!pip install wordcloud

In [3]:
## download important libs

from pyspark import SparkContext
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud 
import json
import bz2
import string


In [4]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')
nltk.download('wordnet')

In [5]:
## this will save zipped file in temp folder
!wget 'https://files.pushshift.io/reddit/comments/RC_2011-09.bz2'

In [6]:
## import bz2 and unzip the file
with bz2.open("RC_2011-09.bz2", "rb") as f:
    content = f.read()

In [7]:
## make txt file and write the content
f = open("RC_comment_09.txt", "wb")
f.write(content)
f.close()

In [None]:
## move files from driver to dbfs file storage
## data is moved to databrick's local storage for further processing

dbutils.fs.mv("file:/databricks/driver/RC_comment_09.txt", 
              "dbfs:/tmp/RC_2011-09.txt")  

In [9]:
## create rdd to further work on it
rdd = sc.textFile("dbfs:/tmp/RC_2011-09.txt")

In [10]:
## data is list of dictionaries
print(rdd.take(1))

Here I start taking a subset

In [12]:
## txt file shows that it is txt of json 
## keep only unix time information and comments
## also data that has 'deleted' info has been removed

## rdd keys are comment ids give under "name"
rdd_subset = rdd.map(lambda line : (json.loads(line)['name'],                                    
                                    json.loads(line)['author'],
                                    json.loads(line)['author_flair_text'],
                                    json.loads(line)['created_utc'],
                                    json.loads(line)['parent_id'],
                                    json.loads(line)['ups'],
                                    json.loads(line)['downs'],
                                    json.loads(line)['retrieved_on'],
                                    json.loads(line)['subreddit'],
                                    json.loads(line)['body'],)
                    ).filter(lambda line: line if '[deleted]' not in line[9] else None)

In [13]:
## subset looks fine
## it has unix time, comments and all other info that is needed
print(rdd_subset.take(1))

In [14]:
## here I create several functions for processing

lemmatizer = WordNetLemmatizer()
analyzer = SentimentIntensityAnalyzer()

def tokenize_sent(x):
    return nltk.sent_tokenize(x)
  
def tokenize_word(x):
    sent_split = [word for line in x for word in line.split()]
    return sent_split
  
## is not int
def remove_int(x):
    not_int = [a for a in x if a.isdigit() is False]
    return not_int
  
def short_words(x):
    short = [a for a in x if len(a) <= 20]
    return short
  
## not needed for vader
stopwords_en=set(stopwords.words('english'))
def remove_stopwords(x):
    cleaned_sent = [w for w in x if not w in stopwords_en]
    return cleaned_sent
  
## not needed for vader
punct_words=list(string.punctuation)
def remove_punct(x):
    cleaned_sent = [''.join(c for c in s if c not in punct_words) for s in x] 
    cleaned_sent = [s for s in cleaned_sent if s] #remove empty space 
    return cleaned_sent

## not needed for vader
def lemmatize_sent(x):
    lemma = [lemmatizer.lemmatize(s) for s in x]
    return lemma
  
def join_tokens(x):
    x = " ".join(x)
    return [x]
  
def sentiment_score(x):
    vs = analyzer.polarity_scores(x[0])
    return vs['neg'], vs['neu'], vs['pos'], vs['compound']


VADER relies on several key words in the sentence
removing them would alter the polarity scores
- conjuctions (no stopword removal)
- degree modifiers (no lemmatizing)
- capitalization (no lowercasing)
- punctuation (no punctuation removal)

Still I will have functions for each case that could be used later on

Here I do a simple check.
To see how urls in the string will effect scores.

Urls and long strings wont effect. They will be removed to keep our data in smaller size.

In [17]:
print(analyzer.polarity_scores("at Least wait a bit before bad [reposting]"))

In [18]:
print(analyzer.polarity_scores("at least wait a bit before bad [reposting](http://www.reddit.com/r/woahdude/comments/jyxly/mighty_morphing_power_art_gif/)."))

Now I check how numbers alter polarity scores.

Numbers affect our scores. But numbers are not useful since they have no semantic value. They will be removed.

In [20]:
print(analyzer.polarity_scores("at least wait a bit before bad [reposting]"))

In [21]:
print(analyzer.polarity_scores("at Least wait a bit before bad [reposting] 123"))

So only removed 
- very long words
- numbers

Data in rdd are just rows.

Each function that is applied on rdd goes through each row.

In rdd line[0] is unix time and line[1] is comments in string.

Functions are applied by lambda and only uses line[1] since it contains comments.

In [24]:
## tokenize for sentences
## rdd rows are given in the tuple
## rdd_subset contains comments in the last part of the tuple
## so I return all elements until the last one and modify the last element which is reddit comment
sent_rdd = rdd_subset.map(lambda line: (line[:9], 
                                        tokenize_sent(line[9])))

## tokenize for words
## now sent_rdd is changed and each row is tuple of tuple and list together
## tuple's second element is list which is modified comments
## tuple's first element is tuple of line[:9] from the first step
word_rdd = sent_rdd.map(lambda line: (line[0], 
                                      tokenize_word(line[1])))

## remove int
removed_int_rdd = word_rdd.map(lambda line: (line[0], 
                                             remove_int(line[1])))

## remove long tokens
shortened_rdd = removed_int_rdd.map(lambda line: (line[0], 
                                                  short_words(line[1])))

## join cleaned tokens for sentiment analysis
joined_rdd = shortened_rdd.map(lambda line: (line[0], 
                                             join_tokens(line[1])))

## sentiments are added as well 
sentiment_rdd = joined_rdd.map(lambda line: (line[0], line[1], 
                                             sentiment_score(line[1])))

## sentiment scores are wrapped in tuples
## so now each row is tuple + list + tuple
## here i open last tuple
rdd_processed = sentiment_rdd.map(lambda line: (line[0], line[1][0], 
                                                line[2][0], line[2][1], 
                                                line[2][2], line[2][3]))

In [25]:
## here I just open nested tuples for each row/line tuple 
rdd_processed_ = rdd_processed.map(lambda line: (line[0][0], line[0][1], line[0][2],
                                                 line[0][3], line[0][4], line[0][5],
                                                 line[0][6], line[0][7], line[0][8], 
                                                 line[1], line[2], line[3], line[4], line[5]))

In [26]:
## now our data is as following, just list
print(rdd_processed_.take(1))

In [27]:
## here I create a dataframe from rdd and give column names
df_subset = spark.createDataFrame(rdd_processed_).toDF("name","author",
                                                       "author_flair_text",
                                                       "unix_time","parent_id",
                                                       "ups","downs","retrieved_on",
                                                       "subreddit", "comment", "neg", 
                                                       "neu", "pos", "com")
# print(df_subset.show(1, truncate=False))


In [28]:
display(df_subset.show(1, truncate=False))

In [29]:
## save the df
df_subset.write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/tmp/df_09_subset.csv")

In [30]:
## check if the data is saved
## it will be used by another notebook 

In [31]:
%fs ls dbfs:/FileStore/tmp

path,name,size
dbfs:/FileStore/tmp/df_09_subset.csv/,df_09_subset.csv/,0
