# Reddit Comments Analysis (First Part)

### Download, preprocess and apply sentiment analyser

Reddit's monthly comments are zipped and available for some modeling.

URL for this is as following https://files.pushshift.io/reddit/comments/.

Since databricks environment is limited to 10GB, only smaller files are download.

First, I will install important libs and then I will download 2011 August comments.

In [None]:
## nltk is required for sentiment analysis
!pip install nltk
!pip install wordcloud

In [None]:
## import important libs
from pyspark import SparkContext
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud 
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import json
import bz2

In [None]:
## download punctutation, stopwords and other parts of nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')
nltk.download('wordnet')

In [None]:
## this will save the zipped file in the temp folder
!wget 'https://files.pushshift.io/reddit/comments/RC_2011-08.bz2'

In [None]:
## this code shows that data is saved in driver's temp memory
%fs ls file:/databricks/driver/

In [None]:
## data is moved to databrick's local storage for further processing
# dbutils.fs.mv("file:/databricks/driver/RC_2011-08.bz2", "dbfs:/tmp/RC_2011-08.bz2")  
# ## move files from driver to dbfs file storage
dbutils.fs.mv("file:/databricks/driver/RC_comment.txt", "dbfs:/tmp/RC_2011-08.txt")  

In [None]:
## import bz2 and unzip the file
with bz2.open("RC_2011-08.bz2", "rb") as f:
    content = f.read()

In [None]:
## make txt file and write the content
f = open("RC_comment.txt", "wb")
f.write(content)
f.close()

In [None]:
## create rdd to further work on it
rdd = sc.textFile("dbfs:/tmp/RC_2011-08.txt")

In [None]:
## txt file shows that it is txt of json 
## keep only unix time information and comments
## also data is filter from rows that has 'deleted' info

rdd = rdd.map(lambda line : (json.loads(line)['created_utc'],json.loads(line)['body'])).filter(lambda line: line if '[deleted]' not in line[1] else None)

In [None]:
## make them lowercased
rdd = rdd.map(lambda line: (line[0], line[1].lower()))

In [None]:
## here I create several functions for processing

lemmatizer = WordNetLemmatizer()
analyzer = SentimentIntensityAnalyzer()
stopwords_english=set(stopwords.words('english'))

def tokenize_sent(x):
    return nltk.sent_tokenize(x)
  

def tokenize_word(x):
    sent_split = [word for line in x for word in line.split()]
    return sent_split
  
def remove_stopwords(x):
    cleaned_sent = [w for w in x if not w in stopwords_english]
    return cleaned_sent
  

punct_words=list(string.punctuation)

def remove_punct(x):
    cleaned_sent = [''.join(c for c in s if c not in punct_words) for s in x] 
    cleaned_sent = [s for s in cleaned_sent if s] #remove empty space 
    return cleaned_sent


def lemmatize_sent(x):
    lemma = [lemmatizer.lemmatize(s) for s in x]
    return lemma
  
def join_tokens(x):
    joinedTokens_list = []
    x = " ".join(x)
    return [x]
  
def sentiment_score(x):
    vs = analyzer.polarity_scores(x[0])
    return vs['neg'], vs['neu'], vs['pos'], vs['compound']


Data in rdd are just rows.

Each function that is applied on rdd goes through each row.

In rdd line[0] is unix time and line[1] is comments in string.

Functions are applied by lambda and only uses line[1] since it contains comments.

In [None]:
## tokenize comment sentences
sent_tokenize_rdd = rdd.map(lambda line: (line[0], tokenize_sent(line[1])))

## tokenize words
word_tokenize_rdd = sent_tokenize_rdd.map(lambda line: (line[0], tokenize_word(line[1])))

## remove stopwords from line1
remove_stopword_rdd = word_tokenize_rdd.map(lambda line: (line[0], remove_stopwords(line[1])))

## remove punct words as well
remove_punct_rdd = remove_stopword_rdd.map(lambda line: (line[0], remove_punct(line[1])))

## find lemma for each token
lemmatize_rdd = remove_punct_rdd.map(lambda line: (line[0], lemmatize_sent(line[1])))

## join cleaned tokens for sentiment analysis
joined_tokens_rdd = lemmatize_rdd.map(lambda line: (line[0], join_tokens(line[1])))

## sentiments are added as well 
sentiment_rdd = joined_tokens_rdd.map(lambda line: (line[0], line[1], sentiment_score(line[1])))
rdd = sentiment_rdd.map(lambda line: (line[0], line[1][0], line[2][0], line[2][1], line[2][2], line[2][3]))

In [None]:
## here I create a dataframe from rdd and give column names
df = spark.createDataFrame(rdd).toDF("date", "comment", "neg", "neu", "pos", "com")
df.show(5, truncate=False)

In [None]:
## store data for further usage

df.write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/tmp/df.csv")