In [2]:
import pandas as pd
import pandas
import numpy as np
import praw
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import re, string
import gc
from sklearn import preprocessing

#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')

This segment will analyze the frequency of words that appear in the most positively and negatively scored comments from Reddit. The two nltk package downloads are required to tag individual words to their part of speech. This allows us to filter the words we do not believe are relevant to a post's popularity.

The word maps will contain every instance of a valid word within the comment texts. The pos_map is a list of parts of speech that will be included in the filtering for our word table.

List with examples
<ul>
  <li>JJ adjective 'big'
</li>
  <li>JJR adjective, comparative 'bigger'
</li>
  <li>JJS adjective, superlative 'biggest'
</li>
  <li>NN noun, singular 'desk'
</li>
  <li>NNS noun plural 'desks'
</li>
  <li>NNP proper noun, singular 'Harrison'
</li>
   <li>NNPS proper noun, plural 'Americans'
</li>
  <li>RB adverb very, silently,
</li>
  <li>RBR adverb, comparative better
</li>
   <li>RBS adverb, superlative best
</li>
  <li>RP particle give up
</li>
  <li>UH interjection errrrrrrrm
</li>
   <li>VB verb, base form take
</li>
  <li>VBD verb, past tense took
</li>
  <li>VBG verb, gerund/present participle taking
</li>
   <li>VBN verb, past participle taken
</li>
  <li>VBP verb, sing. present, non-3d take
</li>
  <li>VBZ verb, 3rd person sing. present takes
</li>
</ul>

The regular expression will filter out all nonalphanumeric characters. The natural language toolkit's tokenizer splits words into tokens based on their criteria for parts of speech. This separates words that have nonalphanumeric characters between them like a hyphen. This is not favorable, since it also allows duplicate words with differing capitalizations within the word. The regular expression will ensure that all words are split on space, and each word does not contain nonsense characters. The filter for these parts of speech is used to more accurately select real english words.

In [None]:
pos_word_map = []
neg_word_map = []
pos_map = ["JJ", "JJR", "JJS", "NN", "NNS", "NNP", "NNPS", "RB", "RBR", "RBS", "RP","UH", "VB", "VBG", "VBD", "VBN", "VBP", "VBZ"]
pattern = re.compile('[\W_]+', re.UNICODE)

This section of code iterates through the positive comment table and copies instances of valid words to the pos_word_map as a tuple of it and the corresponding comment score. This will allow us to measure the relative score of a word in relation to the score of the comments it appears in. Every single word within the text will be assigned the point value of the post it came from, and words that appear multiple times within the text will be weighted with multiple of these assignments. This will increase the positive or negative predictor score of words based on their frequency in high or low scoring points. 

In [None]:
for i, row in posjson.iterrows():
    text = row.text
    new_text = text.split()
    results = []
    score = row.score
    for word in new_text:
        temp = pattern.sub('', word).lower()
        if temp != '':
            results.append((temp,score))
    pos_word_map.extend(results)

The rest of the code sections are intermediary steps in creating a cleaned up DataFrame which we can work with. The DataFrames are manipulated in order to translate the information from a word, score tuple Series into a DataFrame. Two DataFrames are then needed in order to cleanly sum up the frequency of each word within the entire corpus of reddit comments for the json records as well as the associated score for each word. The following show the process as these mappings are produced.

In [None]:
#Creates a DataFrame for the word,score tuples
pos_tokens_df = pd.DataFrame(pos_word_map, columns = ['word', 'score'])

In [None]:
#Creates a DataFrame with the frequency of each word in the total pos_word_map
pos_count_df = pd.DataFrame(pos_tokens_df.word.value_counts())
pos_count_df.reset_index(level=0, inplace=True)
pos_count_df.rename(columns={"index":"word", "word":"frequency"}, inplace=True)
pos_count_df.head(10)

In [None]:
#Creates a DataFrame for the positively scored words where the raw score is the sum of all of the instances of a word's 
#post score
pos_score_df = pd.DataFrame(pos_word_map)
pos_score_df = pd.DataFrame(pos_score_df)
pos_score_df.reset_index(level=0, inplace=True)
pos_score_df = pd.DataFrame(pos_score_df.groupby(0)[1].sum())
pos_score_df.reset_index(level=0, inplace=True)
pos_score_df.rename(columns={0:"word",1:"raw_score"},inplace=True)
pos_score_df.head(10)

As seen above, the data contains a significant number of words that do not have any meaning in either English or in the context of comment score preditions. These words will be filtered out in a couple of steps. The frequency of each word was initially calculated in order to assist with removing words that appear to have no significance or words that may bias the data too much due to outlier posts that contain infrequently used words.

In [None]:
#Creates a DataFrame with the average score of each word as the raw_score / word frequency
pos_df = pos_count_df.set_index('word').join(pos_score_df.set_index('word'), rsuffix='_r')
pos_df.reset_index(level=0, inplace=True)
pos_df.rename(columns={"score_r":"score"}, inplace=True)
pos_df['average_score'] = pos_df.raw_score / pos_df.frequency
pos_df.head(10)

It is in this step that the natural language toolkit is used in order to tag each word parsed in the previous segments with the part of speech the word belongs to. This allows us to choose only words from the list of parts of speech as valid entries in the DataFrame. Filtering out words that might either not be English or removing statistical outliers from appearing only a handful of times will make the data more accurate in predicting positively and negatively biased words. The frequency filter was set to 50 after determining a safeguard buffer for words that did not appear to make any sense.

In [None]:
#Creates a DataFrame with the part of speech for each word added
pos_df = pos_df[pos_df.frequency > 50]
pos_tags = nltk.pos_tag(pos_df.word)
pos_tags = pd.DataFrame(pos_tags)
pos_df['part_of_speech'] = pos_tags[1]
pos_df = pos_df[pos_df.part_of_speech.isin(pos_map)]
pos_df = pd.DataFrame(pos_df)
pos_df.reset_index(level=0,drop=True,inplace=True)
pos_df.head()

It is important to clean up the dataframes that are no longer in use, because the overhead for maintaining the dataframes and arrays in memory are too high. This may cause the kernal to terminate.

In [None]:
#Cleanup for memory
del pos_tokens_df, pos_score_df, pos_word_map, pos_count_df, pos_tags
gc.collect()

This section of code will replicate the upper section with regard to the most negatively scored comments.

In [None]:
for i, row in negjson.iterrows():
    text = row.text
    new_text = text.split()
    results = []
    score = row.score
    for word in new_text:
        temp = pattern.sub('', word).lower()
        if temp != '':
            results.append((temp,score))
    neg_word_map.extend(results)

In [None]:
#Creates a DataFrame for the word,score tuples
neg_tokens_df = pd.DataFrame(neg_word_map, columns = ['word', 'score'])

In [None]:
#Creates a DataFrame with the frequency of each word in the total pos_word_map
neg_count_df = pd.DataFrame(neg_tokens_df.word.value_counts())
neg_count_df.reset_index(level=0, inplace=True)
neg_count_df.rename(columns={"index":"word", "word":"frequency"}, inplace=True)
neg_count_df.head(10)

In [None]:
#Creates a DataFrame for the negatively scored words where the raw score is the sum of all of the instances of a word's 
#post score
neg_score_df = pd.DataFrame(neg_word_map)
neg_score_df = pd.DataFrame(neg_score_df)
neg_score_df.reset_index(level=0, inplace=True)
neg_score_df = pd.DataFrame(neg_score_df.groupby(0)[1].sum())
neg_score_df.reset_index(level=0, inplace=True)
neg_score_df.rename(columns={0:"word",1:"raw_score"},inplace=True)
neg_score_df.head(10)

In [None]:
#Creates a DataFrame with the average score of each word as the raw_score / word frequency
neg_df = neg_count_df.set_index('word').join(neg_score_df.set_index('word'), rsuffix='_r')
neg_df.reset_index(level=0, inplace=True)
neg_df.rename(columns={"score_r":"score"}, inplace=True)
neg_df['average_score'] = neg_df.raw_score / neg_df.frequency
neg_df.head(10)

In [None]:
#Creates a DataFrame with the part of speech for each word added
neg_df = neg_df[neg_df.frequency > 50]
neg_tags = nltk.pos_tag(neg_df.word)
neg_tags = pd.DataFrame(neg_tags)
neg_df['part_of_speech'] = neg_tags[1]
neg_df = neg_df[neg_df.part_of_speech.isin(pos_map)]
neg_df = pd.DataFrame(neg_df)
neg_df.reset_index(level=0,drop=True,inplace=True)
neg_df.head()

In [None]:
#Cleanup for memory
del neg_tokens_df, neg_score_df, neg_word_map, neg_count_df, neg_tags
gc.collect()

Now that the data has been organized for preprocessing, we take the sklearn modules to normalize and standardize the data on each of the numeric columns for both the pos_df and neg_df DataFrames. The most important columns to look at are the normalized and standardized average scores, since this represents the relationship between frequency of the word and the word's raw score. The normalization and standardization allow us to make statisical observations about certain words relative to other words in the data set.

In [None]:
df_num_pos = pos_df.select_dtypes(include=[np.number])
df_num_neg = neg_df.select_dtypes(include=[np.number])
df_num_neg = df_num_neg.abs()

min_max_scaler = preprocessing.MinMaxScaler()
standard_scaler = preprocessing.StandardScaler()
pos_df = pos_df.join(pd.DataFrame(min_max_scaler.fit_transform(df_num_pos), columns=df_num_pos.columns, 
                                  index=df_num_pos.index), rsuffix='_normalized')
neg_df = neg_df.join(pd.DataFrame(min_max_scaler.fit_transform(df_num_neg), columns=df_num_neg.columns, 
                                  index=df_num_neg.index), rsuffix='_normalized')
pos_df = pos_df.join(pd.DataFrame(standard_scaler.fit_transform(df_num_pos), columns=df_num_pos.columns, 
                                  index=df_num_pos.index), rsuffix='_standardized')
neg_df = neg_df.join(pd.DataFrame(standard_scaler.fit_transform(df_num_neg), columns=df_num_neg.columns, 
                                  index=df_num_neg.index), rsuffix='_standardized')

This graph shows the normalized average score for both the words in the top 180,000 positive comments in green and the words in the top 180,000 negative comments in orange. The negatively scored words were normalized with their absolute value to present data that can be directly comparable to the positive words.

In [None]:
sns.kdeplot(pos_df.average_score_normalized, shade=True, color='Green')
sns.kdeplot(neg_df.average_score_normalized, shade=True, color='Orange')
print("Positive Comments (Mean Normalized) : " + str(pos_df.average_score_normalized.mean()))
print("Positive Comments (Std Normalized)  : " + str(pos_df.average_score_normalized.std()))
print("Negative Comments (Mean Normalized) : " + str(neg_df.average_score_normalized.mean()))
print("Negative Comments (Std Normalized)  : " + str(neg_df.average_score_normalized.std()))
print("Difference (Mean Normalized)        : " + str(pos_df.average_score_normalized.mean() - neg_df.average_score_normalized.mean()))
print("Difference (Std Normalized)         : " + str(pos_df.average_score_normalized.std() - neg_df.average_score_normalized.std()))

The second graph plots the standardized average scores for the positive and negative words. With the data standardized, we can now compare the words across positive and negative DataFrames to determine which words are the most positive, neutral, or negative by frequency and comment score.

In [None]:
sns.kdeplot(pos_df.average_score_standardized, shade=True, color='Green')
sns.kdeplot(neg_df.average_score_standardized, shade=True, color='Orange')
print("Positive Comments (Mean Standardized) : " + str(pos_df.average_score_standardized.mean()))
print("Positive Comments (Std Standardized)  : " + str(pos_df.average_score_standardized.std()))
print("Negative Comments (Mean Standardized) : " + str(neg_df.average_score_standardized.mean()))
print("Negative Comments (Std Standardized)  : " + str(neg_df.average_score_standardized.std()))
print("Difference (Mean Standardized)        : " + str(pos_df.average_score_standardized.mean() - neg_df.average_score_standardized.mean()))
print("Difference (Std Standardized)         : " + str(pos_df.average_score_standardized.std() - neg_df.average_score_standardized.std()))

This chart shows the top 10 words by average standardized score for positive comments. The distribution skews long tails where the score is many standard deviations above the mean. This can be attributed to the fact that these data sets take a look at the top/lowest 180,000 comments by score out of a dataset that contains 380 million comments. By eliminating the majority of neutrally voted comments by absolute value, some words will be heavily biased due to their low frequency count in relation to their appearance in some highly scored points, whether they be positive or negative. This means that given analysis of enough data points it is highly likely for those top 9 words to fall in the rankings, since the top 9 also comprise entirely out of words that rank below 3700 on their frequency within the corpus.

In [None]:
pos_df.sort_values(by="average_score_standardized", ascending=False).head(10)

In [None]:
neg_df.sort_values(by="average_score_standardized", ascending=False).head(10)

The words from the positive and negative lists are combined to get the standardized score accounting for the difference between the positive commends and negative comments.

In [None]:
inner_join = pd.merge(left=pos_df, right=neg_df, left_on='word',right_on='word')

In [None]:
left = pos_df.loc[np.logical_not(pos_df.word.isin(neg_df.word))]
right = neg_df.loc[np.logical_not(neg_df.word.isin(pos_df.word))]
right[right.select_dtypes(include=[np.number]).columns] *= -1
diff_join = pd.DataFrame(inner_join.word)
diff_join['diff_standardized_score'] = inner_join.average_score_standardized_x - inner_join.average_score_standardized_y
left.rename(columns={"average_score_standardized":"diff_standardized_score"}, inplace=True)
right.rename(columns={"average_score_standardized":"diff_standardized_score"}, inplace=True)
diff_df = pd.concat([diff_join, left[["word","diff_standardized_score"]], right[["word","diff_standardized_score"]]])
diff_df.sort_values(by="diff_standardized_score", ascending=False,inplace=True)
diff_df.reset_index(level=0, drop=True, inplace=True)

Because we have combined both standardized lists and accounted for the negative words as negative values, the distribution looks similar to the one above. The difference in negative values caused the graph to become mirrored on approxmiately x = 0.

In [None]:
sns.kdeplot(diff_df.diff_standardized_score, shade=True, color='crimson')

This is a list of the top 15 words that predict a positive scoring comment.

In [None]:
diff_df.head(15)

This is a list of the top 15 words that predict a negative scoring comment.

In [None]:
diff_df.tail(15)

This is a list of the 10 words on both sides of standardized 0, which should indicate words that are neutral in predicting the positive or negative score of a comment.

In [None]:
diff_df.loc[3559:3579]