Python Notebook to analyse the crawled subreddit data and prepare a CSV for further analysis. </br>
It takes the raw CSV data crawled from Subreddit, calculates the required features and export a filtered clean CSV with the feature vectors.

In [None]:
#Download and import all dependancies
!pip install twython #for collab to download the twython lib used by SIA
from google.colab import drive #to mount G drive
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
import pandas as pd
import warnings



In [None]:
#Mount Google drive, provide auth code to write/read from G drive
drive.mount("/drive")
warnings.filterwarnings('ignore')

#VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 
nltk.download('vader_lexicon')

Mounted at /drive
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [None]:
raw_data = pd.read_csv("/drive/My Drive/Colab Notebooks/emacs_raw.csv") 
raw_data.loc[raw_data['selftext'].isnull(), 'selftext'] = " "
print("raw_data : ",raw_data.shape)

#filtered_data = raw_data[raw_data['selftext'].notnull()] #Keeping the posts with null 'content', will use title of the post for analysis!
filtered_data = raw_data.loc[raw_data['selftext'] != "[removed]"] #filtering the removed or deleted posts

print("filtered_data : ",filtered_data.shape)
filtered_data.head()

raw_data :  (1353, 15)
filtered_data :  (1255, 15)


Unnamed: 0,author,author_fullname,created_utc,domain,full_link,is_crosspostable,link_flair_text,num_comments,num_crossposts,over_18,permalink,score,selftext,title,total_awards_received
0,w3_ar3_l3g10n,t2_42lfkqcf,01-01-2020,self.emacs,https://www.reddit.com/r/emacs/comments/eid5nw...,True,,7,0,False,/r/emacs/comments/eid5nw/thoughts_on_a_clients...,1,I’ve spent the day trying to configure emacs-s...,Thoughts on a client-server editor model like ...,0
1,github-alphapapa,t2_17be1v66,01-01-2020,github.com,https://www.reddit.com/r/emacs/comments/eidj07...,True,,1,0,False,/r/emacs/comments/eidj07/makemsh_makefilelike_...,1,,makem.sh: Makefile-like script for easily test...,0
4,rock-emacs,t2_3bc1xzt6,01-01-2020,self.emacs,https://www.reddit.com/r/emacs/comments/eifdhf...,True,,7,0,False,/r/emacs/comments/eifdhf/changing_the_behavior...,1,Perhaps the only thing I miss about Vim is how...,Changing the behavior of RET in Emacs to be li...,0
5,_priyadarshan,t2_1cqjsfrj,01-01-2020,self.emacs,https://www.reddit.com/r/emacs/comments/eigot5...,True,,6,0,False,/r/emacs/comments/eigot5/running_notmuch_on_wi...,1,Currently `notmuch` is not available on Window...,Running notmuch on Windows Emacs,0
6,karlicoss,t2_8h0l2,01-01-2020,self.orgmode,https://www.reddit.com/r/emacs/comments/eihvrq...,True,,2,0,False,/r/emacs/comments/eihvrq/orgsync_synchronize_y...,1,,org-sync: synchronize your github/gitlab issue...,0


In [None]:
#Calculate sentiment of the title and post.
sia = SIA()
post_sentiment_results = []
posts = filtered_data['selftext'].tolist()

for post in posts:
    pol_score = sia.polarity_scores(post)
    pol_score['post'] = post
    post_sentiment_results.append(pol_score)

df = pd.DataFrame.from_records(post_sentiment_results)

df['post_sentiment'] = "Neutral"
df.loc[df['compound'] > 0.2, 'post_sentiment'] = "Positive"
df.loc[df['compound'] < -0.2, 'post_sentiment'] = "Negative"
#print(df.head())

title_sentiment_results =[]
titles = filtered_data['title'].tolist()
for title in titles:
    pol_score = sia.polarity_scores(title)
    pol_score['title'] = title
    title_sentiment_results.append(pol_score)

df2 = pd.DataFrame.from_records(title_sentiment_results)
df2['title_sentiment'] = "Neutral"
df2.loc[df2['compound'] > 0.2, 'title_sentiment'] = "Positive"
df2.loc[df2['compound'] < -0.2, 'title_sentiment'] = "Negative"
#print(df2.head())


#Define the final feature dataframe with the required features for further analysis
feature_df = df[['post', 'post_sentiment']]
feature_df[['title', 'title_sentiment']] = df2[['title', 'title_sentiment']].to_numpy()
feature_df['reddit_score'] = filtered_data['score'].values
feature_df['num_comments'] = filtered_data['num_comments'].values
feature_df['date_created'] = filtered_data['created_utc'].values
feature_df['author'] = filtered_data['author'].values

feature_df = feature_df.reindex(columns=['post','title','date_created','author','post_sentiment','title_sentiment','reddit_score','num_comments'])
print(feature_df.shape)
feature_df.head()

(1132, 8)


Unnamed: 0,post,title,date_created,author,post_sentiment,title_sentiment,reddit_score,num_comments
0,"Previously, vim-clap's main feature is the ful...",vim-clap: to be a performant fuzzy finder and ...,01-01-2020,liuchengxu,Positive,Neutral,1,41
1,"Hello all, recently I wrote a new vim plugin '...",New plugin: vim-text-lists,01-01-2020,arumoy_shome,Positive,Neutral,1,0
2,,How many of yiuhrol keep accidentally doing th...,01-01-2020,nebulaeandstars,Neutral,Negative,1,0
3,The task i'm trying to accomplish is to remove...,What would be the best way to format this code...,01-01-2020,eliseu_videira,Positive,Positive,1,24
4,"I've been using vim for a couple of years, and...",TIL - Run lines of code and get the output ins...,01-01-2020,Hollow_5oul,Positive,Neutral,1,25


In [None]:
#write the feature dataframe to a CSV file
feature_df.to_csv('/drive/My Drive/Colab Notebooks/vim_feature.csv', mode='w', encoding='utf-8', index=False)