# Static analysis

## Intro - importing

In [1]:
import pandas as pd
import pickle
import numpy as np

In [2]:
with open("data/comments_cleaned", 'rb') as file:
    comments = pickle.load(file)
    
with open("data/submissions_cleaned", 'rb') as file:
    submissions = pickle.load(file)

In [3]:
submissions

Unnamed: 0,submissions_id,url,permalink,author,created_utc,subreddit,subreddit_id,num_comments,score,over_18,distinguished,domain,stickied,locked,hide_score,id
0,648oo,http://www.ignorancedenied.com/viewthread.php?...,/r/reddit.com/comments/648oo/brain_disease_is_...,DITUS,1199145615,reddit.com,t5_6,1,0,False,,ignorancedenied.com,False,False,False,0
1,648op,http://www.flascience.org/wp/?p=363,/r/science/comments/648op/three_more_florida_c...,rmuser,1199145634,science,t5_mouw,5,20,False,,flascience.org,False,False,False,1
2,648or,http://hosted.ap.org/dynamic/stories/O/ODD_SHO...,/r/reddit.com/comments/648or/nude_couple_grapp...,zorno,1199145709,reddit.com,t5_6,1,3,False,,hosted.ap.org,False,False,False,2
3,648os,http://www.sltrib.com/opinion/ci_7846101?sourc...,/r/politics/comments/648os/apparently_bushs_pr...,rmuser,1199145735,politics,t5_2cneq,2,0,False,,sltrib.com,False,False,False,3
4,648ot,http://hosted.ap.org/dynamic/stories/O/ODD_RAR...,/r/reddit.com/comments/648ot/diners_find_rare_...,zorno,1199145735,reddit.com,t5_6,0,0,False,,hosted.ap.org,False,False,False,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2044805,7mq3n,http://ventaboutsports.blogspot.com/2008/12/so...,/r/funny/comments/7mq3n/some_extremely_corny_j...,themightymidget,1230767909,funny,t5_2qh33,0,1,False,,ventaboutsports.blogspot.com,False,False,False,2044805
2044806,7mq3o,http://www.pbs.org/mormons/etc/genealogy.html,/r/news/comments/7mq3o/pbs_looks_at_the_massiv...,Tom22,1230767926,news,t5_2qh3l,0,0,False,,pbs.org,False,False,False,2044806
2044807,7mq3q,http://www.narutogames.biz,/r/reddit.com/comments/7mq3q/naruto_games/,bixiebix,1230767937,reddit.com,t5_6,7,1,False,,narutogames.biz,False,False,False,2044807
2044808,7mq3r,http://www.youtube.com/watch?v=gdQH1CI4LHY&amp...,/r/politics/comments/7mq3r/ron_paul_on_recent_...,middkidd,1230767963,politics,t5_2cneq,3,1,False,,youtube.com,False,False,False,2044808


In [4]:
comments

Unnamed: 0,comments_id,author,link_id,parent_id,created_utc,subreddit,subreddit_id,score,distinguished,gilded,controversiality,id
0,c02s9s6,Haven,648oh,t1_c02s9rv,1199145604,reddit.com,t5_6,4,,0,0,0
1,c02s9s8,lilmiss2,648oh,t1_c02s9rv,1199145620,reddit.com,t5_6,2,,0,0,1
2,c02s9sc,EverybodysAnAsshole,648et,t1_c02s976,1199145644,reddit.com,t5_6,2,,0,0,2
3,c02s9sd,generalk,647yd,t1_c02s8md,1199145647,programming,t5_2fwo,13,,0,0,3
4,c02s9se,seeker135,6483n,t3_6483n,1199145650,politics,t5_2cneq,4,,0,0,4
...,...,...,...,...,...,...,...,...,...,...,...,...
4873684,c06vwud,CommodoreGuff,7k1l5,t1_c06vpzj,1229579674,programming,t5_2fwo,1,,0,0,4873684
4873685,c06vwue,wolfzero,7k4if,t1_c06vs7l,1229579675,technology,t5_2qh16,4,,0,0,4873685
4873686,c06vwug,Morgin_Black,7k3w5,t3_7k3w5,1229579679,comics,t5_2qh0s,0,,0,0,4873686
4873687,c06vwui,onezerozeroone,7k2bc,t1_c06vrvz,1229579685,atheism,t5_2qh2p,1,,0,0,4873687


## Actual start of the analysis

### Q1: How many unique subreddits occur? Which has the most comments, and which has the most active users?

I will divide this into 3 questions:

    1. Need to count submissions and comments with unique subreddits 
    2. Need to group and count subreddits by comments
    3. Need to group and count subreddits by users

> Could have used *subreddit_ids* but the name is also unqiue!        


In [6]:
subreddits_authors = pd.concat([submissions[['subreddit', 'author']], comments[['subreddit', 'author']]], ignore_index=True)
subreddits_authors = subreddits_authors.groupby('subreddit').agg({'author': "nunique"})
subreddits_authors = subreddits_authors.sort_values(by="author", ascending=False)
# subreddits_authors

Two birds one stone: 

In [7]:
print("Number of Subreddits:\n\n", subreddits_authors.shape[0])

Number of Subreddits:

 4359


In [8]:
print("Top 10 with most users:\n\n", subreddits_authors.iloc[:10])

Top 10 with most users:

                author
subreddit            
reddit.com     163779
politics        38374
pics            29753
technology      28337
funny           28186
entertainment   26360
science         25854
programming     25819
business        25253
worldnews       24937


In [9]:
comments_size = comments[['subreddit']].groupby('subreddit').size().reset_index(name='counts')
comments_size = comments_size.sort_values(by="counts", ascending=False)
# comments_size

In [10]:
print("Subreddits with the most comments:\n\n", comments_size.iloc[:10])

Subreddits with the most comments:

         subreddit   counts
1809   reddit.com  1143183
1755     politics   801396
1777  programming   345997
1741         pics   286192
1867      science   238291
2144    worldnews   228793
836           WTF   187876
1305        funny   175547
1993   technology   149803
73      AskReddit   139760


### Q2: Avrage number of users per subreddit?

In [11]:
print("Avrage user per subreddit:\n\n", round(subreddits_authors['author'].mean(), 5))

Avrage user per subreddit:

 148.66506


### Q3: Users with the most submissions, users with the most comments

In [12]:
authors_of_submissions = submissions[['author']].groupby('author').size().reset_index(name='counts')
authors_of_submissions = authors_of_submissions.sort_values(by="counts", ascending=False)
# authors_of_submissions

In [13]:
print("Users with the most submissions:\n\n", authors_of_submissions.iloc[:10])

Users with the most submissions:

                   author  counts
84823                gst   18870
141813             qgyh2   12238
147359            rmuser    9822
173691            twolf1    8597
13172   IAmperfectlyCalm    8308
141766         qazamisan    6927
54960          charlatan    5998
90683           igeldard    5373
130852          noname99    5334
64933       democracy101    5332


In [14]:
authors_of_comments = comments[['author']].groupby('author').size().reset_index(name='counts')
authors_of_comments = authors_of_comments.sort_values(by="counts", ascending=False)
# authors_of_comments

In [15]:
print("Users with the most submissions:\n\n", authors_of_comments.iloc[:10])

Users with the most submissions:

                  author  counts
12598   NoMoreNicksLeft   13480
56871        malcontent   12159
57872            matts2   11672
58883        mexicodoug    9169
650                7oby    9161
21027         aletoledo    8085
61554          mutatron    7771
65056         otakucode    7759
69965  redditcensoredme    7468
43578            h0dg3s    7439


### Q4: Which users are the most active on the biggest number of subreddits? How many subreddits are they active on?

In [16]:
authors_on_subreddits = pd.concat([submissions[['subreddit', 'author']], comments[['subreddit', 'author']]], ignore_index=True)
authors_on_subreddits = authors_on_subreddits.groupby('author').agg({'subreddit': "nunique"})
authors_on_subreddits = authors_on_subreddits.sort_values(by="subreddit", ascending=False)
# authors_on_subreddits

In [17]:
print("Most active users and the number of subreddits they were active:\n\n", authors_on_subreddits.iloc[:10])

Most active users and the number of subreddits they were active:

                 subreddit
author                   
MrKlaatu              181
Escafane              154
omfgninja             122
scientologist2        111
codepoet              111
turkourjurbs          110
b34nz                 107
Sylveran-01           107
krugerlive            106
tuoder                103


### Q5: Define the correlation between the number of submissions and the number of comments by users. Compute Pearson's coeficient and visualize the results

> https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

In [18]:
from scipy.stats import pearsonr
import matplotlib.pyplot as plt

### Q6: Submissions with the most comments and their subreddits. Show their info (subreddit and context),  skip over18

In [53]:
comments_submissions = comments[["link_id"]].groupby("link_id").size().reset_index(name="counts")
comments_submissions = comments_submissions.sort_values(by="counts", ascending=False)
comments_submissions

Unnamed: 0,link_id,counts
560615,7kpe5,2007
207630,6nz1k,1866
37700,675oj,1500
160264,6jbc0,1413
464558,7beo2,1375
...,...,...
267148,6to4r,1
267147,6to4o,1
267137,6to3q,1
267136,6to3l,1


> These are unflitered comments


In [69]:
sfw_comments = pd.merge(submissions[submissions["over_18"] == False], comments_submissions, how="inner", left_on="submissions_id", right_on="link_id")
sfw_comments = sfw_comments.sort_values(by="counts", ascending=False)
sfw_comments.iloc[:10]

Unnamed: 0,submissions_id,url,permalink,author,created_utc,subreddit,subreddit_id,num_comments,score,over_18,distinguished,domain,stickied,locked,hide_score,id,link_id,counts
156477,6nz1k,http://hundredpushups.com,/r/science/comments/6nz1k/got_six_weeks_try_th...,zekel,1213826517,science,t5_mouw,33329,1621,False,,hundredpushups.com,False,False,False,747502,6nz1k,1866
27473,675oj,https://www.reddit.com/r/reddit.com/comments/6...,/r/reddit.com/comments/675oj/post_the_funniest...,matiasklein,1201730171,reddit.com,t5_6,2039,1098,False,,self.reddit.com,False,False,False,110326,675oj,1500
352438,7beo2,https://www.reddit.com/r/politics/comments/7be...,/r/politics/comments/7beo2/obama_wins_the_pres...,willjohnston,1225857637,politics,t5_2cneq,1934,8538,False,,self.politics,False,False,False,1626962,7beo2,1375
27703,676ja,http://www.washingtonpost.com/wp-dyn/content/c...,/r/reddit.com/comments/676ja/new_study_confirm...,rpi22,1201748070,reddit.com,t5_6,1377,669,False,,washingtonpost.com,False,False,False,111156,676ja,1297
204084,6tvaz,https://www.reddit.com/r/politics/comments/6tv...,/r/politics/comments/6tvaz/im_a_bleedingheart_...,TheRealStick,1217288193,politics,t5_2cneq,1425,788,False,,self.politics,False,False,False,971996,6tvaz,1076
435527,7m6m4,http://www.dailykos.com/story/2008/12/28/11443...,/r/worldnews/comments/7m6m4/today_i_end_my_sup...,Schlichten,1230542729,worldnews,t5_2qh13,1335,1589,False,,dailykos.com,False,False,False,2025149,7m6m4,1063
248884,6z9op,http://www.google.com/chrome,/r/programming/comments/6z9op/chrome_is_here/,georgeb,1220381356,programming,t5_2fwo,1269,1904,False,,google.com,False,False,False,1176159,6z9op,1057
247210,6z2e2,http://www.nytimes.com/reuters/us/internationa...,/r/reddit.com/comments/6z2e2/palin_says_her_da...,nucleophile,1220285161,reddit.com,t5_6,1425,1517,False,,nytimes.com,False,False,False,1168699,6z2e2,1036
289532,7488a,http://www.msnbc.msn.com/id/26884523/?,/r/politics/comments/7488a/bailout_does_not_pa...,IM_A_REPTILIAN,1222711101,politics,t5_2cneq,1346,3361,False,,msnbc.msn.com,False,False,False,1360545,7488a,1028
93504,6fo4i,https://www.reddit.com/r/reddit.com/comments/6...,/r/reddit.com/comments/6fo4i/ask_a_muslim/,cup,1208149360,reddit.com,t5_6,1342,269,False,,self.reddit.com,False,False,False,431771,6fo4i,1004


In [68]:
print(f"Filtered out: {comments_submissions.shape[0] - sfw_comments.shape[0]} comments from nsfw submissions")
print("Check if any nsfw are left: ", sfw_comments[sfw_comments["over_18"]==True].shape[0])

Filtered out: 142200 comments from nsfw submissions
Check if any nsfw are left:  0
