FindTop150.py

(C) 2018 by Jay Kaiser <jayckaiser.github.io>
Created Jan 8, 2018
Updated Jan 8, 2018

Given the Stat files from the previous step, find the top 150 most "influential" subreddits.
Output this information into a file that readily shows the ordering and the means for their ratings.

An "influential subreddit" has a large number of comments, unique users, and average message length.
Posts should also be ranked highly by other users in the subreddit. This indicates active involvement.

In [1]:
import pickle
import os
import pandas as pd

In [2]:
def stats2dataframe(directory, file):
    subreddits = pickle.load(open(os.path.join(directory, file), 'rb'))
    
    simpler_figures = []
    
    total_comments = 0
    for subreddit in subreddits:
        subreddit_content = subreddits.get(subreddit)
        
        num_unique_users            = len(subreddit_content.get('unique_users'))
        total_number_of_comments    = subreddit_content.get('total_number_of_comments')
        average_comments_per_person = total_number_of_comments / num_unique_users
        average_score               = subreddit_content.get('total_score') / total_number_of_comments
        average_comment_length      = subreddit_content.get('total_comments_length') / total_number_of_comments
        
        total_comments += total_number_of_comments
        
        simpler_figures.append([subreddit,
                                total_number_of_comments,
                                num_unique_users,
                                average_comments_per_person,
                                average_comment_length,
                                average_score])
           
    subreddits = pd.DataFrame(simpler_figures,
                              columns=['subreddit',
                                       'num_comments',
                                       'num_unique_users',
                                       'avg_comments_per_user',
                                       'avg_comment_length',
                                       'avg_score'])
    
    print("\rIn {}, there were {:,} new comments on Reddit.".format(file, total_comments), end='')
    return subreddits

In [3]:
rc2013_01 = stats2dataframe('/home/jayckaiser/Dropbox/DataIncubator/Capstone/', 'RC_2013-01.pkl')

In RC_2013-01.pkl, there were 27,919,427 new comments on Reddit.

In [4]:
rc2013_01.shape

(23916, 6)

In [5]:
rc2013_01.head()

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
0,Android,89060,16758,5.314477,197.189412,3.816292
1,todayilearned,353108,106060,3.329323,169.29907,7.080225
2,politics,470008,62932,7.468506,282.881194,4.176435
3,funny,1302731,295464,4.409102,99.726815,8.416836
4,gonewildcurvy,28160,3726,7.557703,50.2712,1.34858


I predict that the number of comments are a strong predictor of the value of a subreddit. Let's investigate whether the other values are valuable as well.

In [6]:
rc2013_01.sort_values(['num_comments'], ascending=False).head()

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
6,AskReddit,3998326,475861,8.402298,151.145225,10.030516
3,funny,1302731,295464,4.409102,99.726815,8.416836
17,pics,922374,257166,3.586687,110.634732,8.15535
53,WTF,703455,186658,3.768684,107.895024,7.947862
44,gaming,654238,167831,3.898195,131.242108,6.381407


In January 2013, there were 23,916 subreddits, and the ones in the cell below are the most commented-on ones. I'm betting most of the other subreddits are unused, however. Let's see this in person.

In [7]:
rc2013_01.sort_values(['avg_comment_length'], ascending=False).head(10)

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
23519,MyEyesBurn,3,2,1.5,26715.666667,1.0
15589,PN_Official,3,1,3.0,16051.333333,1.0
23114,everythingimagined,1,1,1.0,15340.0,1.0
22033,GlobalIssues,2,2,1.0,11306.5,1.0
18024,AvenueBrawler,1,1,1.0,9192.0,1.0
5700,1000thworldproblems,214,19,11.263158,8777.794393,2.172897
23431,Al_Stewart,1,1,1.0,7668.0,2.0
21775,testincode,1,1,1.0,6957.0,1.0
19786,Kazantip,1,1,1.0,6877.0,1.0
21559,warpoetry,2,1,2.0,6416.0,1.0


It turns out that subreddits with the longest comments on average are all single-pop subreddits or just full of "shitposts" (like in the case of r/1000thworldproblems). In fact, let's filter subreddits by number of total users who've posted and see what changes.

In [8]:
rc2013_01_greaterThan10 = rc2013_01[ rc2013_01['num_unique_users'] > 10 ]
rc2013_01_greaterThan10.shape

(8868, 6)

By filtering by subreddits with more than 10 unique posters, I've whittled the total list down to a third. Let's scale this up one more magnitude.

In [9]:
rc2013_01_greaterThan100 = rc2013_01[ rc2013_01['num_unique_users'] > 100 ]
rc2013_01_greaterThan100.shape

(3186, 6)

With more than 100 posters, we're left to one eighth the original amount.

In [10]:
rc2013_01_greaterThan1000 = rc2013_01[ rc2013_01['num_unique_users'] > 1000 ]
rc2013_01_greaterThan1000.shape

(686, 6)

With more than 1000, we're down to only one 40th the original amount. We're getting close.

In [11]:
rc2013_01_greaterThan1000.sort_values(['avg_comment_length']).head(15)

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
235,milf,3460,1511,2.289874,46.697399,2.198266
759,GoneWildPlus,19375,1981,9.780414,49.353961,1.457703
4,gonewildcurvy,28160,3726,7.557703,50.2712,1.34858
391,ladybonersgw,22239,3084,7.211089,51.392464,2.133954
2470,ass,2811,1440,1.952083,51.481323,2.253646
240,gonewild,212226,28211,7.52281,51.646353,1.57606
300,Boobies,2642,1427,1.851437,52.641181,2.700606
1708,curvy,2181,1078,2.023191,53.876662,2.259972
1476,RealGirls,9001,4366,2.061612,54.248972,3.193978
510,treesgonewild,2770,1069,2.591207,55.49639,1.983394


It turns out that the most popular subreddits with the least to contribute (as determined by avg comment length) are all porn subreddits.

Let's dare to try 10,000.

In [12]:
rc2013_01_greaterThan10000 = rc2013_01[ rc2013_01['num_unique_users'] > 10000 ]
rc2013_01_greaterThan10000.shape

(55, 6)

In [13]:
rc2013_01_greaterThan10000.sort_values('avg_comment_length', ascending=False)

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
24,askscience,39947,11671,3.422757,413.287731,5.978446
190,TwoXChromosomes,60077,11161,5.382761,363.329294,7.666944
98,explainlikeimfive,56709,17439,3.251849,358.851734,6.192509
523,bestof,25909,10466,2.47554,288.473388,5.465128
2,politics,470008,62932,7.468506,282.881194,4.176435
156,sex,90738,19427,4.670716,268.064019,6.074412
25,buildapc,104973,12139,8.647582,267.403751,2.036438
70,Games,170897,29116,5.869522,261.974365,6.827294
311,science,69621,26626,2.614775,253.182703,5.013128
467,Frugal,58569,17249,3.395501,251.644505,6.176937


At 10,000 unique users who've posted apiece, there are still 55 subreddits remaining. This is promising, especially because I only need ~150 subreddits in total to study.

Before moving forward, for fun I want to see what the most circle-jerking subreddit is.

In [14]:
rc2013_01.sort_values(['avg_score'], ascending=False).head(20)

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
12133,thatHappened,3,3,1.0,132.333333,90.333333
9003,gonewidl,18,14,1.285714,43.555556,77.277778
22024,mildlyterrifying,1,1,1.0,26.0,53.0
20580,LatvianJokes,47,24,1.958333,105.425532,32.617021
23603,WhyIsntSRSBanned,2,2,1.0,40.0,30.5
21444,frugaljerk,1,1,1.0,69.0,30.0
370,photoshopbattles,17432,5819,2.995704,69.797269,28.203132
259,TheJerkies,930,355,2.619718,170.437634,27.110753
16321,iFunny,1,1,1.0,70.0,26.0
19502,depressing,1,1,1.0,18.0,25.0


And let's see which subreddit has the highest number of posts per person.

In [15]:
rc2013_01.sort_values(['avg_comments_per_user'], ascending=False).head(20)

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
43,videoinfo,1172,1,1172.0,1115.086177,1.0
13184,comment,1060,3,353.333333,7.931132,1.003774
638,ModerationLog,9456,39,242.461538,1592.360618,1.000529
7377,WorldLeadersBattle,138,1,138.0,162.23913,1.028986
5832,PhantomandFriends,402,3,134.0,247.60199,1.965174
76,illjustleavethishere,248,2,124.0,125.370968,1.0
3059,TheHoneyFuckles,733,6,122.166667,95.451569,1.008186
124,Random_Acts_Of_Amazon,152374,1625,93.768615,139.939799,1.437712
16775,FantasyRealignment,967,11,87.909091,35.596691,1.020683
980,ConnectedCareers,6146,73,84.191781,99.863488,2.030101


Now, how can we determine the most influential, most informative subreddits? Here is what I posit: let's find from the subreddits with over 1,000 unique users the top 500 subreddits, the highest subreddits ranked by number of comments, number of unique users, and average comment length. Then, from these, let's limit them to the top 500 of each category. Then, if we took the union of subreddits across all three, we can whittle the list total list of ~600 down to a minimal number. This final subset will be those subreddits that have the most comments, 

In [16]:
rc2013_01_500NC = rc2013_01_greaterThan1000.sort_values(['num_comments'], ascending=False)[:500]
rc2013_01_500NC.head(10)

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
6,AskReddit,3998326,475861,8.402298,151.145225,10.030516
3,funny,1302731,295464,4.409102,99.726815,8.416836
17,pics,922374,257166,3.586687,110.634732,8.15535
53,WTF,703455,186658,3.768684,107.895024,7.947862
44,gaming,654238,167831,3.898195,131.242108,6.381407
7,AdviceAnimals,557399,144860,3.847846,154.645721,6.780685
38,leagueoflegends,484696,60586,8.000132,155.66767,5.390558
2,politics,470008,62932,7.468506,282.881194,4.176435
32,nfl,423906,28597,14.823443,117.091702,6.858525
1,todayilearned,353108,106060,3.329323,169.29907,7.080225


In [17]:
rc2013_01_500NUU = rc2013_01_greaterThan1000.sort_values(['num_unique_users'], ascending=False)[:500]
rc2013_01_500NUU.head(10)

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
6,AskReddit,3998326,475861,8.402298,151.145225,10.030516
3,funny,1302731,295464,4.409102,99.726815,8.416836
17,pics,922374,257166,3.586687,110.634732,8.15535
53,WTF,703455,186658,3.768684,107.895024,7.947862
44,gaming,654238,167831,3.898195,131.242108,6.381407
7,AdviceAnimals,557399,144860,3.847846,154.645721,6.780685
31,IAmA,350507,120297,2.91368,177.438528,11.067933
15,videos,341257,109884,3.105611,136.563103,9.807901
1,todayilearned,353108,106060,3.329323,169.29907,7.080225
75,aww,184882,80851,2.2867,97.600799,5.843186


In [18]:
rc2013_01_500ACL = rc2013_01_greaterThan1000.sort_values(['avg_comment_length'], ascending=False)[:500]
rc2013_01_500ACL.head(10)

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
645,POLITIC,10794,1094,9.866545,1405.43274,1.040578
1215,listentothis,8121,3645,2.227984,675.043837,2.819973
1285,buildapcforme,6494,1147,5.661726,549.28534,1.361411
201,AskHistorians,28316,5931,4.774237,533.590479,6.846836
2168,NeutralPolitics,4238,1035,4.094686,521.894998,4.093204
565,AskSocialScience,3999,1190,3.360504,511.876219,3.428107
2095,TrueFilm,6197,1614,3.839529,485.583024,4.553816
295,TrueAtheism,13913,2987,4.657851,481.054769,4.10501
396,DebateReligion,42276,2005,21.085287,480.546835,2.515186
164,philosophy,22410,4251,5.271701,450.064614,2.656716


In [19]:
rc2013_01_mostRelevant = pd.merge(pd.merge(rc2013_01_500ACL, rc2013_01_500NC, how='inner'), rc2013_01_500NUU, how='inner')

In [20]:
rc2013_01_mostRelevant150 = rc2013_01_mostRelevant.sort_values(['num_unique_users'], ascending=False)[:150]

In [21]:
rc2013_01_mostRelevant150

Unnamed: 0,subreddit,num_comments,num_unique_users,avg_comments_per_user,avg_comment_length,avg_score
304,AskReddit,3998326,475861,8.402298,151.145225,10.030516
295,AdviceAnimals,557399,144860,3.847846,154.645721,6.780685
241,IAmA,350507,120297,2.913680,177.438528,11.067933
266,todayilearned,353108,106060,3.329323,169.299070,7.080225
60,politics,470008,62932,7.468506,282.881194,4.176435
108,atheism,281837,60723,4.641355,245.170581,4.257617
290,leagueoflegends,484696,60586,8.000132,155.667670,5.390558
303,movies,195035,59791,3.261946,151.343805,7.199728
118,worldnews,252793,57961,4.361433,233.457540,6.065552
123,technology,184465,52022,3.545904,229.703288,6.629279


Now, let's write some code to automatically extract these top subreddits from a file and save it into a pickled dataframe just like the one above.

In [22]:
def extractTop150(directory, file):
    raw_dataframe = stats2dataframe(directory, file)
    raw_dataframe = raw_dataframe[ raw_dataframe['num_unique_users'] > 1000 ]
    
    top500NC = raw_dataframe.sort_values(['num_comments'], ascending=False)[:500]
    top500NUU = raw_dataframe.sort_values(['num_unique_users'], ascending=False)[:500]
    top500ACL = raw_dataframe.sort_values(['avg_comment_length'], ascending=False)[:500]
    
    unioned_subset = pd.merge(pd.merge(top500NC, top500NUU, how='inner'), top500ACL, how='inner')
    
    return unioned_subset.sort_values(['num_unique_users'], ascending=False)[:150]


def saveToPickle(directory, file, dataframe):
    dataframe.to_pickle(os.path.join(directory, file))

In [23]:
stats_directory = "/media/jayckaiser/My Passport/reddit/stats/"
top150_directory = "/media/jayckaiser/My Passport/reddit/top150/"

if True:  # This has already been done once so I don't need to do it again.
    # files_to_do = sorted(os.listdir(stats_directory))
    files_to_do = ['RC_2013-03.pkl']
    for file in files_to_do:
        saveToPickle(top150_directory, file,    extractTop150(stats_directory, file) )

In RC_2005-12.pkl, there were 970 new comments on Reddit.In RC_2006-01.pkl, there were 3,195 new comments on Reddit.In RC_2006-02.pkl, there were 7,758 new comments on Reddit.In RC_2006-03.pkl, there were 11,729 new comments on Reddit.In RC_2006-04.pkl, there were 16,704 new comments on Reddit.In RC_2006-05.pkl, there were 23,880 new comments on Reddit.In RC_2006-06.pkl, there were 26,050 new comments on Reddit.In RC_2006-07.pkl, there were 33,113 new comments on Reddit.In RC_2006-08.pkl, there were 45,277 new comments on Reddit.In RC_2006-09.pkl, there were 46,144 new comments on Reddit.In RC_2006-10.pkl, there were 49,505 new comments on Reddit.In RC_2006-11.pkl, there were 57,575 new comments on Reddit.In RC_2006-12.pkl, there were 56,249 new comments on Reddit.

  stride //= shape[i]


In RC_2017-11.pkl, there were 81,433,342 new comments on Reddit.

And for the sake of easily extracting the full set for future parsing, here is one final script for this notebook.

In [31]:
def extractUniqueSubreddits(directory):
    master_set = set()
    
    for file in os.listdir(directory):
        if file == 'ALL.pkl':
            continue
        
        print("\rReading {}.".format(file), end='')
        dataframe = pd.read_pickle(os.path.join(directory, file))
        
        subreddits = set(dataframe['subreddit'])
        master_set = master_set | subreddits
    
    print("\rFinished all files.")
    return master_set


output_path = os.path.join(top150_directory, 'ALL.pkl')

if False:
    master_set = extractUniqueSubreddits(top150_directory)
    pickle.dump(master_set, open(output_path, 'wb'))

In [35]:
import pprint

pprint.pprint(pickle.load(open(output_path, 'rb')))

{'13ReasonsWhy',
 '3DS',
 '3Dprinting',
 '4chan',
 'ADHD',
 'AMA',
 'AbandonedPorn',
 'Advice',
 'AdviceAnimals',
 'Anarchism',
 'Android',
 'Animesuggest',
 'Aquariums',
 'Art',
 'AsianBeauty',
 'AskAnAmerican',
 'AskHistorians',
 'AskMen',
 'AskReddit',
 'AskScienceFiction',
 'AskThe_Donald',
 'AskTrumpSupporters',
 'AskWomen',
 'Atlanta',
 'Austin',
 'Autos',
 'BabyBumps',
 'Bad_Cop_No_Donut',
 'Banished',
 'Battleborn',
 'Battlefield',
 'Bestof2011',
 'Bioshock',
 'Bitcoin',
 'Borderlands',
 'Borderlands2',
 'BuyItForLife',
 'CFB',
 'CODGhosts',
 'Calgary',
 'CanadaPolitics',
 'CasualConversation',
 'Christianity',
 'ClickerHeroes',
 'Coffee',
 'ColbertRally',
 'CompetitiveForHonor',
 'CompetitiveHS',
 'Competitiveoverwatch',
 'Conservative',
 'Cooking',
 'CruciblePlaybook',
 'CrusaderKings',
 'Cynicalbrit',
 'DAE',
 'DIY',
 'DNCleaks',
 'DarkNetMarkets',
 'DarkSouls2',
 'DeadBedrooms',
 'Denmark',
 'Denver',
 'Design',
 'Dexter',
 'Diablo',
 'Diablo3Strategy',
 'DivinityOriginalSi