From the last step, I have created a single user-graph per month of Reddit comments. In each graph, each subreddit is a node, and each node is connected by an edge indicating the total number of users that month who posted at least one comment in both subreddits (with no edges connecting subreddits without shared users). These indicate the most frequented subreddits by user-base, and as such I can determine which subreddits are most politically motivated or frequented by the most politically-minded individuals.

Some notes about the graphs:
- Any subreddits with less than 1,000 unique posters are not included in the graph.
- If two subreddits have a shared userbase of 10 or less users, their edge is not included.
These were done in order to limit the size of the graph and to ignore subreddits of little consequence.

In [1]:
import os
import pickle
import networkx as nx

In [2]:
def retrieveGraph(directory, file):
    graph, names2values, values2names = pickle.load(open( os.path.join(directory, file), 'rb' ))
    return graph, names2values, values2names

In [3]:
test_directory = "/home/jayckaiser/Dropbox/DataIncubator/Capstone/"
test_file = 'RC_2014-01.pkl'

graph, names2values, values2names = retrieveGraph(test_directory, test_file)

In [4]:
# for the ease of searching the graph, I'm making baby functions to tie to the dictionaries
def r(subreddit):
    if subreddit in names2values:
        return names2values[subreddit]
    else:
        return 'NONE'

def v(index):
    if index in values2names:
        return values2names[index]
    else:
        return 'NONE'

In [5]:
graph.number_of_nodes(), graph.number_of_edges()

(4581, 588266)

In January, 2014, there are 4,581 subreddits with 588,266 connections between them that fit our criteria above.

In [6]:
graph.degree( r('politics') )

3533

Of these 4,581, 'r/politics' connects to 3,533 others.

In [7]:
politics_neighbors = [( graph[ r('politics') ][neighbor]['weight'] , v(neighbor) )
                      for neighbor in graph.neighbors( r('politics') )]

In [8]:
politics_neighbors = sorted(politics_neighbors, reverse=True)
politics_neighbors[:10]

[(23301, 'AskReddit'),
 (18404, 'funny'),
 (17859, 'pics'),
 (15124, 'todayilearned'),
 (14378, 'AdviceAnimals'),
 (14218, 'worldnews'),
 (13672, 'WTF'),
 (12878, 'videos'),
 (10633, 'technology'),
 (10628, 'gaming')]

Above are posted the top 10 jointly-posted subreddits to that of 'r/politics'. This is expected, as these are all default subreddits frequented by everyone. We can extract these from the list of default subreddits to find some more interesting results. The list below is from 2017, but the difference will not be too great.

In [9]:
default_subs = [
    'announcements', 'Art', 'AskReddit', 'askscience', 'aww', 'blog', 'books', 'creepy', 'dataisbeautiful', 'DIY', 
    'Documentaries', 'EarthPorn', 'explainlikeimfive', 'food', 'funny', 'Futurology', 'gadgets', 'gaming', 'GetMotivated', 'gifs',
    'history', 'IAmA', 'InternetIsBeautiful', 'Jokes', 'LifeProTips', 'listentothis', 'mildlyinteresting', 'movies', 'Music', 'news', 
    'nosleep', 'nottheonion', 'OldSchoolCool', 'personalfinance', 'philosophy', 'photoshopbattles', 'pics', 'politics',
    'science', 'Showerthoughts', 'space', 'sports', 'television', 'tifu', 'todayilearned', 'UpliftingNews' 'videos', 'worldnews'
]

In [10]:
politics_neighbors_minus_defaults = [(sub[0], sub[1]) for sub in politics_neighbors
                                    if sub[1] not in default_subs]
politics_neighbors_minus_defaults[:20]

[(14378, 'AdviceAnimals'),
 (13672, 'WTF'),
 (12878, 'videos'),
 (10633, 'technology'),
 (5579, 'atheism'),
 (3436, 'nfl'),
 (3253, 'trees'),
 (2869, 'cringepics'),
 (2666, 'Games'),
 (2414, 'bestof'),
 (1912, 'conspiracy'),
 (1746, 'woahdude'),
 (1746, 'Android'),
 (1610, 'reactiongifs'),
 (1553, 'sex'),
 (1548, 'TrueReddit'),
 (1476, 'leagueoflegends'),
 (1476, 'Frugal'),
 (1374, 'cringe'),
 (1359, 'JusticePorn')]

Of these top results, many had once been default subreddits. I'll add another list iteratively until we run out of past defaults.

In [11]:
former_default_subs = [
    'AdviceAnimals', 'WTF', 'videos', 'technology', 'atheism', 'cringepics', 'Games', 'bestof', 'TwoXChromosomes', 'writingprompts',
    'nfl', 'conspiracy', 'woahdude', 'reactiongifs', 'Frugal', 'sex'
]

politics_neighbors_minus_defaults = [(sub[0], sub[1]) for sub in politics_neighbors
                                    if sub[1] not in default_subs and 
                                       sub[1] not in former_default_subs]
politics_neighbors_minus_defaults[:10]

[(3253, 'trees'),
 (1746, 'Android'),
 (1548, 'TrueReddit'),
 (1476, 'leagueoflegends'),
 (1374, 'cringe'),
 (1359, 'JusticePorn'),
 (1354, 'changemyview'),
 (1313, 'Fitness'),
 (1284, 'Libertarian'),
 (1284, 'Bitcoin')]

These are less-default subreddits. The first one actually relating to political views in fact is the last on the list above, 'r/Libertarian'.

In [12]:
politics_neighbors_minus_defaults[10:30]

[(1265, 'CFB'),
 (1246, '4chan'),
 (1219, 'nba'),
 (1199, 'pcmasterrace'),
 (1189, 'facepalm'),
 (1178, 'dogecoin'),
 (1136, 'gonewild'),
 (1113, 'Economics'),
 (1092, 'AskMen'),
 (1073, 'MapPorn'),
 (1038, 'programming'),
 (1025, 'offbeat'),
 (1018, 'PoliticalDiscussion'),
 (1016, 'SubredditDrama'),
 (987, 'guns'),
 (978, 'TumblrInAction'),
 (971, 'HistoryPorn'),
 (939, 'hockey'),
 (924, 'buildapc'),
 (915, 'ImGoingToHellForThis')]

Interestingly enough, 'r/4chan', 'r/guns', and 'r/TumblrInAction' are all more traditionally conservative-aligned subreddits. None of these listed are traditionally liberal subreddits (minus the default 'r/TwoXChromosome,' which was removed above).

In [13]:
politics_neighbors_minus_defaults[30:100]

[(894, 'malefashionadvice'),
 (887, 'MorbidReality'),
 (866, 'circlejerk'),
 (864, 'canada'),
 (861, 'rage'),
 (836, 'relationships'),
 (829, 'hiphopheads'),
 (818, 'MensRights'),
 (797, 'business'),
 (789, 'scifi'),
 (783, 'apple'),
 (780, 'worldpolitics'),
 (779, 'fffffffuuuuuuuuuuuu'),
 (772, 'soccer'),
 (770, 'mildlyinfuriating'),
 (769, 'Minecraft'),
 (750, 'dayz'),
 (740, 'cars'),
 (734, 'gameofthrones'),
 (733, 'skyrim'),
 (717, 'DotA2'),
 (714, 'offmychest'),
 (688, 'truegaming'),
 (688, 'standupshots'),
 (684, 'skeptic'),
 (661, 'battlefield_4'),
 (659, 'NetflixBestOf'),
 (656, 'Unexpected'),
 (650, 'promos'),
 (647, 'pokemon'),
 (646, 'Bad_Cop_No_Donut'),
 (638, 'CrazyIdeas'),
 (637, 'fatpeoplestories'),
 (636, 'baseball'),
 (634, 'asoiaf'),
 (634, 'Drugs'),
 (631, 'AskWomen'),
 (630, 'TrueAtheism'),
 (627, 'geek'),
 (622, 'Steam'),
 (620, 'talesfromtechsupport'),
 (619, 'AskHistorians'),
 (616, 'wow'),
 (609, 'MURICA'),
 (605, 'community'),
 (598, 'motorcycles'),
 (594, 'nsf

Looking at politically-relevant subreddits in the top 100 list that are not already defaults, we can find the following:
- Libertarian
- Economics
- PoliticalDiscussion
- worldpolitics
- Conservative

with maybe the following under consideration for their more serious content:
- business
- AskHistorians
- Christianity

For the sake of exploratory data analysis, let's create a function that'll find the top n joint subreddits for any particular sub.

In [14]:
def exploreNeighbors(subreddit, n=50):
    if r(subreddit) not in graph.nodes():
        print("This sub is not present in the graph.")
    else:
        degree = graph.degree( r(subreddit) )
        print("r/{} has {} neighbors.".format(subreddit, degree))
        
        neighbors = sorted( [( graph[ r(subreddit) ][neighbor]['weight'] , v(neighbor) )
                             for neighbor in graph.neighbors( r(subreddit) )] , reverse=True)
        
        neighbors_minus_defaults = [(sub[0], sub[1]) for sub in neighbors
                                    if sub[1] not in default_subs and 
                                    sub[1] not in former_default_subs]
        
        for neighbor in neighbors_minus_defaults[:n]:
            print(neighbor)

In [15]:
exploreNeighbors('TwoXChromosomes')

r/TwoXChromosomes has 1613 neighbors.
(1518, 'AskWomen')
(1320, 'MakeupAddiction')
(988, 'relationships')
(978, 'TrollXChromosomes')
(703, 'AskMen')
(562, 'offmychest')
(533, 'femalefashionadvice')
(452, 'SkincareAddiction')
(443, 'loseit')
(434, 'TheGirlSurvivalGuide')
(418, 'relationship_advice')
(416, 'trees')
(416, 'childfree')
(415, 'xxfitness')
(404, 'FancyFollicles')
(390, 'Parenting')
(353, 'changemyview')
(352, 'RedditLaqueristas')
(349, 'Fitness')
(342, 'LadyBoners')
(314, 'cats')
(311, 'creepyPMs')
(304, 'TumblrInAction')
(291, 'OkCupid')
(284, 'TrueReddit')
(283, 'facepalm')
(282, 'raisedbynarcissists')
(280, 'TalesFromRetail')
(275, 'fatpeoplestories')
(270, 'GirlGamers')
(266, 'MorbidReality')
(262, 'self')
(255, 'confession')
(248, 'harrypotter')
(245, 'cringe')
(243, 'SubredditDrama')
(233, 'DoesAnybodyElse')
(227, 'BabyBumps')
(217, 'Cooking')
(211, 'Feminism')
(207, 'casualiama')
(204, 'keto')
(204, 'dogs')
(198, 'ABraThatFits')
(196, 'canada')
(195, 'progresspics')
(

In [16]:
exploreNeighbors('WTF')

r/WTF has 4533 neighbors.
(12829, 'trees')
(7183, 'leagueoflegends')
(6234, 'cringe')
(5866, '4chan')
(5145, 'gonewild')
(5129, 'pcmasterrace')
(4310, 'ImGoingToHellForThis')
(4296, 'facepalm')
(4167, 'dogecoin')
(4110, 'Android')
(4052, 'JusticePorn')
(3972, 'Fitness')
(3899, 'circlejerk')
(3869, 'pokemon')
(3747, 'AskMen')
(3643, 'Minecraft')
(3608, 'MorbidReality')
(3579, 'nba')
(3577, 'dayz')
(3443, 'mildlyinfuriating')
(3340, 'buildapc')
(3286, 'Unexpected')
(3269, 'hockey')
(3154, 'rage')
(3130, 'skyrim')
(2996, 'hiphopheads')
(2987, 'guns')
(2949, 'MakeupAddiction')
(2945, 'soccer')
(2926, 'CFB')
(2894, 'offmychest')
(2870, 'battlefield_4')
(2827, 'thatHappened')
(2827, 'relationships')
(2740, 'GrandTheftAutoV')
(2725, 'fffffffuuuuuuuuuuuu')
(2718, 'TumblrInAction')
(2709, 'malefashionadvice')
(2688, 'AskWomen')
(2642, 'FiftyFifty')
(2581, 'MapPorn')
(2564, 'cars')
(2488, 'DotA2')
(2459, 'wow')
(2414, 'HistoryPorn')
(2391, 'Bitcoin')
(2340, 'nsfw')
(2315, 'Whatcouldgowrong')
(22

In [17]:
#exploreNeighbors('ass')  # it's what you'd expect

In [18]:
exploreNeighbors('leagueoflegends')

r/leagueoflegends has 2556 neighbors.
(4272, 'hearthstone')
(4090, 'summonerschool')
(1966, 'trees')
(1958, 'DotA2')
(1876, 'pokemon')
(1743, 'wow')
(1667, 'anime')
(1621, 'dayz')
(1605, 'pcmasterrace')
(1358, 'buildapc')
(1337, 'starcraft')
(1261, 'dogecoin')
(1191, 'LeagueofLegendsMeta')
(1189, '4chan')
(1141, 'magicTCG')
(1047, 'Minecraft')
(1042, 'cringe')
(1014, 'nba')
(986, 'soccer')
(951, 'GlobalOffensive')
(872, 'circlejerk')
(815, 'hiphopheads')
(815, 'friendsafari')
(798, 'Steam')
(792, 'pathofexile')
(784, 'starbound')
(781, 'skyrim')
(742, 'Fitness')
(647, 'darksouls')
(616, 'hockey')
(613, 'TeamRedditTeams')
(607, 'Android')
(601, 'teenagers')
(597, 'Guildwars2')
(586, 'gonewild')
(578, 'gameofthrones')
(567, 'AskMen')
(566, 'TumblrInAction')
(558, 'CFB')
(556, 'malefashionadvice')
(556, 'battlefield_4')
(499, 'GameDeals')
(498, 'truegaming')
(494, 'techsupport')
(473, 'ffxiv')
(470, 'battlestations')
(468, 'JusticePorn')
(466, 'civ')
(466, 'SubredditDrama')
(464, 'LeagueO

In [19]:
exploreNeighbors('linguistics')

r/linguistics has 455 neighbors.
(205, 'MapPorn')
(145, 'badlinguistics')
(129, 'AskHistorians')
(108, 'languagelearning')
(97, 'europe')
(95, 'changemyview')
(81, 'TrueReddit')
(74, 'TumblrInAction')
(72, 'polandball')
(72, 'badhistory')
(67, 'programming')
(54, 'conlangs')
(53, 'vexillology')
(50, 'Fitness')
(49, 'SubredditDrama')
(49, 'AskMen')
(47, 'Minecraft')
(46, 'offbeat')
(46, 'cringe')
(45, 'math')
(45, 'britishproblems')
(45, 'Foodforthought')
(44, 'tipofmytongue')
(44, 'skeptic')
(42, 'writing')
(42, 'trees')
(42, 'travel')
(42, 'grammar')
(42, 'Christianity')
(42, 'Android')
(41, 'scifi')
(41, 'linux')
(39, 'mildlyinfuriating')
(38, 'soccer')
(37, 'translator')
(37, 'gaybros')
(37, 'facepalm')
(37, 'YouShouldKnow')
(37, 'French')
(37, '4chan')
(36, 'thatHappened')
(36, 'malefashionadvice')
(36, 'dogecoin')
(36, 'canada')
(35, 'LearnJapanese')
(35, 'Economics')
(35, 'AskWomen')
(34, 'truegaming')
(33, 'unitedkingdom')
(33, 'etymology')


I'm not going to lie. This is so much cooler than I thought it was going to be! Super fun!

However, I need a means to extract the most political subs from a given month.

In [20]:
exploreNeighbors('conlangs')

r/conlangs has 52 neighbors.
(59, 'worldbuilding')
(54, 'linguistics')
(27, 'MapPorn')
(24, 'languagelearning')
(22, 'Minecraft')
(21, 'writing')
(21, 'vexillology')
(21, 'polandball')
(20, 'badlinguistics')
(20, 'WritingPrompts')
(16, 'programming')
(16, 'changemyview')
(15, 'pokemon')
(15, 'TumblrInAction')
(12, 'europe')
(12, 'civ')
(11, 'duolingo')
(11, 'LearnJapanese')
(10, 'trees')
(10, 'teenagers')
(10, 'rpg')
(10, 'fantasywriters')
(10, 'Christianity')
(10, 'AskHistorians')


In [21]:
exploreNeighbors('TheRedPill')

r/TheRedPill has 1026 neighbors.
(495, 'AskMen')
(476, 'asktrp')
(373, 'MensRights')
(339, 'relationships')
(293, 'seduction')
(265, 'Fitness')
(224, 'trees')
(215, 'changemyview')
(208, 'RedPillWomen')
(201, '4chan')
(200, 'AskWomen')
(189, 'TumblrInAction')
(188, 'cringe')
(170, 'offmychest')
(159, 'leagueoflegends')
(154, 'malefashionadvice')
(154, 'NoFap')
(146, 'Bitcoin')
(139, 'TheBluePill')
(137, 'JusticePorn')
(132, 'relationship_advice')
(132, 'gonewild')
(128, 'dogecoin')
(128, 'OkCupid')
(127, 'hiphopheads')
(127, 'ImGoingToHellForThis')
(126, 'confession')
(126, 'SubredditDrama')
(125, 'TrueReddit')
(124, 'pcmasterrace')
(124, 'fatpeoplestories')
(117, 'nba')
(116, 'PurplePillDebate')
(116, 'Libertarian')
(110, 'rage')
(108, 'MorbidReality')
(105, 'Android')
(104, 'circlejerk')
(103, 'facepalm')
(103, 'Drugs')
(95, 'bodybuilding')
(91, 'thatHappened')
(90, 'RedKings')
(89, 'guns')
(86, 'buildapc')
(83, 'self')
(82, 'motorcycles')
(82, 'keto')
(82, 'casualiama')
(81, 'fatlog

Enough exploring! We need a metric to figure out what the best subs should be for comparison. These subs should be ones that either are related to politics, or ones that are not a default for all other subreddits. Let's do this:
compare each subreddit's neighbors and add up how often each one takes up the top spot. Those that are in the top spots the most should be ignored, as they most likely have too much data to make any strong inferences.

In [22]:
from collections import defaultdict

def findMostCommon(graph, n=50):    
    popularity_dicts = [defaultdict(int) for i in range(n)]
    
    
    for node in graph.nodes():
        neighbors = sorted( [( graph[ node ][neighbor]['weight'] , neighbor )
                             for neighbor in graph.neighbors( node )] , reverse=True)
        
        if len(neighbors) > n:
            neighbors = neighbors[:n]
        
        for index, neighbor in enumerate(neighbors):
            popularity_dicts[index][neighbor[1]] += 1
            
    return popularity_dicts


# this will get rid of any outlier subreddits
def simplify_pop_dicts(popularity_dicts,  defaultlist = None):
    def v(index):
        if index in values2names:
            return values2names[index]
        else:
            return 'NONE'
    
    if defaultlist is None:
        defaultlist = []
    
    
    for index, ranked_dict in enumerate(popularity_dicts):
        
        unsorted_indexed_list = [(value, v(key)) for (key, value) in popularity_dicts[index].items()
                                  if value > 10 and v(key) not in defaultlist]
        popularity_dicts[index] = sorted(unsorted_indexed_list, reverse=True)
    
    return popularity_dicts

In [23]:
test_dict = simplify_pop_dicts( findMostCommon(graph) )

In [24]:
sorted(test_dict[0], reverse=True)

[(3531, 'AskReddit'),
 (48, 'gonewild'),
 (41, 'nfl'),
 (37, 'hockey'),
 (33, 'soccer'),
 (32, 'dogecoin'),
 (29, 'nba'),
 (26, 'leagueoflegends'),
 (25, 'funny'),
 (22, 'trees'),
 (21, 'Bitcoin'),
 (14, 'pokemon'),
 (14, 'guns'),
 (13, 'magicTCG'),
 (13, 'electronic_cigarette'),
 (12, 'Minecraft'),
 (12, 'GrandTheftAutoV'),
 (11, 'buildapc'),
 (11, 'anime')]

In [25]:
sorted(test_dict[1], reverse=True)

[(1508, 'funny'),
 (976, 'pics'),
 (836, 'AskReddit'),
 (236, 'gaming'),
 (42, 'WTF'),
 (33, 'worldnews'),
 (28, 'todayilearned'),
 (26, 'videos'),
 (25, 'IAmA'),
 (20, 'technology'),
 (20, 'Android'),
 (19, 'dogemarket'),
 (19, 'AdviceAnimals'),
 (17, 'gonewild'),
 (15, 'trees'),
 (15, 'gonewildcurvy'),
 (15, 'baseball'),
 (13, 'MakeupAddiction'),
 (12, 'sex'),
 (12, 'nfl'),
 (12, 'movies'),
 (12, 'dogecoin')]

In [26]:
sorted(test_dict[2], reverse=True)

[(1700, 'funny'),
 (1575, 'pics'),
 (224, 'gaming'),
 (125, 'AskReddit'),
 (106, 'AdviceAnimals'),
 (80, 'WTF'),
 (62, 'todayilearned'),
 (47, 'IAmA'),
 (33, 'worldnews'),
 (32, 'technology'),
 (31, 'videos'),
 (26, 'nfl'),
 (12, 'movies')]

In [27]:
def pastThePostSubs(popularity_lists):
    topN = set()
    
    for pop_list in popularity_lists:
        for (value, subreddit) in pop_list:
            if subreddit not in topN:
                topN.add(subreddit)
                break
    
    return topN

In [28]:
new_defaults = pastThePostSubs(test_dict)

In [29]:
new_defaults, len(new_defaults)

({'4chan',
  'AdviceAnimals',
  'Android',
  'AskMen',
  'AskReddit',
  'AskWomen',
  'Fitness',
  'Frugal',
  'Games',
  'IAmA',
  'LifeProTips',
  'Minecraft',
  'Music',
  'TrueReddit',
  'WTF',
  'atheism',
  'aww',
  'bestof',
  'books',
  'buildapc',
  'circlejerk',
  'cringe',
  'cringepics',
  'dogecoin',
  'explainlikeimfive',
  'facepalm',
  'funny',
  'gaming',
  'gifs',
  'leagueoflegends',
  'malefashionadvice',
  'mildlyinfuriating',
  'mildlyinteresting',
  'movies',
  'news',
  'nfl',
  'nottheonion',
  'pcmasterrace',
  'pics',
  'politics',
  'reactiongifs',
  'science',
  'sex',
  'technology',
  'television',
  'todayilearned',
  'trees',
  'videos',
  'woahdude',
  'worldnews'},
 50)

This is now the top 50 most popular subreddits, as determined by a past-the-post counting system. Luckily, only 'r/politics' is politically important (with maybe the exception of 'r/worldnews' and 'r/news'. We can edit the function above and repeat the process to determine the top 150 subreddits for actual parsing.

In [30]:
final150 = pastThePostSubs( simplify_pop_dicts( findMostCommon(graph, 150), defaultlist=new_defaults) )

In [31]:
final150, len(final150)

({'AbandonedPorn',
  'AnimalsBeingJerks',
  'ArcherFX',
  'Bitcoin',
  'CFB',
  'Cooking',
  'CrazyIdeas',
  'DIY',
  'DoesAnybodyElse',
  'DotA2',
  'Drugs',
  'EarthPorn',
  'Fallout',
  'FiftyFifty',
  'FoodPorn',
  'Futurology',
  'GameDeals',
  'GetMotivated',
  'GrandTheftAutoV',
  'Guitar',
  'HistoryPorn',
  'ImGoingToHellForThis',
  'InternetIsBeautiful',
  'Jokes',
  'JusticePorn',
  'Justrolledintotheshop',
  'KerbalSpaceProgram',
  'Libertarian',
  'MURICA',
  'MakeupAddiction',
  'MapPorn',
  'MensRights',
  'MorbidReality',
  'NetflixBestOf',
  'OkCupid',
  'OutOfTheLoop',
  'PS4',
  'RealGirls',
  'Sherlock',
  'Showerthoughts',
  'SquaredCircle',
  'StarWars',
  'Steam',
  'SubredditDrama',
  'TalesFromRetail',
  'TumblrInAction',
  'TwoXChromosomes',
  'Unexpected',
  'Whatcouldgowrong',
  'YouShouldKnow',
  'anime',
  'apple',
  'askscience',
  'asoiaf',
  'battlefield_4',
  'bicycling',
  'breakingbad',
  'britishproblems',
  'canada',
  'cars',
  'casualiama',
  'ca

I'm not sure if I like this method. Let's try a different one.

In [32]:
def total_sums(popularity_lists, n=50):
    final = len(popularity_lists)
    
    topN = defaultdict(int)
    for index, pop_list in enumerate(popularity_lists):
        for (value, subreddit) in pop_list:
            topN[subreddit] += (final - index) * value
    
    sorted_subs = sorted( [(value, key) for (key, value) in topN.items()], reverse=True )[:n]
    return set(key for (value, key) in sorted_subs)

In [33]:
different_defaults = total_sums(test_dict)

In [34]:
different_defaults

{'4chan',
 'AdviceAnimals',
 'Android',
 'AskMen',
 'AskReddit',
 'AskWomen',
 'CFB',
 'Fitness',
 'Frugal',
 'Games',
 'IAmA',
 'LifeProTips',
 'Minecraft',
 'Music',
 'WTF',
 'atheism',
 'aww',
 'bestof',
 'books',
 'buildapc',
 'cringe',
 'cringepics',
 'dogecoin',
 'explainlikeimfive',
 'funny',
 'gaming',
 'gifs',
 'gonewild',
 'hiphopheads',
 'leagueoflegends',
 'mildlyinteresting',
 'movies',
 'nba',
 'news',
 'nfl',
 'nottheonion',
 'pcmasterrace',
 'pics',
 'pokemon',
 'politics',
 'reactiongifs',
 'science',
 'sex',
 'technology',
 'television',
 'todayilearned',
 'trees',
 'videos',
 'woahdude',
 'worldnews'}

In [35]:
intersected = new_defaults.intersection(different_defaults)
len(intersected), intersected

(45,
 {'4chan',
  'AdviceAnimals',
  'Android',
  'AskMen',
  'AskReddit',
  'AskWomen',
  'Fitness',
  'Frugal',
  'Games',
  'IAmA',
  'LifeProTips',
  'Minecraft',
  'Music',
  'WTF',
  'atheism',
  'aww',
  'bestof',
  'books',
  'buildapc',
  'cringe',
  'cringepics',
  'dogecoin',
  'explainlikeimfive',
  'funny',
  'gaming',
  'gifs',
  'leagueoflegends',
  'mildlyinteresting',
  'movies',
  'news',
  'nfl',
  'nottheonion',
  'pcmasterrace',
  'pics',
  'politics',
  'reactiongifs',
  'science',
  'sex',
  'technology',
  'television',
  'todayilearned',
  'trees',
  'videos',
  'woahdude',
  'worldnews'})

These two methods yield almost identical results, but I prefer the methodology of the second one. Let's verify this is the case for the important 150 subreddits as well.

In [36]:
different_final150 = total_sums( simplify_pop_dicts( findMostCommon(graph, 150), defaultlist=different_defaults), 150 )

In [37]:
len(different_final150), different_final150

(146,
 {'AbandonedPorn',
  'AnimalsBeingJerks',
  'ArcherFX',
  'Bitcoin',
  'Cooking',
  'CrappyDesign',
  'CrazyIdeas',
  'DIY',
  'DoesAnybodyElse',
  'DotA2',
  'Drugs',
  'EarthPorn',
  'Fallout',
  'FiftyFifty',
  'FoodPorn',
  'Futurology',
  'GameDeals',
  'GetMotivated',
  'GrandTheftAutoV',
  'Guitar',
  'HistoryPorn',
  'ImGoingToHellForThis',
  'InternetIsBeautiful',
  'Jokes',
  'JusticePorn',
  'Justrolledintotheshop',
  'KerbalSpaceProgram',
  'Libertarian',
  'MURICA',
  'MakeupAddiction',
  'MapPorn',
  'MensRights',
  'MorbidReality',
  'NSFW_GIF',
  'NetflixBestOf',
  'NoStupidQuestions',
  'OkCupid',
  'OldSchoolCool',
  'OutOfTheLoop',
  'PS4',
  'RealGirls',
  'Sherlock',
  'Showerthoughts',
  'SquaredCircle',
  'StarWars',
  'Steam',
  'SubredditDrama',
  'TalesFromRetail',
  'TrueReddit',
  'TumblrInAction',
  'TwoXChromosomes',
  'Unexpected',
  'Whatcouldgowrong',
  'YouShouldKnow',
  'anime',
  'apple',
  'askscience',
  'asoiaf',
  'baseball',
  'battlefield

In [38]:
final150_intersect = final150.intersection(different_final150)
len(final150_intersect), final150_intersect

(122,
 {'AbandonedPorn',
  'AnimalsBeingJerks',
  'ArcherFX',
  'Bitcoin',
  'Cooking',
  'CrazyIdeas',
  'DIY',
  'DoesAnybodyElse',
  'DotA2',
  'Drugs',
  'EarthPorn',
  'Fallout',
  'FiftyFifty',
  'FoodPorn',
  'Futurology',
  'GameDeals',
  'GetMotivated',
  'GrandTheftAutoV',
  'Guitar',
  'HistoryPorn',
  'ImGoingToHellForThis',
  'InternetIsBeautiful',
  'Jokes',
  'JusticePorn',
  'Justrolledintotheshop',
  'KerbalSpaceProgram',
  'Libertarian',
  'MURICA',
  'MakeupAddiction',
  'MapPorn',
  'MensRights',
  'MorbidReality',
  'NetflixBestOf',
  'OkCupid',
  'OutOfTheLoop',
  'PS4',
  'RealGirls',
  'Sherlock',
  'Showerthoughts',
  'SquaredCircle',
  'StarWars',
  'Steam',
  'SubredditDrama',
  'TalesFromRetail',
  'TumblrInAction',
  'TwoXChromosomes',
  'Unexpected',
  'Whatcouldgowrong',
  'YouShouldKnow',
  'anime',
  'apple',
  'askscience',
  'asoiaf',
  'battlefield_4',
  'bicycling',
  'breakingbad',
  'britishproblems',
  'canada',
  'cars',
  'casualiama',
  'cats'

Yep, almost identical. Close enough where I don't think it'll matter too much. In the final searching, we'll add a custom-made list of political and state subreddits as well for the sake of completeness. For the moment, these will work great! Also, because we'll be adding those other ones later, let's limit these lists to the top 100.

In [39]:
def saveDefaultsAndN(graph, values2names, final_count=100):

    # limit the graph down to the most important
    def findMostCommon(graph, n=50):    
        popularity_dicts = [defaultdict(int) for i in range(n)]


        for node in graph.nodes():
            neighbors = sorted( [( graph[ node ][neighbor]['weight'] , neighbor )
                                 for neighbor in graph.neighbors( node )] , reverse=True)

            if len(neighbors) > n:
                neighbors = neighbors[:n]

            for index, neighbor in enumerate(neighbors):
                popularity_dicts[index][neighbor[1]] += 1

        return popularity_dicts

    # this will get rid of any outlier subreddits
    def simplify_pop_dicts(popularity_dicts,  defaultlist = None):
        def v(index):
            if index in values2names:
                return values2names[index]
            else:
                return 'NONE'

        if defaultlist is None:
            defaultlist = []


        for index, ranked_dict in enumerate(popularity_dicts):

            unsorted_indexed_list = [(value, v(key)) for (key, value) in popularity_dicts[index].items()
                                      if value > 10 and v(key) not in defaultlist]
            popularity_dicts[index] = sorted(unsorted_indexed_list, reverse=True)

        return popularity_dicts
    
    # return a final list of subs
    def total_sums(popularity_lists, n=50):
        final = len(popularity_lists)

        topN = defaultdict(int)
        for index, pop_list in enumerate(popularity_lists):
            for (value, subreddit) in pop_list:
                topN[subreddit] += (final - index) * value

        sorted_subs = sorted( [(value, key) for (key, value) in topN.items()], reverse=True )[:n]
        return set(key for (value, key) in sorted_subs)
    
    
    
    # actual_code    
    defaults = total_sums(simplify_pop_dicts( findMostCommon(graph) ))
    finalN = total_sums( simplify_pop_dicts( findMostCommon(graph, final_count),
                                               defaultlist=defaults),
                           final_count )
    
    return defaults, finalN


def pickleDefaultsAndN(directory, file, defaults_N):
    pickle.dump( defaults_N, open(os.path.join(directory, file), 'wb') )
    print('Finished {}.'.format(file), end='\r')

In [40]:
sample_defaults, sampleN = saveDefaultsAndN(graph, values2names)
len(sample_defaults), len(sampleN)

(50, 100)

In [41]:
%%time
graphs_directory = '/media/jayckaiser/My Passport/reddit/user_graphs'
topsubreddits_directory = '/media/jayckaiser/My Passport/reddit/top_subs'

# files_to_do = sorted(os.listdir(graphs_directory))
files_to_do = ['RC_2012-03.pkl']

for file in files_to_do:
    graph, names2values, values2names = retrieveGraph(graphs_directory, file)
    defaults_topN = saveDefaultsAndN(graph, values2names)
    
    pickleDefaultsAndN(topsubreddits_directory, file, defaults_topN)

FileNotFoundError: [Errno 2] No such file or directory: '/media/jayckaiser/My Passport/reddit/user_graphs/RC_2012-03.pkl'

And to finally test whether this worked:

In [42]:
latest_file = os.listdir(topsubreddits_directory)[-1]

latest_defaults, latest_topN = pickle.load(open(os.path.join(topsubreddits_directory, latest_file),
                                               'rb')
                                          )
len(latest_defaults), latest_defaults

(50,
 {'AdviceAnimals',
  'Android',
  'AskReddit',
  'Bitcoin',
  'BlackPeopleTwitter',
  'CrappyDesign',
  'CringeAnarchy',
  'Games',
  'IAmA',
  'Jokes',
  'LifeProTips',
  'Music',
  'NintendoSwitch',
  'OldSchoolCool',
  'Overwatch',
  'Showerthoughts',
  'StarWarsBattlefront',
  'WTF',
  'aww',
  'dankmemes',
  'funny',
  'gaming',
  'gifs',
  'gonewild',
  'iamverysmart',
  'insanepeoplefacebook',
  'interestingasfuck',
  'leagueoflegends',
  'mildlyinfuriating',
  'mildlyinteresting',
  'movies',
  'nba',
  'news',
  'nfl',
  'nottheonion',
  'oddlysatisfying',
  'pcmasterrace',
  'personalfinance',
  'pics',
  'politics',
  'relationships',
  'science',
  'sports',
  'starterpacks',
  'technology',
  'television',
  'todayilearned',
  'trashy',
  'videos',
  'worldnews'})

In [43]:
len(latest_topN), latest_topN

(100,
 {'ATBGE',
  'AsiansGoneWild',
  'AskMen',
  'AskOuija',
  'AskWomen',
  'BetterEveryLoop',
  'BustyPetite',
  'CFB',
  'ComedyCemetery',
  'CryptoCurrency',
  'DIY',
  'Damnthatsinteresting',
  'DestinyTheGame',
  'DnD',
  'Documentaries',
  'EarthPorn',
  'FellowKids',
  'Frugal',
  'Futurology',
  'HumansBeingBros',
  'JusticeServed',
  'Justrolledintotheshop',
  'LateStageCapitalism',
  'Libertarian',
  'MadeMeSmile',
  'MapPorn',
  'MurderedByWords',
  'NSFW_GIF',
  'NatureIsFuckingLit',
  'NoStupidQuestions',
  'PS4',
  'PUBATTLEGROUNDS',
  'PetiteGoneWild',
  'PoliticalHumor',
  'PrequelMemes',
  'ProgrammerHumor',
  'PublicFreakout',
  'RealGirls',
  'RoastMe',
  'SquaredCircle',
  'StarWars',
  'StrangerThings',
  'SubredditDrama',
  'The_Donald',
  'Tinder',
  'TumblrInAction',
  'TwoXChromosomes',
  'UnethicalLifeProTips',
  'Unexpected',
  'UpliftingNews',
  'Wellthatsucks',
  'Whatcouldgowrong',
  'anime',
  'apple',
  'baseball',
  'bestof',
  'blackmagicfuckery',
 