# Subreddit Mapping with Positive Pointwise Mutual Information

In the FivethirtyEight analysis they used positive pointwise mutual information of the commenter overlap counts for the vectors that they applied their vector algebra operations to. While I felt that conditional probabilities would be better, I also felt the need to explore what results under PPMI and related measures looked like.

In [1]:
import pandas as pd
import scipy.sparse as ss
import numpy as np
from sklearn.decomposition import TruncatedSVD
import sklearn.manifold
import sklearn.preprocessing
import tsne
import re

In [2]:
raw_data = pd.read_csv('subreddit-overlap')

In [3]:
raw_data.head()

Unnamed: 0,t1_subreddit,t2_subreddit,NumOverlaps
0,roblox,spaceengineers,20
1,madlads,Guitar,29
2,Chargers,BigBrother,29
3,NetflixBestOf,celebnsfw,35
4,JoeRogan,Glitch_in_the_Matrix,28


In [4]:
subreddit_popularity = raw_data.groupby('t2_subreddit')['NumOverlaps'].sum()
subreddits = np.array(subreddit_popularity.sort_values(ascending=False).index)

In [5]:
index_map = dict(np.vstack([subreddits, np.arange(subreddits.shape[0])]).T)

In [6]:
count_matrix = ss.coo_matrix((raw_data.NumOverlaps, 
                              (raw_data.t2_subreddit.map(index_map),
                               raw_data.t1_subreddit.map(index_map))),
                             shape=(subreddits.shape[0], subreddits.shape[0]),
                             dtype=np.float64)

In [7]:
count_matrix

<56187x56187 sparse matrix of type '<type 'numpy.float64'>'
	with 15381950 stored elements in COOrdinate format>

Things proceed the same as the other analyses up to the construction of the count matrix. Now, instead of row normalizing to get conditional probabilities we are going to compute the pointwise mutual information. Given events $A$ and $B$ we define the pointwise mutual information of A and B to be
$$
\text{PMI}(A, B) = \log\left(\frac{P(A, B)}{P(A)P(B)}\right);
$$
that is the PMI is the log of the ratio of the joint probability of $A$ *and* $B$ occuring with the product of the independent probabilties of $A$ and $B$. To start we'll just compute $P(A)$ and $P(B)$ by row and column normalizing.

In [8]:
row_normalized = sklearn.preprocessing.normalize(count_matrix.tocsr(), norm='l1')
col_normalized = sklearn.preprocessing.normalize(count_matrix.tocsr(), norm='l1', axis=0)

Next we compute the denominator by multiplying $P(A)$ by $P(B)$ pointwise across the whole matrix.

In [9]:
pmi_denominator = row_normalized.copy()
pmi_denominator.data = row_normalized.data * col_normalized.data

The numerator is the joint probability, which is just a matter of normalizing by the sum over all entries.

In [10]:
pmi_numerator = count_matrix.tocsr()
pmi_numerator /= pmi_numerator.sum()

We can then compute the PMI by taking the log of the numerator over the denominator. One alternative I explored was simply forcing all values to be positive by adding the min value (of non-zero entries). It didn't help.

In [11]:
pmi_matrix = pmi_numerator.copy()
pmi_matrix.data = np.log(pmi_numerator.data / pmi_denominator.data)
# pmi_matrix.data += pmi_matrix.min()

The positive pointwise mutual information is found by simply eliminating an negative values -- we truncate them all to 0. The is easily done via numpy.

In [12]:
ppmi_matrix = pmi_matrix.copy()
ppmi_matrix.data = np.where(ppmi_matrix.data > 0, ppmi_matrix.data, 0)
ppmi_matrix.eliminate_zeros()
ppmi_matrix

<56187x56187 sparse matrix of type '<type 'numpy.float64'>'
	with 6427505 stored elements in Compressed Sparse Row format>

We now return to our regularly scheduled programming ...

In [13]:
reduced_vectors = TruncatedSVD(n_components=500,
                               random_state=0).fit_transform(ppmi_matrix)

In [14]:
reduced_vectors = sklearn.preprocessing.normalize(reduced_vectors[:10000], norm='l2')

In [15]:
seed_state = np.random.RandomState(0)
subreddit_map = tsne.bh_sne(reduced_vectors[:10000], perplexity=50.0, random_state=seed_state)

In [16]:
subreddit_map_df = pd.DataFrame(subreddit_map, columns=('x', 'y'))
subreddit_map_df['subreddit'] = subreddits[:10000]
subreddit_map_df.head()

Unnamed: 0,x,y,subreddit
0,17.666521,-8.146278,AskReddit
1,16.331204,-3.281477,pics
2,16.108943,-2.749507,funny
3,15.659676,-2.525058,todayilearned
4,15.624509,-2.579759,worldnews


In [17]:
import hdbscan

In [18]:
clusterer = hdbscan.HDBSCAN(min_samples=5, 
                            min_cluster_size=20).fit(subreddit_map)
cluster_ids = clusterer.labels_

In [19]:
subreddit_map_df['cluster'] = cluster_ids

In [20]:
from bokeh.plotting import figure, show, output_notebook, output_file
from bokeh.models import HoverTool, ColumnDataSource, value
from bokeh.models.mappers import LinearColorMapper
from bokeh.palettes import viridis
from collections import OrderedDict

output_notebook()

In [21]:
color_mapper = LinearColorMapper(palette=viridis(256), low=0, high=cluster_ids.max())
color_dict = {'field': 'cluster', 'transform': color_mapper}

plot_data_clusters = ColumnDataSource(subreddit_map_df[subreddit_map_df.cluster >= 0])
plot_data_noise = ColumnDataSource(subreddit_map_df[subreddit_map_df.cluster < 0])

tsne_plot = figure(title=u'A Map of Subreddits',
                   plot_width = 700,
                   plot_height = 700,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, resize, reset'),
                   active_scroll=u'wheel_zoom')
tsne_plot.add_tools( HoverTool(tooltips = OrderedDict([('subreddit', '@subreddit'),
                                                       ('cluster', '@cluster')])))


# draw clusters
tsne_plot.circle(u'x', u'y', source=plot_data_clusters,
                 fill_color=color_dict, line_alpha=0.002, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')
# draw noise
tsne_plot.circle(u'x', u'y', source=plot_data_noise,
                 fill_color=u'gray', line_alpha=0.002, fill_alpha=0.05,
                 size=10, hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

show(tsne_plot);

Not exactly the two-dimensional embedding we were looking for. We can look at the individual clusters as well ...

In [22]:
def is_nsfw(subreddit):
    return re.search(r'(nsfw|gonewild)', subreddit)

for cid in range(cluster_ids.max() + 1):
    subreddits = subreddit_map_df.subreddit[cluster_ids == cid]
    if np.any(subreddits.map(is_nsfw)):
        subreddits = ' ... Censored ...'
    else:
        subreddits = subreddits.values
        
    print '\nCluster {}:\n{}\n'.format(cid, subreddits) 


Cluster 0:
 ... Censored ...


Cluster 1:
 ... Censored ...


Cluster 2:
['friendsafari' 'GlobalOffensiveTrade' 'Pokemongiveaway' 'AskUK'
 'CasualPokemonTrades' 'Liberal' 'SVExchange' 'falloutlore' 'PerfectTiming'
 'worstof' 'ACTrade' 'AskTrollX' 'BDSMcommunity' 'spicy' 'Dogtraining'
 'IndieGaming' 'electricians' 'curlyhair' 'ak47' 'lockpicking' 'beerporn'
 'bourbon' 'linguistics' 'montreal' 'Welding' 'buffy' 'CatsStandingUp'
 'redditgetsdrawn' 'stephenking' 'BeforeNAfterAdoption' 'AdoptMyVillager'
 'makeupexchange' 'liberalgunowners' 'misc' 'TryingForABaby' 'TheDarkTower'
 'longisland' 'DragonsDogma' 'coins' 'discworld' 'UKPersonalFinance'
 'DepthHub' 'tasker' 'kindle' 'PAX' 'piercing' 'metalworking' 'dataisugly'
 'galaxys5' 'Roku' 'warcraftlore' 'NewsOfTheWeird' 'htpc' 'dvdcollection'
 'TheCreatures' 'HeistTeams' 'AdvancedMicroDevices' 'logodesign' 'NFA'
 'Tools' 'Cascadia' 'UBC' 'SRSDiscussion' 'tales' 'RateMyMayor'
 'VictoriaBC' 'acturnips' 'beertrade' 'Disney_Infinity'
 'randomac



Cluster 15:
['AdviceAnimals' 'SquaredCircle' 'UpliftingNews' 'Fantasy' 'lgbt'
 'Portland' 'engineering' 'spaceengineers' 'FFXV' 'arduino' 'CHIBears'
 'pittsburgh' 'dishonored' 'bisexual' 'ChildrenFallingOver' 'ufc' 'mexico'
 'Madden' 'AskDocs' 'Trucks' 'trailerparkboys' 'GalaxyS6' 'Incels' 'infp'
 'POLITIC' 'canucks' 'TalesFromTheFrontDesk' 'Sacramento'
 'ledootgeneration' 'flashlight' 'NorthCarolina' 'NYKnicks' 'boxoffice'
 'RedditLaqueristas' 'FixedGearBicycle' 'insurgency'
 'EmpireDidNothingWrong' 'C_S_T' 'grilledcheese' 'VideoEditing'
 'rpg_gamers' 'Wordpress' 'mycology' 'gamingpc' 'RoastMyCar'
 'cinematography' 'BackYardChickens' 'AdvancedRunning' 'Rochester'
 'AppalachianTrail' 'madisonwi' 'MouseReview' 'hockeyjerseys' 'Finland'
 'Harley' 'spaceflight' 'csgo' 'Forex' 'RandomKindness' 'trance'
 'kickstarter' 'sixwordstories' 'Bandnames' 'perktv' 'uktrees' 'dirtyr4r'
 'AskVet' 'tesdcares' 'TechoBlanco' 'timelapse' 'CivEx']


Cluster 16:
['3DS' 'natureismetal' 'zelda' 'socialskill



Cluster 41:
['ImageComics' 'AmateurArchives' 'bostontrees' 'btcfork' 'hotsauce'
 'HaloCirclejerk' 'LavignyInquisition' 'lisp_ja' 'KTymee'
 'tallfashionadvice' 'selfservice' 'TronMTG' 'amishadowbanned'
 'Hitomi_Tanaka' 'IslamUnveiled' 'USE2016' 'zizek' 'toukenranbu'
 'girlsdoporn' 'wtfamazon']


Cluster 42:
['CompTIA' 'juggling' 'OurFlatWorld' 'randomsexygifs' 'thelastpsychiatrist'
 'akalimains' 'LLLikeAGlove' 'processing' 'shorthairchicks' 'Iraq' 'banjo'
 'PlantsVSZombies' 'JMT' 'bladerunner' 'Team_Robin' 'NuclearPower' 'ARAM'
 'jacksonandterrellgifs' 'github' 'JustBlowjobGifs' 'JuneBumpers2017'
 'teamimpulse' 'oxford']


Cluster 43:
 ... Censored ...


Cluster 44:
['overlord' 'Osaka' 'papertowns' 'UCSantaBarbara' 'CodersForSanders' 'Lisk'
 'onebros' 'NSALeaks' 'keriberry_420' 'araragigirls' 'Avengers' 'gulag'
 'gilf' 'Temple' 'NBAForums' 'ImaginaryElves' 'continuityporn' 'MikePatton'
 'spacesimgames' 'JonKortajarenaDL' 'mumbai' 'AnimalHeists'
 'FirstWorldConformists' 'GMAT' 'genderf



Cluster 67:
['castles' 'Hedgehog' 'crystalpalace' 'Kikpals' 'ReAlSaltLake'
 'modelSupCourt' 'Chromalore' 'whynotasource' 'logic' 'JulyBumpers2017'
 'HistoricalWorldPowers' 'firefall' 'omeuprimeirofilme3d16' 'gout'
 'rocket_league_trading' 'pokemongola' 'programmerreactions' 'KarmaForFree'
 'GIMP' 'starwarsmemes' 'RepTime' 'Medievalart' 'BubbleButts' 'KanMusu'
 'ELATeachers' 'QuadCities' 'squirting' 'EscapeTheBucket' 'greygoo'
 'CumKiss' 'YeezyOrDie' 'shareItWithMe' 'LucieWildeIsRetarded' 'pdxgunnuts'
 'billythefridge' 'stickyhentai' 'highfivefails' 'Petioles' 'nightlyshow']


Cluster 68:
['VintageApple' 'mindcrackcirclejerk' 'BreastEnvy' 'Bacon'
 'ImaginaryHorrors' 'genetics' 'hdtgm' 'OneyPlays' 'Saggy' 'alltheleft'
 'hamburg' 'Digibro' 'ctbeer' 'WastelandPowers' 'HairyPussy' 'GasTheKikes'
 'storage' 'Sep2015Event' 'learnspanish' 'TrendingReddits' 'adelaidefc'
 'ModelForeignAffairs' 'rHermitcraft_UHC' 'pentax' 'getnarwhal' 'rwbyRP'
 'momsbox' 'horrorlit' 'BustyNaturals' 'Unashamed' '



Cluster 94:
['blackops2' 'GetStudying' 'uncensorship' 'britpics' 'ironmaiden'
 'OverwatchLFT' 'TomorrowWorld' 'Socialistart' 'brownbeauty'
 'mapporncirclejerk' 'TheCube' 'waiting_to_try' 'GoTlinks' 'videos_Youtube'
 'PlazaAragon' 'cazzeggio' 'motogif' 'movie_ja' 'youtubefun' 'agnostic'
 'PokemonTCG' 'sharepoint' 'lightingdesign' 'badukpolitics'
 'TechnologyProTips' 'transporn' 'AngieVaronaLegal' 'Opeth'
 'PlantBasedDiet' 'marvelmemes' 'stupidslutsclub' 'humblebundles'
 'feminineboys' 'gridcoin' 'MaymayZone' 'shorthairedwaifus' 'oddlyweird'
 'AdiposeAmigos' 'ArtJunkie' 'ultrarunning' 'maui' 'tequila'
 'ImaginaryHellscapes' 'CabinPorn' 'CyberPsychology' 'astateoftrance'
 'SmiteOceanic' 'TrollRPG' 'Awwww' 'jquery' 'DanceDanceRevolution'
 'PixelCarRacer' 'cbradio' 'Dermatology']


Cluster 95:
['PennStateUniversity' 'animeplot' 'TelevisionRatings' 'amifat' 'inbou_ja'
 'ShittyTechSupport' 'AnimalsBeingGeniuses' 'openbroke' 'comeonandslam'
 'WorldofTanksXbox' 'XXXTENTACION' 'tycoon' 'musici



Cluster 121:
['AskBattlestations' 'CBRBattleRoyale' 'fallenlondon' 'nexusplayer' 'Filme'
 'StormfrontorSJW' 'TheHearth' 'july12' 'ArenaHS' 'bash' 'Mabinogi'
 'TuberSim' 'SmashingPumpkins' 'pkmntcgreferences' 'AmateurCollegeGirls'
 'tiltshift' 'LeagueOfVideos' 'UKGreens' 'BlackwellAcademy'
 'ImaginaryGotham' 'kpopgfys' 'BloodGulchRP' 'rubberducks' 'FenerbahceSK'
 'suboxone' 'Prague' 'redheadxxx' 'AvaAddams' 'WhiteCheeks']


Cluster 122:
['TheXanaxCartel' 'unitedstatesofamerica' 'battlefield_live' 'VXJunkies'
 'scienceofdeduction' 'StonerProTips' 'meat' 'AustinBeer'
 'economicCollapse' 'PokemonGOValor' 'rule34PS2' 'NakedProgress' 'vulkan'
 'TorontoRevolt' 'indepthstories' 'Tsunderes' 'NCIS' 'AdrenalineFans'
 'dolphinconspiracy' 'Nekomimi' 'stellarisgame' 'Lyme' 'ZHU' 'b00b3d']


Cluster 123:
['katyperry' 'zyzz' 'gpdwin' 'PokemonGOToronto' 'girls' 'Subredditads'
 'TheOriginals' 'jilling' 'Clemson' 'ParisComments' 'hackernews' 'RIPDotA2'
 'playstationvr' 'freedonuts' 'bigdickjoy' 'ToMetr