# Subreddit Mapping using t-SNE

This was my first effort at subreddit mapping to test if the idea was vaiable. It turns out that this was mostly quite similar to the final analysis, but I spent a while exploring some other options as well.

In [1]:
import pandas as pd
import scipy.sparse as ss
import numpy as np
from sklearn.decomposition import TruncatedSVD
import sklearn.manifold
import tsne
import re

In [2]:
raw_data = pd.read_csv('subreddit-overlap')

In [3]:
raw_data.head()

Unnamed: 0,t1_subreddit,t2_subreddit,NumOverlaps
0,roblox,spaceengineers,20
1,madlads,Guitar,29
2,Chargers,BigBrother,29
3,NetflixBestOf,celebnsfw,35
4,JoeRogan,Glitch_in_the_Matrix,28


In [4]:
subreddit_popularity = raw_data.groupby('t2_subreddit')['NumOverlaps'].sum()
subreddits = np.array(subreddit_popularity.sort_values(ascending=False).index)

In [5]:
index_map = dict(np.vstack([subreddits, np.arange(subreddits.shape[0])]).T)

In [6]:
count_matrix = ss.coo_matrix((raw_data.NumOverlaps, 
                              (raw_data.t2_subreddit.map(index_map),
                               raw_data.t1_subreddit.map(index_map))),
                             shape=(subreddits.shape[0], subreddits.shape[0]),
                             dtype=np.float64)

In [7]:
count_matrix

<56187x56187 sparse matrix of type '<type 'numpy.float64'>'
	with 15381950 stored elements in COOrdinate format>

I hadn't bothered to look if the relevant scikit-learn functions actually accepted sparse matrices when I was just playing, so I did the row normalization myself by hand.

In [8]:
conditional_prob_matrix = count_matrix.tocsr()
row_sums = np.array(conditional_prob_matrix.sum(axis=1))[:,0]
row_indices, col_indices = conditional_prob_matrix.nonzero()
conditional_prob_matrix.data /= row_sums[row_indices]

In [9]:
reduced_vectors = TruncatedSVD(n_components=500,
                               random_state=0).fit_transform(conditional_prob_matrix)

Again with the hand-rolled normalisation. It was not hard in this case.

In [10]:
reduced_vectors /= np.sqrt((reduced_vectors**2).sum(axis=1))[:, np.newaxis]

Instead of LargeVis we can just use t-SNE. Some caveats: the tnse package is still quite a bit faster than t-SNE in scikit-learn, but iot only works with python 2.

In [11]:
seed_state = np.random.RandomState(0)
subreddit_map = tsne.bh_sne(reduced_vectors[:10000], perplexity=50.0, random_state=seed_state)

In [12]:
subreddit_map_df = pd.DataFrame(subreddit_map, columns=('x', 'y'))
subreddit_map_df['subreddit'] = subreddits[:10000]
subreddit_map_df.head()

Unnamed: 0,x,y,subreddit
0,4.811215,-12.553612,AskReddit
1,4.913335,-12.591433,pics
2,4.710313,-12.528339,funny
3,5.251131,-12.89494,todayilearned
4,6.23771,-13.607648,worldnews


Clustering looks pretty much the same as it did in the final version. I played with parameters a little here, and also looked at leaf clustering as the cluster extraction method. In practice, however, the standard Excess of Mass approach was more than adequate.

In [13]:
import hdbscan

In [14]:
clusterer = hdbscan.HDBSCAN(min_samples=5, 
                            min_cluster_size=20).fit(subreddit_map)
cluster_ids = clusterer.labels_

In [15]:
subreddit_map_df['cluster'] = cluster_ids

Onto the Bokeh plotting. This was still just experimenting with mapping and clustering so I hadn't honed down the plot code much. I don't do nice colormapping, for instance, but instead plot the noise and cluster points separately. There is also no adjustment of alpha channels based on zoom levels. It was good enough to view the map and mouse over regions to see how well things worked.

In [16]:
from bokeh.plotting import figure, show, output_notebook, output_file
from bokeh.models import HoverTool, ColumnDataSource, value
from bokeh.models.mappers import LinearColorMapper
from bokeh.palettes import viridis
from collections import OrderedDict

output_notebook()

In [17]:
color_mapper = LinearColorMapper(palette=viridis(256), low=0, high=cluster_ids.max())
color_dict = {'field': 'cluster', 'transform': color_mapper}

plot_data_clusters = ColumnDataSource(subreddit_map_df[subreddit_map_df.cluster >= 0])
plot_data_noise = ColumnDataSource(subreddit_map_df[subreddit_map_df.cluster < 0])

tsne_plot = figure(title=u'A Map of Subreddits',
                   plot_width = 700,
                   plot_height = 700,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, resize, reset'),
                   active_scroll=u'wheel_zoom')
tsne_plot.add_tools( HoverTool(tooltips = OrderedDict([('subreddit', '@subreddit'),
                                                       ('cluster', '@cluster')])))


# draw clusters
tsne_plot.circle(u'x', u'y', source=plot_data_clusters,
                 fill_color=color_dict, line_alpha=0.002, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')
# draw noise
tsne_plot.circle(u'x', u'y', source=plot_data_noise,
                 fill_color=u'gray', line_alpha=0.002, fill_alpha=0.05,
                 size=10, hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

show(tsne_plot);

The final real test was simply print out the contents of the clusters and see if they made sense. For the most part they are pretty good, but they are less good than what LargeVis provided, with more clusters for which there aren't clear topics. Feel free to do exactly this for the LargeVis version and you'll see what I mean. 

In [18]:
def is_nsfw(subreddit):
    return re.search(r'(nsfw|gonewild)', subreddit)

for cid in range(cluster_ids.max() + 1):
    subreddits = subreddit_map_df.subreddit[cluster_ids == cid]
    if np.any(subreddits.map(is_nsfw)):
        subreddits = ' ... Censored ...'
    else:
        subreddits = subreddits.values
        
    print '\nCluster {}:\n{}\n'.format(cid, subreddits) 


Cluster 0:
['itookapicture' 'analog' 'photocritique' 'Nikon' 'SonyAlpha' 'Cameras'
 'postprocessing' 'AskPhotography' 'ExposurePorn' 'canon' 'Instagram' 'M43'
 'WeddingPhotography' 'photomarket' 'Lightroom' 'fujix' 'pentax'
 'shootingcars' 'AmateurPhotography' 'streetphotography' 'Polaroid' 'Leica'
 'beforeandafteredit' 'photoclass_2016' 'photoclass2017' 'LandscapeAstro']


Cluster 1:
['germany' 'German' 'Austria' 'rocketbeans' 'the_schulz' 'berlin' 'de_IAmA'
 'kreiswichs' 'edefreiheit' 'FragReddit' 'deOhneRegeln' 'Bundesliga'
 'hamburg' 'MBundestag' 'GermanRap' 'Filme' 'wien' 'Munich' '600euro'
 'ich_iel' 'einfach_posten' 'NichtDerPostillon']


Cluster 2:
['DebateReligion' 'Catholicism' 'DebateAnAtheist' 'DebateAChristian'
 'TrueChristian' 'religion' 'Reformed' 'prolife' 'OrthodoxChristianity'
 'bad_religion' 'NoFapChristians' 'AcademicBiblical' 'RadicalChristianity'
 'brokehugs' 'DebateEvolution' 'Sidehugs' 'CatholicPolitics' 'Christians'
 'ReasonableFaith' 'OpenChristian' 'Anglican



Cluster 22:
['traps' 'Sissies' 'Tgirls' 'sissyhypno' 'FemBoys' 'chastity' 'GoneWildCD'
 'crossdressing' 'GoneWildTrans' 'Tgifs' 'transporn' 'men_in_panties'
 'tbulges' 'Sissyperfection' 'BaileyJay' 'tflop' 'bigdickgirl' 'Ladyboys'
 'Rearcock' 'Sissy_humiliation' 'Shemales']


Cluster 23:
['newsokunomoral' 'BakaNewsJP' 'lowlevelaware' 'newsokuvip'
 'highlevelkarma' 'japan_anime' 'rakugakicho' 'soccer_jp' 'nintendo_jp'
 'steamr' 'newsokurMod' 'newsg' 'NewsSokuhou_R' 'jisakupc' 'ryuuseigai'
 'Oekaki_ja' 'dokusyo_syoseki_r' 'Giron' 'TableTopGames_ja' 'TV_ja'
 'apple_jp' 'modclubhouse_ja' 'eiganews' 'otoge' 'nanJ' 'kancolle_ja'
 'PuzzleAndDragons_ja' 'yuri_jp' 'lovelive_ja' 'imas_ja' 'VOCALOID_UTAU_jp'
 'sky_ja' 'Fuee' 'Radio_ja' 'Motorsports_ja' 'kokkai' 'quake_jp' 'gazou'
 'YJSNPI' 'GeinouNews' 'touhou_jp' 'MonsterHunter_jp']


Cluster 24:
 ... Censored ...


Cluster 25:
['Bitcoin' 'btc' 'Buttcoin' 'ethereum' 'ethtrader' 'BitcoinMarkets'
 'Monero' 'CryptoCurrency' 'bitcoinxt' 'BitTipper



Cluster 46:
['trees' 'vaporents' 'eldertrees' 'microgrowery' 'saplings'
 'StonerEngineering' 'Marijuana' 'CannabisExtracts' 'see' 'highdeas' 'mflb'
 'SpaceBuckets' 'canadients' 'treedibles' 'EntExchange' 'CBD' 'COents'
 'weed' 'Waxpen' 'glassheads' 'cannabis' 'ploompax' 'Autoflowers' 'rosin'
 'chinaglass' 'StonerProTips' 'abv' 'portabledabs' 'GrassHopperVape' 'MMJ'
 'HerbGrow' 'macrogrowery' 'cannabiscultivation' 'Pieces' 'arizer' 'lbregs'
 'Petioles' 'lampwork' 'timetolegalize']


Cluster 47:
['gadgets' 'sysadmin' 'talesfromtechsupport' 'techsupport'
 'techsupportgore' 'cordcutters' 'geek' 'raspberry_pi' 'tech' 'homelab'
 'google' 'Piracy' 'DataHoarder' 'networking' 'PleX' 'techsupportmacgyver'
 'hometheater' 'homeautomation' 'HomeNetworking' 'iiiiiiitttttttttttt'
 'soylent' 'Ingress' 'amazonecho' 'trackers' 'computers' 'cableporn'
 'torrents' 'retrobattlestations' 'software' 'linuxadmin' 'amazon'
 'AskNetsec' 'chrome' 'Addons4Kodi' 'vmware' 'kodi' 'PowerShell'
 'computertechs' 'meg



Cluster 70:
['pitbulls' 'Pets' 'dogpictures' 'Dogtraining' 'corgi' 'germanshepherds'
 'goldenretrievers' 'pugs' 'Dachshund' 'AskVet' 'puppy101' 'WiggleButts'
 'husky' 'shiba' 'Horses' 'labrador' 'Greyhounds' 'Bulldogs' 'lookatmydog'
 'AustralianCattleDog' 'Animals' 'Chihuahua' 'Boxer' 'puppies'
 'BostonTerrier' 'BorderCollie' 'greatdanes' 'beagle' 'AustralianShepherd'
 'IDmydog' 'Rottweiler' 'VetTech' 'siberianhusky' 'Petloss' 'basset'
 'DobermanPinscher' 'samoyeds']


Cluster 71:
['doge' 'chicksfalling' 'creepy_gif' 'catvideos' 'SuperSaiyanGifs'
 'TheCatTrapIsWorking' 'sfw_wtf' 'HQRG' 'CatTaps' 'thatshitsfunny'
 'oddlyweird' 'nflgifs' 'upvote' 'longstabbything' 'kittengifs'
 'ReversedGIFS' 'bears' 'TrollDevelopers' 'K_gifs' 'starwarsgifs'
 'BadRocketLeagueGoals' 'reverseanimalrescue' 'SpaceGifs' 'drugmemes'
 'PolyBridge' 'RetroGamePorn' 'PlayingWithFire' 'GirlsDancingAwkwardly'
 'HumansBeingJerks' 'PeopleAlmostDying' 'awwgifs' 'Snek' 'todayiwaslucky'
 'MemesIRL' 'FirePorn' 'ZoomingG



Cluster 90:
['streetwear' 'Sneakers' 'Kanye' 'Repsneakers' 'FashionReps'
 'supremeclothing' 'malefashion' 'sneakermarket' 'FreeKarma' 'DesignerReps'
 'streetwearstartup' 'TeenMFA' 'Aliexpress' 'FreeKarmas' 'TeenFA'
 'FreeKarma4You' 'chanzhfsneakers' 'PalaceClothing' 'kicksmarket' 'RepTime'
 'adidasatc' 'YeezyOrDie' 'bapeheads' 'FashionRepsBST']


Cluster 91:
['hiphopheads' 'FrankOcean' 'donaldglover' 'HipHopImages' 'ChanceTheRapper'
 'TheWeeknd' 'hiphop101' 'OFWGKTA' 'sadboys' 'KidCudi' 'KendrickLamar'
 'Hiphopcirclejerk' 'TeamSESH' 'trapmuzik' 'ChiefKeef' 'liluglymane' 'G59'
 'travisscott' 'XXXTENTACION' 'KanyeLeaks' 'freshalbumart' 'mfdoom'
 'Drizzy']


Cluster 92:
['pokemongo' 'TheSilphRoad' 'ClashOfClans' 'summonerswar'
 'PuzzleAndDragons' 'bravefrontier' 'FFBraveExvius' 'FFRecordKeeper'
 'Maplestory' 'FireEmblemHeroes' 'SWGalaxyOfHeroes' 'vainglorygame'
 'pokemongodev' 'DuelLinks' 'MobiusFF' 'crusadersquest' 'battlecats'
 'CastleClash' 'pokemonduel' '7kglobal' 'Granblue_en' 'Mem



Cluster 101:
['teenagers' 'dogecoin' 'Cubers' 'Nerf' 'test' 'PictureGame' 'Throwers'
 'Diepio' 'FRC' 'pokemongoyellow' 'PokemonGoMystic' 'joinrobin' 'trollabot'
 '2b2t' 'robintracking' 'ButtonAftermath' 'marchingband' 'FTC'
 'PokemonGOValor' 'Slitherio']


Cluster 102:
['sweden' 'de' 'france' 'thenetherlands' 'singapore' 'brasil' 'kpop'
 'Philippines' 'mexico' 'italy' 'Denmark' 'portugal' 'argentina' 'Suomi'
 'belgium' 'southafrica' 'indonesia' 'malaysia' 'croatia' 'aoe2' 'norge'
 'Polska' 'drunkenpeasants' 'vzla' 'greece' 'serbia' 'Norway' 'ukraina'
 'poland' 'chile' 'Switzerland' 'hungary' 'lebanon' 'Iceland' 'podemos'
 'Romania' 'IBO' 'spain' 'bih' 'Slovenia' 'Eesti' 'czech' 'Barcelona'
 'albania' 'azerbaijan' 'bulgaria' 'latvia' 'politota' 'lithuania']


Cluster 103:
['ffxiv' 'Guildwars2' 'elderscrollsonline' 'MMORPG' 'blackdesertonline'
 'swtor' 'bladeandsoul' 'wowservers' 'WildStar' 'archeage' 'marvelheroes'
 'treeofsavior' 'DFO' 'TeraOnline' 'ffxi' 'Skyforge' 'everquest' 'PSO2



Cluster 112:
['funkopop' 'ActionFigures' 'powerrangers' 'dvdcollection' 'Legodimensions'
 'bioniclelego' 'TMNT' 'lootcrate' 'comiccon' 'funkoswap' 'Legomarket'
 'toyexchange' 'legostarwars' 'legotrade' 'NYCC' 'lootcratespoilers'
 'Voltron' 'familyguythegame' 'uvtrade' 'legodeal' 'supersentai'
 'Steelbooks' 'CollectorCorps']


Cluster 113:
['kulchasimulator' 'boburnham' 'UnknownTradeCo' 'PlanetDolan' 'Brunei'
 'Parkour' 'DieAntwoord' 'TheReportOfTheWeek' '90sHipHop' 'origami'
 'Movie_Club' 'inbou_ja' 'that_Poppy' 'hampan' 'juggling' 'Cello'
 'cardistry' 'terracehouse' 'Korn' 'TastyFood' 'videogames' 'KatarinaMains'
 'AFireInside' 'badselfeater' 'DiamondClub' 'MumkeysAnimeReviews'
 'rangerland' 'PinkOmega' 'Redboid' 'hiphop' 'KISS' 'rule34PS2'
 'Crewniverse' 'bboy' 'quizzyBot' 'BulletBarry' 'Draven' 'scaredshitless'
 'kingcobrajfs' 'WoahTunes' 'cpop' 'makemeaplaylist' 'CartoonNetwork'
 'thelastpsychiatrist' 'akalimains' 'chillstep' 'electroswing' 'youboobers'
 'ThatSnobEmpire' 'CosmicD