# Mapping Reddit

[Reddit](http://reddit.com) is a discussion board that bills itself as the "Front Page of the Internet". It is divided into a large number of topic-specific "subreddits". In this demo, we'll take some data about which subreddits various active Reddit users post to a lot to make a visual map of subreddits. The data comes from here: http://figshare.com/articles/reddit_user_posting_behavior/874101 and you can find a pure-Python version of this demo using `scikit-learn` at http://opensource.datacratic.com/mtlpy50/.

## Initializing `pymldb`

In this demo, we will use `pymldb` to interact with the [REST API](/doc/#builtin/WorkingWithRest.md.html): see the [Using `pymldb` Tutorial](/doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details.

In [1]:
from pymldb import Connection
mldb = Connection("http://localhost")

## Loading up the raw data

In [2]:
mldb.put('/v1/procedures/import_reddit', { 
    "type": "import.text",  
    "params": { 
        "dataFileUrl": "http://files.figshare.com/1310438/reddit_user_posting_behavior.csv.gz",
        'delimiter':'', 
        'quotechar':'',
        'outputDataset': 'reddit_raw',
        'runOnCreation': True
    } 
})


And here is what our raw dataset looks like. The `lineText` column will need to be parsed: it's comma-delimited, with the first token being a user ID and the remaining tokens being the set of subreddits that user contributed to.

In [3]:
mldb.query("select * from reddit_raw limit 5")

Unnamed: 0_level_0,lineText
_rowName,Unnamed: 1_level_1
471242,"1094849,politics,fffffffuuuuuuuuuuuu,askscienc..."
770157,"2112642,AdviceAnimals,technology"
592067,"1459838,AdviceAnimals,kindlefire,electronics,t..."
172925,"319466,fffffffuuuuuuuuuuuu,AskReddit,pics,funny"
495371,"1160486,mylittlepony,tf2trade,Dota2Trade,tf2"


## Transforming the raw data into a sparse matrix


We will create and run a [Procedure](/doc/#builtin/procedures/Procedures.md.html) of type [`transform`](/doc/#builtin/procedures/TransformDataset.md.html). The `tokenize` function will project out the subreddit names into columns.

In [4]:
mldb.put('/v1/procedures/reddit_import', {
    "type": "transform",
    "params": {
        "inputData": "select tokenize(lineText, {offset: 1, value: 1}) as * from reddit_raw",
        "outputDataset": "reddit_dataset",
        "runOnCreation": True
    }
})

Here is the resulting dataset: it's a sparse matrix with a row per user and a column per subreddit, where the cells are `1` if the row's user was a contributor to the column's subreddit, and `null` otherwise.

In [5]:
mldb.query("select * from reddit_dataset limit 5")

Unnamed: 0_level_0,FoodPorn,japan,food,JusticePorn,Music,headphones,progmetal,Drugs,AskReddit,todayilearned,...,tasker,Autos,Coffee,steamdeals,electronics,androidapps,tf2,Dota2Trade,tf2trade,mylittlepony
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
471242,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,,,,,,,,,
770157,,,,,,,,,,,...,,,,,,,,,,
592067,,,,,,,,,1.0,,...,1.0,1.0,1.0,1.0,1.0,1.0,,,,
172925,,,,,,,,,1.0,,...,,,,,,,,,,
495371,,,,,,,,,,,...,,,,,,,1.0,1.0,1.0,1.0


## Dimensionality Reduction with Singular Value Decomposition (SVD)


We will create and run a [Procedure](/doc/#builtin/procedures/Procedures.md.html) of type [`svd.train`](/doc/#builtin/procedures/Svd.md.html).

In [6]:
mldb.put('/v1/procedures/reddit_svd', {
    "type" : "svd.train",
    "params" : {
        "trainingData" : """
            SELECT 
                COLUMN EXPR (AS columnName() ORDER BY rowCount() DESC, columnName() LIMIT 4000) 
            FROM reddit_dataset
        """,
        "columnOutputDataset" : "reddit_svd_embedding",
        "runOnCreation": True
    }
})


The result of this operation is a new dataset with a row per subreddit for the 4000 most-active subreddits and columns representing coordinates for that subreddit in a 100-dimensional space. 

**Note:** the row names are the subreddit names followed by "|1" because the SVD training procedure interpreted the input matrix as categorical rather than numerical.

In [7]:
mldb.query("select * from reddit_svd_embedding limit 5")

Unnamed: 0_level_0,svd0000,svd0001,svd0002,svd0003,svd0004,svd0005,svd0006,svd0007,svd0008,svd0009,...,svd0090,svd0091,svd0092,svd0093,svd0094,svd0095,svd0096,svd0097,svd0098,svd0099
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
supremeclothing|1,-0.00027,-0.00058,8.6e-05,-0.00018,1e-06,0.000441,-0.000414197,-7.2e-05,-0.000346,0.000471,...,-0.007272,-0.001051,0.00052,-0.001242,0.000601,-0.000144,-0.000347,0.001039,-0.000815,0.001911
Dublin|1,-0.000261,6.8e-05,0.000329,0.000202,0.000228,0.000115,-0.0001676869,1.2e-05,0.000125,0.00015,...,0.000497,9.5e-05,0.000299,-0.000575,-0.000615,-0.000199,-0.000467,-0.000356,-0.000539,0.000318
AsianBeauty|1,-3.5e-05,-8.8e-05,5.8e-05,1.6e-05,-3.1e-05,4.9e-05,-3.195397e-05,5e-06,2.5e-05,-8.2e-05,...,-0.000192,-0.000434,-0.000344,0.000568,6e-05,0.001428,-0.000145,-0.000353,-0.000258,0.00031
YGOBinders|1,-2.4e-05,-7.5e-05,-2.8e-05,-7.7e-05,-2.9e-05,-4.8e-05,-9.98669e-07,-4.9e-05,-2e-05,-4e-06,...,-0.000113,9e-06,5.9e-05,0.000197,0.000253,-1.7e-05,3e-06,-8.2e-05,-8.4e-05,0.00011
thick|1,-0.000901,0.000782,-0.000218,0.00046,0.000541,0.001293,3.014595e-05,-0.000296,-0.001294,0.000387,...,-0.000966,-0.000637,-0.000264,-0.000638,-0.001522,-0.000498,0.001079,0.00176,-0.000999,0.000857


## Clustering with K-Means


We will create and run a [Procedure](/doc/#builtin/procedures/Procedures.md.html) of type [`kmeans.train`](/doc/#builtin/procedures/KmeansProcedure.md.html).

In [8]:
mldb.put('/v1/procedures/reddit_kmeans', {
    "type" : "kmeans.train",
    "params" : {
        "trainingData" : "select * from reddit_svd_embedding",
        "outputDataset" : "reddit_kmeans_clusters",
        "numClusters" : 20,
        "runOnCreation": True
    }
})


The result of this operation is a simple dataset which associates each row in the input (i.e. each subreddit) to one of 20 clusters.

In [9]:
mldb.query("select * from reddit_kmeans_clusters limit 5")

Unnamed: 0_level_0,cluster
_rowName,Unnamed: 1_level_1
supremeclothing|1,2
Dublin|1,3
AsianBeauty|1,10
YGOBinders|1,7
thick|1,0


## 2-d Dimensionality Reduction with t-SNE


We will create and run a [Procedure](/doc/#builtin/procedures/Procedures.md.html) of type [`tsne.train`](/doc/#builtin/procedures/TsneProcedure.md.html).

In [10]:
mldb.put('/v1/procedures/reddit_tsne', {
    "type" : "tsne.train",
    "params" : {
        "trainingData" : "select * from reddit_svd_embedding",
        "rowOutputDataset" : "reddit_tsne_embedding",
        "runOnCreation": True
    }
})


The result is similar to the SVD step above: we get a row per subreddit and the columns are coordinates, but this time in a 2-dimensional space appropriate for visualization.

In [11]:
mldb.query("select * from reddit_tsne_embedding limit 5")

Unnamed: 0_level_0,x,y
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1
supremeclothing|1,62.150974,-4.51986
Dublin|1,0.6145,40.628929
AsianBeauty|1,-11.832833,-17.97777
YGOBinders|1,-3.408515,14.584464
thick|1,-66.764389,-28.684132


## Counting the number of users per subreddit


We will create and run a [Procedure](/doc/#builtin/procedures/Procedures.md.html) of type [`transform`](/doc/#builtin/procedures/TransformDataset.md.html) on the transpose of the original input dataset.

In [12]:
mldb.put('/v1/procedures/reddit_count_users', {
    "type": "transform",
    "params": {
        "inputData": "select columnCount() as numUsers named rowName() + '|1' from transpose(reddit_dataset)",
        "outputDataset": "reddit_user_counts",
        "runOnCreation": True
    }
})

We appended "|1" to the row names in this dataset to allow the `merge` operation below to work well.

In [13]:
mldb.query("select * from reddit_user_counts limit 5")

Unnamed: 0_level_0,numUsers
_rowName,Unnamed: 1_level_1
GoodMorningPeriwinkle|1,6
supremeclothing|1,602
CoD4|1,29
Skweee|1,5
Cimmeria|1,1


## Querying and Visualizating the output

We'll use the [Query API](/doc/#builtin/sql/QueryAPI.md.html) to get the data into a Pandas DataFrame and then use Bokeh to visualize it.

In the query below we renamed the rows to get rid of the "|1" which the SVD appended to each subreddit name and we filter out rows where `cluster` is `null` because we only clustered the 4000 most-active subreddits.

In [14]:
df = mldb.query("""
    select *, quantize(x, 7) as grid_x, quantize(y, 7) as grid_y 
    named regex_replace(rowName(), '\|1', '') 
    from merge(reddit_user_counts, reddit_tsne_embedding, reddit_kmeans_clusters)  
    where cluster is not null 
    order by numUsers desc
""")
df.head()

Unnamed: 0_level_0,numUsers,x,y,cluster,grid_x,grid_y
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AskReddit,523005,-3.627485,-36.323093,3,-7,-35
funny,396478,-2.620554,47.816345,13,0,49
pics,362588,34.699753,12.125911,9,35,14
WTF,262293,-42.486813,-25.38582,0,-42,-28
gaming,255763,-46.379269,45.731155,7,-49,49


In [15]:
import numpy as np
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

import bokeh.plotting as bp
from bokeh.models import HoverTool

In [16]:
#this line must be in its own cell 
bp.output_notebook()

In [17]:
x = bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by t-SNE",
       tools=[HoverTool( tooltips=[ ("/r/", "@subreddit") ] )], toolbar_location=None,
       x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
    x = df.x.values, 
    y=df.y.values, 
    color=colormap[df.cluster.astype(int).values],
    alpha=0.6,
    radius=(df.numUsers.values ** .3)/15,
    source=bp.ColumnDataSource({"subreddit": df.index.values})
)

labels = df.reset_index().groupby(['grid_x', 'grid_y'], as_index=False).first()
labels = labels[labels["numUsers"] > 10000]
x.text(
    x = labels.x.values, 
    y = labels.y.values,
    text = labels._rowName.values,
    text_align="center", text_baseline="middle",
    text_font_size="8pt", text_font_style="bold",
    text_color="#333333"
)

bp.show(x)

## Where to next?

Check out the other [Tutorials and Demos](/doc/#builtin/Demos.md.html).