# Dimension Reduction with TSNE and Bokeh Scatterplot

This is a workflow I use often in data exploration. TSNE gives a good representation of high-dimensional data, and Bokeh is helpful in creating a simple interactive plots with contextual info given by colors and tooltips. 

This workflow has been extremely helpful for:

- text analytics/NLP tasks if text data is passed through a `TfidfVectorizer` or similar from `scikit-learn`
- understanding `word2vec` or `doc2vec` vectors by passing them to TSNE
- getting an idea of *separability* in doing prediction / classification by passing the outcome variable to bokeh

This example uses the [Australian atheletes data set](http://math.furman.edu/~dcs/courses/math47/R/library/DAAG/html/ais.html), which contains 11 numeric variables. This workflow is even more helpful on larger datsets with higher dimensionality.

### References

> t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. 

[t-SNE - Laurens van der Maaten](https://lvdmaaten.github.io/tsne/)

> Bokeh is a Python interactive visualization library that targets modern web browsers for presentation.

[Welcome to Bokeh](http://bokeh.pydata.org/en/latest/)

---

In [1]:
from statsmodels.api import datasets
from sklearn.manifold import TSNE
import pandas as pd

from bokeh.plotting import figure, ColumnDataSource, output_notebook, output_file, show, save 
from bokeh.models import HoverTool, WheelZoomTool, PanTool, BoxZoomTool, ResetTool, TapTool, SaveTool
from bokeh.palettes import brewer
output_notebook()

  from pandas.core import datetools


In [2]:
ais = datasets.get_rdataset("ais", "DAAG")
data = ais['data']

In [3]:
data.head()

Unnamed: 0,rcc,wcc,hc,hg,ferr,bmi,ssf,pcBfat,lbm,ht,wt,sex,sport
0,3.96,7.5,37.5,12.3,60,20.56,109.1,19.75,63.32,195.9,78.9,f,B_Ball
1,4.41,8.3,38.2,12.7,68,20.67,102.8,21.3,58.55,189.7,74.4,f,B_Ball
2,4.14,5.0,36.4,11.6,21,21.86,104.6,19.88,55.36,177.8,69.1,f,B_Ball
3,4.11,5.3,37.3,12.6,69,21.88,126.4,23.66,57.18,185.0,74.9,f,B_Ball
4,4.45,6.8,41.5,14.0,29,18.96,80.3,17.64,53.2,184.6,64.6,f,B_Ball


In [4]:
data_numeric = data.select_dtypes(exclude=['object'])

In [5]:
# these parameters are tweaked for this dataset and are *not* good defaults
perplexity = 15
learning_rate = 400

tsne = TSNE(n_components=2, perplexity=perplexity, learning_rate=learning_rate, random_state=666)

tsne_data = tsne.fit_transform(data_numeric)

### Formatting data for Bokeh
The easiest/cleanest way to get data into Bokeh is to put everything you'll need (original data, TSNE values, point colorings/other metadata) into a single data frame. You can pass that dataframe to `ColumnDataSource` then reference the column names in plot creation.

In [6]:
tsne_df = pd.DataFrame(tsne_data, columns=['Component 1', 'Component 2'], index=data.index)

In [7]:
data_all = pd.concat([data, tsne_df], axis=1)

In [8]:
category = 'sex'

category_items = data_all[category].unique()
palette = brewer['Set3'][len(category_items) + 1]
colormap = dict(zip(category_items, palette))
data_all['color'] = data_all[category].map(colormap)

In [9]:
data_all.head()

Unnamed: 0,rcc,wcc,hc,hg,ferr,bmi,ssf,pcBfat,lbm,ht,wt,sex,sport,Component 1,Component 2,color
0,3.96,7.5,37.5,12.3,60,20.56,109.1,19.75,63.32,195.9,78.9,f,B_Ball,-19.713213,-17.05101,#8dd3c7
1,4.41,8.3,38.2,12.7,68,20.67,102.8,21.3,58.55,189.7,74.4,f,B_Ball,-18.542049,-15.324479,#8dd3c7
2,4.14,5.0,36.4,11.6,21,21.86,104.6,19.88,55.36,177.8,69.1,f,B_Ball,-10.269636,-27.617649,#8dd3c7
3,4.11,5.3,37.3,12.6,69,21.88,126.4,23.66,57.18,185.0,74.9,f,B_Ball,-22.154285,-12.438956,#8dd3c7
4,4.45,6.8,41.5,14.0,29,18.96,80.3,17.64,53.2,184.6,64.6,f,B_Ball,-2.223337,-26.898342,#8dd3c7


### Creating the Plot

Note: Plot does not render in Firefox (2018-04-08)

In [10]:
title = "Australian Athletes - t-SNE"

source = ColumnDataSource(data_all)

hover = HoverTool(tooltips=[(column, '@' + column) for column in reversed(data.columns)])

tools = [hover, WheelZoomTool(), PanTool(), BoxZoomTool(), ResetTool(), TapTool(), SaveTool()]

p = figure(
    tools=tools,
    title=title,
    plot_width=800,
    plot_height=800,
    toolbar_location='below',
    toolbar_sticky=False, )

p.circle(
    x='Component 1',
    y='Component 2',
    source=source,
    size=10,
    line_color='#333333',
    line_width=0.5,
    fill_alpha=0.8,
    color='color',
    legend=category)

show(p)