This example shows how kernel density estimation (KDE), a powerful non-parametric density estimation technique, can be used to learn a generative model for a dataset. With this generative model in place, new samples can be drawn. These new samples reflect the underlying model of the data.

#### New to Plotly?
Plotly's Python library is free and open source! [Get started](https://plot.ly/python/getting-started/) by downloading the client and [reading the primer](https://plot.ly/python/getting-started/).
<br>You can set up Plotly to work in [online](https://plot.ly/python/getting-started/#initialization-for-online-plotting) or [offline](https://plot.ly/python/getting-started/#initialization-for-offline-plotting) mode, or in [jupyter notebooks](https://plot.ly/python/getting-started/#start-plotting-online).
<br>We also have a quick-reference [cheatsheet](https://images.plot.ly/plotly-documentation/images/python_cheat_sheet.pdf) (new!) to help you get started!

### Version

In [1]:
import sklearn
sklearn.__version__

'0.18.1'

### Imports

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

Automatically created module for IPython interactive environment


### Calculations

In [3]:
# load the data
digits = load_digits()
data = digits.data

# project the 64-dimensional data to a lower dimension
pca = PCA(n_components=15, whiten=False)
data = pca.fit_transform(digits.data)

# use grid search cross-validation to optimize the bandwidth
params = {'bandwidth': np.logspace(-1, 1, 20)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(data)

print("best bandwidth: {0}".format(grid.best_estimator_.bandwidth))

# use the best estimator to compute the kernel density estimate
kde = grid.best_estimator_

# sample 44 new points from the data
new_data = kde.sample(44, random_state=0)
new_data = pca.inverse_transform(new_data)

# turn data into a 4x11 grid
new_data = new_data.reshape((4, 11, -1))
real_data = digits.data[:44].reshape((4, 11, -1))

best bandwidth: 3.79269019073


### Plot Results

In [4]:
def matplotlib_to_plotly(cmap, pl_entries):
    h = 1.0/(pl_entries-1)
    pl_colorscale = []
    
    for k in range(pl_entries):
        C = map(np.uint8, np.array(cmap(k*h)[:3])*255)
        pl_colorscale.append([k*h, 'rgb'+str((C[0], C[1], C[2]))])
        
    return pl_colorscale

cmap = matplotlib_to_plotly(plt.cm.binary, 4)

In [5]:
# plot real digits and resampled digits
fig1 = tools.make_subplots(rows=4, cols=11, 
                          print_grid=False)

fig2 = tools.make_subplots(rows=4, cols=11, 
                          print_grid=False)

for j in range(11):
    for i in range(4):
        p1 = go.Heatmap(z=real_data[i, j].reshape((8, 8)),
                        colorscale=cmap, showscale=False)
        fig1.append_trace(p1, i+1, j+1)
        
        p2 = go.Heatmap(z=new_data[i, j].reshape((8, 8)),
                       colorscale=cmap, showscale=False)
        fig2.append_trace(p2, i+1, j+1)

### Selection from the input data

In [6]:

fig1['layout'].update(title='Selection from the input data',
                      height=600, hovermode='closest')

for i in map(str,range(1, 45)):
    y = 'yaxis'+i
    x = 'xaxis'+i
    fig1['layout'][y].update(autorange='reversed',
                            showticklabels=False, ticks='')
    fig1['layout'][x].update(showticklabels=False, ticks='')


In [7]:
py.iplot(fig1)

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



### "New" digits drawn from the kernel density model

In [8]:

fig2['layout'].update(title='"New" digits drawn from the kernel density model',
                      height=600, hovermode='closest')

for i in map(str,range(1, 45)):
    y = 'yaxis'+i
    x = 'xaxis'+i
    fig2['layout'][y].update(autorange='reversed',
                            showticklabels=False, ticks='')
    fig2['layout'][x].update(showticklabels=False, ticks='')


In [9]:
py.iplot(fig2)

The draw time for this plot will be slow for all clients.


In [11]:
from IPython.display import display, HTML

display(HTML('<link href="//fonts.googleapis.com/css?family=Open+Sans:600,400,300,200|Inconsolata|Ubuntu+Mono:400,700" rel="stylesheet" type="text/css" />'))
display(HTML('<link rel="stylesheet" type="text/css" href="http://help.plot.ly/documentation/all_static/css/ipython-notebook-custom.css">'))

! pip install git+https://github.com/plotly/publisher.git --upgrade
import publisher
publisher.publish(
    'Kernel Density Estimation.ipynb', 'scikit-learn/plot-nearest-centroid/', 'plot-digits-kde-sampling | plotly',
    ' ',
    title = 'Kernel Density Estimation | plotly',
    name = 'Kernel Density Estimation',
    has_thumbnail='true', thumbnail='thumbnail/kernel-density.jpg', 
    language='scikit-learn', page_type='example_index',
    display_as='nearest_neighbors', order=4,
    ipynb= '~Diksha_Gabha/3461')

Collecting git+https://github.com/plotly/publisher.git
  Cloning https://github.com/plotly/publisher.git to /tmp/pip-GBEr93-build
Installing collected packages: publisher
  Found existing installation: publisher 0.10
    Uninstalling publisher-0.10:
      Successfully uninstalled publisher-0.10
  Running setup.py install for publisher ... [?25l- done
[?25hSuccessfully installed publisher-0.10
