Kernel density estimation is a popular tool for visualising the distribution of data.

# Kernel density estimation
[wiki](https://en.wikipedia.org/wiki/Kernel_density_estimation)  
Let $(x_1, x_2, â€¦, x_n)$ be an univariate independent and identically distributed sample drawn from some distribution with an unknown density $f$. We are interested in estimating the shape of this function $f$. Its kernel density estimator is

$${\hat {f}}_{h}(x)={\frac {1}{n}}\sum _{i=1}^{n}K_{h}(x-x_{i})={\frac {1}{nh}}\sum _{i=1}^{n}K{\Big (}{\frac {x-x_{i}}{h}}{\Big )}$$
where $K$ is the kernal and $h>0$ the bandwidth. 

#  Kernel density estimation for bivariate data
[R-pachage:ks](https://cran.r-project.org/web/packages/ks/vignettes/kde.pdf)

Brief introduction KDE for 2-dim data.  
For a bivariiate random sample $(X_1, X_2, \dots , X_n)$ draws from a unknow density $f$, the kernal density estimate is defined by 
$$\hat f(x;H) = \frac{1}{n}\sum_{i=1}^{n} K_{H}(x-X_i)$$
where $x=(x_1,x_2)^T$ and $X_i=(X_{i1},X_{i2})^T$, $i=1,2,\dots,n$,  
$K_H(x) = {|H|}^{-\frac{1}{2}}K(H^{-\frac{1}{2}}x)$ and $K(x)=(2\pi)^{-1}exp({-\frac{1}{2}}x^Tx)$.  
In general, $K(x)$ is a symmetric probability density function called kernal, and $H$ is a positive-define symmetic matrix.

Let $H=\left[\begin{array}{cc} r & 0 \\ 0 & r \end{array}\right]$, then 2d KDE with normal kernal can be write 
$$\begin{array}{ll} \hat f(x;H) &= \frac{1}{n}\sum_{i=1}^{n} K_{H}(x-X_i)\\
&= \frac{1}{n}\sum_{i=1}^{n}{|H|}^{-\frac{1}{2}}K(H^{-\frac{1}{2}}(x-X_i))\\
&= \frac{1}{rn}\sum_{i=1}^{b}(2\pi)^{-1}exp(-\frac{1}{2}(x-X_i)^T H^{-\frac{1}{2}} H^{-\frac{1}{2}}(x-X_i))\\
&= {\frac {1}{nr}}\sum _{i=1}^{n}K{\Big (}{\frac{x-X_i}{r}}{\Big )}
\end{array}$$, and
$$\begin{array}{ll}
K_{(2-dim)}([x,y]) &= \frac{1}{2\pi}exp(\frac{1}{2}(x^2+y^2))\\ 
&= 2\pi \frac{1}{2\pi}exp(\frac{1}{2}(x^2)) \frac{1}{2\pi}exp(\frac{1}{2}(y^2))\\
&= 2\pi K(x)K(y)
\end{array}$$

In [2]:
import numpy as np
import pandas as pd
import json

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.mllib.stat import KernelDensity

In [27]:
sc = SparkContext()

In [None]:
#sc.stop()

In [53]:
# an RDD of sample data
data = sc.parallelize([1.0, 10.0, 100.0])
# Construct the density estimator with the sample data and a standard deviation for the Gaussian
# kernels
kd = KernelDensity()
kd.setSample(data)
kd.setBandwidth(3.0)

In [55]:
# Find density estimates for the given values
densities = kd.estimate([0.1,0.2])

In [56]:
densities

array([ 0.04256782,  0.04299207])