# Density Estimation

Density estimation walks the line between unsupervised learning, feature engineering, and data modeling. Some of the most popular and useful density estimation techniques are mixture models such as Gaussian Mixtures (sklearn.mixture.GaussianMixture), and neighbor-based approaches such as the kernel density estimate (sklearn.neighbors.KernelDensity). Gaussian Mixtures are discussed more fully in the context of clustering, because the technique is also useful as an unsupervised clustering scheme.

Density estimation is a very simple concept, and most people are already familiar with one common density estimation technique: the histogram.

## Kernel Density Estimation

Kernel density estimation in scikit-learn is implemented in the sklearn.neighbors.KernelDensity estimator, which uses the Ball Tree or KD Tree for efficient queries (see Nearest Neighbors for a discussion of these). Though the above example uses a 1D data set for simplicity, kernel density estimation can be performed in any number of dimensions, though in practice the curse of dimensionality causes its performance to degrade in high dimensions.

The kernel density estimator can be used with any of the valid distance metrics (see sklearn.neighbors.DistanceMetric for a list of available metrics), though the results are properly normalized only for the Euclidean metric. One particularly useful metric is the Haversine distance which measures the angular distance between points on a sphere.

In [1]:
from sklearn.neighbors.kde import KernelDensity

KernelDensity?

In [3]:
from sklearn.neighbors.kde import KernelDensity
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)

kde.score_samples([[32,4]])

array([-10562.91076071])

In [4]:
kde.sample(1)

array([[ 2.21052437,  1.09216422]])

In [5]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

In [6]:
estimators = []
for c in [0, 1, 2]:
    m = KernelDensity().fit(X[y == c])
    estimators.append(m)
    
for estimator in estimators:
    print estimator.score_samples([X[0]])

[-3.8262878]
[-8.13952384]
[-12.91720053]
