# Searching for Structure: Nonparametric Methods

K. Leighly 2017

This lecture was drawn from the following sources:
 - Ivezic section 4.8, 6.1, 6.2, 6.4
 - Bishop 2.5, 9.1
 - Richards HistogramExample.ipynb

## Motivation

Often, when you obtain a data set, the way forward for analysis and resulting paper is not clear.  This may be especially true in astronomy, where the role of serendipity cannot be underemphasized.  

So your first steps when faced with a new multivariate data set will be exploratory.  The Ivezic book starts out Chapter 6 by commenting that this kind of activity can be termed _exploratory data analysis_ (EDA) and it is the kind of thing that you might apply to large data sets and data mining problems.

Often, what you want to do with these large data sets is infer the probability distribution from the data directly, without any preconception about what its shape or complexity (e.g., there may be clusters) may be. This activity is termed density estimation. Then, that PDF can be used in various ways. For example, identification of areas of low density in the PDF may point toward unusual or rare objects, or it may be used to compare a new, not-well-understood sample with the well-studied sample used to create the PDF, with the latter procedure being related to regression.

Today we will focus on predominately non-parametric methods (histograms, kernel density estimation, and k-means).  Next week we will look at parametric methods (gaussian mixture models+).
Ivezic notes that nonparametric method can be more predictive than model-fitting methods, but they can be computationally expensive, and harder to interpret.  


## Histograms

Everyone has computed a histogram. Histograms are easy to compute and easy to understand, and they provide an important took in nonparmetric data analysis. 

But they are far from trivial, as their shape and the information that they relay depends on the bin width and the starting point. Moreover, one intuits the that number of bins should increase with the number of points. 

And the result can be important; for example, a scientist may be looking to see whether a distribution is bimodal, implying two distinguisable distributions. If the bins are too large, the interesting structure may be undetected.



So, how to choose the bin width? We present some rules of thumb here, and consider some more formal rules below.

Ivezic points out that the methods for choosing the bin size generally advise using a bin size proportional to the scale of the distribution, and decreasing with the sample size. 

**Scott's rule** suggests a bin width of

$$\Delta_b=\frac{3.5 \sigma}{N^{1/3}},$$

where $\sigma$ is the sample standard deviation and $N$ is the sample size. Note that this rule inherently assumes that the distribution is Gaussian, and so might be problematic for the bimodality question outlined above.

**Silverman** also has a rule for the bandwidth $h$, which is:

$$h=(\frac{4\sigma^5}{3N})^{1/5} \approx 1.06 \sigma N^{-1/5}.$$

**Freedman-Diaconis** rule attempts to generalize to non-Gaussian distributions:

$$\Delta_b = \frac{2(q_{75}-q_{25})}{N^{1/3}} = \frac{2.7 \sigma_G}{N^{1/3}}, $$

where $q_{75}$ and $q_{25}$ denote the widths of the distribution at 75% and 25% of the peak (interquartile range). That is, $q_{25}$ is the location at which 25% of cumulative area of the distribution has been covered, and $q_{75}$ is the location at which 75% of the cumulative area has been covered, and the difference $q_{75}-q_{25}$ is termed the interquartile range.  $\sigma_G$ is the standard deviation of the data.



Clearly, all of these methods can only be considered to be rules of thumb when the distribution is complicated.
There are two other methods where the binsize is more theoretically determined:

**Knuth's** rule is a Bayesian method (see Ivezic 5.7.2); the paper describing the method can be found [here](https://arxiv.org/abs/physics/0605197).  Knuth used Bayesian model selection, treating the histogram as a piecewise constant model of the underlying density function, using a constant bin width, with the number of bins being the result of model selection.  

**Bayesian blocks**  This method is also a Bayesian method, but it relaxes the requirement that the bin widths be constant.  The papers discussing this method are [here](http://adsabs.harvard.edu/abs/1998ApJ...504..405S) and [here](http://adsabs.harvard.edu/abs/2013ApJ...764..167S).  (It is worth noting that it was originally used for time series analysis.)

The idea of the Bayesian blocks formalism is that the data are segmented into blocks, with the borders being set by changepoints.  Then, a log likelihood fitness function is defined, and the borders determined to maximize the likelihood.  

Ivezic mentions that there are several programs to compute and plot histograms, e.g, plt.hist, np.histogram, astroML.plotting.hist. In some of these, the specific rule for determiming the bin size can be specified.  That is, you do not have to specify the bin size, rather you can just specify the rule.

Let's try an example.   This generates displayed in Figure 6.5 in Ivezic.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pylab, mlab, pyplot
from scipy import stats
%matplotlib inline

# Execute this cell to generate a univariate data array, x
# this is the same data used in Ivezic, Figure 6.5
np.random.seed(0)
N = 1000
mu_gamma_f = [(5, 1.0, 0.1),
              (7, 0.5, 0.5),
              (9, 0.1, 0.1),
              (12, 0.5, 0.2),
              (14, 1.0, 0.1)]
true_pdf = lambda x: sum([f * stats.cauchy(mu, gamma).pdf(x)
                          for (mu, gamma, f) in mu_gamma_f])
x = np.concatenate([stats.cauchy(mu, gamma).rvs(int(f * N))
                    for (mu, gamma, f) in mu_gamma_f])
np.random.shuffle(x)
x = x[x > -5]
x = x[x < 25]

print x.shape

xtemp=np.linspace(0,20,1000)
ytemp=true_pdf(xtemp)
plt.plot(xtemp,ytemp)


Now let's take a first look at the data.  Use 
```python
plt.hist()
```
where that function plots a histogram of any univariate data array given as an argument.  It takes lots of other arguments too.  See (http://matplotlib.org/api/pyplot_api.html?highlight=hist#matplotlib.pyplot.hist).  You might start with `histtype = "step"` so that we can later add information to the plot and still see the original histogram.  See what happens if you don't do this.

In [None]:
plt.hist(x,histtype='step')

How would you characterize this distribution?  Could we reasonably think of it as a normal distribution that we could characterize by some mean and standard deviation?  Maybe, but even just by looking at this plot we see that it wouldn't be a particularly good description of the data.

Now do what we would typically do in astronomy: try re-plotting the histogram a couple of different ways to see if it reveals anything more about the data.  Try using only 5 bins `bins=5`, 20 bins.

Also, 

```python
normed=True
```
causes these histograms to be computed as a probability density, so that the sum of the values in the bins equals 1.


In [None]:
pylab.rcParams['figure.figsize'] = (15, 6)

plt.subplot(1,3,1)
plt.hist(x,histtype='step',normed = True)
plt.text(15, .114, 'default binsize')

plt.subplot(1,3,2)
plt.hist(x,bins=5,histtype='step',normed = True)
plt.text(15, .08, '5 bins')

plt.subplot(1,3,3)
plt.hist(x,bins=20,histtype='step',normed = True)
plt.text(15, .2, '20 bins')


We find that small changes in parameters to the histogram function *significantly* change the PDF.  That's bad, because the underlying data clearly have **not** changed.

One of the problems with histograms is that some bins end up with little (or no) data.  We can fix this by making variable-width bin sizes that have the same number of objects in each bin.  How can we do this?  We can do it using "slicing".

In [None]:
#Execute this cell
a = np.linspace(1,42,num=42)
print a
print a[::2]
print a[::3]

If you are familiar with slicing, then you know that [::2] and [::3] say to count by 2 and count by 3. 

But that isn't what they really do. They say to take every other index of the array or every 3rd index of the array. 

So, if your array is sorted (like $a$ is), then you could use this to instead define the number of values in a bin. 

That is for any given value of M

```python
bins = np.append(np.sort(x)[::M], np.max(np.sort(x)[:-1]))
```
would give bins with $M$ objects in each bin. Note that you need to add the maximum value to set the right edge of the last bin. Try it for $M=100$, 50, and 25.


In [None]:
M=100

pyplot.plot(1)

pyplot.subplot(131)

bins = np.append(np.sort(x)[::M], np.max(np.sort(x)[:-1]))

pyplot.hist(x,bins=bins,histtype='step',normed = True)
plt.text(15, .25, 'M=100')

print bins

pyplot.subplot(132)

M=50

bins = np.append(np.sort(x)[::M], np.max(np.sort(x)[:-1]))

pyplot.hist(x,bins=bins,histtype='step',normed = True)
plt.text(15, .30, 'M=50')

pyplot.subplot(133)

M=25

bins = np.append(np.sort(x)[::M], np.max(np.sort(x)[:-1]))

pyplot.hist(x,bins=bins,histtype='step',normed = True)
plt.text(15, .40, 'M=25')


Again, this can look pretty different depending on what the number of objects you choose as the minimum for each bin and compared to the plots above. And it looks a lot different from the plots above.

Finally, let's try the rules of thumb from above.  _These are implemented in the astroML package_.  The available parameters for string values of the bins keyword are:
- 'blocks' : use bayesian blocks for dynamic bin widths - we will talk about this later.
- 'knuth' : use Knuth's rule to determine bins
- 'scott' : use Scott's rule to determine bins
- 'freedman' : use the Freedman-diaconis rule to determine bins



In [None]:
from astroML.plotting import hist as fancyhist

pylab.rcParams['figure.figsize'] = (10, 10)


pyplot.subplot(221)
fancyhist(x, bins="scott", histtype="step")
plt.plot(xtemp,ytemp*975)
plt.text(15, 200, 'scott')

pyplot.subplot(222)
fancyhist(x, bins="freedman", histtype="step")
plt.plot(xtemp,ytemp*975)
plt.text(15, 200, 'freedman')

pyplot.subplot(223)
fancyhist(x, bins="knuth", histtype="step")
plt.plot(xtemp,ytemp*975)
plt.text(15, 200, 'knuth')

pyplot.subplot(224)
fancyhist(x, bins="blocks", histtype="step")
plt.plot(xtemp,ytemp*975)
plt.text(15, 200, 'Bayesian Blocks')


These look more representative of the data (except for the skinny peak around 9), but are still different.  Probably they would look better (closer to the actual) if there were more data.

Another option is to plot all the data, as follows. 

This is called a **rug plot** and now we have a better idea of where most of the data and where the gaps really are (as opposed to where the binning makes them appear to be). However, the markers are all piled up, so we have lost all sense of the relative numbers of objects. Are there ~10 at x=2.5 or could there be 100?

In [None]:
# execute this cell
pylab.rcParams['figure.figsize'] = (6,6)
plt.hist(x,histtype="step")
plt.plot(x, 0*x, '|', color='k', markersize=25) #Note markersize is (annoyingly) in *points*

### Histogram Errors

Before we move on to alternative methods to express histogram information, let's discuss histogram errors.

What are the errors in the histogram values? Consider that the bin size $\Delta_b$ has been chosen, and the $N$ data values have been partitioned into $M$ bins, with the counts in each bin is $n_k$, where $k=1,\ldots,M$. For a normalized distribution,

$$f_k=\frac{n_k}{\Delta_b N}.$$

Then, an uncertainty can be assigned, using Poission statistics, as

$$\sigma_k=\frac{\sqrt{n_k}}{\Delta_b N}.$$

This assumes that there are sufficient points per bin that the uncertainties would be symmetric (.e., at least 15 or 20), and that they should be distributed in a Gaussian distribution.  (Recall your HW problem; how large should $N$ be before the distribution is normal?




## Kernel Density Estimation

The next logical step from histograms is to express the histogram information as a smooth curve, rather than steps, i.e., estimate the probability density of the data.

We will first consider non-parametric methods for probability density estimation. The advantage of these is that they, like histograms, make few assumptions about the shape of the distribution, and thus they can better handle problems where the assumed distribution shape is a poor fit to the data.

Computing a histogram naturally leads a researcher to think about two ideas natural to the concept of nonparametric modeling. The first is that the act of binning data naturally implies that there is some relationship between data that are near one another. Associated with that is an idea of a distance measure, i.e., the distance between two points. A second related idea is that of smoothing; that is, a bin implies that the points are related, but the bin size shouldn't be too large to destroy information.

Let's look at a specific example. Imagine that the observations are being drawn from some unknown probability density $p(\mathbf{x})$, and we will estimate this density. Consider some small region $R$ that contains $\mathbf{x}$. The probability density associated with this region is

$$P=\int_{R} p(\mathbf{x}) d\mathbf{x}.$$

Imagine that we have collected a data set comprised of $N$ observations drawn from this probability distribution. Each data point has some probability that it will lie either within $R$ or outside of $R$. The binomial distribution applies and the total number of points $K$ that lie inside $R$ is

$$Bin(K|N,P) = \frac{N!}{K!(N-K)!} P^K (1-P)^{N-K}.$$

Recall that for a binomial distribution, the expectation of mean fraction of points inside the region is $E[K/N]=P$, and the variance is $var[K/N]=P(1-P)/N$.

 - For large N, the distribution will be strongly peaked around the mean, so $K\approx NP$, i.e., the number in the bin equals the total number times the probability density associated with the region $R$.
 - In addition, if the region $R$ is small enough that the probability is uniform over that region, then $P \approx p(\mathbf{x})V$, where $V$ is the volume of $R$. Combining these reveals

$$p(\mathbf{x}) = \frac{K}{NV}.$$

Note that the validity of this equation is based on two competing assumptions:
 - The first is that the region be small enough that the probability is constant; 
 - the other being that the region is large enough that the number of points inside the region is sufficient for the probability to become sharply peaked. 
 
This conflict is exploited in two different approaches: 
 - One can fix $K$ and determine the value $V$ from the data; _this idea gives rise to the K-nearest-neighbor approaches_. 
 - Or one can fix $V$ and determine $K$ from the data, giving rise to the _kernel approach_. 
 
It can be shown, apparently, that both approaches will converge to the true probability density in the limit $N \rightarrow \infty$ provided $V$ shrinks with $N$ and $K$ grows with $N$. (See Bishop for references)

So we will next consider these two methods (the kernel approach and the nearest neighbor approach), and then will consider (next week) something somewhat intermediate (the Gaussian mixture model).

### Kernel Density Estimator (KDE)

As noted above, a difficulty in using histograms is that the result depends critically on bin size and even starting value. That is illustrated in the Ivezic Figure 6.1 below.

![Ivezic Figure 6.1 top](http://www.astroml.org/_images/fig_hist_to_kernel_1.png)

In a histogram, each point contributes one unit to the bin, without regard to where the point is in the bin. 

One way to glean more information is to let each point be substituted by its own bin, and let those bins overlap and sum together. 

An example is given below (Ivezic Figure 6.1). The top left is an example of the application of a Kernel Density Estimator (KDE) with a top-hat distribution. The point can be substituted by other kernels, such as a gaussian kernel with different widths (remining three graphs).

![Ivezic 6.1 bottom](http://www.astroml.org/_images/fig_hist_to_kernel_2.png)

Given a set of measurements $\{x_i\}$, the kernel density estimator at an arbitrary position $x$ is given by

$$f_N(x) = \frac{1}{Nh^D} \sum_{i=1}^{N} K\left(\frac{d(x,x_i)}{h}\right),$$

where $K(u)$ is the kernel function, and $h$ is the bandwidth, i.e., the size of the kernel. So the local density is estimated as the weighted mean of all the points, where the weights result from the distribution $K(u)$ and typically decrease with distance $d(x,x_i)$ from a point.

The kernel function can be any smooth function that is

1. positive at all points 
2. is normalized to unity
3. has a mean of zero
4. has a variance greater than zero.

It is not surprising that a commonly used kernel is the Gaussian kernel,

$$K(u)=\frac{1}{(2\pi)^{D/2}} \exp(-u^2/2),$$

where $D$ is the number of dimensions of the parameter space, and $u=d(x,x_i)/h$.

Also useful are the tophat kernel $$K(u)=\begin{cases} \frac{1}{V_d(1)} & \mbox{ if } u\leq 1,\\ 0 & \mbox{ if } u > 1,\end{cases}$$

and the exponential kernel,

$$K(u) = \frac{1}{D! V_D(1)} e^{-|u|},$$

where $V_D(r)$ is the volume of a $D$-dimensional hypersphere of radius $r$.

Another commonly used one is the Epanechnikov kernel

$$K(U)= \frac{3}{4} (1-u^2) \mathbf{1}_{|u|\leq 1},$$

where $\mathbf{1}$ is the indicator function, which just ensures that the kernel function is only greater than zero within the bounds indicated by $h$. This function has the computational advantage that it sort of looks gaussian, but doesn't have wings that extend to infinity, and it has other useful attributes, as described below.

These kernels and others are compared in the figure below.  These are the kernels available in [sklearn.neighbors.KernelDensity](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity)

![Various kernels](http://scikit-learn.org/stable/_images/sphx_glr_plot_kde_1d_002.png)

So to summarize the KDE process:

 - A kernel sits at each point
 - The kernel has some shape that can be specified
 - The kernel has a width that must be specified by the user.

While a KDE may make more sense than a histogram when trying to determine the probability density of your data, there is still one remaining factor that must be specified - the width of the kernel.

Let's do an example to see the influence of this, as well as the influence of the type of kernel. 

In [None]:
# execute this cell to load the KDE module
from sklearn.neighbors import KernelDensity
xplot = np.linspace(x.min(),x.max(),1000)  # Use this instead of 'x' for plotting - basically resampling the x linearly.
def kde_sklearn(data, bandwidth = 1.0, kernel="linear"):
    kde_skl = KernelDensity(bandwidth = bandwidth, kernel=kernel)
    kde_skl.fit(data[:, np.newaxis])
    log_pdf = kde_skl.score_samples(xplot[:, np.newaxis]) # note that sklearn returns log(density).

    return np.exp(log_pdf)

Before we try the Gaussian and Epanechnikov kernels, let's first start with a tophat using kernel = "tophat", which will produce a plot much like the rug plot.


Start with bandwidth=0.01. See what happens when you adjust this.

In [None]:
pylab.rcParams['figure.figsize'] = (15, 6)


PDFtophat = kde_sklearn(x,bandwidth=0.01,kernel="tophat")
print PDFtophat.shape
print xplot.shape


pyplot.subplot(131)
plt.plot(xplot,PDFtophat,'b',linewidth=2)
plt.text(15, 0.5, 'bandwidth=0.01')

PDFtophat = kde_sklearn(x,bandwidth=0.1,kernel="tophat")

pyplot.subplot(132)
plt.plot(xplot,PDFtophat,'g',linewidth=2)
plt.text(15, 0.3, 'bandwidth=0.1')

PDFtophat = kde_sklearn(x,bandwidth=1.0,kernel="tophat")

pyplot.subplot(133)
plt.plot(xplot,PDFtophat,'r',linewidth=2)
plt.text(15, 0.15, 'bandwidth=1.0')


In [None]:
# try a linear function


PDFtophat = kde_sklearn(x,bandwidth=0.01,kernel="linear")
print PDFtophat.shape
print xplot.shape


pyplot.subplot(131)
plt.plot(xplot,PDFtophat,'b',linewidth=2)
plt.text(15, 0.5, 'bandwidth=0.01')

PDFtophat = kde_sklearn(x,bandwidth=0.1,kernel="linear")

pyplot.subplot(132)
plt.plot(xplot,PDFtophat,'g',linewidth=2)
plt.text(15, 0.3, 'bandwidth=0.1')

PDFtophat = kde_sklearn(x,bandwidth=1.0,kernel="linear")

pyplot.subplot(133)
plt.plot(xplot,PDFtophat,'r',linewidth=2)
plt.text(15, 0.15, 'bandwidth=1.0')



In [None]:
# try a gaussian function

PDFtophat = kde_sklearn(x,bandwidth=0.01,kernel="gaussian")
print PDFtophat.shape
print xplot.shape


pyplot.subplot(131)
plt.plot(xplot,PDFtophat,'b',linewidth=2)
plt.text(15, 0.5, 'bandwidth=0.01')

PDFtophat = kde_sklearn(x,bandwidth=0.1,kernel="gaussian")

pyplot.subplot(132)
plt.plot(xplot,PDFtophat,'g',linewidth=2)
plt.text(15, 0.3, 'bandwidth=0.1')

PDFtophat = kde_sklearn(x,bandwidth=1.0,kernel="gaussian")

pyplot.subplot(133)
plt.plot(xplot,PDFtophat,'r',linewidth=2)
plt.text(15, 0.15, 'bandwidth=1.0')



In [None]:
# try a epanechnikov function


PDFtophat = kde_sklearn(x,bandwidth=0.01,kernel="epanechnikov")
print PDFtophat.shape
print xplot.shape


pyplot.subplot(131)
plt.plot(xplot,PDFtophat,'b',linewidth=2)
plt.text(15, 0.5, 'bandwidth=0.01')

PDFtophat = kde_sklearn(x,bandwidth=0.1,kernel="epanechnikov")

pyplot.subplot(132)
plt.plot(xplot,PDFtophat,'g',linewidth=2)
plt.text(15, 0.3, 'bandwidth=0.1')

PDFtophat = kde_sklearn(x,bandwidth=1.0,kernel="epanechnikov")

pyplot.subplot(133)
plt.plot(xplot,PDFtophat,'r',linewidth=2)
plt.text(15, 0.15, 'bandwidth=1.0')



Although there is still quite a bit of flexibility, you can see that with the kernel density estimator, you can get a better sense of the structure of the data than you can with a histogram.

### Cross-validation to Determine the KDE bin width

There exists a rigorous method for determining the best bin width; in addition, it motivates the use of the Epanechnikov kernel. The idea is simple. Define a cost function that will evaluate the impact of leaving a point out:

$$ CV_l(h)=\frac{1}{N} \sum_{i=1}^N \log f_{h,-i}(x_i).$$

Recall that $f$ is the estimated density, e.g., from the histogram that you computed with bin size $h$. So this represents, as a function of bin width $h$, the sums of the logs of the density, computed using the data set with one point left out, and then evaluated at the left out point, summed over all points $i$. 

The value of $CV_l(h)$ is then minimized as a function of $h$, and the minimum value yields the optimal choice of $h$. The concept of cross validation will appear again when we start looking at supervised learning methods, as well as linear regression.

(Here, you are going to need to evaluate the density at the values of $x_i$ in the sample, rather than on the defined x axis as above.  You do that using the method kde_skl.score_samples, which above is evaluated on defined vector $xplot$, but can be evaluated on any $\{x\}$.)



An alternative method is to use the mean integrated square error (MISE) defined in general as:

$$MISE=\int_{-\infty}^{+\infty} [f(x)-h(x)]^2 dx$$

where $f(x)$ is the estimated density, and $h(x)$ is the true one.  The idea is to minimize this function, and in a way, this is  similar to the least-squares estimator that we discussed earlier, as it looks at the square devation between the model and the data (the difference is that the least-squares one included the data uncertainty and led to $\chi^2$.) The mean-square error is also known as the $L_2$ norm.  One can also use a cost function that minimized the absolute deviation between the model and the data, and that is called the $L_1$ norm.

Aside: a parameter related to the $L_1$ norm was used by e.g., Leighly 2004  and Casebeer et al. 2006 (there referred to as the FOM or figure of merit) to determine a best fit between data and model, and was chosen to be rather more robust against outliers than $\chi^2$ (although it might not be the best choice to take care of outliers).

Expanding the MISE:

$$\int(f_h-f)^2 = \int f_h^2 - 2 \int f_h f + \int f^2.$$

The last term is dropped because it doesn't depend on the bin size $h$. The first term is obtained from the histogram that has been computed. The expectation of the middle term is:

$$E\left[\int f_h(x) f(x)dx\right] = E\left[\frac{1}{N} \sum_{i=1}^{N} f_{h,-1}(x)\right],$$

that is, it involves computing the sum over $f_{h,-1}$, the estimated density with each point missing. So define the $L_2$ cross-validation score:

$$CV_{L_2}(h) = \int f_h^2 + 2 \frac{1}{N} \sum_{i=1}^{N} f_{h,-i}(x_i).$$

This works because $E[CV_{L_2}(h)+\int f^2] = E[MISE(f_h)].$ Moreover, it can be shown that the Epanechnikov kernel, as described above, has the optimal behavior for this minimization.

Let's try to evaluate the optimal $h$ for the fake data above.  We'll set up a vector of values of $h$, and measure $$ CV_l(h)=\frac{1}{N} \sum_{i=1}^N \log f_{h,-i}(x_i).$$

Let's make the $h$ vector have 21 values, logarithmically spaced between 0.01, which looks too small, and 1.0, which looks too big.

It turns out that this procedure is somewhat fiddly.  If the PDF is zero at the point at which the left-out one is being evaluated, the log_pdf returned will be -inf.  This can be combated by not computing at the very ends (where the data is sparse) and using a Gaussian kernel, which, with its long wings, will not return a pdf equal to zero.


Because it takes a while to run, we will not run it here, but instead extract the results from a pickle file.  The code used to generate it is below.


```python

temp=np.linspace(-2,0,21)
print type(temp)
h_candidates=np.array(10.0**temp)
print type(h_candidates)
cv=np.zeros([h_candidates.shape[0],xtemp.shape[0]])
print cv.shape

numh=h_candidates.shape[0]
print numh

for i in range(numh):
    htemp=h_candidates[i]
    kde_skl = KernelDensity(bandwidth = htemp, kernel='gaussian')
    numpnts=xtemp.shape[0]
    for j in range(20,numpnts-20):
        data_one_out=np.concatenate((xtemp[0:j],xtemp[j+1:numpnts]))
        kde_skl.fit(data_one_out[:,np.newaxis])
        log_pdf = kde_skl.score_samples(xtemp[:,np.newaxis])
        cv[i,j]=cv[i,j]+log_pdf[j]
    
```


import pickle
object=cv
file=open('kde_cross_validation_example','w')
pickle.dump(object,file)
file.close()

In [None]:
import pickle
file=open('kde_cross_validation_example','r')
cv=pickle.load(file)

temp=np.linspace(-2,0,21)
print type(temp)
h_candidates=np.array(10.0**temp)

cv_out=np.zeros(21)
for i in range(21):
    cv_out[i]=cv[i,:].sum()
    
plt.semilogx(h_candidates,cv_out)

print 'The value of h when CV is maximized',h_candidates[np.where(cv_out==cv_out.max())]

In [None]:

PDFtophat = kde_sklearn(x,bandwidth=0.1258,kernel="gaussian")
print PDFtophat.shape
print xplot.shape

plt.plot(xplot,PDFtophat,'g',linewidth=2)
xtemp=np.linspace(0,20,1000)
ytemp=true_pdf(xtemp)
plt.plot(xtemp,ytemp)
plt.text(15, 0.25, 'bandwidth=optimal')


Figure 6.3 from Ivezic shows the results of application of several different kernels to data from the SDSS "Great Wall".

![Ivezic Fig 6.3](http://www.astroml.org/_images/fig_great_wall_KDE_1.png)

It turns out that computing the KDE can be quite computationally expensive; in principal, one has to compute the contribution of $N$ points at each point. See Ivezic for further discussion about this point and algorithms to get around this limitation.

### Examples of KDE Implementation and More Information

[This webpage](http://www.mglerner.com/blog/?p=28) shows some examples of simulations involving histograms and KDE. And [here's](https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/) a comparision of kernel density estimators. And [here](http://scikit-learn.org/stable/modules/density.html) is more discussion plus examples.



### KDE with Measurement Errors

Consider now data with some errors, starting with homoscedastic errors $\sigma$. These errors will correspond to uncertainty in the position of the point, for example. Assume that the data is drawn from the true PDF $h(x)$ and the error is described by $g(x|\sigma)$. Then the observed distribution is given by the convolution of the true PDF and the error:

$$f(x)=(h \star g)(x) = \int_{-\infty}^{\infty} h(x^\prime) g(x-x^\prime) dx^\prime.$$

The method of deconvolution KDE estimates $f(x)$ from the noisy data first, then deconvolves the noise PDF. To use the convenience of the convolution theorem, the deconvolution is done in Fourier space. The procedure is:

Find the kernel density estimate of the observed data, $f(x)$ and compute the fourier transform $F(k)$.
Compute the Fourier transform $G(k)$ of the noise distribution $g(x)$.
Take the inverse fourier transform of $H(k)=F(k)/G(k)$ in order to extract the true PDF $h(x)$.
Clearly, this procedure is easy or difficult depending on the data and the form of the error. See Ivezic for more details and references.

### Extensions and Related Methods

In Ivezic 6.1.3, there are brief comments about extentions and related methods.

- KDE can be extended to other tasks, including classification (kernel discrimination analysis), regression (kernel regression), and conditional density estimation.  We will discuss some of these later.
- These methods can also be extended, e.g., so that each bin can have its own width.

### K-means Nearest Neighbors

Let's return to our simple analysis of the density:

$$p(\mathbf{x})= \frac{K}{NV}, $$

recalling that $K$ is the number of objects in the region, $N$ is the total number of objects, and $V$ is the volume of the region. We motivated kernel density estimation by basically fixing $V$ and determining $K$, basically by setting up bins, rather like a histogram, and determining the membership of the bins $K$.

Now, let's consider the alternative approach, let's fix $K$ and determine $V$ from the data, i.e., let's determine which bin an event belongs to.

One of the difficulties of the kernel approach is that the kernel width is the same for all kernels. So, in regions with high density, the data may be over smoothed, and detail washed out, while in regions of low density, noisy estimates are obtained. That is, the optimal choice of the bin size $h$ is dependent on the location in data space.

So returning to the example of two gaussians above, let's fix the number of regions $K$ and use the data to find an appropriate value of $V$. Consider a small sphere centered on the point at which the density $p(x)$ is to be estimated. Let the radius grow until it contains precisely $K$ points. Then the estimate of the density is given by the equation above with $V$ set to the volume of the resulting sphere. This technique is called K nearest neighbors, and is illustrated in the figure 2.26 from Bishop.

To implement this technique, for each point $x$ find the distance to the Kth-nearest-neighbor and call that $d_K$. Then, the implied density at that is

$$f_k(x)=\frac{K}{V_D(d_K)},$$

where the volume $V_D$ is evaluated according to the problem dimensionality. That is, even though the clusters are drawn in 1 or two dimensions, there could be more dimensions, in which case the volume needs to be computed suitably. So, for example, for $D=1$, it is just a distance, for $D=2$, $V_2=\pi d^2$, for $D=3$, $V_3=\frac{4}{3} \pi d^3$, and a suitable hypervolume needs to be used for $D > 3$ (see Equation 7.3 in Ivezic).

Thus this estimator assumes that the density is locally constant. Ivezic notes that in practice, one can compute:

$$f_K(x)=\frac{C}{d_K^D},$$

and evaluate the scaling factor at the end by requiring that the sum of the product of $f_K$ and the pixel volume equal the number of points. The error in $f_K(x)$ is simply $\sigma_f=K^{1/2}/V_D(d_K)$. The fractional relative error is $\sigma_f/f = 1/K^{1/2}$ so the fractional accuracy of the density increases with $K$ at the expense of the spatial resolution (which scales as $K^{1/D}$. In practice, $K$ should be at least $5$; see Ivezic for references.

Instead of looking at just the Kth nearest neighbor, one can consider the distances to all $K$ nearest neighbors. We can perhaps call this method k neighbors, since it uses all of them. Then

$$f_K(x) = \frac{C}{\sum_{i=1}^{K} d_i^D}.$$

This equation is derived in Ivezic et al. 2005. The normalization is

$$C=\frac{K(K+1)}{2V_D(1)}.$$

This method will be better in the face of sparse data. The methods are the same for uniform data. Figure 6.4 from Ivezic, below, shows the application of the KDE and the K-nearest-neighbors to the SDSS "Great" Wall data.

![Ivezic Figure 6.4](http://www.astroml.org/_images/fig_great_wall_1.png)

A comparison of Bayesian Blocks, KDE, and nearest neighbor densities:

![Ivezic Figure 6.5](http://www.astroml.org/_images/fig_GMM_density_estimation_1.png)

## Cluster Analysis

So far we have been looking at determining the probability density function of data.  But there are other, related questions we can ask.  

The idea of cluster analysis is to look for concentrations of points in a multivariate data set. Alternatively, if the probability density has been estimated, we can look for "overdensities". These are just two of the many possible applications of these ideas.

As above, we will continue to assume that we are working in the "unsupervised" regime, and there is no prior information about the possible clusters.



## K-Means

Consider a data set $\{x_1,\ldots,x_N\}$ consisting of $N$ observations of a D-dimensional variable $x$. We will partition the data into some number $K$ clusters, and we will specify $K$.

The clusters will be formed of points close to one another that will be identified by $\mathbf{\mu}_k$, i.e., a set of $K$ vectors that can be thought of specifying the center of each cluster. So each $\mathbf{\mu}_k$ is also **D-dimensional**. The goal is to assign data points to clusters so that the sum of the squares of the distances of each data point to its closest vector $\mathbf{\mu}_k$ is a minimum.

For each data point $x_n$ let us introduce a set of binary indicator variables $r_{nk}$, where $k=1,\ldots,K$, so that $r_{nk}=1$ if a data point $x_n$ is assigned to cluster $k$, and zero otherwise. Then we can define a "distortion measure" $J$ as

$$ J=\sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk} (x_n-\mathbf{\mu}_k)^2.$$

$J$ will represent the sum of the squares of the distances of each data point to its assigned vector $\mathbf{\mu}_k$. 

The goal is to determine the values of $r_{nk}$ and the $\mathbf{\mu}_k$ that will minimize $J$. This can be done through an iterative procedure in which each step involves a two-step process. The initial values of the $\mathbf{\mu}_k$ are assigned. In the first step, the values of $r_{nk}$ are determined that minimize $J$, keeping $\mathbf{\mu}_k$ fixed. In the second step, the values of $\mathbf{\mu}_k$ are updated to minimize $J$, keeping $r_{nk}$ fixed. This two-step process is called the Expectation-Maximization (EM) Algorithm, and we will discuss it in some detail when we talk about Gaussian Mixture Models (GMM). You will implement this in your HW.

First consider the determination of the $r_{nk}$. $J$ is linear in $r_{nk}$, so this optimization can be performed rather easily, just by determining which $\mathbf{\mu}_k$ is closest to a data point $x_n$ and assigning $r_{rk}=1$ to that combination.

Next, consider the optimization of the $\mathbf{\mu}_k$ with the $r_{nk}$ fixed. $J$ is a quadratic function of $\mathbf{\mu}_k$, so it can be minimized by setting its derivative with respect to $\mathbf{\mu}_k$ equal to zero, giving

$$\sum_{n=1}^{N}r_{nk}(x_n-\mathbf{\mu}_k)=0$$

which can be solved for $\mathbf{\mu}_k$ to yield

$$\boldsymbol{\mu}_k = \frac{\sum_n r_{nk}x_n}{\sum_n r_{nk}}.$$

The denominator of this expression is simply the number of points in the $k$th cluster, so this result is equivalent to assigning $\mathbf{\mu}_k$ to the mean of the data points assigned to cluster $k$. This part of the procedure engenders the name: K-means algorithm.

This algorithm continually reduces $J$, so convergence is guaranteed.  However, the algorithm can converge to a local minimum rather than a global one.

The algorithm is illustrated in Figure 9.1 of Bishop, which shows a sequence of convergence for "the Old Faithful data set", which is a sequence of data describing the eruptions of this famous geyser.  The two parameters are the duration of the eruption, and the length of time between the eruptions.  In your HW, you will implement this.

Figure 9.2 shows the cost function $J$ as a function of step, illustrating that convergence is quite fast.

It is worth noting that if there are many points and many clusters, convergence can be slow. So it may be worth looking for algorithms to speed it up. See Bishop for details.

Alternative algorithms exist. Instead of updating all the prototype vectors at once, one can sequentially update them, as:

$$\boldsymbol{\mu}_k^{new} = \boldsymbol{\mu}_k^{old} +
\eta_n(\mathbf{x}_n-\boldsymbol{\mu}_k^{old}),$$

where $\eta_n$ is the learning rate parameter, which is made to decrease as more data points are considered.

The K-meloids algorthim is another extention. The idea behind this algorithm is that in some cases the distance between the prototype and the data may not be Euclidian, and a more general dissimilarity measure $\mathcal{V}(x,x^\prime)$ between two vectors $x$ and $x^\prime$ so that

$$\tilde{J}=\sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk}
\mathcal{V}(x_n,\boldsymbol{\mu}_k).$$

This method leaves the precise form of the dissimilarity measure up in the air.

[Here's](https://www.youtube.com/watch?v=BVFG7fd1H30) a rather fun example of the application of the k-means clustering algorithm.

### Example 

k-means is implemented in sklearn, but actually I'm going to use the one from scipy.  We will employ it on the famous "iris" data set.  It consists of 150 sets of measurements from iris of four different characteristics:
   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm




In [None]:
r=np.loadtxt('iris_data.dat')
print r.shape

It may help to rescale the daa by subtracting the mean and dividing by the standard deviation.  This ensures that no measurement dominates in the distances.

In [None]:
SLength=(r[:,0]-r[:,0].mean())/r[:,0].std()
SWidth=(r[:,1]-r[:,1].mean())/r[:,1].std()
PLength=(r[:,2]-r[:,2].mean())/r[:,2].std()
PWidth=(r[:,3]-r[:,3].mean())/r[:,3].std()

In [None]:
pylab.rcParams['figure.figsize'] = (15,10)

pyplot.subplot(231)
plt.plot(SLength,PLength,'o')

pyplot.subplot(232)
plt.plot(SLength,PLength,'o')

pyplot.subplot(233)
plt.plot(SLength,PWidth,'o')

pyplot.subplot(234)
plt.plot(SWidth,PLength,'o')

pyplot.subplot(235)
plt.plot(SWidth,PWidth,'o')

pyplot.subplot(236)
plt.plot(PLength,PWidth,'o')



In [None]:
from scipy.cluster.vq import kmeans,vq,whiten


temp=np.vstack((SLength,SWidth,PLength,PWidth))
properties=temp.T

print properties.shape

centroids,_ = kmeans(properties,2)
idx,_ = vq(properties,centroids)

print idx.shape
print idx.min(),idx.max()
print centroids

In [None]:
pyplot.subplot(231)
plt.plot(SLength[idx==0],PLength[idx==0],'o')
plt.plot(SLength[idx==1],PLength[idx==1],'o')

pyplot.subplot(232)
plt.plot(SLength[idx==0],PLength[idx==0],'o')
plt.plot(SLength[idx==1],PLength[idx==1],'o')

pyplot.subplot(233)
plt.plot(SLength[idx==0],PWidth[idx==0],'o')
plt.plot(SLength[idx==1],PWidth[idx==1],'o')

pyplot.subplot(234)
plt.plot(SWidth[idx==0],PLength[idx==0],'o')
plt.plot(SWidth[idx==1],PLength[idx==1],'o')

pyplot.subplot(235)
plt.plot(SWidth[idx==0],PWidth[idx==0],'o')
plt.plot(SWidth[idx==1],PWidth[idx==1],'o')

pyplot.subplot(236)
plt.plot(PLength[idx==0],PWidth[idx==0],'o')
plt.plot(PLength[idx==1],PWidth[idx==1],'o')


It seems very clear that there are at least two clusters in these data.  What about three?  

In [None]:
centroids,_ = kmeans(properties,3)
idx,_ = vq(properties,centroids)

print idx.shape
print idx.min(),idx.max()
print centroids

In [None]:
pyplot.subplot(231)
plt.plot(SLength[idx==0],PLength[idx==0],'ko')
plt.plot(SLength[idx==1],PLength[idx==1],'ro')
plt.plot(SLength[idx==2],PLength[idx==2],'go')

pyplot.subplot(232)
plt.plot(SLength[idx==0],PLength[idx==0],'ko')
plt.plot(SLength[idx==1],PLength[idx==1],'ro')
plt.plot(SLength[idx==2],PLength[idx==2],'go')

pyplot.subplot(233)
plt.plot(SLength[idx==0],PWidth[idx==0],'ko')
plt.plot(SLength[idx==1],PWidth[idx==1],'ro')
plt.plot(SLength[idx==2],PLength[idx==2],'go')

pyplot.subplot(234)
plt.plot(SWidth[idx==0],PLength[idx==0],'ko')
plt.plot(SWidth[idx==1],PLength[idx==1],'ro')
plt.plot(SWidth[idx==2],PLength[idx==2],'go')

pyplot.subplot(235)
plt.plot(SWidth[idx==0],PWidth[idx==0],'ko')
plt.plot(SWidth[idx==1],PWidth[idx==1],'ro')
plt.plot(SWidth[idx==2],PWidth[idx==2],'go')

pyplot.subplot(236)
plt.plot(PLength[idx==0],PWidth[idx==0],'ko')
plt.plot(PLength[idx==1],PWidth[idx==1],'ro')
plt.plot(PLength[idx==2],PWidth[idx==2],'go')


In [None]:
from mpl_toolkits.mplot3d import Axes3D

pylab.rcParams['figure.figsize'] = (10, 10)

fig=plt.figure()
ax=fig.add_subplot(111,projection='3d')

plt.axis([-3,3,-3,3],fontsize=20,linewidth=4)

ax.scatter3D(SLength[idx==0],PLength[idx==0],SWidth[idx==0],c='k',linewidth=2)
ax.scatter3D(SLength[idx==1],PLength[idx==1],SWidth[idx==1],c='r',linewidth=2)
ax.scatter3D(SLength[idx==2],PLength[idx==2],SWidth[idx==2],c='g',linewidth=2)

ax.set_xlabel('SLength')
ax.set_ylabel('PLength')
ax.set_zlabel('Swidth')


plt.show()
