# Kernel Density Estimation for Flow Cytometry Data  

**Using kernel density estimation (KDE) to analyze multivariate flow cytometry data.**

By Jihyun Park (`jihyunp@ics.uci.edu`) and Padhraic Smyth (`smyth@ics.uci.edu`)<br/>
Department of Computer Science, University of California, Irvine

Presented as part of the UCI BigDIPA Lab

September 2017

## Outline
-------------------------
Kernel density estimation (KDE) is a non-parametric way to estimate a probability density function.
It uses a finite sample of data to make inferences about the underlying probability density function that generated the data.

In this portion of the lab we will use KDE techniques to analyze multidimensional data obtained via flow cytometry from human subjects.

## Requirements
--------------------------------
- Familiarity with the main KDE python notebook for the BigDIPA course.
- Basic knowledge of probability.
- Familiarity with programming and nD array computing (e.g. working with matrices in Matlab or numpy in python).
- Python 3.5 or Python 2.7 with libraries : Jupyter (for ipython notebook), numpy, scipy, scikit-learn, matplotlib.
- It is recommended to have the newest version of libraries installed.

## References
----------------------------------
- [Scikit-learn Density Estimation](http://scikit-learn.org/stable/modules/density.html) : Description and examples on density estimation including KDE.
- [Scikit-learn KDE Package Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html) : `KernelDensity` class documentation.

## Before We Start : Import Packages
-----------------------------

Import the necessary packages (you may have already imported them).

In [None]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('classic')
%matplotlib inline

from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
# from sklearn.grid_search import GridSearchCV # if you have older version of sklearn
from sklearn.cluster import KMeans

If you get an **`ImportError`** for **`model_selection`**, 
you can try importing **`GridSearchCV`** from **`sklearn.grid_search`** package.

- `from sklearn.grid_search import GridSearchCV`

or, update your **`scikit-learn`** using one of these commands below in **terminal window**.

- `conda install scikit-learn` : If you installed python through anaconda
- `pip install -U scikit-learn`



 ## Flow Cytometry Data

In [None]:
# define the directory where the data is
datapath = './spatial_data/'

We will import a standard flow-cytometry data set. 

Each row in the file corresponds to a single cell from a blood sample taken from a specific individual.

There are 5 columns, each of which corresponds to a different *marker*. A marker is essentially a way to measure some physical or chemical property of a cell, e.g., its size. So we can think of the 5 markers as 5 different measurements for each cell.

Below we will select dimensions 2 and 3 as the markers we will analyze. In a typical flow cytometry data analysis we would look at all dimensions (all columns), but since its easier to visualize data in 2 dimensions we will just focus on dimensions 2 and 3 here.

As a sidenote, modern flow cytometry data sets are often much more complex than the single file we are analyzing here, e.g.,
- there can be 10, 20, or more dimensions (markers)
- there can be many different files corresponding to different time-points or tissues for the same individual, or multiple individuals
 

In [None]:
# Read in the flow-cytometry data matrix
filename = datapath + "flow_cytometry.csv" 
FCdata = pd.read_csv(filename).as_matrix()
print(FCdata.shape)

# Extract columns 2 and 3 and store in Xdata  
Xdata =  FCdata[:,2:4]
print(Xdata.shape)

# PART 1
-------------------------------


## 1. Plot the data

Below we plot the data in the two dimensional space (marker 1 and marker 2) that we have selected to analyze. 

We see that there appears to be cluster structure in the data. We would expect that the KDEs (with an appropriate bandwidth selected) will show that there are at least 2 large modes in the data.

Note that these modes or clusters correspond to different types of blood cells that are biologically-meaningful. There is significant interest in being able to automatically detect such clusters of cells (e.g., for biological discovery and for clinical cancer diagnosis).

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
# Plot the data points
ax.scatter(Xdata[:,0], Xdata[:,1], s=20, alpha=0.5, linewidth=0)
ax.set_xlabel('Marker 2 (SS)')
ax.set_ylabel('Marker 3 (FL1.LOG)')
plt.show()

### Grid arrays
Before we plot the density, we will generate arrays for grids. This is because we want a density value for each grid location. We're going to generate 100 X 100 mesh grid. If you want to make the grid denser, change the number for **`ngrid`** in the below code to something larger.

After generating the mesh grid, we're going to flatten the matrix to have (N x 2) shape so that it can be used as an input for other functions. <br\> 
The variables **`X, Y, xy, `**and **`ngrid`**  will be used throughout the lab. 

In [None]:
# Generate Mesh Grid for Plotting (ngrid x ngrid matrix)

# limits and grids for FC data
lower_lim = -100
upper_lim = 800

ngrid = 200  # Because it'll take long (we have more data points)

xgrid = np.linspace(lower_lim, upper_lim, ngrid)
ygrid = np.linspace(lower_lim, upper_lim, ngrid)

X, Y = np.meshgrid(xgrid, ygrid) # Now we have (ngrid x ngrid) matrix

# ravel() function flattens (ngrid x ngrid) matrix -> (1 x ngrid**2) array
xy = np.vstack([X.ravel(), Y.ravel()]).T

# Print the shapes!
print('Shape of X and Y: ' + str(X.shape))
print('Shape of xy: ' + str(xy.shape))

## 2. Define a wrapper function that runs KDE

Now define a wrapper function that runs KDE and returns the evaluated density $\hat{p}(x)=  \frac{1}{N} \sum_{i=1}^{N} K\left(x - x_i; h \right)$ in `(ngrid X ngrid)` matrix form. <br/>
It is better to define a wrapper function since this will be used a lot!

In [None]:
def run_kde(Xdata, bandwidth, metric, kernel):
    """ Construct a KernelDensity object, fit with the data points we generated, 
        and then return the evaluated density for the (ngrid X ngrid) mesh grid """
    # Construct a kernel density object
    kde = KernelDensity(bandwidth=bandwidth, metric=metric, kernel=kernel)
    kde.fit(Xdata)
    # kde.score_samples() returns values in log scale
    # xy is the flattened mesh grid that we defined earlier (used as global var.)
    log_p_hat = kde.score_samples(xy)
    phat = np.exp(log_p_hat)
    phat = phat.reshape((ngrid, ngrid))
    return phat

print('Function run_kde() defined.')

## 3. Plot the estimated density
### 3.1 Manually selected bandwidth

In [None]:
# Generate the estimated density using kernel density estimation
selected_bw = 60
phat = run_kde(Xdata, bandwidth=selected_bw, metric='euclidean', kernel='gaussian') 

fig, axs = plt.subplots(1,2,figsize=(11,5), sharex=True, sharey=True) 
ax = axs[0]
levels = np.linspace(phat.min(), phat.max(), 20)
im = ax.contourf(X, Y, phat, levels=levels, cmap='Blues')
ax.set_xlabel('Marker 2 (SS)')
ax.set_ylabel('Marker 3 (FL1.LOG)')
ax.set_title('Density from KDE (BW=%.2f)' % selected_bw)

ax = axs[1]
ax.scatter(Xdata[:,0], Xdata[:,1], s=30, c='blue', alpha=0.5, linewidth=0)
ax.set_xlim(lower_lim, upper_lim)
ax.set_ylim(lower_lim, upper_lim)
ax.set_xlabel('Marker 2 (SS)')
ax.set_ylabel('Marker 3 (FL1.LOG)')
plt.show()

## 4. Automated Bandwidth Selection

Cross-validation can be used to select the bandwidth automatically. Cross-validation is a model validation technique for assessing how the results will generalize to an independent data set. In K-fold cross-validation, randomized data are splitted into K sets, and K-1 sets are used for estimating the density (train set) and 1 set is used for evaluation (validation set). We do this for K times, and score is calculated for the validation set at each run. The overall cross-validation score is the mean of the M scores. (More info: [Wikipedia: Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29))

Scikit-learn has a nice package called **`GridSearchCV`** that does all the job for us! It uses **`score()`** function in the object to calculate the score. **`KernelDensity`** class has a function **`score(valX)`** that returns the total log probability of the validation data **`valX`** under the model. **`GridSearchCV`** will calculate the cross-validation score for each bandwidth value, and then return the bandwidth that gave the highest score. <br/>(Package info: [sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html))

We will use 5-fold cross-validation.

In [None]:
min_bw = 30
max_bw = 80
grid = GridSearchCV(KernelDensity(metric='euclidean', kernel='gaussian'),
                    {'bandwidth': np.linspace(min_bw, max_bw, 20)}, cv=5) # 5-fold cross-validation
grid.fit(Xdata)
print(grid.best_params_)
bw_cv = grid.best_params_['bandwidth'] # Bandwidth value saved in 'bw_cv'

In [None]:
# Generate the estimated density using kernel density estimation
phat = run_kde(Xdata, bandwidth=bw_cv, metric='euclidean', kernel='gaussian') 

fig, axs = plt.subplots(1, 2, figsize=(11,5), sharex=True, sharey=True) 
ax = axs[0]
levels = np.linspace(phat.min(), phat.max(), 20)
im = ax.contourf(X, Y, phat, levels=levels, cmap='Blues')
ax.set_xlim(lower_lim, upper_lim)
ax.set_ylim(lower_lim, upper_lim)
ax.set_xlabel('Marker 2 (SS)')
ax.set_ylabel('Marker 3 (FL1.LOG)')
ax.set_title('Density from KDE (BW=%.2f)' % bw_cv)

ax = axs[1]
ax.scatter(Xdata[:,0], Xdata[:,1], s=25, c='blue', linewidth=0, alpha=0.6)
ax.set_xlabel('Marker 2 (SS)')
ax.set_ylabel('Marker 3 (FL1.LOG)')
ax.set_title('Data')

plt.show()

# PART 2 
--------------------


## K-Means Clustering

Since we could clearly see two clusters in the data, we will try one of the famous clustering algorithms called **K-means**. The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing the within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

We are going to use scikit-learn's K-means package (`sklearn.cluster.KMeans`).

K-means description on Scikit-learn : [http://scikit-learn.org/stable/modules/clustering.html#k-means](http://scikit-learn.org/stable/modules/clustering.html#k-means)<br>
K-means package documentation : [http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

In [None]:
kmeans = KMeans(n_clusters=2, max_iter=300, tol=0.0001)
kmeans = kmeans.fit(Xdata)

In [None]:
kmeans.labels_

In [None]:
group0_idx = np.where(kmeans.labels_==0)[0]
group1_idx = np.where(kmeans.labels_==1)[0]

Xdata0 = Xdata[group0_idx]
Xdata1 = Xdata[group1_idx]

In [None]:
print(Xdata.shape)
print(Xdata0.shape, Xdata1.shape)

In [None]:
centers = kmeans.cluster_centers_
print(centers)

In [None]:
fig, ax = plt.subplots(figsize=(5,5), sharex=True, sharey=True) 
ax.scatter(Xdata0[:,0], Xdata0[:,1], s=30, c='blue', alpha=0.5, linewidth=0, label='Group0')
ax.scatter(Xdata1[:,0], Xdata1[:,1], s=30, c='red', alpha=0.5, linewidth=0, label='Group1')
ax.scatter(centers[:,0], centers[:,1], s=30, c='black', label='Center')
ax.set_xlabel('Marker 2 (SS)')
ax.set_ylabel('Marker 3 (FL1.LOG)')
ax.legend(loc='upper right')
plt.show()