# Clustering 01

In this notebook, we need to find which clusters correspond with which cortex layer in the primary data. 

In [1]:
import pandas as pd 
import numpy as np
import sklearn as sk
import hdbscan
import dask.dataframe as da
import dask

Let's read in the data, and then perform a sample run of density-based clustering, and visualize our results with a UMAP projection onto $\mathbb{R}^2$

In [2]:
organoid = da.read_csv('../data/processed/organoid.csv')
primary = da.read_csv('../data/processed/primary.csv')

In [3]:
organoid.head()

Unnamed: 0,SLC16A8,IMP4,ST8SIA2,ATP2B3,RP3-394A18.1,NTAN1,PDP1,RABGGTB,ENPP5,AIP,...,NIPA1,RP11-728G15.1,RAD51B,NPIPB11,CTR9,CBR1,RPUSD3,PRDM5,YBX1,NUP50
0,0.0,0.0,0.0,0.0,0.376018,0.0,0.0,0.0,0.0,1.086759,...,0.0,0.0,0.0,0.0,0.087199,0.0,0.669813,0.223595,3.286254,0.241645
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.047786,0.0,0.0,0.0,3.609096,0.965683
2,0.0,0.0,0.0,0.0,0.0,0.747101,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.017795,0.0,0.0,3.5776,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.99626,0.0
4,0.0,0.888053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.3441,0.0


As usual, let's register the dask progress bar

In [4]:
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar = pbar.register()

## Clustering

We begin by using HDBSCAN, a density based clustering method that makes two important assumptions: our clusters are not Gaussian balls, and we don't know the number of true clusters a priori. Especially since we cannot reasonably visualize our data pairwise (2^16k plots).

In [5]:
@dask.delayed
def dask_cluster(data):
    clusterer = hdbscan.HDBSCAN(min_cluster_size=3)
    return clusterer.fit(primary)

In [9]:
prim_cluster = dask_cluster(primary)

In [None]:
prim_cluster = prim_cluster.compute()

[##################                      ] | 46% Completed | 35min 53.5s

Let's look at what clusters were found by HDBSCAN

In [None]:
set(prim_cluster.clusters_)

## Cluster visualization

Now that we've performed clustering using HDBSCAN, let's visualize the 2D projection of the data using UMAP

In [None]:
import umap

N_NEIGB = 15
proj = umap.UMAP(n_neighbors=N_NEIGB)
umap = umap.fit_transform(primary, verbose=True)

Now that we've run UMAP, let's set up the projected data with the labels and visualize it

In [19]:
umap_df = pd.DataFrame(umap, index=primary.index)
umap_df['Labels'] = labels
umap_df = umap_df.rename({0: 'UMAP_1', 1:'UMAP_2', 2:'UMAP_3'}, axis=1)

umap_df.head()

NameError: name 'umap' is not defined

Using `seaborn`, we can generate a nice scatter plot colored by cluster

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

sns.scatterplot(
    x="UMAP_1", 
    y="UMAP_2", 
    data=comb_umap,
    hue='Type',
    legend='full',
    ax=ax,
)

plt.title(f'UMAP Projection of Primary data, colored by clusters found by HDBSCAN, n_neighbors={N_NEIGB}')
plt.show()