# Clustering 01

In this notebook, we need to find which clusters correspond with which cortex layer in the primary data. 

In [15]:
import pandas as pd 
import numpy as np
import sklearn as sk
import hdbscan

Let's read in the data, and then perform a sample run of density-based clustering, and visualize our results with a UMAP projection onto $\mathbb{R}^2$

In [5]:
organoid = pd.read_csv('../data/processed/organoid.tsv', sep='\t').set_index('cell', drop=True)
primary = pd.read_csv('../data/processed/primary.tsv', sep='\t').set_index('cell', drop=True)

In [6]:
organoid.head()

Unnamed: 0_level_0,DPM1,RAB18,LRP12,ZNF286B,CDK9,TMEM132D,MRPL41,CCDC7.1,SPN,TYK2,...,ZNF441,MSH2,PAFAH1B2,LSR,NDUFC1,DHCR7,ZNF532,GATAD1,LDHA,Type
cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
H1SWeek3_AAACCTGAGACAAAGG,0.799305,0.16077,0.456607,0.0,0.0,0.455842,1.628209,0.0,0.0,0.0,...,0.0,0.621518,0.63328,0.0,1.646975,0.0,0.270115,0.298587,2.491196,1
H1SWeek3_AAACCTGAGCACACAG,0.969522,0.0,0.0,0.0,1.244261,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.505902,0.0,0.0,1.338477,1.418463,1
H1SWeek3_AAACCTGAGGATGGAA,0.0,0.0,0.0,0.0,0.0,0.0,1.754852,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.134613,0.0,0.0,3.015667,1
H1SWeek3_AAACCTGCAATTGCTG,0.0,0.0,1.545464,0.0,0.0,0.0,1.569294,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.286781,0.0,0.0,0.0,0.0,1
H1SWeek3_AAACCTGCAGCGTAAG,0.0,0.647715,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.205176,0.0,0.0,1.903437,0.0,0.0,0.0,2.736439,1


In [7]:
primary.head()

Unnamed: 0_level_0,DPM1,RAB18,LRP12,ZNF286B,CDK9,TMEM132D,MRPL41,CCDC7.1,SPN,TYK2,...,ZNF441,MSH2,PAFAH1B2,LSR,NDUFC1,DHCR7,ZNF532,GATAD1,LDHA,Type
cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACCTGAGCTGCCCA_50646,0.0,2.34528,0.0,0.0,0,0,0.0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
AAACCTGAGCTTATCG_50647,0.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.559999,0.0,0
AAACCTGAGTATGACA_50652,0.0,1.713228,0.0,0.0,0,0,0.0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.98538,0
AAACCTGAGTCGCCGT_50654,0.0,0.0,0.0,0.0,0,0,1.90848,0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.033877,0.0,0.0,0.0,0.0,0
AAACCTGCACCAGCAC_50657,0.0,0.0,0.0,2.474566,0,0,0.0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [11]:
primary = primary.drop('Type', axis=1)
organoid = organoid.drop('Type', axis=1)

In [26]:
desc = primary.describe()
maxs = desc.loc['max', :]

In [39]:
maxs.value_counts()[0]

7240

## Clustering

We begin by using HDBSCAN, a density based clustering method that makes two important assumptions: our clusters are not Gaussian balls, and we don't know the number of true clusters a priori. Especially since we cannot reasonably visualize our data pairwise (2^16k plots).

In [None]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=3)
clusterer = clusterer.fit(primary)

In [None]:
labels = clusterer.labels_
set(labels)

## Cluster visualization

Now that we've performed clustering using HDBSCAN, let's visualize the 2D projection of the data using UMAP

In [None]:
import umap

N_NEIGB = 15
proj = umap.UMAP(n_neighbors=N_NEIGB)
umap = umap.fit_transform(primary, verbose=True)

Now that we've run UMAP, let's set up the projected data with the labels and visualize it

In [19]:
umap_df = pd.DataFrame(umap, index=primary.index)
umap_df['Labels'] = labels
umap_df = umap_df.rename({0: 'UMAP_1', 1:'UMAP_2', 2:'UMAP_3'}, axis=1)

umap_df.head()

NameError: name 'umap' is not defined

Using `seaborn`, we can generate a nice scatter plot colored by cluster

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

sns.scatterplot(
    x="UMAP_1", 
    y="UMAP_2", 
    data=comb_umap,
    hue='Type',
    legend='full',
    ax=ax,
)

plt.title(f'UMAP Projection of Primary data, colored by clusters found by HDBSCAN, n_neighbors={N_NEIGB}')
plt.show()