## Tutorial: Single-Cell Transcriptomics using UMAP


Single cell gene expression can be analyzed faster and more easily explored using GPU-accelerated UMAP analysis & visualization. Using UMAP in this way, **the user can cluster cell types by patterns of gene expression**

* Task: Analyze single cell's gene expression for clustering
* Data: 5 independent datasets of roughly 30K rows of 200 columns of single cell
* [data](https://cytotrace.stanford.edu/#shiny-tab-dataset_download)
* [paper](https://arxiv.org/pdf/2208.05229.pdf)

**Insight/ Result:**

 1.   Speed: Go from minutes to seconds for entire ~10000 cell samples (102s vs 18s on a small T4 GPU),
 2.   Visualization: Add interactivity, similarity edges, and GPU scale to otherwise hard-to-read static scatter plots

## Setup

### For the GPU-cloud-accelerated visualization step, get a free API key at https://hub.graphistry.com


In [None]:
from google.colab import userdata
g_user=userdata.get('g_user')
g_pass=userdata.get('g_pass')

In [None]:
import os, time
from collections import Counter
import cProfile
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pstats import Stats
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', 200)

In [None]:
!pip install --extra-index-url=https://pypi.nvidia.com cuml-cu12

!pip install graphistry[ai]

!pip install -q Biopython
!pip install -q scanpy

In [None]:
import scanpy as sc
import anndata

import graphistry
graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username=g_user, password=g_pass) ## key id, secret key

graphistry.__version__


'0.33.0+97.ga86be5c'

In [None]:
!nvidia-smi

Mon Jul  8 12:42:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
import cuml, cudf
cuml.__version__

'24.06.01'

## Data Download & Description

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install kaggle -q

For raw data, get a free Kaggle account from https://www.kaggle.com/docs/api

In [None]:
from google.colab import userdata
kaggle_user=userdata.get('kaggle_user')
kaggle_pass=userdata.get('kaggle_pass')

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 62 bytes


In [None]:
#download 2 single cell datasets
import kaggle as kg
import pandas as pd
import os

os.environ['KAGGLE_USERNAME'] = kaggle_user
os.environ['KAGGLE_KEY'] = kaggle_pass

kg.api.authenticate()
kg.api.dataset_download_file(dataset = "alexandervc/scrnaseq-collection-of-datasets", file_name='Cytotrace/GSE67123_6.h5ad')
kg.api.dataset_download_file(dataset = "alexandervc/scrnaseq-collection-of-datasets", file_name='Cytotrace/GSE107910_40.h5ad')

!unzip -o GSE107910_40.h5ad.zip
!unzip -o GSE67123_6.h5ad.zip

!mkdir -p single_cell
!mv *.h5ad single_cell

Dataset URL: https://www.kaggle.com/datasets/alexandervc/scrnaseq-collection-of-datasets
Dataset URL: https://www.kaggle.com/datasets/alexandervc/scrnaseq-collection-of-datasets
Archive:  GSE107910_40.h5ad.zip
  inflating: GSE107910_40.h5ad       
Archive:  GSE67123_6.h5ad.zip
  inflating: GSE67123_6.h5ad         


In [None]:
list_files = []
for dirname, _, filenames in os.walk('single_cell'):
    for filename in filenames:
        list_files.append(os.path.join(dirname, filename))

for fn in list_files:
    adata = sc.read(fn)
    print( adata.uns['info'] )
    print()

['Thymus (Drop-seq)' 'Validation' '15429' '9307.0' 'nan' '9307' '19530'
 '8' '8' 'UMI' 'Mouse' '1' 'Thymus' 'Drop-seq' 'Timepoints' 'in vivo'
 '29884461' '20180619' 'GSE107910' 'Immunity'
 'Only hematopoietic cells, selected based on detectable Ptprc expression, were considered in this dataset. ']

['Embryonic HSCs (Tang et al.)' 'Validation' '143' 'nan' 'nan' '143'
 '24028' '5' '5' 'TPM/FPKM' 'Mouse' '1' 'Embryo' 'Tang et al.'
 'Timepoints' 'in vivo' '27225119' '20160526' 'GSE67123' 'Nature' 'nan']



# compute UMAP on GPU for GSE107910_40 Murine Thymus cells






In [None]:
fn='single_cell/GSE107910_40.h5ad'
adata = sc.read(fn)
str_data_inf = fn.split('/')[1].split('.')[0] + ' ' + str(adata.X.shape)+'\n' + adata.uns['info'][0]

EE=pd.DataFrame(adata.X,columns=adata.uns['gcsGenesNames'],index=adata.uns['allcellnames'])
g1=graphistry.nodes(cudf.from_pandas(EE.T))
t0 = time.time()

g22 = g1.umap(
            use_scaler='robust', ## zscale, minmax, standard, normal,
            n_components=2,
            n_neighbors=12,
            engine='cuml' ## cannot even run in available RAM, try by switching to engine='umap_learn'
    )

print('\n Total ', np.round(time.time() - t0,1), 'seconds passed')





 Total  38.1 seconds passed


In [None]:
emb2=g22._node_embedding

A=emb2.reset_index()['index'].to_pandas()

B=g22._edges
B['_src_implicit'] = B['_src_implicit'].replace(A, regex=True)
B['_dst_implicit'] = B['_dst_implicit'].replace(A, regex=True)

g33=graphistry.nodes(emb2.reset_index(),'index').edges(g11._edges.dropna(),'_src_implicit','_dst_implicit').bind(point_x="x",point_y="y").settings(url_params={"play":0})

g33.plot()

## this paper was specifically interested in peak mitosis genes, ie ["Tirosh" genes](https://genome.cshlp.org/content/25/12/1860.short), so lets zoom in on those

In [None]:
fn='single_cell/GSE107910_40.h5ad'
import scanpy as sc
import anndata
adata = sc.read(fn)
str_data_inf = fn.split('/')[1].split('.')[0] + ' ' + str(adata.X.shape)+'\n' + adata.uns['info'][0]


In [None]:
import numpy as np
import pandas as pd

In [None]:
S_phase_genes_Tirosh = ['MCM5', 'PCNA', 'TYMS', 'FEN1', 'MCM2', 'MCM4', 'RRM1', 'UNG', 'GINS2', 'MCM6', 'CDCA7', 'DTL', 'PRIM1', 'UHRF1', 'MLF1IP', 'HELLS', 'RFC2', 'RPA2', 'NASP', 'RAD51AP1', 'GMNN', 'WDR76', 'SLBP', 'CCNE2', 'UBR7', 'POLD3', 'MSH2', 'ATAD2', 'RAD51', 'RRM2', 'CDC45', 'CDC6', 'EXO1', 'TIPIN', 'DSCC1', 'BLM', 'CASP8AP2', 'USP1', 'CLSPN', 'POLA1', 'CHAF1B', 'BRIP1', 'E2F8']
G2_M_genes_Tirosh = ['HMGB2', 'CDK1', 'NUSAP1', 'UBE2C', 'BIRC5', 'TPX2', 'TOP2A', 'NDC80', 'CKS2', 'NUF2', 'CKS1B', 'MKI67', 'TMPO', 'CENPF', 'TACC3', 'FAM64A', 'SMC4', 'CCNB2', 'CKAP2L', 'CKAP2', 'AURKB', 'BUB1', 'KIF11', 'ANP32E', 'TUBB4B', 'GTSE1', 'KIF20B', 'HJURP', 'CDCA3', 'HN1', 'CDC20', 'TTK', 'CDC25C', 'KIF2C', 'RANGAP1', 'NCAPD2', 'DLGAP5', 'CDCA2', 'CDCA8', 'ECT2', 'KIF23', 'HMMR', 'AURKA', 'PSRC1', 'ANLN', 'LBR', 'CKAP5', 'CENPE', 'CTCF', 'NEK2', 'G2E3', 'GAS2L3', 'CBX5', 'CENPA']
u = 'allgenenames'
list_genes_upper = [t.upper() for t in adata.uns[u] ]
I = np.where( pd.Series(list_genes_upper).isin( S_phase_genes_Tirosh + G2_M_genes_Tirosh ) )[0]


## CPU UMAP

In [None]:
EE=pd.DataFrame(adata.X[:,I],columns=adata.uns['gcsGenesNames'][I],index=adata.uns['allcellnames'])
g1=graphistry.nodes(cudf.from_pandas(EE.T))
t0 = time.time()

g11 = g1.umap(
            use_scaler='robust', ## zscale, minmax, standard, normal,
            n_components=2,
            n_neighbors=12,
            engine='umap_learn'
    )


print('\n Total ', np.round(time.time() - t0,1), 'seconds passed')





 Total  26.6 seconds passed


### GPU UMAP

In [None]:
EE=pd.DataFrame(adata.X[:,I],columns=adata.uns['gcsGenesNames'][I],index=adata.uns['allcellnames'])
g1=graphistry.nodes(cudf.from_pandas(EE.T)) #,columns=adata1.uns['gcsGenesNames']))

t0 = time.time()

g11 = g1.umap(
            use_scaler='robust', ## zscale, minmax, standard, normal,
            n_components=2,
            n_neighbors=12,
            engine='cuml'
    )


print('\n Total ', np.round(time.time() - t0,1), 'seconds passed')





 Total  15.5 seconds passed


### Visualize

* Nodes are cells
* Edges are similarity relationships
* Initial layout is from the UMAP dimensionality reduction to 2D
* Interactive layout is an aesthetically-optimized force-directed graph over the similarity graph, which is more interpretable for dense clusters


In [None]:
g11.plot()