In [1]:

import scCloud as sc
import scplot as sp
import holoviews as hv
import numpy as np
import pandas as pd

import scplot as sp
hv.extension('bokeh')

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


This tutorial illustrates basic scCloud functionality using 3k PBMCs from a Healthy Donor from 10X Genomics. 
The dataset is available [here](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k).

Read in Cell Ranger output

In [2]:
# Data available from http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz 
adata = sc.io.read_input('filtered_gene_bc_matrices/hg19/')

Read input is finished. Time spent = 2.64s.


Generate QC metrics

In [3]:
sc.tools.qc_metrics(adata)

In [4]:
adata.var_keys()

['gene_ids', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features']

Plot QC stats

In [5]:
sp.violin(adata, ['n_genes', 'n_counts', 'percent_mito'], by='passed_qc', width=300)

In [6]:
sp.scatter_matrix(adata, ['n_genes', 'n_counts', 'percent_mito'], color='passed_qc')

In [7]:
sp.violin(adata, ['n_cells'])

Filter cells and genes based on compted qc metrics

In [8]:
adata = adata[adata.obs['passed_qc']]
adata = adata[:, adata.var['robust']]
adata

View of AnnData object with n_obs × n_vars = 2481 × 14616 
    obs: 'Channel', 'passed_qc', 'n_genes', 'n_counts', 'percent_mito'
    var: 'gene_ids', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features'
    uns: 'genome'

Normalize counts and then transform to log space

In [9]:
sc.tools.log_norm(adata, 1e4)

Trying to set attribute `.X` of view, making a copy.


Normalization is finished. Time spent = 24.58s.


Select highly variable genes

In [10]:
sc.tools.highly_variable_features(adata, consider_batch=False)

Trying to set attribute `.uns` of view, making a copy.


2000 highly variable features have been selected. Time spent = 0.06s.


In [11]:
sp.scatter(adata, x='mean', y='var', color='highly_variable_features', xlabel='Mean log expression', ylabel='Variance of log expression')*sp.line(adata, x='mean', y='hvf_loess')

Compute PCA in variable gene space

In [12]:
sc.tools.pca(adata)

PCA is done. Time spent = 0.20s.


Generate nearest neighbor graph

In [13]:
sc.tools.neighbors(adata)

Nearest neighbor search is finished in 0.28s.
Constructing affinity matrix is done. Time spent = 0.08s.
Affinity matrix calculation is finished in 0.08s
Calculating connected components is done.
Calculating normalized affinity matrix is done.
diffmap finished. Time spent = 0.22s.


Run diffusion map

In [None]:
sc.tools.diffmap(adata)

Cluster cells

In [14]:
sc.tools.leiden(adata)

Graph is constructed. Time spent = 0.12s.
Leiden clustering is done. Time spent = 0.75s.


In [15]:
sc.tools.louvain(adata)

Graph is constructed. Time spent = 0.08s.
Louvain clustering is done. Time spent = 0.30s.


In [16]:
sp.count_plot(adata, 'leiden_labels', 'louvain_labels')

Generate embeddings

In [17]:
sc.tools.fitsne(adata) 

FIt-SNE is calculated. Time spent = 43.61s.


In [18]:
sc.tools.umap(adata) 

UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
     learning_rate=1.0, local_connectivity=1.0, metric='euclidean',
     metric_kwds=None, min_dist=0.5, n_components=2, n_epochs=None,
     n_neighbors=15, negative_sample_rate=5, random_state=0,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=True)
Construct fuzzy simplicial set
Mon Aug 12 10:06:25 2019 Finding Nearest Neighbors
Mon Aug 12 10:06:25 2019 Finished Nearest Neighbor Search
Mon Aug 12 10:06:27 2019 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Mon Aug 12 10:0

Plot the cluster assignments

In [19]:
sp.embedding(adata, 'fitsne', ['leiden_labels'])

In [20]:
sp.embedding(adata, 'umap', ['leiden_labels'])

In [21]:
sc.tools.de_analysis(adata, labels='leiden_labels')

Begin t_test.
Contingency table is collected.
Cluster 1 is processed.
Cluster 2 is processed.
Cluster 3 is processed.
Cluster 4 is processed.
Cluster 5 is processed.
Cluster 6 is processed.
Welch's t-test is done. Time spent = 0.19s.
Begin Fisher's exact test.
Cluster 1 is processed.
Cluster 2 is processed.
Cluster 3 is processed.
Cluster 4 is processed.
Cluster 5 is processed.
Cluster 6 is processed.
Fisher's exact test is done. Time spent = 1.09s.
Begin Mann-Whitney U test.
Cluster 1 is processed.
Cluster 2 is processed.
Cluster 3 is processed.
Cluster 4 is processed.
Cluster 5 is processed.
Cluster 6 is processed.
Mann-Whitney U test is done. Time spent = 44.22s.
Begin calculating ROC statistics.
Cluster 1 is processed.
Cluster 2 is processed.
Cluster 3 is processed.
Cluster 4 is processed.
Cluster 5 is processed.
Cluster 6 is processed.
ROC statistics are calculated. Time spent = 61.48s.


In [22]:
markers = sc.tools.find_markers(adata, label_attr='leiden_labels') # FIXME

[1]	valid_0's multi_error: 0.0568996	valid_1's multi_error: 0.0923695
Training until validation scores don't improve for 1 rounds.
[2]	valid_0's multi_error: 0.0515233	valid_1's multi_error: 0.0763052
[3]	valid_0's multi_error: 0.047043	valid_1's multi_error: 0.0803213
Early stopping, best iteration is:
[2]	valid_0's multi_error: 0.0515233	valid_1's multi_error: 0.0763052
LightGBM used 6.78s to train.
find_markers took 7.86s to finish.


In [23]:
for key in markers:
    markers_for_cluster= markers[key]
    print(key)
    print(markers_for_cluster['strong'][0:5])

5
['S100A4', 'LST1', 'FCGR3A', 'S100A11', 'SERPINA1']
2
['NKG7', 'IL32', 'CCL5', 'MALAT1', 'CD8B']
3
['LYZ', 'S100A4', 'S100A11', 'SH3BGRL3', 'S100A10']
4
['CD79A', 'HLA-DPB1', 'CD74', 'LINC00926', 'RPL37']
1
['IL32', 'IL7R', 'ITGB7', 'MAL', 'CD3D']
0
['MALAT1', 'CD8B', 'RPL37', 'CCR7', 'SCGB3A1']


In [24]:
markers = {
	"title" : "Cell markers",
	"cell_types" : [
		{
			"name" : "CD4 T cells",
			"markers" : [
				{
					"genes" : ["IL7R+"],
					"weight" : 1.0
				}
			]
		},

		{
			"name" : "CD14+ Monocytes",
			"markers" : [
				{
					"genes" : ["CD14+", "LYZ+"],
					"weight" : 1.0
				}
			]
		},
		{
			"name" : "B cells",
			"markers" : [
				{
					"genes" : ["MS4A1+"],
					"weight" : 1.0
				}
			]
		},
		{
			"name" : "CD8 T cells",
			"markers" : [
				{
					"genes" : ["CD8A+"],
					"weight" : 1.0
				}
			]
		},
		{
			"name" : "NK cells",
			"markers" : [
				{
					"genes" : ["GNLY+", "NKG7+"],
					"weight" : 1.0
				}
			]
		},
		{
			"name" : "FCGR3A+ Monocytes",
			"markers" : [
				{
					"genes" : ["FCGR3A+", "MS4A7+"],
					"weight" : 1.0
				}
			]
		},
		{
			"name" : "Dendritic Cells",
			"markers" : [
				{
					"genes" : ["FCER1A+", "CST3+"],
					"weight" : 1.0
				}
			]
		},
		{
			"name" : "Megakaryocytes",
			"markers" : [
				{
					"genes" : ["PPBP+"],
					"weight" : 1.0
				}
			]
		}
	]
}



In [25]:
#import scCloud.annotate_cluster
#import sys
#scCloud.annotate_cluster.annotate_clusters(adata, markers, 0.5, sys.stdout, False)

Plot marker genes

In [29]:
sp.dotplot(adata, by='leiden_labels', 
           keys=['IL7R', 'CCR7', 'S100A4', 'CD14', 'LYZ', 'MS4A1', 'CD8A', 'FCGR3A', 'MS4A7', 'GNLY', 'NKG7', 'FCER1A', 'CST3', 'PPBP'])

In [31]:
sc.io.write_output(adata, 'pmbc_scCloud.h5ad')

... storing 'Channel' as categorical


Write output is finished. Time spent = 1.20s.
