In [1]:

import scCloud as scc
import scplot as sp
import holoviews as hv
hv.extension('bokeh')
import numpy as np
import pandas as pd

import scplot as sp

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


This tutorial illustrates basic scCloud functionality using 3k PBMCs from a Healthy Donor from 10X Genomics. 
The dataset is available [here](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k).

Read in Cell Ranger output

In [2]:
#!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
adata = scc.read_input('filtered_gene_bc_matrices/hg19/')
output_file = 'scc_tutorial_output'
adata

AnnData object with n_obs × n_vars = 2700 × 32738 
    obs: 'Channel'
    var: 'gene_ids'
    uns: 'genome'

Generate QC metrics

In [3]:
scc.qc_metrics(adata)

In [4]:
adata.var_keys()

['gene_ids', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features']

In [5]:
adata.obs_keys()

['Channel', 'passed_qc', 'n_genes', 'n_counts', 'percent_mito']

Plot QC stats

In [6]:
sp.violin(adata, ['n_genes', 'n_counts', 'percent_mito'], by='passed_qc')

In [7]:
sp.scatter(adata, 'n_genes', 'n_counts', color='passed_qc')

In [8]:
sp.violin(adata, ['n_cells'])

Filter cells and genes based on compted qc metrics

In [9]:
adata = adata[adata.obs['passed_qc']]
adata = adata[:, adata.var['robust']]
adata

View of AnnData object with n_obs × n_vars = 2481 × 14616 
    obs: 'Channel', 'passed_qc', 'n_genes', 'n_counts', 'percent_mito'
    var: 'gene_ids', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features'
    uns: 'genome'

Normalize counts and then transform to log space

In [10]:
scc.log_norm(adata, 1e5)

Trying to set attribute `.X` of view, making a copy.


Select highly variable genes

In [11]:
scc.highly_variable_features(adata, consider_batch=False)

Trying to set attribute `.uns` of view, making a copy.


Plot variable genes

In [12]:
sp.variable_feature_plot(adata)

Compute PCA in variable gene space

In [13]:
scc.pca(adata)

Generate nearest neighbor graph

In [14]:
scc.neighbors(adata)

Run diffusion map

In [15]:
scc.diffmap(adata)

Cluster cells using leiden and louvain methods

In [16]:
scc.leiden(adata)

In [17]:
scc.louvain(adata)

See the composition of each leiden cluster

In [18]:
sp.composition_plot(adata, 'leiden_labels', 'louvain_labels', invert=True)

Generate embeddings using FIt-SNE and UMAP

In [19]:
scc.fitsne(adata) 

In [20]:
scc.umap(adata) 

UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
     learning_rate=1.0, local_connectivity=1.0, metric='euclidean',
     metric_kwds=None, min_dist=0.5, n_components=2, n_epochs=None,
     n_neighbors=15, negative_sample_rate=5, random_state=0,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=True)
Construct fuzzy simplicial set
Tue Aug 20 14:19:04 2019 Finding Nearest Neighbors
Tue Aug 20 14:19:04 2019 Finished Nearest Neighbor Search
Tue Aug 20 14:19:06 2019 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Tue Aug 20 14:1

Plot the cluster assignments

In [21]:
sp.embedding(adata, 'fitsne', ['leiden_labels'])

In [22]:
sp.embedding(adata, 'umap', ['leiden_labels'])

Find differentially expressed genes

In [23]:
scc.de_analysis(adata, cluster='leiden_labels')

In [24]:
de_results = adata.varm['de_res']
sorted(de_results.dtype.names)

['WAD_score:1',
 'WAD_score:2',
 'WAD_score:3',
 'WAD_score:4',
 'WAD_score:5',
 'WAD_score:6',
 'auroc:1',
 'auroc:2',
 'auroc:3',
 'auroc:4',
 'auroc:5',
 'auroc:6',
 'log_fold_change:1',
 'log_fold_change:2',
 'log_fold_change:3',
 'log_fold_change:4',
 'log_fold_change:5',
 'log_fold_change:6',
 'mean_logExpr:1',
 'mean_logExpr:2',
 'mean_logExpr:3',
 'mean_logExpr:4',
 'mean_logExpr:5',
 'mean_logExpr:6',
 'mean_logExpr_other:1',
 'mean_logExpr_other:2',
 'mean_logExpr_other:3',
 'mean_logExpr_other:4',
 'mean_logExpr_other:5',
 'mean_logExpr_other:6',
 'percentage:1',
 'percentage:2',
 'percentage:3',
 'percentage:4',
 'percentage:5',
 'percentage:6',
 'percentage_fold_change:1',
 'percentage_fold_change:2',
 'percentage_fold_change:3',
 'percentage_fold_change:4',
 'percentage_fold_change:5',
 'percentage_fold_change:6',
 'percentage_other:1',
 'percentage_other:2',
 'percentage_other:3',
 'percentage_other:4',
 'percentage_other:5',
 'percentage_other:6',
 't_pval:1',
 't_pval:

In [25]:
markers = scc.find_markers(adata, label_attr='leiden_labels') # TODO, store result in adata

[1]	valid_0's multi_error: 0.0568996	valid_1's multi_error: 0.140562
Training until validation scores don't improve for 1 rounds.
[2]	valid_0's multi_error: 0.0501792	valid_1's multi_error: 0.108434
[3]	valid_0's multi_error: 0.0425627	valid_1's multi_error: 0.108434
Early stopping, best iteration is:
[2]	valid_0's multi_error: 0.0501792	valid_1's multi_error: 0.108434


In [26]:
markers = {
	"title" : "Cell markers",
	"cell_types" : [
		{
			"name" : "CD4 T cells",
			"markers" : [
				{
					"genes" : ["IL7R+"],
					"weight" : 1.0
				}
			]
		},
		{
			"name" : "B cells",
			"markers" : [
				{
					"genes" : ["MS4A1+"],
					"weight" : 1.0
				}
			]
		}
	]
}

scc.infer_cell_types(adata, markers, de_test = 't')

Cluster 1:
    name: CD4 T cells; score: 1.00; average marker percentage: 58.47%; strong support: (IL7R+,58.47%)
Cluster 2:
    name: CD4 T cells; score: 1.00; average marker percentage: 77.66%; strong support: (IL7R+,77.66%)
Cluster 3:
Cluster 4:
Cluster 5:
    name: B cells; score: 1.00; average marker percentage: 85.71%; strong support: (MS4A1+,85.71%)
Cluster 6:


Plot marker genes

In [27]:
sp.dotplot(adata, by='leiden_labels', 
           keys=['IL7R', 'CCR7', 'S100A4', 'CD14', 'LYZ', 'MS4A1', 'CD8A', 'FCGR3A', 'MS4A7', 'GNLY', 'NKG7', 'FCER1A', 'CST3', 'PPBP'])

In [28]:
scc.write_output(adata, output_file)

... storing 'Channel' as categorical
