# DCA Liver Runtime

In this notebook, we will analyze the scaleability of the DCA method to larger datasets. We fit DCA on various percentages of the Liver dataset (which contains over 100000 cells). We fit DCA on subsets of this data ranging from 10% up to 100% of the full liver data.

In [1]:
"""Broadly useful python packages"""
import pandas as pd
import os
import numpy as np
import pickle
from copy import deepcopy
from shutil import move, rmtree
import warnings
from memory_profiler import memory_usage
from time import time

"""Machine learning and single cell packages"""
import sklearn.metrics as metrics
from sklearn.metrics import adjusted_rand_score as ari, normalized_mutual_info_score as nmi
import scanpy as sc
from anndata import AnnData
import seaborn as sns

  from pandas.core.index import RangeIndex


In [2]:
"""Miscellaneous useful functions"""

def read_liver_data(path, cache=True):
    adata = sc.read_mtx(os.path.join(path, 'matrix.mtx')).T
    genes_file = pd.read_csv(os.path.join(path, 'genes.tsv'), sep='\t')
    barcodes_file = pd.read_csv(os.path.join(path, 'barcodes.tsv'), sep='\t')

    adata.var.index = genes_file["genename"]
    adata.obs.index = barcodes_file["cellname"]
    adata.obs = barcodes_file
        
    sc.pp.filter_cells(adata, min_genes = 200)
    mito_genes = adata.var_names.str.startswith('mt-')
    adata.obs['percent_mito'] = np.sum(
        adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
    adata.obs['n_counts'] = adata.X.sum(axis=1).A1
    adata = adata[adata.obs['percent_mito'] < 0.2, :]
    sc.pp.filter_genes(adata, min_cells = 30)

    return adata

def build_dir(dir_path):
    subdirs = [dir_path]
    substring = dir_path

    while substring != '':
        splt_dir = os.path.split(substring)
        substring = splt_dir[0]
        subdirs.append(substring)
        
    subdirs.pop()
    subdirs = [x for x in subdirs if os.path.basename(x) != '..']

    n = len(subdirs)
    subdirs = [subdirs[n - 1 - x] for x in range(n)]
    
    for dir_ in subdirs:
        if not os.path.isdir(dir_):
            os.mkdir(dir_)
            
def run_dca(adata):
    sc.external.pp.dca(adata, mode = 'denoise', ae_type='nb-conddisp', verbose = True)
        
def profile(frac):
    np.random.seed(11111)
    indices = np.random.choice(range(adata.shape[0]), size = round(frac * adata.shape[0]), replace = False)
    tmp = adata.copy()[indices]
    
    tmp = AnnData(tmp.X.toarray())
    sc.pp.filter_genes(tmp, min_cells = 1)
    start = time()
    run = memory_usage((run_dca, (), {'adata': tmp}))
    final = time() - start
    peak_memory = max(run) - min(run)
    stats_zscore = final, peak_memory, "DCA", int(100*frac)
    
    return stats_zscore

## Figure Data

In [3]:
build_dir("../Figures/liver")
profile_stats = {"Time (Seconds)": [] , "Memory (MiB)": [], "Method": [], 'Percent': []}
profile_stats = pd.DataFrame(profile_stats)

## Read in Data

In [4]:
adata = read_liver_data("../Data/liver", cache = True)

Transforming to str index.
Trying to set attribute `.var` of view, copying.


## Profile Memory and Speed

In [5]:
fracs = [0.1, 0.2, 0.4, 0.6, 0.8, 1]

for frac in fracs:
    np.random.seed(11111)
    indices = np.random.choice(range(adata.shape[0]), size = round(frac * adata.shape[0]), replace = False)
    pd.DataFrame(indices).to_csv("indices" + str(frac) + ".csv")

n = 0
for frac in fracs:
    profile_stats.loc[n] = profile(frac)
    n = n + 1


In a future version of Scanpy, `scanpy.api` will be removed.
Simply use `import scanpy as sc` and `import scanpy.external as sce` instead.

Using TensorFlow backend.





  _config = yaml.load(open(_config_path))



dca: Successfully preprocessed 21496 genes and 10469 cells.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.
































Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
count (InputLayer)              (None, 21496)        0                                            
__________________________________________________________________________________________________
enc0 (Dense)                    (None, 64)           1375808     count[0][0]                      
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 64)           192         enc0[0][0]                       
__________________________________________________________________________________________________
enc0_act (Activation)           (None, 64)           0           batch_normalization_1[0][0]      
____________________________________________________________________________________________




Train on 9422 samples, validate on 1047 samples
Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300

Epoch 00031: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300

Epoch 00051: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300

Epo

In [6]:
profile_stats.to_csv("../Figures/liver/DCA_profile.csv")
profile_stats

Unnamed: 0,Time (Seconds),Memory (MiB),Method,Percent
0,1506.398011,10378.695312,DCA,10.0
1,1926.55742,13263.257812,DCA,20.0
2,3517.835267,20978.339844,DCA,40.0
3,4168.908276,23393.4375,DCA,60.0
4,6336.861966,25850.035156,DCA,80.0
5,5383.067867,24373.273438,DCA,100.0
