# Scanpy covid-19 preprocessing

Because the original dataset counts matrix is too large to read into R (due to 32-bit limit for sparse matrices due to Rcpp), we will preprocess the dataset in python through normalization and subsetting by variable genes before reading into R. This will reduce the number of values stored in the sparse matrix to allow for performing downstream steps in R.

In [64]:
import numpy as np
import pandas as pd
import scanpy as sc
import scipy
from numpy import savetxt
from scipy import io

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')

scanpy==1.7.2 anndata==0.7.6 umap==0.5.1 numpy==1.19.2 scipy==1.6.0 pandas==1.2.1 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.9.1 louvain==0.7.0


Obtained AnnData file `GSE158055_covid19.h5ad` from https://drive.google.com/file/d/1TXDJqOvFkJxbcm2u2-_bM5RBdTOqv56w/view, based on this Seurat issue regarding large file size: https://github.com/satijalab/seurat/issues/4030

In [4]:
adata = sc.read('/data/srlab2/jkang/datasets/ren_covid_cell_2021/GSE158055_covid19.h5ad')
adata

AnnData object with n_obs × n_vars = 1462702 × 27943

In [15]:
## Log(CP10k+1) normalize
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

normalizing counts per cell
    finished (0:00:42)


In [40]:
var_genes = pd.read_table('All_sample_hvg.csv', header = 0) # Obtained from original authors (Ren et al)
var_genes

Unnamed: 0,Genes
0,A2M
1,A2ML1
2,AATK
3,ABCA1
4,ABCA13
...,...
1296,ZBBX
1297,ZG16B
1298,ZNF185
1299,ZNF683


In [41]:
var_genes.loc[:, 'Genes']

0           A2M
1         A2ML1
2          AATK
3         ABCA1
4        ABCA13
         ...   
1296       ZBBX
1297      ZG16B
1298     ZNF185
1299     ZNF683
1300    ZNF705A
Name: Genes, Length: 1301, dtype: object

In [49]:
## Subset by variable genes
adata_vargenes = adata[:, var_genes.loc[:, 'Genes']]
adata_vargenes

View of AnnData object with n_obs × n_vars = 1462702 × 1301
    uns: 'log1p'

In [52]:
adata_vargenes.obs

d01_sample_A_AACAGGGGTCGGATTT
d01_sample_A_AACCAACGTCCGAAAG
d01_sample_A_AACCTTTGTAGCACGA
d01_sample_A_AAGCATCTCTATCGCC
d01_sample_A_AATCACGGTCATAAAG
...
d17_9_TTTGTCATCCACGCAG
d17_9_TTTGTCATCCGCTGTT
d17_9_TTTGTCATCGTCGTTC
d17_9_TTTGTCATCTGTACGA
d17_9_TTTGTCATCTTGCCGT


In [53]:
adata_vargenes.var

A2M
A2ML1
AATK
ABCA1
ABCA13
...
ZBBX
ZG16B
ZNF185
ZNF683
ZNF705A


In [54]:
np.max(adata_vargenes.X)

8.979815

In [56]:
np.shape(adata_vargenes.X)

(1462702, 1301)

In [57]:
adata_vargenes.X

<1462702x1301 sparse matrix of type '<class 'numpy.float32'>'
	with 48864973 stored elements in Compressed Sparse Row format>

Save sparse matrix

In [66]:
scipy.io.mmwrite('/data/srlab2/jkang/datasets/ren_covid_cell_2021/exp_norm_vargenes', adata_vargenes.X)

pd.Series(adata_vargenes.obs_names, index=adata_vargenes.obs_names).to_csv('/data/srlab2/jkang/datasets/ren_covid_cell_2021/exp_norm_vargenes_barcodes.csv', header=False)

adata_vargenes.var.to_csv('/data/srlab2/jkang/datasets/ren_covid_cell_2021/exp_norm_vargenes_genes.csv', header=True)

In [81]:
x = adata_vargenes.X[0:5, 0:5]
print(x) # sanity checked these values upon reading the matrix into R

  (0, 3)	1.0726684
  (3, 3)	0.40238896
