## Cortex preprocessing

We follow the preprocessing code from scVI (https://github.com/romain-lopez/scVI-reproducibility/blob/master/CORTEX-prepro.ipynb). 

Before running this notebook, users need to download the raw data (expression_mRNA_17-Aug-2014.txt) at https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mRNA_17-Aug-2014.txt, and store it locally at “datasets/cortex/expression_mRNA_17-Aug-2014.txt”

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
data_path = "/home/jzhaoaz/jiazhao/scPI_v2/package/datasets/cortex/"

### Load data

In [3]:
X = pd.read_csv(data_path + "expression_mRNA_17-Aug-2014.txt", sep="\t", low_memory=False).T
clusters = np.array(X[7], dtype=str)[2:]
celltypes, labels = np.unique(clusters, return_inverse=True)
gene_names = np.array(X.iloc[0], dtype=str)[10:]
X = X.loc[:, 10:]
X = X.drop(X.index[0])
expression = np.array(X, dtype=np.int)[1:]

In [4]:
print(expression.shape[0], "cells with", expression.shape[1], "genes.")

3005 cells with 19972 genes.


In [5]:
selected = np.std(expression, axis=0).argsort()[-558:][::-1]
expression = expression[:, selected]
gene_names = gene_names[selected].astype(str)
print(expression.shape)

(3005, 558)


### Split training and testing datasets

In [6]:
X_train, X_test, c_train, c_test = train_test_split(expression, labels, random_state=0)

In [7]:
print(X_train.shape, X_test.shape, c_train.shape, c_test.shape)

(2253, 558) (752, 558) (2253,) (752,)


In [8]:
np.savetxt(data_path + "data_train", X_train)
np.savetxt(data_path + "data_test", X_test)
np.savetxt(data_path + "label_train", c_train)
np.savetxt(data_path + "label_test", c_test)

### Mask counts for imputation

In [9]:
expression_train = np.loadtxt(data_path + "data_train", dtype='float32')
X_zero = np.copy(expression_train)
i,j = np.nonzero(X_zero)
ix = np.random.choice(range(len(i)), int(np.floor(0.1 * len(i))), replace=False)
X_zero[i[ix], j[ix]] *= 0.0

In [10]:
np.savetxt(data_path + "X_zero.txt", X_zero)
np.savetxt(data_path + "i.txt", i)
np.savetxt(data_path + "j.txt", j)
np.savetxt(data_path + "ix.txt", ix)

In [11]:
np.save(data_path + "i.npy", i)
np.save(data_path + "j.npy", j)
np.save(data_path + "ix.npy", ix)