### Preparing the data
In this notebook, we demonstrate how to prepare the Mouse Smart-seq dataset, which is a single-cell dataset was released as part of a transcriptomic cell types study in [Tasic et al., 2018](https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-v1-and-alm-smart-seq). The dataset includes RNA sequencing of neurons from the anterolateral motor cortex (ALM) and primary visual cortex (VISp) regions of adult mice using Smart-seq (SSv4) platform. 

In [1]:
import os, sys
import scipy.io as sio
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Import modules
module_path = '/Users/yeganeh.marghi/github/MMIDAS/'
sys.path.insert(0, module_path)
from utils.tree_based_analysis import get_merged_types
from utils.analysis_cells_tree import HTree
from utils.config import load_config
from utils.data_tools import normalize_cellxgene


Download ```zip``` files and place them the data folder. There should be a ```config.toml```,  which is a global configuration file containing following paths:


* ```package_dir='xxx'```
* ```data_path='xxx'```

In [3]:
paths = load_config(config_file='config.toml')
data_path = paths['package_dir'] / paths['data_path']

In [4]:
# Load the mouse Smart-seq VISp data
data_VISp_exon = data_path / 'mouse_VISp_2018-06-14_exon-matrix.csv'
anno_VISp = data_path / 'mouse_VISp_2018-06-14_samples-columns.csv'
df_vis_exon = pd.read_csv(data_VISp_exon)
df_vis_anno = pd.read_csv(anno_VISp, encoding='unicode_escape')

# Load the mouse Smart-seq ALM data
data_ALM_exon = data_path / 'mouse_ALM_2018-06-14_exon-matrix.csv'
anno_ALM = data_path / 'mouse_ALM_2018-06-14_samples-columns.csv'
df_alm_exon = pd.read_csv(data_ALM_exon)
df_alm_anno = pd.read_csv(anno_ALM, encoding='unicode_escape')

print(f'Total number of cells in VISp and ALM: {len(df_vis_anno)}, {len(df_alm_anno)}')

Total number of cells in VISp and ALM: 15413, 10068


In [5]:
# Get the neuronal cells across brain regions
vis_neuron = df_vis_anno['class'].isin(['GABAergic', 'Glutamatergic'])
alm_neuron = df_alm_anno['class'].isin(['GABAergic', 'Glutamatergic'])
vis_counts = df_vis_exon.values[:, 1:][:, vis_neuron].T
alm_counts = df_alm_exon.values[:, 1:][:, alm_neuron].T

df_anno = pd.concat([df_vis_anno[vis_neuron], df_alm_anno[alm_neuron]], ignore_index=True)
total_count = np.concatenate((vis_counts, alm_counts), axis=0)

# Normalized counts values using LogCPM
logCPM = np.log1p(normalize_cellxgene(total_count) * 1e6)

print(np.sum(logCPM, axis=1))

[30890.15859407 34090.13980254 35085.63428565 ... 34077.15380524
 31090.81791427 35629.482184  ]


In [8]:
# list of all genes in the dataset
ref_gene_file = data_path / 'mouse_ALM_2018-06-14_genes-rows.csv'

# selected genes for mouse Smart-seq data analysis
slc_gene_file = data_path / 'genes_SS_VISp_ALM.csv'

ref_genes_df = pd.read_csv(ref_gene_file)
slc_gene_df = pd.read_csv(slc_gene_file)

print(ref_genes_df[41530:41550])
print('-'*100)
print(f'Total number of genes: {len(ref_genes_df)}, Number of selected genes: {len(slc_gene_df)}')

      gene_symbol    gene_id chromosome  gene_entrez_id  \
41530      Sssca1  500741647         19           56390   
41531         Sst  500737291         16           20604   
41532       Sstr1  500729687         12           20605   
41533       Sstr2  500728684         11           20606   
41534       Sstr3  500736064         15           20607   
41535       Sstr4  500704969          2           20608   
41536       Sstr5  500738797         17           20609   
41537       Ssty1  500745186          Y           20611   
41538       Ssty2  500745340          Y           70009   
41539        Ssu2  500714992          6          243612   
41540       Ssu72  500710656          4           68991   
41541      Ssx2ip  500707937          3           99167   
41542        Ssx9  500742933          X          382206   
41543       Ssxa1  500743112          X          385338   
41544       Ssxb1  500742924          X           67985   
41545      Ssxb10  500742922          X          385312 

Filter out genes that were not selected, as well as two categories of cells: low quality cells, and those belonging to ```CR``` and ```Meis2``` subclasses.

In [11]:
# select genes
genes = slc_gene_df.genes.values
gene_indx = [np.where(ref_genes_df.gene_symbol.values == gg)[0][0] for gg in genes]
log1p = logCPM[:, gene_indx]

# remove low quality cells and those belonging to the subclasses 'CR' and 'Meis2'.
mask = (df_anno['cluster']!='Low Quality') & (~df_anno['subclass'].isin(['CR', 'Meis2']))
df_anno = df_anno[mask].reset_index() 
log1p = log1p[mask, :]

print(f'final shape of normalized gene expresion matix: {log1p.shape}')

final shape of normalized gene expresion matix: (22365, 5032)


Build a data dictionaty for the Smart-seq dataloader. 

In [17]:
# load the tree.csv to obtain colors for t-types on the taxonomies
htree_file = data_path / 'tree.csv'
treeObj = HTree(htree_file=htree_file)
ttypes = treeObj.child[treeObj.isleaf]
colors = treeObj.col[treeObj.isleaf]

# build a data dictionary for the dataloader
data = df_anno[['subclass', 'cluster']].to_dict('list')
data['gene_id'] = genes
data['log1p'] = log1p
data['sample_id'] = df_anno.seq_name.values
data['class_label'] = df_anno['class'].values
data['cluster_color'] = np.array([colors[0]]*len(data['cluster']))

for cluster in df_anno.cluster.unique():
    c_idx = np.where(data['cluster'] == cluster)
    data['cluster_color'][c_idx] = colors[ttypes == cluster]
