# Integration of Forelimb Data

For our integrated dataset, we seek to obtain cell type-averaged counts from the mouse forelimb 10X scRNA-seq dataset published by [He et al](https://www.nature.com/articles/s41586-020-2536-x).

For the dataset integration, we assume the authors have performed satisfactory quality control on cells. We will also take all genes from the raw data so that we can maximize on the number of genes in the integrated dataset, since we take the intersection of genes. Finally, for reasonable comparison, we will normalize all cells to have the same total counts, `1e4`.

In [1]:
import numpy as np
import scanpy as sc
import scipy as sp
import pandas as pd
import seaborn as sb

import module as md

datadir = '../../data/raw_data/sc_data/forelimb/'
resdir = '../../data/processed_data/sc_data/forelimb/'

He et al. upload their gene expression matrix as a tsv file, available at https://cells.ucsc.edu/?bp=limb&ds=mouse-limb. 

We can read this tsv file into Scanpy to create an AnnData object. 

In [2]:
%%time
# We create the AnnData object by uploading the gene expression matrix.
file = datadir + 'exprMatrix.tsv' # file that stores the data, a csv for this paper
adata = sc.read_csv(file, delimiter = "\t").T

adata

CPU times: user 6min 48s, sys: 42.8 s, total: 7min 31s
Wall time: 7min 47s


AnnData object with n_obs × n_vars = 90637 × 43346

The authors performs a logp1 transform on the count data, and we need to invert this by doing a expm1 transform so we can visualize it.

In [3]:
adata.X = np.expm1(adata.X)

They also upload a tsv file with metadata of the gene expression matrix, available at https://cells.ucsc.edu/?bp=limb&ds=mouse-limb. We can add these annotations as observations onto our AnnData gene expression matrix.

In [4]:
# The dataset also has a metadata file with annotations of each cell's
# batch number, developmental stage, and cell type. We add these 
# annotations to the AnnData object.

metadata = datadir+'meta.tsv'
df_meta = pd.read_csv(metadata, delimiter = "\t")
df_meta = df_meta.set_index(df_meta['index'])

adata.obs['batch'] = df_meta['batch']
adata.obs['batch'] = [str(i) for i in adata.obs['batch']]
adata.obs['stage'] = df_meta['stage']
adata.obs['cell_type'] = df_meta['cell_type']

df_meta.head()

Unnamed: 0_level_0,index,cell_type,batch,doublet_scores,nCount_RNA,nFeatures_RNA,Percent Mitochond.,stage,UMI Count,doublet_corrected_p,doublet_corrected_p_less_than_0_1
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
limb12_13_0AAACCTGAGATCGATA_1,limb12_13_0AAACCTGAGATCGATA_1,Chondrocyte,7,0.216129,11426.0,3404,0.016366,13.0,11428.0,2e-06,1
limb12_13_0AAACCTGAGATGAGAG_1,limb12_13_0AAACCTGAGATGAGAG_1,Mesenchymal 2,7,0.223881,6474.0,2322,0.013284,13.0,6474.0,0.815298,0
limb12_13_0AAACCTGAGCAGATCG_1,limb12_13_0AAACCTGAGCAGATCG_1,Epithelial 1,7,0.014121,8269.0,2162,0.01161,13.0,8273.0,0.896632,0
limb12_13_0AAACCTGAGCGATCCC_1,limb12_13_0AAACCTGAGCGATCCC_1,Fibroblast,7,0.065789,14966.0,3771,0.011092,13.0,14972.0,0.815298,0
limb12_13_0AAACCTGAGTGTACCT_1,limb12_13_0AAACCTGAGTGTACCT_1,Perichondrial,7,0.12527,7649.0,2517,0.018042,13.0,7651.0,0.815298,0


We seek to subset the cells into groups for the merged dataset. The object contains samples from `e10.5, e11, e12, e13, e13.5, e14, e15` mice. We will subset the cell types according to these timepoints.

Moreover, to not give weight to very small cell type populations, we filter for cell types containing more than 30 cells.The authors performs a logp1 transform on the count data, and we need to invert this by doing a expm1 transform.

In [5]:
# Groupby the three subsetting categories
df_filt = pd.DataFrame(adata.obs.groupby(["stage", "cell_type"]).size())
df_filt = df_filt.rename(columns = {0: "ncells"})
df_filt = df_filt[df_filt["ncells"] > 0].reset_index()

# Drop the cell types with fewer than 30 counts.
df_filt = df_filt[df_filt["ncells"] >= 30]
df_filt = df_filt.drop(columns="ncells")

# Display counts of cell types in each timepoint.
df_filt.value_counts("stage")

stage
15.0    20
14.0    17
13.0    16
13.5    14
12.0    11
11.0     7
10.5     6
dtype: int64

In [6]:
# Subset the AnnData object to only include cells that are in the subsetted cell types.
keys = list(df_filt.columns.values)
i1 = adata.obs.set_index(keys).index
i2 = df_filt.set_index(keys).index

adata = adata[adata.obs[i1.isin(i2)].index,:]

adata

  if not is_categorical(df_full[k]):


View of AnnData object with n_obs × n_vars = 90137 × 43346
    obs: 'batch', 'stage', 'cell_type'

### Data Before Processing <a class="anchor" id="2-bullet"></a>

Though we do not perform quality control on cells or genes, we will visualize basic statistics on these parameters to get a rough idea of the data quality.

In [None]:
%%time

# Below, we display cells with genes/cell and counts/cell, for genes that 
# have at least one 1 count in any cell. 
sc.pp.filter_cells(adata, min_genes=0)
sc.pp.filter_genes(adata, min_cells=0)

# Set annotations to the AnnData object for total counts, cells per gene, and mitochondrial genes
#X = np.matrix(adata.X)
adata.obs['n_total_counts_per_cell'] = adata.X.sum(axis=1)
adata.var['n_cells_per_gene'] = adata.X.astype(bool).astype(int).sum(axis=0).T
adata.obs['n_genes_per_cell'] = adata.X.astype(bool).astype(int).sum(axis=1)

# Make plots to visualize data quality.

pre_processed = md.vis_pre_processing(adata, genes_range=(0, 20000),
    counts_range=(0, 1e5),title='Figure 2: Data Before Pre-Processing',
                                     genes_threshold = 1000, 
                                      counts_threshold=2000)

In [None]:
avg_genes = int(np.average(adata.obs['n_genes_per_cell']))
avg_counts = int(np.average(adata.obs['n_total_counts_per_cell']))

print('The average number of genes per cell is ' + str(avg_genes) 
      + ' and the average number of counts per cell is ' 
      + str(avg_counts))

### Data Normalization <a class="anchor" id="3-bullet"></a>

The data is already normalized so each cell has `1e4` counts.

In [None]:
adata.X.sum(axis=1)

In [None]:
%%time

# Normalize the data
sc.pp.log1p(adata)

# We set the raw attribute of our AnnData object as the normalized count matrix. 
adata.raw=adata

For the integrated atlas, we will now average gene counts over cell types, add metadata, and save the result as csvs, separate for each timepoint.

In [None]:
# Save data separately for each time point.

df = pd.DataFrame(columns = adata.var.index)

for j in df_filt.index:
    df.loc[j] = adata_sub[(adata.obs["stage"] == df_filt.loc[j, "stage"]) & 
                          (adata.obs["tissue"] == df_filt.loc[j, "tissue"]) & 
                          (adata.obs["cell_ontology_class"] == df_filt.loc[j, 
                                                                           "cell_ontology_class"])][:, :].X.mean(0)
        
df["cell_ontology_class"] = df_filt["cell_ontology_class"]
df["tissue"] = "NaN"
df["dataset"] = "Forelimb_E10.5_15.0"

df.to_csv(resdir + "forelimb_all_genes.csv")

In [None]:
%load_ext watermark
%watermark -v -p numpy,pandas,scanpy,jupyterlab