In [1]:
%load_ext lab_black
%load_ext autotime
import pandas as pd
import numpy as np

time: 248 ms (started: 2022-09-17 17:25:34 -07:00)


This notebook relies on a [berenslab tutorial](https://github.com/berenslab/rna-seq-tsne/blob/398261383041f84a5b818ff243a412748fbc2f2a/demo.ipynb) for most of its code.

## Warning

This is another notebook that involves downloading large files. This is both a test of your patience and potentially your RAM, although processing the data should not take up more than ~6 GB of your kernel RAM. 

## Download the RNAseq data

The data is stored in a CSV inside a zipped folder containing multiple files. The [pandas read_csv doc](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for the `compression` parameter currently states:

> If using ‘zip’, the ZIP file must contain only one data file to be read in.

so we must make use of the [python zipfile module](https://docs.python.org/3/library/zipfile.html) to fetch this data.

In [2]:
import zipfile
from io import BytesIO

import requests
from scipy import sparse


def read_macosko_data():
    visp_zip_url = (
        "http://celltypes.brain-map.org/api/v2/well_known_file_download/694413985"
    )
    vreq = requests.get(visp_zip_url, timeout=10)

    alm_zip_url = (
        "http://celltypes.brain-map.org/api/v2/well_known_file_download/694413179"
    )
    areq = requests.get(alm_zip_url, timeout=10)

    with zipfile.ZipFile(BytesIO(vreq.content)) as visp_zip:
        # we need the gene symbols from this file
        with visp_zip.open("mouse_VISp_2018-06-14_genes-rows.csv") as visp_genes_file:
            genesDF = pd.read_csv(visp_genes_file)
            # now onto the main event
            with zipfile.ZipFile(BytesIO(areq.content)) as alm_zip:
                with visp_zip.open(
                    "mouse_VISp_2018-06-14_exon-matrix.csv"
                ) as visp_csv, alm_zip.open(
                    "mouse_ALM_2018-06-14_exon-matrix.csv"
                ) as alm_csv:
                    csv_kwds = dict(
                        chunksize=1000, index_col=0, na_filter=False, dtype=np.uint32
                    )

                    genes = []
                    sparseblocks = []
                    areas = []
                    cells = []
                    for chunk1, chunk2 in zip(
                        pd.read_csv(visp_csv, **csv_kwds),
                        pd.read_csv(alm_csv, **csv_kwds),
                    ):
                        if len(cells) == 0:
                            cells = np.concatenate((chunk1.columns, chunk2.columns))
                            areas = [0] * chunk1.columns.size + [
                                1
                            ] * chunk2.columns.size

                        genes.extend(list(chunk1.index))
                        sparseblock1 = sparse.csr_matrix(chunk1.values)
                        sparseblock2 = sparse.csr_matrix(chunk2.values)
                        sparseblock = sparse.hstack(
                            (sparseblock1, sparseblock2), format="csr"
                        )
                        sparseblocks.append([sparseblock])
                        print(".", end="", flush=True)
                    print(" done")
                    counts = sparse.bmat(sparseblocks).T
                    genes = np.array(genes)
                    areas = np.array(areas)
                    return counts, genes, cells, areas, genesDF

time: 117 ms (started: 2022-09-17 17:25:36 -07:00)


Fetching these files over the internet will take a few minutes, and should take up around 3 GB.

In [3]:
counts, genes, cells, areas, genesDF = read_macosko_data()

.............................................. done
time: 7min 41s (started: 2022-09-17 17:25:36 -07:00)


In [4]:
counts, genes, cells, areas

(<25481x45768 sparse matrix of type '<class 'numpy.uint32'>'
 	with 227422472 stored elements in Compressed Sparse Column format>,
 array([    71661,     76253,     58520, ..., 100861498, 100861500,
        100861503]),
 array(['F1S4_160108_001_A01', 'F1S4_160108_001_B01',
        'F1S4_160108_001_C01', ..., 'FJS4_170511_012_F01',
        'FJS4_170511_012_G01', 'FJS4_170511_012_H01'], dtype=object),
 array([0, 0, 0, ..., 1, 1, 1]))

time: 8.5 ms (started: 2022-09-17 17:33:17 -07:00)


### Replace the entrez ids with gene symbols

This bit makes use of the `genesDF` file we also extracted from the VISp zip file:

In [5]:
genesDF

Unnamed: 0,gene_symbol,gene_id,chromosome,gene_entrez_id,gene_name
0,0610005C13Rik,500717483,7,71661,RIKEN cDNA 0610005C13 gene
1,0610006L08Rik,500717917,7,76253,RIKEN cDNA 0610006L08 gene
2,0610007P14Rik,500730104,12,58520,RIKEN cDNA 0610007P14 gene
3,0610009B22Rik,500726890,11,66050,RIKEN cDNA 0610009B22 gene
4,0610009E02Rik,500702775,2,100125929,RIKEN cDNA 0610009E02 gene
...,...,...,...,...,...
45763,n-R5s142,500721654,8,100861496,nuclear encoded rRNA 5S 142
45764,n-R5s143,500721655,8,100861497,nuclear encoded rRNA 5S 143
45765,n-R5s144,500721656,8,100861498,nuclear encoded rRNA 5S 144
45766,n-R5s146,500721658,8,100861500,nuclear encoded rRNA 5S 146


time: 12.3 ms (started: 2022-09-17 17:33:17 -07:00)


In [6]:
gene_entrez_ids = genesDF["gene_entrez_id"].tolist()
symbols = genesDF["gene_symbol"].tolist()
id2symbol = dict(zip(gene_entrez_ids, symbols))
genes = np.array([id2symbol[g] for g in genes])

time: 26.7 ms (started: 2022-09-17 17:33:18 -07:00)


## Read cluster information

In [7]:
clusterInfo = pd.read_csv(
    "https://raw.githubusercontent.com/berenslab/rna-seq-tsne/398261383041f84a5b818ff243a412748fbc2f2a/data/tasic-sample_heatmap_plot_data.csv",
)
clusterInfo

Unnamed: 0,sample_name,cluster_id,cluster_color,cluster_label,class_id,class_color,class_label,Gad2_log10_cpm,Slc17a7_log10_cpm,Lamp5_log10_cpm,Sncg_log10_cpm,Vip_log10_cpm,Sst_log10_cpm,Pvalb_log10_cpm
0,F1S4_161216_001_A01,94,#53D385,L5 PT ALM Slco2a1,2,#27AAE1,Glutamatergic,0.000000,2.703004,2.644231,0.000000,0.194593,0.000000,0.000000
1,F1S4_180124_314_A01,73,#33A9CE,L5 IT ALM Npw,2,#27AAE1,Glutamatergic,0.000000,2.655333,3.254294,0.000000,0.000000,0.000000,0.000000
2,F1S4_180124_315_A01,2,#FF88AD,Lamp5 Fam19a1 Pax6,1,#EF4136,GABAergic,2.981714,0.000000,0.000000,0.968798,0.000000,0.000000,0.000000
3,F1S4_180124_315_B01,8,#9440F3,Sncg Slc17a8,1,#EF4136,GABAergic,2.479560,0.000000,0.000000,2.388210,0.000000,0.000000,1.685995
4,F1S4_180124_315_C01,8,#9440F3,Sncg Slc17a8,1,#EF4136,GABAergic,2.881715,0.000000,0.000000,3.005049,0.000000,0.000000,0.952222
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23817,FYS4_171004_104_C01,92,#00A863,L5 PT VISp C1ql2 Cdh13,2,#27AAE1,Glutamatergic,0.000000,2.720627,3.122579,0.000000,0.000000,0.000000,0.000000
23818,FYS4_171004_104_D01,42,#BF9F00,Sst Hpse Sema3c,1,#EF4136,GABAergic,2.246059,0.000000,0.000000,0.000000,0.000000,3.863217,1.146638
23819,FYS4_171004_104_F01,89,#0000FF,L5 PT VISp Chrna6,2,#27AAE1,Glutamatergic,0.000000,2.157171,2.817028,0.000000,0.000000,0.476322,0.000000
23820,FYS4_171004_104_G01,35,#CC6D3D,Sst Calb2 Pdlim5,1,#EF4136,GABAergic,2.792404,0.000000,0.000000,0.295530,0.000000,4.184470,0.000000


time: 1.08 s (started: 2022-09-17 17:33:18 -07:00)


In [8]:
goodCells = clusterInfo["sample_name"].values
clusterIds = clusterInfo["cluster_id"].values
labels = clusterInfo["cluster_label"].values
colors = clusterInfo["cluster_color"].values

clusterNames = np.array(
    [labels[clusterIds == i + 1][0] for i in range(np.max(clusterIds))]
)
clusterColors = np.array(
    [colors[clusterIds == i + 1][0] for i in range(np.max(clusterIds))]
)
clusters = np.copy(clusterIds) - 1

clusterNames[:5], clusterColors[:5], clusters[:5]

(array(['Lamp5 Krt73', 'Lamp5 Fam19a1 Pax6', 'Lamp5 Fam19a1 Tmem182',
        'Lamp5 Ntn1 Npy2r', 'Lamp5 Plch2 Dock5'], dtype='<U26'),
 array(['#DDACC9', '#FF88AD', '#FFB8CE', '#DD6091', '#FF7290'], dtype='<U7'),
 array([93, 72,  1,  7,  7]))

time: 15.9 ms (started: 2022-09-17 17:33:19 -07:00)


## Keep the good cells

In [9]:
ind = np.array([np.where(cells == c)[0][0] for c in goodCells])
counts = counts[ind, :]

time: 10.8 s (started: 2022-09-17 17:33:19 -07:00)


In [10]:
print("Number of cells:", counts.shape[0])
print("Number of cells from ALM:", np.sum(areas == 0))
print("Number of cells from VISp:", np.sum(areas == 1))
print("Number of clusters:", np.unique(clusters).size)
print("Number of genes:", counts.shape[1])
print(
    "Fraction of zeros in the data matrix: {:.2f}".format(
        counts.size / np.prod(counts.shape)
    )
)

Number of cells: 23822
Number of cells from ALM: 15413
Number of cells from VISp: 10068
Number of clusters: 133
Number of genes: 45768
Fraction of zeros in the data matrix: 0.20
time: 40.8 ms (started: 2022-09-17 17:33:29 -07:00)


### Save just in case?

Although this will probably take longer to write than it does to read and process the data over the network:

In [1]:
#import gzip
#import pickle
#from pathlib import Path

# tasic2018 = {
#     "counts": counts,
#     "genes": genes,
#     "clusters": clusters,
#     "areas": areas,
#     "clusterColors": clusterColors,
#     "clusterNames": clusterNames,
# }

# path = "path-to-wherever-you-save-this-stuff"

# uncomment this if you want the pickle gzipped too
# with gzip.open(
#     Path(path) / "tasic2018-raw.pkl.gz", "wb"
# ) as f:
#     pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)

## Feature Selection

In [11]:
def calcNearZeroRate(data, threshold=0):
    zeroRate = 1 - np.squeeze(np.array((data > threshold).mean(axis=0)))
    return zeroRate


def calcMeanLogExpression(data, threshold=0, atleast=10):
    nonZeros = np.squeeze(np.array((data > threshold).sum(axis=0)))
    N = data.shape[0]
    A = data.multiply(data > threshold)
    A.data = np.log2(A.data)
    meanExpr = np.zeros(data.shape[1]) * np.nan
    detected = nonZeros >= atleast
    meanExpr[detected] = np.squeeze(np.array(A[:, detected].mean(axis=0))) / (
        nonZeros[detected] / N
    )
    return meanExpr


def featureSelection(meanLogExpression, nearZeroRate, yoffset=0.02, decay=1.5, n=3000):
    low = 0
    up = 10
    nonan = ~np.isnan(meanLogExpression)
    xoffset = 5
    for _ in range(100):
        selected = np.zeros_like(nearZeroRate).astype(bool)
        selected[nonan] = (
            nearZeroRate[nonan]
            > np.exp(-decay * meanLogExpression[nonan] + xoffset) + yoffset
        )

        if np.sum(selected) == n:
            break

        if np.sum(selected) < n:
            up = xoffset
            xoffset = (xoffset + low) / 2
        else:
            low = xoffset
            xoffset = (xoffset + up) / 2

    return selected

time: 11 ms (started: 2022-09-17 17:33:30 -07:00)


### Select 3000 genes

* Get mean log non-zero expression of each gene
* Get near-zero frequency of each gene
* Find 3000 genes based on the above

In [12]:
mle = calcMeanLogExpression(counts, threshold=32)
nzr = calcNearZeroRate(counts, threshold=32)
selectedGenes = featureSelection(mle, nzr, n=3000)

time: 13.6 s (started: 2022-09-17 17:33:30 -07:00)


## Convert to log CPM

In [13]:
def create_logCPM(counts_full, selected):
    # Compute library sizes
    librarySizes = counts_full.sum(axis=1)

    # Library size normalisation
    counts3k = counts_full[:, selected]
    data = counts3k / librarySizes * 1e6

    # Log-transformation
    return np.log2(data + 1)


data = create_logCPM(counts, selectedGenes)

time: 2.24 s (started: 2022-09-17 17:33:43 -07:00)


Also convert the `data` numpy matrix (a data type which is likely to go away) to an 2D array:

In [14]:
data = data.A1.reshape(data.shape)

time: 1.11 ms (started: 2022-09-17 17:33:45 -07:00)


### Prepare the target

Create a palette for plotting:

In [15]:
tasic2018_palette = dict(ClusterName=dict(zip(clusterNames, clusterColors)))

time: 2.52 ms (started: 2022-09-17 17:33:45 -07:00)


We can use the good cell names as the index for the `target`. We'll also store the cluster id, name and the per-cell color just in case that's easier to use at some point.

In [16]:
target = pd.DataFrame(
    dict(
        ClusterId=clusters,
        ClusterColor=clusterColors[clusters],
        ClusterName=clusterNames[clusters],
    ),
    index=goodCells,
)

time: 8.7 ms (started: 2022-09-17 17:33:45 -07:00)


In [17]:
target.head()

Unnamed: 0,ClusterId,ClusterColor,ClusterName
F1S4_161216_001_A01,93,#53D385,L5 PT ALM Slco2a1
F1S4_180124_314_A01,72,#33A9CE,L5 IT ALM Npw
F1S4_180124_315_A01,1,#FF88AD,Lamp5 Fam19a1 Pax6
F1S4_180124_315_B01,7,#9440F3,Sncg Slc17a8
F1S4_180124_315_C01,7,#9440F3,Sncg Slc17a8


time: 6.05 ms (started: 2022-09-17 17:33:45 -07:00)


## Data Pipeline

In [18]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(check_for_duplicates=True).run(
    "tasic2018",
    data=data,
    target=target,
    target_palette=tasic2018_palette,
    tags=["highdim", "scRNAseq"],
    url="https://doi.org/10.1038/s41586-018-0654-5",
    verbose=True,
)

time: 1min 1s (started: 2022-09-17 17:33:45 -07:00)


### PCA 50 pipeline

Preprocessing as done in the berenslab notebook

In [19]:
data_pca50_result = create_default_pipeline(reduce=50).run(
    "tasic2018-pca50",
    data=data,
    target=target,
    target_palette=tasic2018_palette,
    tags=["scRNAseq"],
    url="https://doi.org/10.1038/s41586-018-0654-5",
    verbose=True,
)

time: 22.8 s (started: 2022-09-17 17:36:04 -07:00)
