_Neural Data Science_

Lecturer: Dr. Jan Lause, Prof. Dr. Philipp Berens

Tutors: Jonas Beck, Fabio Seel, Julius Würzler

Summer term 2025

Student names: Nina Lutz, Mathis Nommensen

LLM Disclaimer: <span style='background: yellow'>*Did you use an LLM to solve this exercise? If yes, which one and where did you use it? [Copilot, Claude, ChatGPT, etc.]* </span>

# Project 3: Single-cell data analysis.

In [None]:
#!pip install memory-profiler #N: i needed to install this. delete code block if not needed anymore

In [None]:
# %matplotlib notebook #N: had to change this
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import string

import scipy as sp
from scipy import sparse
import sklearn

## add your packages ##

import time
import pickle
import memory_profiler
import seaborn as sns
import scipy.stats as stats

%load_ext memory_profiler

from pathlib import Path

In [None]:
import black
import jupyter_black

jupyter_black.load(line_length=79)

In [None]:
variables_path = Path("../results/variables")
figures_path = Path("../results/figures")
data_path = Path("../data")

In [None]:
# plt.style.use("matplotlib_style.txt")
plt.style.use("../matplotlib_style.txt")  # N: had to change this as well

In [None]:
np.random.seed(42)

## Project and data description

In this project, we are going to work with the typical methods and pipelines used in single-cell data analysis and get some hands-on experience with the techniques used in the field. For that, we will be using Patch-seq multimodal data from cortical neurons in mice, from Scala et al. 2021 (https://www.nature.com/articles/s41586-020-2907-3#Sec7). From the different data modalities they used, we will focus on transcriptomics and electrophysiological data. 

In a real-world scenario, single cell data rarely comes with any "ground truth" labels. Often, the goal of researchers after measuring cells is to precisely classify them, grouping them into families or assigning them cell types based on the recorded features. This is normally done using usupervised methods, such as clustering methods.

However, the single-cell data that we are using in this project has some cell types assigned to each cell. These are not "ground truth" type annotations, but were one of the results from the original Scala et al. work. Still, we are going to use those annotations for validation (despite them not really being ground truth) to sanity-check some of our analyses, such as visualizations, clustering, etc. We will mainly work with cell types (`rna_types`, 77 unique types) and cell families (`rna_families`, 9 unique families).

From the transcriptomics mRNA counts, we will only work with the exon counts for simplicity. Some of the electrophysiological features are not high-quality recordings, therefore we will also filter them out.

## Import data

### Meta data

In [None]:
# META DATA

meta = pd.read_csv(data_path / "m1_patchseq_meta_data.csv", sep="\t")

cells = meta["Cell"].values

layers = meta["Targeted layer"].values.astype("str")
cre = meta["Cre"].values
yields = meta["Yield (pg/µl)"].values
yields[yields == "?"] = np.nan
yields = yields.astype("float")
depth = meta["Soma depth (µm)"].values
depth[depth == "Slice Lost"] = np.nan
depth = depth.astype(float)
thickness = meta["Cortical thickness (µm)"].values
thickness[thickness == 0] = np.nan
thickness = thickness.astype(float)
traced = meta["Traced"].values == "y"
exclude = meta["Exclusion reasons"].values.astype(str)
exclude[exclude == "nan"] = ""

mice_names = meta["Mouse"].values
mice_ages = meta["Mouse age"].values
mice_cres = np.array(
    [
        c if c[-1] != "+" and c[-1] != "-" else c[:-1]
        for c in meta["Cre"].values
    ]
)
mice_ages = dict(zip(mice_names, mice_ages))
mice_cres = dict(zip(mice_names, mice_cres))

print("Number of cells with measured depth:    ", np.sum(~np.isnan(depth)))
print("Number of cells with measured thickness:", np.sum(~np.isnan(thickness)))
print("Number of reconstructed cells:          ", np.sum(traced))

sliceids = meta["Slice"].values
a, b = np.unique(sliceids, return_counts=True)
assert np.all(b <= 2)
print("Number of slices with two cells:        ", np.sum(b == 2))

# Some consistency checks
assert np.all(
    [
        np.unique(meta["Date"].values[mice_names == m]).size == 1
        for m in mice_names
    ]
)
assert np.all(
    [
        np.unique(meta["Mouse age"].values[mice_names == m]).size == 1
        for m in mice_names
    ]
)
assert np.all(
    [
        np.unique(meta["Mouse gender"].values[mice_names == m]).size == 1
        for m in mice_names
    ]
)
assert np.all(
    [
        np.unique(meta["Mouse genotype"].values[mice_names == m]).size == 1
        for m in mice_names
    ]
)
assert np.all(
    [
        np.unique(meta["Mouse"].values[sliceids == s]).size == 1
        for s in sliceids
    ]
)

In [None]:
meta.columns

### "Ground truth labels"

In [None]:
# filter out low quality cells in term of RNA
print(
    "There are",
    np.sum(meta["RNA family"] == "low quality"),
    "cells with low quality RNA recordings.",
)
exclude_low_quality = meta["RNA family"] != "low quality"

In [None]:
rna_family = meta["RNA family"][exclude_low_quality]
rna_type = meta["RNA type"][exclude_low_quality]

In [None]:
print(len(np.unique(rna_family)))
print(len(np.unique(rna_type)))

In [None]:
pickle_in = open(data_path / "dict_rna_type_colors.pkl", "rb")
dict_rna_type_colors = pickle.load(pickle_in)

In [None]:
rna_type_colors = np.vectorize(dict_rna_type_colors.get)(rna_type)

### Transcriptomic data

In [None]:
# READ COUNTS
data_exons = pd.read_csv(
    data_path / "m1_patchseq_exon_counts.csv.gz", na_filter=False, index_col=0
)

assert all(cells == data_exons.columns)
genes = np.array(data_exons.index)

# filter out low quality cells in term of rna family
exonCounts = data_exons.values.transpose()[exclude_low_quality]
print("Count matrix shape (exon):  ", exonCounts.shape)

In [None]:
# GENE LENGTH

data = pd.read_csv(data_path / "gene_lengths.txt")
assert all(data["GeneID"] == genes)
exonLengths = data["exon_bp"].values

### Electrophysiological features

In [None]:
# EPHYS DATA

ephysData = pd.read_csv(data_path / "m1_patchseq_ephys_features.csv")
ephysNames = np.array(ephysData.columns[1:]).astype(str)
ephysCells = ephysData["cell id"].values
ephysData = ephysData.values[:, 1:].astype("float")
names2ephys = dict(zip(ephysCells, ephysData))
ephysData = np.array(
    [
        names2ephys[c] if c in names2ephys else ephysData[0] * np.nan
        for c in cells
    ]
)

print("Number of cells with ephys data:", np.sum(np.isin(cells, ephysCells)))

assert np.sum(~np.isin(ephysCells, cells)) == 0

In [None]:
# Filtering ephys data

features_exclude = [
    "Afterdepolarization (mV)",
    "AP Fano factor",
    "ISI Fano factor",
    "Latency @ +20pA current (ms)",
    "Wildness",
    "Spike frequency adaptation",
    "Sag area (mV*s)",
    "Sag time (s)",
    "Burstiness",
    "AP amplitude average adaptation index",
    "ISI average adaptation index",
    "Rebound number of APs",
]
features_log = [
    "AP coefficient of variation",
    "ISI coefficient of variation",
    "ISI adaptation index",
    "Latency (ms)",
]

X = ephysData[exclude_low_quality]
print(X.shape)
for e in features_log:
    X[:, ephysNames == e] = np.log(X[:, ephysNames == e])
X = X[:, ~np.isin(ephysNames, features_exclude)]

keepcells = ~np.isnan(np.sum(X, axis=1))
X = X[keepcells, :]
print(X.shape)

X = X - X.mean(axis=0)
ephysData_filtered = X / X.std(axis=0)

In [None]:
np.sum(np.isnan(ephysData_filtered))

# Research questions to investigate

**1) Inspect the data by computing key statistics.** For RNA counts, you can compute and plot statistics, e.g. total counts per cell, number of expressed genes per cell, mean count per gene, variance per gene, mean-variance relationship... See https://www.embopress.org/doi/full/10.15252/msb.20188746 for common quality control statistics. Keep in mind that the RNA data in this project is read counts, not UMI counts, so it is not supposed to follow a Poisson distribution. To get an idea of the technical noise in the data, you can plot count distributions of single genes within cell types (like in the lecture). 

Similarly, you can compute and plot statistics over the electrophyiological data. Also, investigate the distribution of "ground truth" labels. Comment about other relevant metadata, and think if you can use it as some external validation for other analyses. If you do use other metadata throughout the project, explain why and what you get out of it. Take into account that certain features may not be very informative for our purposes (e.g. mouse age), so only choose features that provide you with useful information in this context. If you want to get additional information about the metadata, have a look at the extended data section in the original publication (e.g., cre-lines in Figure 1c in the extended data).

**2) Normalize & transform the data; select genes & apply PCA.** There are several ways of normalizing the RNA count data (Raw, CPM, CPMedian, RPKM, see https://www.reneshbedre.com/blog/expression_units.html, https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-021-02936-w). Take into account that there are certain normalizations that only make sense for UMI data, but not for this read count data. You also explored different transformations in the assignment (none, log, sqrt). Compare how the different transformations change the two-dimensional visualization. After normalization and transformation, choose a set of highly variable genes (as demonstrated in the lecture) and apply PCA. Play with the number of selected genes and the number of PCA components, and again compare their effects on the two-dimensional visualization.

**3) Two-dimensional visualization.** To visualize the RNA count data after normalization, transformation, gene selection and PCA, try different methods (just PCA, t-SNE, UMAP, ..) and vary their parameters (exaggeration, perplexity, ..). Compare them using quantitative metrics (e.g., kNN accuracy in high-dim vs. two-dim, kNN recall). Please refer to Lause et al., 2024 (https://doi.org/10.1371/journal.pcbi.1012403) where many of these metrics are discussed and explained to make an informed choice on which metrics to use. Think about also using the electrophysiological features and other metadata to enhance different visualizations.

**4) Clustering.** To find cell types in the RNA count data, you will need to look for clusters. Try different clustering methods (leiden, GMM). Implement a negative binomial mixture model. For that you can follow a similar method that what is described in Harris et al. 2018 (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2006387#abstract0), with fixed r (r=2). Feel free to simplify the setup from the paper and not optimize over the set of important genes S but fix it instead, or skip the split and merge part of their clustering algorithm. A vanilla NBMM implementation should suffice. Take into account that the NBMM tries to cluster data that follows a negative binomial distribution. Therefore, it does not make sense to apply this clustering method to all kinds of normalized and transformed data. Please refer to the Harris et al. 2018 publication for the appropriate choice of normalization, and reflect on why this normalization makes sense. Evaluate your clustering results (metrics, compare number of clusters to original labels,...).

**5) Correlation between electrophysiological features and genes/PCs.** Finally, connect RNA counts and functional data: Most likely, there will be interesting relationships between the transcriptomic and electrophyiological features in this data. Find these correlations and a way of visualizing them. In studying correlations using the PCA-reduced version of the transcriptomics data, it could be interesting to study PC loadings to see which genes are dominating which PCs. For other advanced analyses, you can get inspitation from Kobak et al., 2021 (https://doi.org/10.1111/rssc.12494).
    

# Task 1

**1.1 QC Statistics per cell.**

RNA Counts and stuff

In [None]:
exonCounts.shape[0]
exonCounts.shape[1]
exonCounts.shape

In [None]:
# total counts per cell (count depth)

# exonCounts  # shape = 1232 cells x 42466 genes
total_counts_per_cell = total_counts_per_cell = exonCounts.sum(axis=1)

# plot
fig, axes = plt.subplots(1, 3, figsize=(10, 4))

axes[0].hist(
    total_counts_per_cell, bins=50, color="skyblue", edgecolor="black"
)
axes[0].set_xlabel("Total RNA counts per cell")
axes[0].set_ylabel("Number of cells")
axes[0].set_title("Distribution of Total Counts Per Cell")

axes[1].scatter(
    range(len(total_counts_per_cell)), total_counts_per_cell, s=5, alpha=0.5
)
axes[1].set_yscale("log")
axes[1].set_xlabel("Cell index")
axes[1].set_ylabel("Total RNA counts")
axes[1].set_title("Total Counts Per Cell")

# Rank-ordered plot of total counts per cell - Figure 2C in the paper
sorted_counts = np.sort(total_counts_per_cell)[::-1]

axes[2].plot(sorted_counts)
axes[2].set_yscale("log")
axes[2].set_xlabel("Ranked Cell Index")
axes[2].set_ylabel("Total Counts (log scale)")
axes[2].set_title("Rank-ordered Total Counts per Cell")

--> most cells have comparable total cells

--> the first ~20 cells have a lot more counts than the rest

--> the last ~100 cells have very few counts

In [None]:
# number of expressed genes per cell
expressed_genes_per_cell = (exonCounts > 0).sum(axis=1)

# plot
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].scatter(range(len(expressed_genes_per_cell)), expressed_genes_per_cell)
axes[0].set_xlabel("Cell Index")
axes[0].set_ylabel("Expressed Genes")
axes[0].set_title("Total Number of Expressed Genes per Cell")

axes[1].hist(expressed_genes_per_cell, bins=50)
axes[1].set_xlabel("Total number of expressed genes per cell")
axes[1].set_ylabel("Number of cells")
axes[1].set_title("Distribution of Expressed Genes Per Cell")

In [None]:
# fraction of mitochondrial genes

# mt_like_genes = [g for g in genes if "mt" in g.lower()]
# print(mt_like_genes)
# after checking the gene names, we can assume that mitochondrial genes start with "mt-" or "MT-"
mt_gene_mask = np.char.startswith(genes.astype(str).astype("U"), "mt-")

print("Number of mitochondrial genes found:", np.sum(mt_gene_mask))

# Sum counts over mitochondrial genes per cell
mt_counts_per_cell = exonCounts[:, mt_gene_mask].sum(axis=1)

# Fraction mitochondrial
fraction_mito = mt_counts_per_cell / total_counts_per_cell

# Plot
plt.figure(figsize=(8, 4))
plt.hist(fraction_mito, bins=50, color="salmon", edgecolor="black")
plt.xlabel("Fraction of Mitochondrial Counts per Cell")
plt.ylabel("Number of Cells")
plt.title("Distribution of Mitochondrial RNA Content")
plt.show()

In [None]:
# Look for any gene names containing 'mt' or 'MT'
mt_like_genes = [g for g in genes if "mt" in g.lower()]
print(mt_like_genes)

**1.2 QC Statistics per gene.**


TODO: Frage - braucht es diesen plot? der mit log scale ist doch viel viel aussagekräftiger

In [None]:
# mean expression across all cells
mean_expression_across_cells = exonCounts.mean(axis=0)
print("Mean expression across all cells:", mean_expression_across_cells.shape)
# variance across all cells
variance_expression_across_cells = exonCounts.var(axis=0)
print("Variance across all cells:", variance_expression_across_cells.shape)

# plot
fig, ax = plt.subplots(figsize=(10, 4))
ax.scatter(
    mean_expression_across_cells,
    variance_expression_across_cells,
    s=5,
    alpha=0.5,
)
ax.set_xlabel("Mean Expression Across Cells")
ax.set_ylabel("Variance Across Cells")
ax.set_title("Mean vs Variance of Gene Expression Across Cells")
plt.show()

In [None]:
# plot looks sparse, so log transform the data

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(
    mean_expression_across_cells + 1,  # avoid log(0)
    variance_expression_across_cells + 1,
    s=5,
    alpha=0.5,
)
ax.set_xscale("log")
ax.set_yscale("log")
ax.set_xlabel("Mean Expression Across Cells (log scale)")
ax.set_ylabel("Variance Across Cells (log scale)")
ax.set_title("Mean vs Variance of Gene Expression (log scale)")
plt.show()

In [None]:
# dropout rate / fraction of cells where a gene has zero counts

dropout_rate_per_gene = (exonCounts == 0).sum(axis=0) / exonCounts.shape[0]

# Quick summary
print("Mean dropout rate across all genes:", np.mean(dropout_rate_per_gene))

# Plot
plt.figure(figsize=(8, 4))
plt.hist(dropout_rate_per_gene, bins=50, color="green", edgecolor="black")
plt.xlabel("Dropout Rate (Fraction of Cells with Zero Counts)")
plt.ylabel("Number of Genes")
plt.title("Dropout Rate per Gene")
plt.show()

count distributions of single genes within cell types

In [None]:
print(np.unique(rna_type))
print("L2/3 IT_2" in rna_type)  # hö

In [None]:
"""# Group gene counts by cell type
gene_names = rna_type
unique_cell_types = np.unique(gene_names)
gene_name = "L2/3 IT_2"
gene_index = gene_names.get_loc(gene_name)
gene_counts = exonCounts[:, gene_index]
data = []

for ct in unique_cell_types:
    counts_in_group = gene_counts[np.array(cell_types) == ct]
    data.append(counts_in_group)

plt.figure(figsize=(10, 5))
plt.boxplot(data, labels=unique_cell_types)
plt.ylabel("Counts of Sst")
plt.xlabel("Cell type")
plt.title("Expression of Sst across cell types")
plt.xticks(rotation=45)"""

**1.2 Statistics for electrophysiological features**

In [None]:
# features overview
ephysNames_filtered = ephysNames[
    ~np.isin(ephysNames, features_exclude)
]  # see above

print(
    "Remaining electrophysiological features (n = {})".format(
        len(ephysNames_filtered)
    )
)
for i, name in enumerate(ephysNames_filtered, 1):
    print(f"{i:2d}. {name}")

In [None]:
# descriptive statistics of ephysiological features (keep in mind that data is standardized)

# print(
#    "Number of cells with ephys data: ",
#    np.sum(np.isin(cells, ephysCells)),
# )

# dictionary to collect stats
stats_dict = {
    "Mean": [],
    "Std": [],
    "Min": [],
    "Max": [],
    "Median": [],
    "Skewness": [],
}

# Compute stats per feature
for i in range(X.shape[1]):
    data = X[:, i]
    stats_dict["Mean"].append(np.mean(data))
    stats_dict["Std"].append(np.std(data))
    stats_dict["Min"].append(np.min(data))
    stats_dict["Max"].append(np.max(data))
    stats_dict["Median"].append(np.median(data))
    stats_dict["Skewness"].append(stats.skew(data))

# Convert to DataFrame
feature_stats_df = pd.DataFrame(stats_dict, index=ephysNames_filtered)

print("Basic statistics of electrophysiological features (standardized):")
display(feature_stats_df)

In [None]:
# Plotting the distribution of each electrophysiological feature
# for i, feature in enumerate(ephysNames_filtered):
#    plt.figure(figsize=(8, 4))
#    sns.histplot(X[:, i], bins=30, kde=True, color="skyblue")
#    plt.title(f"Distribution of {feature}")
#    plt.xlabel(feature)
#    plt.ylabel("Density")
#    plt.grid()
#    plt.show()

n_features = len(ephysNames_filtered)
n_cols = 4  # Number of columns in the grid
n_rows = int(
    np.ceil(n_features / n_cols)
)  # Number of rows in the grid, rounded up

fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 3))

axes = axes.flatten()

for i, feature in enumerate(ephysNames_filtered):
    ax = axes[i]
    sns.histplot(
        X[:, i],
        bins=30,
        kde=True,
        color="skyblue",
        stat="density",
        ax=ax,
    )
    sns.kdeplot(
        X[:, i], color="darkblue", linewidth=1, ax=ax
    )  # smoothed density plot (darkblue line)

    ax.axvline(0, color="red", linestyle="--", linewidth=1)
    ax.set_title(feature, fontsize=9, fontweight="bold")
    ax.set_xlabel("Standardized value")
    ax.set_ylabel("Density")

# Remove empty subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# plt.tight_layout()
plt.show()

In [None]:
# DataFrame with standardized ephys data
ephys_df = pd.DataFrame(X, columns=ephysNames_filtered)

# Compute Pearson correlation matrix
corr_matrix = ephys_df.corr()

# Test
print("Correlation matrix shape:", corr_matrix.shape)
# print(corr_matrix.head())

# "high correlation" threshold
threshold = 0.6  # whats the best value?
high_corr_pairs = []

# Iterate over the upper triangle of the correlation matrix (excluding the diagonal) because the matrix is symmetric
for i in range(len(corr_matrix.columns)):
    for j in range(i + 1, len(corr_matrix.columns)):
        corr_value = corr_matrix.iloc[i, j]
        if abs(corr_value) > threshold:
            feature_i = corr_matrix.columns[i]
            feature_j = corr_matrix.columns[j]
            high_corr_pairs.append((feature_i, feature_j, corr_value))

# Print results
print(
    "Feature pairings with high correlations (|corr| > {:.2f}):".format(
        threshold
    )
)
for feature1, feature2, corr in sorted(
    high_corr_pairs, key=lambda x: -abs(x[2])
):
    print(f"{feature1:30s} x {feature2:30s}: {corr:+.2f}")

In [None]:
# Plotting the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(
    # sns.clustermap(  # clustermap clusters features based on correlation, kann einkommentiert werden
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap="vlag",
    center=0,
    square=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.8},
)
plt.title("Pairwise Pearson Correlation of Electrophysiological Features")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.show()

# 2

**2.1 normalization + transformation**

In [None]:
import numpy as np
import pandas as pd

# exonCounts.shape = (n_cells, n_genes)
# gene_names = list of gene names
# cell_names = list of cell names

# Compute total counts per cell
total_counts_per_cell = exonCounts.sum(axis=1)

# Avoid dividing by zero
total_counts_per_cell[total_counts_per_cell == 0] = 1

# Calculate CPM
cpm = (exonCounts.T / total_counts_per_cell).T * 1e6

# Log-transform
log_cpm = np.log1p(cpm)

# Create dataframe
log_cpm_df = pd.DataFrame(log_cpm)  # , index=cell_names, columns=gene_names)

# Save to CSV
log_cpm_df.to_csv("log_cpm_normalized.csv")

print("Normalization complete. Log-CPM shape:", log_cpm.shape)

**2.2 select genes & apply PCA**