# Alzheimer’s Disease snRNA-seq Analysis (Entorhinal Cortex)

**Author:** Olivia Mohning  
**Repository:** `scrnaseq`  
**Notebook:** `notebooks/01_data_ingestion_and_eda.ipynb`  
**Created:** 2025-08-07  

This notebook uses single-nucleus RNA-sequencing (snRNA-seq) data from the entorhinal cortex of Alzheimer’s disease and control brains (dataset **GSE138852**) to perform exploratory data analysis in Python. The workflow focuses on reproducible data ingestion, cleaning, integration of metadata, and visualization using the pandas data stack.  

### Objectives
- Load and align single-nucleus RNA-seq count matrices with associated metadata  
- Inspect dataset structure and cell-type composition  
- Perform exploratory data analysis (EDA) of gene expression across cell types and disease conditions (AD vs Control)  
- Visualize expression of key genes (e.g., APOE, CLU, GFAP) to identify early transcriptional patterns related to neurodegeneration  

**Data Source:** GSE138852 – Human entorhinal cortex nuclei from aged Alzheimer’s disease and control individuals  
**Platform:** Illumina NextSeq 500  
**Notes:** Processed count and metadata files are stored in `data/GSE138852/` and loaded directly for analysis.  

Step 1: Import modules, suppress warnings

In [2]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module=r"louvain")
from pathlib import Path
import numpy as np
import pandas as pd
import scanpy as sc

Step 2: Auto-detect repo root by walking up from current dir

In [3]:
def find_repo_root(start = Path.cwd()):
    for p in [start] + list(start.parents):
        if (p / "notebooks").is_dir() and (p / "data").is_dir() and (p / "README.md").exists():
            return p
    raise FileNotFoundError("Could not locate repo root. Are you running inside the 'scrnaseq' repo?")

repo_root = find_repo_root()

Step 3: Define paths (updated to use data/raw/GSE138852)

In [4]:
data_dir = repo_root / "data"
src_dir  = data_dir / "raw" / "GSE138852"
proc_dir = data_dir / "processed"
fig_dir  = repo_root / "figures"
res_dir  = repo_root / "results"

Step 4: Make sure output directories exist

In [5]:
for d in (proc_dir, fig_dir, res_dir):
    d.mkdir(parents=True, exist_ok=True)

Step 5: Assign variables to file paths

In [6]:
counts_path = src_dir / "GSE138852_counts.csv"
meta_path   = src_dir / "GSE138852_covariates.csv"

Step 6: Sanity checks for repo root and file paths

In [7]:
print("Repo root:", repo_root)
print("Counts path:", counts_path.exists(), counts_path)
print("Meta path:",   meta_path.exists(),   meta_path)

Repo root: /Users/oliviamohning/Documents/ds-portfolio/scrnaseq
Counts path: True /Users/oliviamohning/Documents/ds-portfolio/scrnaseq/data/raw/GSE138852/GSE138852_counts.csv
Meta path: True /Users/oliviamohning/Documents/ds-portfolio/scrnaseq/data/raw/GSE138852/GSE138852_covariates.csv


Step 7: Load raw data and quick sanity checks

In [8]:
# Load data
counts_df = pd.read_csv(counts_path, index_col=0)
meta_df   = pd.read_csv(meta_path, index_col=0)

# Quick checks
print("Counts shape:", counts_df.shape)
print("Metadata shape:", meta_df.shape)

Counts shape: (10850, 13214)
Metadata shape: (13214, 5)


Step 8: Inspect first few rows of counts_df and meta_df

In [9]:
display(counts_df.head())
display(meta_df.head())

Unnamed: 0,AAACCTGGTAGAAAGG_AD5_AD6,AAACCTGGTAGCGATG_AD5_AD6,AAACCTGTCAGTCAGT_AD5_AD6,AAACCTGTCCAAACAC_AD5_AD6,AAACCTGTCCAGTATG_AD5_AD6,AAAGCAACATGGGAAC_AD5_AD6,AAAGCAAGTCGAATCT_AD5_AD6,AAAGCAAGTTTGTTGG_AD5_AD6,AAAGTAGGTAATCACC_AD5_AD6,AAAGTAGGTTCCACGG_AD5_AD6,...,TTTGGTTAGCCACGCT_AD1_AD2,TTTGGTTCAACTTGAC_AD1_AD2,TTTGGTTCAGCCTTTC_AD1_AD2,TTTGGTTCATCGGACC_AD1_AD2,TTTGGTTTCCCAGGTG_AD1_AD2,TTTGGTTTCCGTACAA_AD1_AD2,TTTGTCACAAGCCATT_AD1_AD2,TTTGTCAGTATAGGTA_AD1_AD2,TTTGTCATCCACTGGG_AD1_AD2,TTTGTCATCCGGGTGT_AD1_AD2
FO538757.2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
AP006222.2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
RP5-857K21.4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
RP11-206L10.9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
NOC2L,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


Unnamed: 0,oupSample.batchCond,oupSample.cellType,oupSample.cellType_batchCond,oupSample.subclustID,oupSample.subclustCond
AAACCTGGTAGAAAGG_AD5_AD6,AD,oligo,oligo_AD,o3,AD
AAACCTGGTAGCGATG_AD5_AD6,AD,oligo,oligo_AD,o3,AD
AAACCTGTCAGTCAGT_AD5_AD6,AD,oligo,oligo_AD,o3,AD
AAACCTGTCCAAACAC_AD5_AD6,AD,oligo,oligo_AD,o3,AD
AAACCTGTCCAGTATG_AD5_AD6,AD,oligo,oligo_AD,o3,AD


Step 9: Transpose rows and columns of counts so nuclei are rows and genes are columns

In [10]:
counts_df = counts_df.T
display(counts_df.head())

Unnamed: 0,FO538757.2,AP006222.2,RP5-857K21.4,RP11-206L10.9,NOC2L,HES4,ISG15,AGRN,C1orf159,SDF4,...,MT-ATP6,MT-CO3,MT-ND3,MT-ND4L,MT-ND4,MT-ND5,MT-CYB,AL592183.1,AC007325.4,AC007325.2
AAACCTGGTAGAAAGG_AD5_AD6,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
AAACCTGGTAGCGATG_AD5_AD6,0,0,0,0,0,0,0,0,0,0,...,3,2,4,0,6,0,0,0,0,0
AAACCTGTCAGTCAGT_AD5_AD6,0,0,0,0,0,0,0,0,0,0,...,0,3,3,0,2,1,1,1,0,0
AAACCTGTCCAAACAC_AD5_AD6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
AAACCTGTCCAGTATG_AD5_AD6,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0


Step 10: Keep only nuclei that exist in both counts and metadata

In [11]:
common_nuclei = counts_df.index.intersection(meta_df.index)
counts_df = counts_df.loc[common_nuclei]
meta_df = meta_df.loc[common_nuclei]

Step 11: Merge counts and metadata side by side, plus sanity checks

In [15]:
# Merge meta_df and counts_df
merged_df = pd.concat([meta_df, counts_df], axis=1)

# Basic sanity checks
print("Merged shape (cells x [metadata + genes]):", merged_df.shape)
print("Metadata columns:", list(meta_df.columns))
print("Gene columns (first 5):", list(counts_df.columns[:5]))
print("First 3 cell IDs:", merged_df.index[:3].tolist())

# Viewing data
merged_df.head()

Merged shape (cells x [metadata + genes]): (13214, 10855)
Metadata columns: ['oupSample.batchCond', 'oupSample.cellType', 'oupSample.cellType_batchCond', 'oupSample.subclustID', 'oupSample.subclustCond']
Gene columns (first 5): ['FO538757.2', 'AP006222.2', 'RP5-857K21.4', 'RP11-206L10.9', 'NOC2L']
First 3 cell IDs: ['AAACCTGGTAGAAAGG_AD5_AD6', 'AAACCTGGTAGCGATG_AD5_AD6', 'AAACCTGTCAGTCAGT_AD5_AD6']


Unnamed: 0,oupSample.batchCond,oupSample.cellType,oupSample.cellType_batchCond,oupSample.subclustID,oupSample.subclustCond,FO538757.2,AP006222.2,RP5-857K21.4,RP11-206L10.9,NOC2L,...,MT-ATP6,MT-CO3,MT-ND3,MT-ND4L,MT-ND4,MT-ND5,MT-CYB,AL592183.1,AC007325.4,AC007325.2
AAACCTGGTAGAAAGG_AD5_AD6,AD,oligo,oligo_AD,o3,AD,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
AAACCTGGTAGCGATG_AD5_AD6,AD,oligo,oligo_AD,o3,AD,0,0,0,0,0,...,3,2,4,0,6,0,0,0,0,0
AAACCTGTCAGTCAGT_AD5_AD6,AD,oligo,oligo_AD,o3,AD,0,0,0,0,0,...,0,3,3,0,2,1,1,1,0,0
AAACCTGTCCAAACAC_AD5_AD6,AD,oligo,oligo_AD,o3,AD,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
AAACCTGTCCAGTATG_AD5_AD6,AD,oligo,oligo_AD,o3,AD,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0


Step 12: Save cleaned dataset

In [None]:
merged_path = proc_dir / "00_merged.csv"
merged_df.to_csv(merged_path)
print("Saved:", merged_path)

Saved: /Users/oliviamohning/Documents/ds-portfolio/scrnaseq/data/processed/00_merged.csv


Step 13: Check how many nuclei of each cell type are present

In [20]:
merged_df['oupSample.cellType'].value_counts()

oupSample.cellType
oligo      7432
astro      2171
OPC        1078
unID        925
neuron      656
mg          449
doublet     405
endo         98
Name: count, dtype: int64

Step 14: Compute average gene expression per cell type

In [None]:
celltype_means = merged_df.groupby('oupSample.cellType').mean(numeric_only=True)
celltype_means.head()

Unnamed: 0_level_0,FO538757.2,AP006222.2,RP5-857K21.4,RP11-206L10.9,NOC2L,HES4,ISG15,AGRN,C1orf159,SDF4,...,MT-ATP6,MT-CO3,MT-ND3,MT-ND4L,MT-ND4,MT-ND5,MT-CYB,AL592183.1,AC007325.4,AC007325.2
oupSample.cellType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
OPC,0.12987,0.025974,0.19295,0.032468,0.021336,0.033395,0.056586,0.056586,0.022263,0.058442,...,0.194805,0.414657,0.185529,0.017625,0.397032,0.038961,0.218924,0.160482,0.046382,0.003711
astro,0.183326,0.047904,0.221557,0.032243,0.041916,0.094887,0.033625,0.053892,0.054353,0.071396,...,0.323814,0.614463,0.320129,0.049747,0.720405,0.07462,0.376785,0.122064,0.025795,0.085214
doublet,0.150617,0.037037,0.120988,0.019753,0.049383,0.024691,0.05679,0.076543,0.034568,0.091358,...,0.933333,1.733333,0.617284,0.083951,1.646914,0.14321,0.82963,0.133333,0.032099,0.012346
endo,0.132653,0.040816,0.132653,0.05102,0.040816,0.295918,0.081633,0.122449,0.020408,0.05102,...,0.714286,1.285714,0.816327,0.091837,1.132653,0.112245,0.734694,0.122449,0.030612,0.010204
mg,0.089087,0.026726,0.10245,0.022272,0.024499,0.002227,0.013363,0.013363,0.020045,0.040089,...,0.207127,0.398664,0.178174,0.028953,0.447661,0.028953,0.256125,0.095768,0.006682,0.0


Step 15: 

oupSample.cellType
oligo      7432
astro      2171
OPC        1078
unID        925
neuron      656
mg          449
doublet     405
endo         98
Name: count, dtype: int64