# Alzheimer's Disease scRNA-seq Analysis (Entorhinal Cortex)

**Author:** Olivia Mohning  
**Repository:** `scrnaseq`  
**Notebook:** `notebooks/01_build_anndata.ipynb`  
**Created:** 2025-08-07

---

## Description
This notebook loads, processes, and organizes single-cell RNA sequencing (scRNA-seq) data from the entorhinal cortex of Alzheimer's disease patients (dataset: **GSE138852**) into an **AnnData** object for downstream analysis using Scanpy and related tools.

---

## Objectives
1. Load processed counts and metadata  
2. Perform basic quality control (QC) and filtering  
3. Normalize and log-transform the data  
4. Conduct dimensionality reduction (PCA, UMAP)  
5. Identify cell clusters and marker genes  
6. Save processed AnnData object for later steps  

---

## Requirements
- Python >= 3.10  
- `scanpy`, `anndata`, `pandas`, `numpy`, `matplotlib`, `seaborn`  
- `scvi-tools`, `celltypist`  
- Bioinformatics tools: `samtools`, `bcftools`, `bedtools`, `bwa`, `blast` (optional)  

---

## Data Source
**GSE138852** – Human entorhinal cortex from aged individuals with Alzheimer's disease  
**Platform:** Illumina NextSeq 500  

---

## Notes
Raw FASTQ files are excluded from version control due to size limits.  
Processed counts and covariates are included in `data/GSE138852`.  


In [2]:
# Step 1: Load Data
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module=r"louvain")
from pathlib import Path
import numpy as np
import pandas as pd
import scanpy as sc

# Auto-detect repo root by walking up from current dir
def find_repo_root(start: Path = Path.cwd()) -> Path:
    here = start.resolve()
    for p in [here] + list(here.parents):
        if (p / "notebooks").is_dir() and (p / "data").is_dir() and (p / "README.md").exists():
            return p
    raise FileNotFoundError("Could not locate repo root. Are you running inside the 'scrnaseq' repo?")

repo_root = find_repo_root()

# Define paths (updated to use data/raw/GSE138852)
data_dir = repo_root / "data"
src_dir  = data_dir / "raw" / "GSE138852"
proc_dir = data_dir / "processed"
fig_dir  = repo_root / "figures"
res_dir  = repo_root / "results"

# Make sure output directories exist
for d in (proc_dir, fig_dir, res_dir):
    d.mkdir(parents=True, exist_ok=True)

# File paths
counts_path = src_dir / "GSE138852_counts.csv"
meta_path   = src_dir / "GSE138852_covariates.csv"

# Sanity checks
print("Repo root:", repo_root)
print("Counts path:", counts_path.exists(), counts_path)
print("Meta path:",   meta_path.exists(),   meta_path)


Repo root: /Users/oliviamohning/Documents/ds-portfolio/scrnaseq
Counts path: True /Users/oliviamohning/Documents/ds-portfolio/scrnaseq/data/raw/GSE138852/GSE138852_counts.csv
Meta path: True /Users/oliviamohning/Documents/ds-portfolio/scrnaseq/data/raw/GSE138852/GSE138852_covariates.csv


In [None]:
# Load data
counts_df = pd.read_csv(counts_path, index_col=0)
meta_df   = pd.read_csv(meta_path, index_col=0)

# Quick checks
print("Counts shape:", counts_df.shape)
print("Metadata shape:", meta_df.shape)

In [None]:
# Optional: inspect first few rows
display(counts_df.head(), meta_df.head())

In [2]:
# Step 2: Build AnnData from counts + metadata

# If genes are rows and cells are columns, flip so cells are rows
if counts_df.shape[0] < counts_df.shape[1]:
    counts_df = counts_df.T

# Keep only cells present in both tables (and preserve order)
common_cells = counts_df.index.intersection(meta_df.index)
counts_df = counts_df.loc[common_cells]
meta_df   = meta_df.loc[common_cells]

# Build AnnData (cells x genes)
adata = sc.AnnData(X=counts_df.values)
adata.obs = meta_df.copy()
adata.var = pd.DataFrame(index=counts_df.columns)
adata.obs_names = counts_df.index
adata.var_names = counts_df.columns

# Basic sanity checks
print("AnnData shape (cells x genes):", adata.n_obs, "x", adata.n_vars)
print("Obs columns:", list(adata.obs.columns)[:8], "...")
print("First 3 cell IDs:", adata.obs_names[:3].tolist())

# Save raw AnnData
raw_path = proc_dir / "00_raw.h5ad"
adata.write_h5ad(raw_path, compression="gzip")
print("Saved:", raw_path)


AnnData shape (cells x genes): 13214 x 10850
Obs columns: ['oupSample.batchCond', 'oupSample.cellType', 'oupSample.cellType_batchCond', 'oupSample.subclustID', 'oupSample.subclustCond'] ...
First 3 cell IDs: ['AAACCTGGTAGAAAGG_AD5_AD6', 'AAACCTGGTAGCGATG_AD5_AD6', 'AAACCTGTCAGTCAGT_AD5_AD6']
Saved: /Users/oliviamohning/Documents/ds-portfolio/scrnaseq/data/processed/00_raw.h5ad
