# Xenium Human Lung Preview: Data Import & Matrix Construction

This notebook processes the raw 10x Genomics *Xenium Human Lung Preview* (FFPE, non-diseased) dataset.  
It ingests the original **cells** and **transcripts** tables, detects spatial coordinates, and builds a sparse **cell × gene matrix** in the form of an `AnnData` object.  

**Workflow steps:**
1. **Environment setup** – imports, paths, plotting defaults, random seed.  
2. **Cell metadata** – load `cells.parquet`, decode strings, detect XY, prepare `obs`.  
3. **Gene features** – parse `features.tsv.gz` to exclude non-gene controls.  
4. **Matrix construction** – two-pass build of sparse counts from `transcripts.parquet` (QV ≥ 20, in-cell, genes only).  
5. **AnnData creation** – assemble `X`, `obs`, `var`, QC metrics, and spatial coordinates.  
6. **Save outputs** – write `adata_raw.h5ad` and a QC metrics CSV for downstream analysis.  

> **Note:** This notebook ends by saving `adata_raw.h5ad`.  
> All normalization, clustering, visualization, and marker analysis are performed in **Notebook 2 (02_xenium_downstream.ipynb)**.


## Environment setup and imports

This section loads all required Python packages, sets a fixed random seed for reproducibility, configures paths for data, results, and figures, and applies plotting defaults. It also ensures the expected Xenium dataset folder exists before continuing.

In [2]:
# =========================
# Standard library imports
# =========================
from __future__ import annotations  
import gc
import gzip
import os
import sys
from datetime import datetime
import pathlib 

# ======================
# Third-party libraries
# ======================
import anndata as ad
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import scanpy as sc
import seaborn as sns
import squidpy as sq
from matplotlib.colors import LogNorm, PowerNorm
from scipy.sparse import coo_matrix, csr_matrix, issparse



# =====================================
# Set global seed for reproducibility
# =====================================
import random
SEED = 123
random.seed(SEED)
np.random.seed(SEED)

  from pkg_resources import DistributionNotFound, get_distribution
  return module_get_attr_redirect(attr_name, deprecated_mapping=_DEPRECATED)


In [3]:
# Optional: quiet a dask warning some packages emit
try:
    import dask
    dask.config.set({"dataframe.query-planning": True})
except Exception:
    pass

# Paths (assumes this notebook lives in repo/notebooks or repo/)
ROOT = pathlib.Path.cwd()
if ROOT.name == "notebooks":
    ROOT = ROOT.parents[0]
DATA = ROOT / "data"
UNPACKED = DATA / "unpacked"
RESULTS = ROOT / "results"; RESULTS.mkdir(exist_ok=True)
FIGS = RESULTS / "figures"; FIGS.mkdir(parents=True, exist_ok=True)

# Plot look
sc.set_figure_params(dpi=120, frameon=False)
sns.set_context("talk", font_scale=0.9)

# Find the unpacked dataset folder (created by your unpack script)
FOLDER = UNPACKED / "xenium_preview_human_non_diseased_lung_with_add_on_ffpe"
assert FOLDER.exists(), f"Expected folder not found: {FOLDER}"
print(f"[INFO] Using dataset folder: {FOLDER}")

[INFO] Using dataset folder: /home/juliors/Documents/SPATIAL-OMICS/xenium-raw-data-analysis-workflow/data/unpacked/xenium_preview_human_non_diseased_lung_with_add_on_ffpe


## Load cell coordinates and build cell × gene matrix

This section:

1. **Loads cell metadata** from `cells.parquet`, decoding byte strings, detecting XY coordinate columns, and keeping key spatial/shape attributes.
2. **Generates a quick spatial density plot** (black theme, light grid) to verify tissue coverage.
3. **Streams the transcript data** from `transcripts.parquet` in two passes:
   - **Pass 1**: inventory of genes meeting quality and control filters.
   - **Pass 2**: aggregate counts per `(cell_id, gene)` pair into a sparse matrix.
4. **Constructs an AnnData object** with:
   - Raw counts in `.layers["raw_counts"]`
   - QC metrics (`total_counts`, `n_genes_by_counts`)
   - Spatial coordinates in `.obsm["spatial"]`.

## Load and visualize cell metadata

Read `cells.parquet`, decode text columns, detect and standardize XY coordinates, and keep essential metadata for downstream analysis.  
Generates a quick 2D density plot (dark theme) to check spatial coverage of all cells.

In [4]:

# Read the cells table (arrow→pandas)
cells_path = FOLDER / "cells.parquet"
assert cells_path.exists(), f"Missing {cells_path}"
cells_df = pq.read_table(cells_path).to_pandas()

# Decode byte columns to strings (10x often stores 'binary' columns)
for col in cells_df.columns:
    if cells_df[col].dtype == object:
        cells_df[col] = cells_df[col].apply(
            lambda x: x.decode("utf-8", "ignore") if isinstance(x, (bytes, bytearray)) else x
        )

# Ensure we have cell IDs
assert "cell_id" in cells_df.columns, f"'cell_id' missing; saw: {cells_df.columns.tolist()}"

# Robust XY detection (handle multiple historical names)
lower_to_orig = {c.lower(): c for c in cells_df.columns}
x_candidates = ["x_location","x_centroid","center_x","x"]
y_candidates = ["y_location","y_centroid","center_y","y"]

x_col = next((lower_to_orig[c] for c in x_candidates if c in lower_to_orig), None)
y_col = next((lower_to_orig[c] for c in y_candidates if c in lower_to_orig), None)
assert x_col and y_col, f"Could not find XY in cells.parquet. Columns: {cells_df.columns.tolist()}"

# Standardize names for downstream tools
cells_df = cells_df.rename(columns={x_col: "x_location", y_col: "y_location"})
cells_df["x_location"] = pd.to_numeric(cells_df["x_location"], errors="coerce")
cells_df["y_location"] = pd.to_numeric(cells_df["y_location"], errors="coerce")

# Keep a clean obs table (you can add more columns if you like)
keep_cols = ["cell_id","x_location","y_location"]
for extra in ["fov_name","area","nucleus_area"]:
    if extra in cells_df.columns: keep_cols.append(extra)

cells_df = cells_df[keep_cols].set_index("cell_id", drop=False).sort_index()
print(f"[LOAD] cells.parquet → {len(cells_df):,} cells; coords = ({x_col}, {y_col})")

# Quick sanity figure — black theme
plt.style.use("dark_background")
fig, ax = plt.subplots(figsize=(5.6, 5.6))

# Plot density
h = ax.hist2d(
    cells_df["x_location"], cells_df["y_location"], bins=300, cmap="viridis"
)

ax.invert_yaxis()
ax.set_aspect("equal")
ax.set_title("Spatial: cell density (all cells)", color="white")

# Light gray, thin grid every ~3000 units
ax.set_xticks(range(0, int(cells_df["x_location"].max()), 3000))
ax.set_yticks(range(0, int(cells_df["y_location"].max()), 3000))
ax.grid(color="gray", alpha=0.3, linewidth=0.8)

plt.tight_layout()
plt.savefig(FIGS / "00_spatial_density_all_cells_black.png", dpi=200)
plt.close()


[LOAD] cells.parquet → 295,883 cells; coords = (x_centroid, y_centroid)


## Build filtered cell × gene count matrix

Generate a sparse cell–gene matrix from `transcripts.parquet` (QV ≥ 20, in-cell, genes only, controls excluded), matching the behavior of 10x’s `cell_feature_matrix`.


In [5]:
# === Build a cell × gene matrix from transcripts.parquet (QV≥20, genes-only, in-cell) ===
# Keeps behavior consistent with 10x cell_feature_matrix.  (QV≥20; exclude controls)

# ---- paths ----
tx_path = FOLDER / "transcripts.parquet"
cfm_dir = FOLDER / "cell_feature_matrix"           # contains features.tsv.gz
feat_tsv = cfm_dir / "features.tsv.gz"
assert tx_path.exists(), f"Missing {tx_path}"
assert feat_tsv.exists(), f"Missing {feat_tsv} (needed to drop controls)."

# ---- load 'Gene Expression' feature names from features.tsv.gz ----
# columns: [ensembl_id, feature_name, feature_type]
with gzip.open(feat_tsv, "rt") as fh:
    feats = pd.read_csv(fh, sep="\t", header=None, names=["ensembl_id","feature_name","feature_type"])
gene_names = pd.Index(feats.loc[feats["feature_type"]=="Gene Expression","feature_name"].astype(str).unique())
gene_name_set = set(gene_names)

# ---- dataset & schema ----
dataset = ds.dataset(tx_path)
cols = {c.lower(): c for c in dataset.schema.names}
GENE_COL = next((cols[k] for k in ["feature_name","gene","gene_name","target"] if k in cols), None)
CELL_COL = cols.get("cell_id")
QV_COL   = next((cols[k] for k in ["qv","quality","q"] if k in cols), None)
assert GENE_COL and CELL_COL and QV_COL, f"Missing required columns; saw: {list(cols.values())}"

def bytes_to_str(s):
    if s.dtype == object:
        return s.apply(lambda x: x.decode("utf-8","ignore") if isinstance(x,(bytes,bytearray)) else x)
    return s

# ---- Pass 1: gene inventory under filters ----
PASS1_BATCH = 5_000_000
gene_totals = {}

need_p1 = [GENE_COL, CELL_COL, QV_COL]
for i, batch in enumerate(dataset.to_batches(columns=need_p1, batch_size=PASS1_BATCH), start=1):
    df = batch.to_pandas()

    # normalize dtypes
    df[GENE_COL] = bytes_to_str(df[GENE_COL]).astype(str)
    # CRITICAL: make cell_id dtype match cells_df.index (string is safest)
    df[CELL_COL] = df[CELL_COL].astype(str)

    # filters: QV≥20, in known cells, genes only (exclude controls)
    df = df[(df[QV_COL] >= 20) & (df[CELL_COL].isin(cells_df.index.astype(str))) & (df[GENE_COL].isin(gene_name_set))]
    if df.empty:
        del df; gc.collect(); continue

    vc = df[GENE_COL].value_counts()
    for g, n in vc.items():
        gene_totals[g] = gene_totals.get(g, 0) + int(n)

    if i % 4 == 0:
        print(f"[pass1] batches={i} (genes so far: {len(gene_totals):,})")
    del df, vc
    gc.collect()

genes = pd.Index(sorted(gene_totals, key=gene_totals.get, reverse=True), name="gene")
gene_to_col = {g:i for i, g in enumerate(genes)}
print(f"[pass1] unique genes: {len(genes):,}")

# ---- allocate target matrix ----
ordered_cell_ids = cells_df.index.astype(str).tolist()  # ensure string
cell_to_row = {cid:i for i, cid in enumerate(ordered_cell_ids)}
n_cells, n_genes = len(ordered_cell_ids), len(genes)
X = csr_matrix((n_cells, n_genes), dtype=np.int32)
print(f"[SHAPE] target matrix: {n_cells:,} cells × {n_genes:,} genes")

# ---- Pass 2: aggregate (cell_id, gene) counts with same filters ----
PASS2_BATCH = 2_000_000
need_p2 = [CELL_COL, GENE_COL, QV_COL]

for i, batch in enumerate(dataset.to_batches(columns=need_p2, batch_size=PASS2_BATCH), start=1):
    df = batch.to_pandas()

    df[GENE_COL] = bytes_to_str(df[GENE_COL]).astype(str)
    df[CELL_COL] = df[CELL_COL].astype(str)

    df = df[(df[QV_COL] >= 20) &
            (df[CELL_COL].isin(cell_to_row)) &               # faster than isin(cells_df.index) now
            (df[GENE_COL].isin(gene_name_set))]
    if df.empty:
        del df; gc.collect(); continue

    grp = df.groupby([CELL_COL, GENE_COL]).size().astype(np.int32)
    del df

    rows, cols_, data = [], [], []
    ar, ac, av = rows.append, cols_.append, data.append
    for (cid, g), n in grp.items():
        r = cell_to_row.get(cid)
        c = gene_to_col.get(g)
        if r is not None and c is not None:
            ar(r); ac(c); av(int(n))
    del grp

    if rows:
        coo = coo_matrix((data, (rows, cols_)), shape=(n_cells, n_genes), dtype=np.int32).tocsr()
        X += coo
        del coo, rows, cols_, data

    if i % 4 == 0:
        print(f"[pass2] batches={i:>3}  nnz={X.nnz:,}")
    gc.collect()

print(f"[BUILD] sparse matrix complete: shape={X.shape}, nnz={X.nnz:,}")


[pass1] batches=4 (genes so far: 392)
[pass1] batches=8 (genes so far: 392)
[pass1] batches=12 (genes so far: 392)
[pass1] batches=16 (genes so far: 392)
[pass1] batches=20 (genes so far: 392)
[pass1] batches=24 (genes so far: 392)
[pass1] batches=28 (genes so far: 392)
[pass1] unique genes: 392
[SHAPE] target matrix: 295,883 cells × 392 genes
[pass2] batches=  4  nnz=2,054,020
[pass2] batches=  8  nnz=4,096,753
[pass2] batches= 12  nnz=6,115,699
[pass2] batches= 16  nnz=8,086,845
[pass2] batches= 20  nnz=10,110,691
[pass2] batches= 24  nnz=12,180,785
[pass2] batches= 28  nnz=14,240,040
[BUILD] sparse matrix complete: shape=(295883, 392), nnz=14,544,817


## Create and Save AnnData object with QC metrics

Initialize an `AnnData` object from the cell–gene matrix, store raw counts, compute basic QC stats, and add spatial coordinates.


In [6]:
adata = ad.AnnData(
    X=X,                                       # sparse counts
    obs=cells_df.loc[ordered_cell_ids],        # row metadata (cells)
    var=pd.DataFrame(index=genes),             # col metadata (genes)
)

# Keep raw counts before normalization
adata.layers["raw_counts"] = adata.X.copy()

# QC metrics and spatial coordinates
adata.obs["total_counts"] = np.asarray(adata.X.sum(axis=1)).ravel()
adata.obs["n_genes_by_counts"] = np.asarray((adata.X > 0).sum(axis=1)).ravel()
adata.obsm["spatial"] = adata.obs[["x_location","y_location"]].to_numpy()

print(f"[ANNData] {adata.n_obs:,} cells × {adata.n_vars:,} genes (nnz={adata.X.nnz:,})")

# --- Save handoff objects ---

# Save AnnData with raw counts + QC metrics
adata.write_h5ad(RESULTS / "adata_raw.h5ad", compression="lzf")

# Save a quick QC table (optional, lightweight summary)
adata.obs[["total_counts","n_genes_by_counts"]].to_csv(
    RESULTS / "qc_metrics.csv", index=True
)

print(f"[SAVE] wrote AnnData with {adata.n_obs:,} cells × {adata.n_vars:,} genes → {RESULTS/'adata_raw.h5ad'}")


[ANNData] 295,883 cells × 392 genes (nnz=14,544,817)
[SAVE] wrote AnnData with 295,883 cells × 392 genes → /home/juliors/Documents/SPATIAL-OMICS/xenium-raw-data-analysis-workflow/results/adata_raw.h5ad
