# Example notebook

In [None]:
%load_ext autoreload
%autoreload 2

import os

while not os.path.exists("pyproject.toml"):
    os.chdir("..")

In [None]:
import scanpy as sc
import nichepca as npc

## Load data

Your AnnData object is expected to contain raw counts in `adata.X`.

In [None]:
adata = sc.read_h5ad("path/to/your/data.h5ad")

## Standard pipeline

We found that higher number of neighbors e.g., `knn=25` lead to better results in brain tissue, while `knn=10` works well for kidney data. We recommend to qualitatively optimize these parameters on a small subset of your data. The number of PCs (`n_comps=30` by default) seems to have negligible effect on the results.

In [None]:
npc.wf.nichepca(adata, knn=25)
sc.pp.neighbors(adata, use_rep="X_npca")
sc.tl.leiden(adata, resolution=0.5, flavor="igraph", n_iterations=2)

## Multi-sample domain identification

If you have multiple samples in `adata.obs["sample"]`, you can provide the key `sample` to `npc.wf.nichepca` this uses harmony by default:

In [None]:
npc.wf.nichepca(adata, knn=25, sample_key="sample")

If you have cell type labels in `adata.obs["cell_type"]`, you can directly provide them to `nichepca` as follows (we found this sometimes works better for multi-sample domain identification). However, in this case we need to run `npc.cl.leiden_unique` to handle potential duplicate embeddings:

In [None]:
npc.wf.nichepca(adata, knn=25, obs_key="cell_type", sample_key="sample")
npc.cl.leiden_unique(adata, use_rep="X_npca", resolution=0.5, n_neighbors=15)

## Run custom pipelines

The `nichepca` function also allows to customize the original `("norm", "log1p", "agg", "pca")` pipeline, e.g., without median normalization:

In [None]:
npc.wf.nichepca(adata, knn=25, pipeline=["log1p", "agg", "pca"])

or with `"pca"` before `"agg"`:

In [None]:
npc.wf.nichepca(adata, knn=25, pipeline=["norm", "log1p", "pca", "agg"])