# Mapping data to a subset of the Whole Mouse Brain taxonomy

Navigate to this link: https://alleninstitute.github.io/abc_atlas_access/intro.html and decide on the reference you wish to use. 

For this tutorial, we will be using the "Mouse whole-brain transcriptomic cell type atlas (Hongkui Zeng)" and specifically the 10xv3 dataset: https://alleninstitute.github.io/abc_atlas_access/descriptions/WMB-10Xv3.html

## A. Downloading the data
### You need internet access to run the chunks below
You need to install: 
- Expression matrices (.h5ad files containing raw data) - in our example there will be one object per brain region
- Cell metadata files 
- Taxonomy files

The simplest way to do this is to navigate to the corresponding page for each of those file categories in the AWS S3 explorer: https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html 

For this tutorial, the corresponding links are: 
- Expression matrices: https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/WMB-10Xv3/20230630/
- Cell metadata (I'm using the most recent file): https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/metadata/WMB-10X/20241115/cell_metadata.csv
- Taxonomy files: https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#metadata/WMB-taxonomy/20231215/

We will also need some additional files: 
- precomputed stats 
- marker lookup table

You can get these by running the code below (Whole Mouse Brain taxonomy):

In [7]:
!wget https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/mapmycells/WMB-10X/20240831/mouse_markers_230821.json 
!wget https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/mapmycells/WMB-10X/20240831/precomputed_stats_ABC_revision_230821.h5 

--2025-10-12 14:57:13--  https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/mapmycells/WMB-10X/20240831/mouse_markers_230821.json
Resolving allen-brain-cell-atlas.s3-us-west-2.amazonaws.com... 52.218.152.49, 3.5.86.223, 52.92.160.10, ...
Connecting to allen-brain-cell-atlas.s3-us-west-2.amazonaws.com|52.218.152.49|:443... ^C
--2025-10-12 14:57:16--  https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/mapmycells/WMB-10X/20240831/precomputed_stats_ABC_revision_230821.h5
Resolving allen-brain-cell-atlas.s3-us-west-2.amazonaws.com... 52.92.177.210, 3.5.76.167, 3.5.77.48, ...
Connecting to allen-brain-cell-atlas.s3-us-west-2.amazonaws.com|52.92.177.210|:443... ^C


Here's the structure of my data folder:

In [3]:
ABC_atlas_celltyping/
├── data
│   ├── expression_matrices
│   │   └── WMB-10Xv3
│   │       └── 20230630
│   │           ├── WMB-10Xv3-CTXsp-raw.h5ad
│   │           ├── WMB-10Xv3-HPF-raw.h5ad
│   │           ├── WMB-10Xv3-HY-raw.h5ad
│   │           ├── WMB-10Xv3-Isocortex-1-raw.h5ad
│   │           ├── WMB-10Xv3-Isocortex-2-raw.h5ad
│   │           ├── WMB-10Xv3-MB-raw.h5ad
│   │           ├── WMB-10Xv3-OLF-raw.h5ad
│   │           ├── WMB-10Xv3-PAL-raw.h5ad
│   │           ├── WMB-10Xv3-STR-raw.h5ad
│   │           └── WMB-10Xv3-TH-raw.h5ad
│   └── metadata
│       ├── WMB-10X
│       │   ├── 20230630
│       │   │   └── cell_metadata.csv
│       │   └── 20241115
│       │       └── cell_metadata.csv
│       └── WMB-taxonomy
│           └── 20231215
│               ├── cluster_annotation_term.csv
│               ├── cluster_annotation_term_set.csv
│               ├── cluster.csv
│               ├── cluster_to_cluster_annotation_membership.csv
│               └── views
│                   ├── cluster_annotation_term_with_counts.csv
│                   ├── cluster_to_cluster_annotation_membership_color.csv
│                   └── cluster_to_cluster_annotation_membership_pivoted.csv

SyntaxError: invalid character '├' (U+251C) (1239423222.py, line 2)

## B. Subsetting the data to the desired regions
### You can connect the kernel to an (offline) compute node for a faster analysis 

Imports and folder organization:

In [1]:
import pandas as pd 
import anndata
import pathlib
import os
import numpy as np
import re
import json

In [2]:
abc_atlas_dir = pathlib.Path('/scratch/mfafouti/ABC_atlas_celltyping')

In [21]:
project_path = str(abc_atlas_dir)
print(f'This is the path where the repository was cloned: {project_path} It will be used as a root for all subsequent paths')

data_dir = abc_atlas_dir / 'data'
expression_dir = data_dir / 'expression_matrices' / 'WMB-10Xv3' / '20230630'
metadata_dir = data_dir / 'metadata' / 'WMB-10X' / '20241115'
taxonomy_dir = data_dir / 'metadata' / 'WMB-taxonomy' / '20231215'
scratch_dir = abc_atlas_dir / 'scratch'
precompute_dir = scratch_dir / 'precompute'
reference_dir = scratch_dir / 'reference'
query_dir = data_dir / 'query_data'

for dir_name in (scratch_dir, precompute_dir, reference_dir, query_dir):
    if not dir_name.exists():
        dir_name.mkdir()
        

This is the path where the repository was cloned: /scratch/mfafouti/ABC_atlas_celltyping It will be used as a root for all subsequent paths


In [26]:
# query_file = data_dir / 'query_data'/ 'BC3_collapsed_mouse_genes.h5ad'
query_file_list = list(query_dir.glob("*.h5ad"))
print(query_file_list)

[PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/query_data/BC13_collapsed_mouse_genes.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/query_data/BC14_collapsed_mouse_genes.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/query_data/BC15_collapsed_mouse_genes.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/query_data/BC28_collapsed_mouse_genes.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/query_data/BC3_collapsed_mouse_genes.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/query_data/BC9_collapsed_mouse_genes.h5ad')]


Below I define the regions I wish to include - I have excluded: CB (=cerebellum), MY (=medulla), P (=pons)

In [23]:
regions_to_include = ["CTXsp","HPF","HY","Isocortex-1","Isocortex-2","MB","OLF","PAL","STR","TH"]

First we will preview the metadata CSV:

In [24]:
cell_metadata = pd.read_csv(metadata_dir / 'cell_metadata_new.csv')
cell_metadata.head()

Unnamed: 0,cell_label,cell_barcode,barcoded_cell_sample_label,library_label,feature_matrix_label,entity,brain_section_label,library_method,region_of_interest_acronym,donor_label,donor_genotype,donor_sex,dataset_label,x,y,cluster_alias,abc_sample_id
0,GCGAGAAGTTAAGGGC-410_B05,GCGAGAAGTTAAGGGC,410_B05,L8TX_201030_01_C12,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550850,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.146826,-3.086639,1,484be5df-5d44-4bfe-9652-7b5bc739c211
1,AATGGCTCAGCTCCTT-411_B06,AATGGCTCAGCTCCTT,411_B06,L8TX_201029_01_E10,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550851,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.138481,-3.022,1,5638505d-e1e8-457f-9e5b-59e3e2302417
2,AACACACGTTGCTTGA-410_B05,AACACACGTTGCTTGA,410_B05,L8TX_201030_01_C12,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550850,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.472557,-2.992709,1,a0544e29-194f-4d34-9af4-13e7377b648f
3,CACAGATAGAGGCGGA-410_A05,CACAGATAGAGGCGGA,410_A05,L8TX_201029_01_A10,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550850,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.379622,-3.043442,1,c777ac0b-77e1-4d76-bf8e-2b3d9e08b253
4,AAAGTGAAGCATTTCG-410_B05,AAAGTGAAGCATTTCG,410_B05,L8TX_201030_01_C12,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550850,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.90948,-2.601536,1,49860925-e82b-46df-a228-fd2f97e75d39


In [25]:
cell_metadata.shape

(4042976, 17)

In [8]:
pattern = "(" + "|".join(re.escape(r) for r in regions_to_include) + ")"

# Match only WMB-10Xv3- datasets that contain those region names
filtered_cells = cell_metadata[
    cell_metadata.feature_matrix_label.str.contains(
        f"WMB-10Xv3-.*{pattern}", case=False, na=False
    )
]
print(f"Filtered cells shape: {filtered_cells.shape}")
print("Unique feature_matrix_labels after filtering:")
print(filtered_cells.feature_matrix_label.unique()) # make sure this matches the regions of interest

  cell_metadata.feature_matrix_label.str.contains(


Filtered cells shape: (1824724, 17)
Unique feature_matrix_labels after filtering:
['WMB-10Xv3-HPF' 'WMB-10Xv3-Isocortex-1' 'WMB-10Xv3-PAL' 'WMB-10Xv3-STR'
 'WMB-10Xv3-CTXsp' 'WMB-10Xv3-HY' 'WMB-10Xv3-OLF' 'WMB-10Xv3-TH'
 'WMB-10Xv3-MB' 'WMB-10Xv3-Isocortex-2']


Load the expression matrices for the regions of interest:

In [9]:
h5ad_path_list = list(expression_dir.glob("*.h5ad"))
print(h5ad_path_list)


[PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/expression_matrices/WMB-10Xv3/20230630/WMB-10Xv3-CTXsp-raw.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/expression_matrices/WMB-10Xv3/20230630/WMB-10Xv3-HPF-raw.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/expression_matrices/WMB-10Xv3/20230630/WMB-10Xv3-HY-raw.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/expression_matrices/WMB-10Xv3/20230630/WMB-10Xv3-Isocortex-1-raw.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/expression_matrices/WMB-10Xv3/20230630/WMB-10Xv3-Isocortex-2-raw.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/expression_matrices/WMB-10Xv3/20230630/WMB-10Xv3-MB-raw.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/expression_matrices/WMB-10Xv3/20230630/WMB-10Xv3-OLF-raw.h5ad'), PosixPath('/scratch/mfafouti/ABC_atlas_celltyping/data/expression_matrices/WMB-10Xv3/20230630/WMB-10Xv3-PAL-raw.h5ad'), PosixPath('/scratch/mfa

## D. Create the files necessary to run cell_type_mapper
The cell_type_mapper package: https://github.com/AllenInstitute/cell_type_mapper/tree/main requires the following input: 
- precomputed stats file 
- marker lookup files 

The files already exist for the whole taxonomy. There are 3 options
1. Map to the full taxonomy 
(fastest, see https://github.com/AllenInstitute/cell_type_mapper/blob/67b740991ad30e2369f14aeef6f711e195a0ae76/examples/map_to_subset_of_taxonomy_fast.ipynb#L8)
2. Create a custom version of those files 
(more complicated, not in this tutorial, see: https://github.com/AllenInstitute/cell_type_mapper/blob/67b740991ad30e2369f14aeef6f711e195a0ae76/examples/mapping_to_subset_of_abc_atlas_data.ipynb) 
3. Instruct the cell type mapper at mapping to ignore certain branches of the cell type taxonomy. 

We will use approach 3

In [27]:
# load the dataframe that associates cells with cell type taxons
taxonomy_path = taxonomy_dir / 'cluster_to_cluster_annotation_membership.csv'
taxonomy_df = pd.read_csv(taxonomy_path)

# create a dict mapping cluster_alias to taxons in the cell
# type taxonomy
alias_to_truth = dict()
for cell in taxonomy_df.to_dict(orient='records'):
    alias = cell['cluster_alias']
    level = cell['cluster_annotation_term_set_label']
    node = cell['cluster_annotation_term_label']
    if alias not in alias_to_truth:
        alias_to_truth[alias] = dict()
    alias_to_truth[alias][level] = node

# use the association between cells and cluster_alias to create a dict
# mapping cell_label to cell type taxons
# ground_truth = {
#     cell_label: alias_to_truth[cluster_alias]
#     for cell_label, cluster_alias in
#     zip(test_set.cell_label.values, test_set.cluster_alias.values)}

In [28]:
valid_classes = set(
    [
        alias_to_truth[cl]['CCN20230722_SUBC']
        for cl in filtered_cells.cluster_alias.values
    ]
)
classes_to_drop = list(
    set(
        [alias_to_truth[cl]['CCN20230722_SUBC']
         for cl in alias_to_truth
         if alias_to_truth[cl]['CCN20230722_SUBC'] not in valid_classes]
    )
)

nodes_to_drop = [('subclass', cl) for cl in classes_to_drop]

print('=======example nodes being dropped=======')
for pair in nodes_to_drop:
    print(pair)

('subclass', 'CS20230722_SUBC_299')
('subclass', 'CS20230722_SUBC_310')
('subclass', 'CS20230722_SUBC_244')
('subclass', 'CS20230722_SUBC_296')
('subclass', 'CS20230722_SUBC_305')
('subclass', 'CS20230722_SUBC_294')
('subclass', 'CS20230722_SUBC_236')
('subclass', 'CS20230722_SUBC_239')
('subclass', 'CS20230722_SUBC_283')
('subclass', 'CS20230722_SUBC_257')
('subclass', 'CS20230722_SUBC_266')
('subclass', 'CS20230722_SUBC_278')
('subclass', 'CS20230722_SUBC_258')
('subclass', 'CS20230722_SUBC_288')
('subclass', 'CS20230722_SUBC_284')
('subclass', 'CS20230722_SUBC_233')
('subclass', 'CS20230722_SUBC_247')
('subclass', 'CS20230722_SUBC_242')
('subclass', 'CS20230722_SUBC_248')
('subclass', 'CS20230722_SUBC_253')
('subclass', 'CS20230722_SUBC_260')
('subclass', 'CS20230722_SUBC_292')
('subclass', 'CS20230722_SUBC_291')
('subclass', 'CS20230722_SUBC_297')
('subclass', 'CS20230722_SUBC_300')
('subclass', 'CS20230722_SUBC_302')
('subclass', 'CS20230722_SUBC_218')
('subclass', 'CS20230722_SUB

In [29]:
type(nodes_to_drop)
print(nodes_to_drop)

[('subclass', 'CS20230722_SUBC_299'), ('subclass', 'CS20230722_SUBC_310'), ('subclass', 'CS20230722_SUBC_244'), ('subclass', 'CS20230722_SUBC_296'), ('subclass', 'CS20230722_SUBC_305'), ('subclass', 'CS20230722_SUBC_294'), ('subclass', 'CS20230722_SUBC_236'), ('subclass', 'CS20230722_SUBC_239'), ('subclass', 'CS20230722_SUBC_283'), ('subclass', 'CS20230722_SUBC_257'), ('subclass', 'CS20230722_SUBC_266'), ('subclass', 'CS20230722_SUBC_278'), ('subclass', 'CS20230722_SUBC_258'), ('subclass', 'CS20230722_SUBC_288'), ('subclass', 'CS20230722_SUBC_284'), ('subclass', 'CS20230722_SUBC_233'), ('subclass', 'CS20230722_SUBC_247'), ('subclass', 'CS20230722_SUBC_242'), ('subclass', 'CS20230722_SUBC_248'), ('subclass', 'CS20230722_SUBC_253'), ('subclass', 'CS20230722_SUBC_260'), ('subclass', 'CS20230722_SUBC_292'), ('subclass', 'CS20230722_SUBC_291'), ('subclass', 'CS20230722_SUBC_297'), ('subclass', 'CS20230722_SUBC_300'), ('subclass', 'CS20230722_SUBC_302'), ('subclass', 'CS20230722_SUBC_218'), 

In [30]:
baseline_marker_path = scratch_dir / 'mouse_markers_230821.json'
baseline_precompute_path = scratch_dir / 'precomputed_stats_ABC_revision_230821.h5'

Now we can run the mapping again, using the Whole Mouse Brain precomputed stats and marker lookup files, but telling the mapper to ignore these cell type classes that are not present in the isocortex.

In [31]:
additional_files_dir = data_dir / 'additional_files'

In [32]:
baseline_marker_path = additional_files_dir / 'mouse_markers_230821.json'
baseline_precompute_path = additional_files_dir / 'precomputed_stats_ABC_revision_230821.h5'


In [33]:
for file in query_file_list:
    query_file = file
    !apptainer exec --bind {project_path}:{project_path},/tmp:/tmp \
        {project_path}/docker/celltypemapper.sif \
        python {project_path}/scripts/create_files.py \
        --input_dir {additional_files_dir} \
        --scratch_dir {scratch_dir} \
        --nodes_to_drop '{json.dumps(nodes_to_drop)}' \
        --query {query_file}

Using path: /scratch/mfafouti/ABC_atlas_celltyping/data/additional_files for input
Using path: /scratch/mfafouti/ABC_atlas_celltyping/scratch to find the test_set.h5ad and save the output
=== Running Hierarchical Mapping 1.6.1 

No gene_mapper_db provided. Assuming that query genes have already been mapped to the same species/authority as reference genes.
MAPPING FROM SPECIFIED MARKERS RAN SUCCESSFULLY
CLEANING UP
Using path: /scratch/mfafouti/ABC_atlas_celltyping/data/additional_files for input
Using path: /scratch/mfafouti/ABC_atlas_celltyping/scratch to find the test_set.h5ad and save the output
=== Running Hierarchical Mapping 1.6.1 

No gene_mapper_db provided. Assuming that query genes have already been mapped to the same species/authority as reference genes.
MAPPING FROM SPECIFIED MARKERS RAN SUCCESSFULLY
CLEANING UP
Using path: /scratch/mfafouti/ABC_atlas_celltyping/data/additional_files for input
Using path: /scratch/mfafouti/ABC_atlas_celltyping/scratch to find the test_set.h

In [None]:
!apptainer exec --bind {project_path}:{project_path},/tmp:/tmp \
    {project_path}/docker/celltypemapper.sif \
    python {project_path}/scripts/create_files.py --input_dir {additional_files_dir} --scratch_dir {scratch_dir} --nodes_to_drop {json.dumps(nodes_to_drop)} --query=

usage: create_files.py [-h] --input_dir INPUT_DIR --scratch_dir SCRATCH_DIR
                       [--nodes_to_drop NODES_TO_DROP]
create_files.py: error: unrecognized arguments: CS20230722_CLAS_27]]
