## Step 3: Create anndata object from Allen Brain Cell Atlas

Now we'll create an anndata object from the files we downloaded in the previous step. Note that this step will require a significant amount of memory and disk space (300+ GB). 

If you are unable to run this script due to your system's hardware specifications and would like the output .h5ad file instead (which is approximately 300 GB in size), please email `michael.odea@nyulangone.org` and `shane.liddelow@nyulangone.org` and we will coordinate to provide this file to you.

In [1]:
import os
os.chdir('..') # changing working directory to parent 'EpiMemAstros' directory, adjust as needed
import pandas as pd
from pathlib import Path
import numpy as np
import anndata as ad
from scipy import sparse
import scanpy as sc
import glob

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
from abc_atlas_access.abc_atlas_cache.anndata_utils import get_gene_data

First, load the project cache.

In [2]:
download_base = Path('inputs/ABC_atlas/abc_atlas_downloads/')
abc_cache = AbcProjectCache.from_cache_dir(download_base)
abc_cache.current_manifest

'releases/20241130/manifest.json'

Next, read in the cell-level metadata as a data frame.

In [3]:
cell = abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='cell_metadata').set_index('cell_label')
cell

Unnamed: 0_level_0,cell_barcode,barcoded_cell_sample_label,library_label,feature_matrix_label,entity,brain_section_label,library_method,region_of_interest_acronym,donor_label,donor_genotype,donor_sex,dataset_label,x,y,cluster_alias,abc_sample_id
cell_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
GCGAGAAGTTAAGGGC-410_B05,GCGAGAAGTTAAGGGC,410_B05,L8TX_201030_01_C12,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550850,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.146826,-3.086639,1,484be5df-5d44-4bfe-9652-7b5bc739c211
AATGGCTCAGCTCCTT-411_B06,AATGGCTCAGCTCCTT,411_B06,L8TX_201029_01_E10,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550851,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.138481,-3.022000,1,5638505d-e1e8-457f-9e5b-59e3e2302417
AACACACGTTGCTTGA-410_B05,AACACACGTTGCTTGA,410_B05,L8TX_201030_01_C12,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550850,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.472557,-2.992709,1,a0544e29-194f-4d34-9af4-13e7377b648f
CACAGATAGAGGCGGA-410_A05,CACAGATAGAGGCGGA,410_A05,L8TX_201029_01_A10,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550850,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.379622,-3.043442,1,c777ac0b-77e1-4d76-bf8e-2b3d9e08b253
AAAGTGAAGCATTTCG-410_B05,AAAGTGAAGCATTTCG,410_B05,L8TX_201030_01_C12,WMB-10Xv3-HPF,cell,,10Xv3,RHP,Snap25-IRES2-Cre;Ai14-550850,Ai14(RCL-tdT)/wt,F,WMB-10Xv3,23.909480,-2.601536,1,49860925-e82b-46df-a228-fd2f97e75d39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GTGTGAGCAAACGCGA-1350_C05,GTGTGAGCAAACGCGA,1350_C05,L8XR_220728_01_A05,WMB-10XMulti,cell,,10xRSeq_Mult,MB,C57BL6J-641405,wt/wt,M,WMB-10XMulti,-7.716915,0.223654,8861,ba1d0e38-bea7-4d4f-bfcd-49121938e743
TTAGCAATCCCTGTTA-1350_C05,TTAGCAATCCCTGTTA,1350_C05,L8XR_220728_01_A05,WMB-10XMulti,cell,,10xRSeq_Mult,MB,C57BL6J-641405,wt/wt,M,WMB-10XMulti,-3.115098,-3.024478,8215,342bd0bb-cbe5-479b-9c70-fef59a730255
TTTGGCTGTCGCGCAA-1350_C05,TTTGGCTGTCGCGCAA,1350_C05,L8XR_220728_01_A05,WMB-10XMulti,cell,,10xRSeq_Mult,MB,C57BL6J-641405,wt/wt,M,WMB-10XMulti,-7.950964,0.409335,8798,4634de09-d8e0-4e40-a49b-eba311de08b5
ATCCACCTCACAGACT-1320_B04,ATCCACCTCACAGACT,1320_B04,L8XR_220630_02_B10,WMB-10XMulti,cell,,10xRSeq_Mult,OLF,C57BL6J-625156,wt/wt,F,WMB-10XMulti,4.579441,12.135833,8798,5b3061de-1cb8-47b6-9368-52824e1031ce


Next, read in the gene-level metadata as a data frame.

In [4]:
gene = abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='gene').set_index('gene_identifier')
gene

Unnamed: 0_level_0,gene_symbol,name,mapped_ncbi_identifier,comment
gene_identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENSMUSG00000051951,Xkr4,X-linked Kx blood group related 4,NCBIGene:497097,
ENSMUSG00000089699,Gm1992,predicted gene 1992,,
ENSMUSG00000102331,Gm19938,"predicted gene, 19938",,
ENSMUSG00000102343,Gm37381,"predicted gene, 37381",,
ENSMUSG00000025900,Rp1,retinitis pigmentosa 1 (human),NCBIGene:19888,
...,...,...,...,...
ENSMUSG00000095523,AC124606.1,PRAME family member 8-like,NCBIGene:100038995,no expression
ENSMUSG00000095475,AC133095.2,uncharacterized LOC545763,NCBIGene:545763,no expression
ENSMUSG00000094855,AC133095.1,uncharacterized LOC620639,NCBIGene:620639,no expression
ENSMUSG00000095019,AC234645.1,,,no expression


Next, read all the H5AD files into a list of anndata objects. Depending on your system's hardware, you may wish to avoid loading all AnnData objects into memory. You can alternatively concatenate the Anndata object files on disk using the `anndata.experimental.concat_on_disk` function. [See the anndata package documentation for details.](https://anndata.readthedocs.io/en/latest/generated/anndata.experimental.concat_on_disk.html#anndata.experimental.concat_on_disk)

In [5]:
raw_h5ads = glob.glob('inputs/ABC_atlas/abc_atlas_downloads/expression_matrices/*/*/WMB-10X*-raw.h5ad')

adatas = []
for x in raw_h5ads:
    tmp_ad = sc.read_h5ad(x)
    adatas.append(tmp_ad)

adatas

[AnnData object with n_obs × n_vars = 1687 × 32285
     obs: 'cell_barcode', 'library_label', 'anatomical_division_label'
     var: 'gene_symbol'
     uns: 'normalization', 'parent', 'parent_layer', 'parent_rows',
 AnnData object with n_obs × n_vars = 250040 × 32285
     obs: 'cell_barcode', 'library_label', 'anatomical_division_label'
     var: 'gene_symbol',
 AnnData object with n_obs × n_vars = 193723 × 32285
     obs: 'cell_barcode', 'library_label', 'anatomical_division_label'
     var: 'gene_symbol'
     uns: 'normalization', 'parent', 'parent_layer', 'parent_rows',
 AnnData object with n_obs × n_vars = 44310 × 32285
     obs: 'cell_barcode', 'library_label', 'anatomical_division_label'
     var: 'gene_symbol'
     uns: 'normalization', 'parent', 'parent_layer', 'parent_rows',
 AnnData object with n_obs × n_vars = 131212 × 32285
     obs: 'cell_barcode', 'library_label', 'anatomical_division_label'
     var: 'gene_symbol'
     uns: 'normalization', 'parent', 'parent_layer', 'pare

Now we merge the anndata object list into a single anndata object.

In [6]:
mouse_atlas = ad.concat(adatas)
mouse_atlas

AnnData object with n_obs × n_vars = 4059388 × 32285
    obs: 'cell_barcode', 'library_label', 'anatomical_division_label'

A few cell barcodes which are present in the anndata object are missing metadata information. We'll remove these so that we're only left with cells with metadata information. 

In [7]:
mouse_atlas = mouse_atlas[cell.index,:].copy()

Now, merge the cell metadata dataframe into .obs field and the gene metadata into the .var field of the anndata object.

In [8]:
mouse_atlas.obs = mouse_atlas.obs.merge(cell, left_index=True, right_index=True, how="inner")
mouse_atlas.var = mouse_atlas.var.merge(gene, left_index=True, right_index=True, how="inner")

Next, we'll create a dataframe which contains all the cluster-level annotation information, and we'll add this information to each cell's metadata.

In [9]:
term_set = abc_cache.get_metadata_dataframe(directory='WMB-taxonomy', file_name='cluster_annotation_term_set')
membership = abc_cache.get_metadata_dataframe(directory='WMB-taxonomy', file_name='cluster_to_cluster_annotation_membership')
pivot = membership.groupby(['cluster_alias', 'cluster_annotation_term_set_name'])['cluster_annotation_term_name'].first().unstack()
pivot = pivot[term_set['name']]
pivot

cluster_annotation_term_set_name,neurotransmitter,class,subclass,supertype,cluster
cluster_alias,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Glut,01 IT-ET Glut,018 L2 IT PPP-APr Glut,0082 L2 IT PPP-APr Glut_3,0326 L2 IT PPP-APr Glut_3
2,Glut,01 IT-ET Glut,018 L2 IT PPP-APr Glut,0082 L2 IT PPP-APr Glut_3,0327 L2 IT PPP-APr Glut_3
3,Glut,01 IT-ET Glut,018 L2 IT PPP-APr Glut,0081 L2 IT PPP-APr Glut_2,0322 L2 IT PPP-APr Glut_2
4,Glut,01 IT-ET Glut,018 L2 IT PPP-APr Glut,0081 L2 IT PPP-APr Glut_2,0323 L2 IT PPP-APr Glut_2
5,Glut,01 IT-ET Glut,018 L2 IT PPP-APr Glut,0081 L2 IT PPP-APr Glut_2,0325 L2 IT PPP-APr Glut_2
...,...,...,...,...,...
34368,GABA-Glyc,27 MY GABA,288 MDRN Hoxb5 Ebf2 Gly-Gaba,1102 MDRN Hoxb5 Ebf2 Gly-Gaba_1,4955 MDRN Hoxb5 Ebf2 Gly-Gaba_1
34372,GABA-Glyc,27 MY GABA,285 MY Lhx1 Gly-Gaba,1091 MY Lhx1 Gly-Gaba_3,4901 MY Lhx1 Gly-Gaba_3
34374,GABA-Glyc,27 MY GABA,285 MY Lhx1 Gly-Gaba,1091 MY Lhx1 Gly-Gaba_3,4902 MY Lhx1 Gly-Gaba_3
34376,GABA-Glyc,27 MY GABA,285 MY Lhx1 Gly-Gaba,1091 MY Lhx1 Gly-Gaba_3,4903 MY Lhx1 Gly-Gaba_3


In [10]:
mouse_atlas.obs = mouse_atlas.obs.merge(pivot, left_on = 'cluster_alias', right_index = True, how="left")

In [11]:
mouse_atlas

AnnData object with n_obs × n_vars = 4042976 × 32285
    obs: 'cell_barcode_x', 'library_label_x', 'anatomical_division_label', 'cell_barcode_y', 'barcoded_cell_sample_label', 'library_label_y', 'feature_matrix_label', 'entity', 'brain_section_label', 'library_method', 'region_of_interest_acronym', 'donor_label', 'donor_genotype', 'donor_sex', 'dataset_label', 'x', 'y', 'cluster_alias', 'abc_sample_id', 'neurotransmitter', 'class', 'subclass', 'supertype', 'cluster'
    var: 'gene_symbol', 'name', 'mapped_ncbi_identifier', 'comment'

Lastly, save the anndata object. Note that this h5ad file require approximately 300 GB of disk space.

In [12]:
mouse_atlas.write_h5ad('outputs/allen_brain_cell_atlas-RAW.h5ad')