## Notebook to combine replication and reference cohort data with our Brain Aging Phase 1 data and do some harminization of obs attribute features. Additional other public data is also included for cluster and cell-type labeling purposes

Replication data: 
NABEC snRNA from Xylena Reed

public data:
1. [Leng K, Li E, Eser R et al. Molecular characterization of selectively vulnerable neurons in Alzheimer’s disease. Nat Neurosci 2021;24:276–87.](https://pubmed.ncbi.nlm.nih.gov/33432193/)
2. [Morabito S, Miyoshi E, Michael N et al. Single-nucleus chromatin accessibility and transcriptomic characterization of Alzheimer’s disease. Nat Genet 2021, DOI: 10.1038/s41588-021-00894-z.](https://pubmed.ncbi.nlm.nih.gov/34239132/)



In [1]:
!date

Wed Nov  8 16:38:50 EST 2023


#### import libraries

In [44]:
from pandas import read_csv, concat
from scanpy import read_h5ad, read_10x_h5
from os.path import exists
from anndata import concat as ad_concat
import numpy as np
from matplotlib.pyplot import rc_context
import matplotlib.pyplot as plt

# for white background of figures (only for docs rendering)
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

#### set notebook variables

In [9]:
# naming
project = 'aging_phase1'
set_name = f'{project}_replication'

# directories
wrk_dir = '/home/jupyter/brain_aging_phase1'
demux_dir = f'{wrk_dir}/demux'
replication_dir = f'{wrk_dir}/replication'
figures_dir = f'{wrk_dir}/figures'
public_dir = f'{wrk_dir}/public'
# sc.settings.figdir = f'{figures_dir}/'

# in files
phase1_raw_h5ad = f'{demux_dir}/aging.h5ad'
phase1_final_h5ad = f'{demux_dir}/aging.pegasus.leiden_085.subclustered.h5ad'
replication_h5ad_file = f'{replication_dir}/{project}_nabec.raw.h5ad'
replication_doublets_file = f'{replication_dir}/{project}_nabec.scrublet_scores.csv'

# out files
raw_anndata_file = f'{replication_dir}/{set_name}.raw.h5ad'

# variables
DEBUG = True

### load data

#### load the replication data

In [8]:
%%time
adata_rep = read_h5ad(replication_h5ad_file)
print(adata_rep)
if DEBUG:
    display(adata_rep.obs.sample(5))
    display(adata_rep.var.sample(5))    

AnnData object with n_obs × n_vars = 79600 × 36601
    obs: 'sample_id', 'pmi', 'sex', 'age'


Unnamed: 0,sample_id,pmi,sex,age
CAGGTCCAGCTCCCTG-1,UMARY-4789,19.0,female,72
CCTACTTCATGAGCAG-1-1,UMARY-4726,6.0,male,28
GTCATTAAGGCATTAC-1,UMARY-5171,5.0,male,79
CCACACAAGCCTGTGA-1,UMARY-1818,3.0,male,76
CCTTTAGTCAGGATGA-1-1,UMARY-5123,24.0,male,61


FAM91A1
DPY19L1
ADAM2
AL022068.1
PAX1


#### load the replication data doublet predictions

In [11]:
rep_dblt_df = read_csv(replication_doublets_file)
print(rep_dblt_df.shape)
if DEBUG:
    display(rep_dblt_df.head())

(79600, 7)


Unnamed: 0.1,Unnamed: 0,sample_id,pmi,sex,age,doublet_score,predicted_doublet
0,AAACAGCCACAAAGAC-1,UMARY-1540,7.0,male,28,0.004621,False
1,AAACAGCCACATGCTA-1,UMARY-1540,7.0,male,28,0.0188,False
2,AAACAGCCACCTGCTC-1,UMARY-1540,7.0,male,28,0.372531,True
3,AAACAGCCAGGACACA-1,UMARY-1540,7.0,male,28,0.008357,False
4,AAACAGCCATGGAGGC-1,UMARY-1540,7.0,male,28,0.031537,False


#### load the reference data

##### load the Leng et al data
- for the entorhinal cortex samples only keep the Braak Stage 0 samples (n=3)
- for the superior frontal gyrus only keep the Braak Stage 0 or 2 (n=7)

In [17]:
%%time
ec_file = f'{public_dir}/cellxgene_collections/Leng_entorhinal_cortex.h5ad'
adata_ref_ec = read_h5ad(ec_file)
# filter by Braak Stage
adata_ref_ec = adata_ref_ec[adata_ref_ec.obs.BraakStage == '0']
print(adata_ref_ec)
if DEBUG:
    display(adata_ref_ec.obs.sample(5))
    display(adata_ref_ec.var.sample(5))   

View of AnnData object with n_obs × n_vars = 9730 × 32826
    obs: 'SampleID', 'donor_id', 'BraakStage', 'SampleBatch', 'nUMI', 'nGene', 'initialClusterAssignments', 'seurat.clusters', 'clusterAssignment', 'clusterCellType', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'assay_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'suspension_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'schema_version', 'title'
    obsm: 'X_cca', 'X_cca.aligned', 'X_tsne'


Unnamed: 0,SampleID,donor_id,BraakStage,SampleBatch,nUMI,nGene,initialClusterAssignments,seurat.clusters,clusterAssignment,clusterCellType,...,organism_ontology_term_id,suspension_type,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage
EC1_TGGTTAGGTGTGTGCC,EC1,1,0,C,861.0,638,EC:c4,2,EC:Astro,Astro,...,NCBITaxon:9606,nucleus,mature astrocyte,10x 3' v2,Alzheimer disease,Homo sapiens,male,entorhinal cortex,unknown,50-year-old human stage
EC1_CGCCAAGGTCTCGTTC,EC1,1,0,C,416.0,351,EC:c1,1,EC:Micro,Micro,...,NCBITaxon:9606,nucleus,mature microglial cell,10x 3' v2,Alzheimer disease,Homo sapiens,male,entorhinal cortex,unknown,50-year-old human stage
EC2_ATTCTACGTTTAGGAA,EC2,2,0,C,1636.0,1093,EC:c6,4,EC:Exc.1,Exc,...,NCBITaxon:9606,nucleus,glutamatergic neuron,10x 3' v2,Alzheimer disease,Homo sapiens,male,entorhinal cortex,unknown,60-year-old human stage
EC2_TGCCCATCATACAGCT,EC2,2,0,C,314.0,280,EC:c1,1,EC:Micro,Micro,...,NCBITaxon:9606,nucleus,mature microglial cell,10x 3' v2,Alzheimer disease,Homo sapiens,male,entorhinal cortex,unknown,60-year-old human stage
EC2_ACTGTCCGTAAGTGTA,EC2,2,0,C,220.0,204,EC:c3,3,EC:OPC,OPC,...,NCBITaxon:9606,nucleus,oligodendrocyte precursor cell,10x 3' v2,Alzheimer disease,Homo sapiens,male,entorhinal cortex,unknown,60-year-old human stage


Unnamed: 0_level_0,feature_is_filtered,feature_name,feature_reference,feature_biotype
feature_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENSG00000232030,False,MAGEB6B,NCBITaxon:9606,gene
ENSG00000044012,False,GUCA2B,NCBITaxon:9606,gene
ENSG00000256504,False,RP11-1060J15.7,NCBITaxon:9606,gene
ENSG00000167182,False,SP2,NCBITaxon:9606,gene
ENSG00000250739,False,LINC01262,NCBITaxon:9606,gene


CPU times: user 2.15 s, sys: 307 ms, total: 2.46 s
Wall time: 2.45 s


In [20]:
%%time
sfg_file = f'{public_dir}/cellxgene_collections/Leng_superior_frontal_gyrus.h5ad'
adata_ref_sfg = read_h5ad(sfg_file)
# filter by Braak Stage
adata_ref_sfg = adata_ref_sfg[(adata_ref_sfg.obs.BraakStage == '0') | (adata_ref_sfg.obs.BraakStage == '2')]
print(adata_ref_sfg)
if DEBUG:
    display(adata_ref_sfg.obs.sample(5))
    display(adata_ref_sfg.var.sample(5))  

View of AnnData object with n_obs × n_vars = 32240 × 32826
    obs: 'SampleID', 'donor_id', 'BraakStage', 'SampleBatch', 'nUMI', 'nGene', 'initialClusterAssignments', 'seurat.clusters', 'clusterAssignment', 'clusterCellType', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'assay_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'suspension_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'schema_version', 'title'
    obsm: 'X_cca', 'X_cca.aligned', 'X_tsne'


Unnamed: 0,SampleID,donor_id,BraakStage,SampleBatch,nUMI,nGene,initialClusterAssignments,seurat.clusters,clusterAssignment,clusterCellType,...,organism_ontology_term_id,suspension_type,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage
SFG6_AGCTTGACATATACCG,SFG6,6,2,D,421.0,340,SFG:c19,14,SFG:Endo,Endo,...,NCBITaxon:9606,nucleus,endothelial cell,10x 3' v2,Alzheimer disease,Homo sapiens,male,superior frontal gyrus,unknown,87-year-old human stage
SFG2_ACGAGGACAGCTCCGA,SFG2,2,0,C,2261.0,1563,SFG:c1,7,SFG:Astro.2,Astro,...,NCBITaxon:9606,nucleus,mature astrocyte,10x 3' v2,Alzheimer disease,Homo sapiens,male,superior frontal gyrus,unknown,60-year-old human stage
SFG2_GTCACAAGTGACCAAG,SFG2,2,0,C,745.0,619,SFG:c1,1,SFG:Oligo.2,Oligo,...,NCBITaxon:9606,nucleus,oligodendrocyte,10x 3' v2,Alzheimer disease,Homo sapiens,male,superior frontal gyrus,unknown,60-year-old human stage
SFG4_AGATCTGTCTTTAGGG,SFG4,4,2,C,469.0,403,SFG:c6,3,SFG:Astro.1,Astro,...,NCBITaxon:9606,nucleus,mature astrocyte,10x 3' v2,Alzheimer disease,Homo sapiens,male,superior frontal gyrus,unknown,72-year-old human stage
SFG4_CTAAGACCACCAGCAC,SFG4,4,2,C,316.0,289,SFG:c4,4,SFG:Micro,Micro,...,NCBITaxon:9606,nucleus,mature microglial cell,10x 3' v2,Alzheimer disease,Homo sapiens,male,superior frontal gyrus,unknown,72-year-old human stage


Unnamed: 0_level_0,feature_is_filtered,feature_name,feature_reference,feature_biotype
feature_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENSG00000151576,False,QTRT2,NCBITaxon:9606,gene
ENSG00000145723,False,GIN1,NCBITaxon:9606,gene
ENSG00000086289,False,EPDR1,NCBITaxon:9606,gene
ENSG00000167371,False,PRRT2,NCBITaxon:9606,gene
ENSG00000276407,False,RP11-407B7.3,NCBITaxon:9606,gene


CPU times: user 4.11 s, sys: 925 ms, total: 5.04 s
Wall time: 5.03 s


##### combine the Leng data

In [22]:
adate_leng = ad_concat([adata_ref_ec, adata_ref_sfg])
adate_leng.obs_names_make_unique()
# add Study attribute
adate_leng.obs['study'] = 'Leng'
print(adate_leng)
if DEBUG:
    display(adate_leng.obs.sample(5))
    display(adate_leng.var.sample(5))

AnnData object with n_obs × n_vars = 41970 × 32826
    obs: 'SampleID', 'donor_id', 'BraakStage', 'SampleBatch', 'nUMI', 'nGene', 'initialClusterAssignments', 'seurat.clusters', 'clusterAssignment', 'clusterCellType', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'assay_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'suspension_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'study'
    obsm: 'X_cca', 'X_cca.aligned', 'X_tsne'


Unnamed: 0,SampleID,donor_id,BraakStage,SampleBatch,nUMI,nGene,initialClusterAssignments,seurat.clusters,clusterAssignment,clusterCellType,...,suspension_type,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage,study
EC1_AGCAGCCTCGTAGGAG,EC1,1,0,C,3983.0,2116,EC:c5,8,EC:Exc.4,Exc,...,nucleus,glutamatergic neuron,10x 3' v2,Alzheimer disease,Homo sapiens,male,entorhinal cortex,unknown,50-year-old human stage,Leng
EC3_CTAAGACAGGACTGGT,EC3,3,0,C,776.0,576,EC:c14,5,EC:Inh.1,Inh,...,nucleus,GABAergic neuron,10x 3' v2,Alzheimer disease,Homo sapiens,male,entorhinal cortex,unknown,71-year-old human stage,Leng
SFG7_TCAGATGAGCTCCTCT,SFG7,7,2,D,210.0,184,SFG:c5,13,SFG:Exc.6,Exc,...,nucleus,glutamatergic neuron,10x 3' v2,Alzheimer disease,Homo sapiens,male,superior frontal gyrus,unknown,80 year-old and over human stage,Leng
EC3_GATGAGGTCAACCATG,EC3,3,0,C,502.0,412,EC:c6,4,EC:Exc.1,Exc,...,nucleus,glutamatergic neuron,10x 3' v2,Alzheimer disease,Homo sapiens,male,entorhinal cortex,unknown,71-year-old human stage,Leng
SFG4_ATCTGCCTCAGGCAAG,SFG4,4,2,C,5393.0,2809,SFG:c2,12,SFG:Exc.5,Exc,...,nucleus,glutamatergic neuron,10x 3' v2,Alzheimer disease,Homo sapiens,male,superior frontal gyrus,unknown,72-year-old human stage,Leng


ENSG00000187260
ENSG00000104140
ENSG00000112232
ENSG00000272366
ENSG00000157193


#### load the Morabita data

In [48]:
%%time
morabita_data_file = f'{public_dir}/Morabita_snRNA_ATAC/GSE174367_snRNA-seq_filtered_feature_bc_matrix.h5'
adata_morabita = read_10x_h5(morabita_data_file)
adata_morabita.var_names_make_unique()
print(adata_morabita)
if DEBUG:
    display(adata_morabita.obs.sample(5))
    display(adata_morabita.var.sample(5))    

  utils.warn_names_duplicates("var")


AnnData object with n_obs × n_vars = 61770 × 58721
    var: 'gene_ids', 'feature_types', 'genome'


CTCCCAAAGTAACGAT-10
AATCACGAGCTTACGT-15
GTCGTTCCAAGTGACG-1
CACGTGGTCCTAGAGT-7
CACCGTTAGATAGCTA-12


Unnamed: 0,gene_ids,feature_types,genome
MAST4-IT1,ENSG00000249057.1,Gene Expression,GRCh38.p12.premrna2
CCDC178,ENSG00000166960.16,Gene Expression,GRCh38.p12.premrna2
RNU6-1085P,ENSG00000200446.1,Gene Expression,GRCh38.p12.premrna2
FMO10P,ENSG00000234984.1,Gene Expression,GRCh38.p12.premrna2
AC034223.2,ENSG00000251281.1,Gene Expression,GRCh38.p12.premrna2


CPU times: user 9.4 s, sys: 1.14 s, total: 10.5 s
Wall time: 10.5 s


In [49]:
morabita_info_file = f'{public_dir}/Morabita_snRNA_ATAC/GSE174367_snRNA-seq_cell_meta.csv.gz'
morabita_info = read_csv(morabita_info_file)
print(morabita_info.shape)
# add study label
morabita_info['study'] = 'Morabita'
# make cell IDs the index
morabita_info = morabita_info.set_index('Barcode')
# keep only info for cells present in data
morabita_info = morabita_info.loc[morabita_info.index.isin(adata_morabita.obs.index)]
print(morabita_info.shape)
if DEBUG:
    display(morabita_info.sample(5))

(61472, 12)
(61472, 12)


Unnamed: 0_level_0,SampleID,Diagnosis,Batch,Cell.Type,cluster,Age,Sex,PMI,Tangle.Stage,Plaque.Stage,RIN,study
Barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
TTTCACAAGTTTGAGA-9,Sample-96,Control,1,ODC,ODC10,79,M,7.0,Stage 1,Stage A,7.1,Morabita
CAATCGACAGCTGTTA-2,Sample-43,AD,1,INH,INH3,90,F,4.17,Stage 6,Stage B,8.9,Morabita
AAGTCGTGTTCTCAGA-15,Sample-45,AD,1,ODC,ODC7,89,F,3.88,Stage 6,Stage B,7.8,Morabita
CATTGAGAGAATTTGG-17,Sample-66,Control,2,ODC,ODC1,90,F,2.92,Stage 2,,10.0,Morabita
AACAAGATCAATCCAG-16,Sample-100,Control,1,ODC,ODC5,79,M,,Stage 2,Stage B,8.7,Morabita


##### add Morabita info into the obs attribute

In [50]:
print(len(set(morabita_info.index) ^ set(adata_morabita.obs.index)))
# drop any cells there wasn't info for
adata_morabita = adata_morabita[adata_morabita.obs.index.isin(morabita_info.index)]
print(adata_morabita)
len(set(morabita_info.index) ^ set(adata_morabita.obs.index))
adata_morabita.obs = concat([adata_morabita.obs, morabita_info], axis='columns')
# keep on the control samples
adata_morabita = adata_morabita[adata_morabita.obs.Diagnosis == 'Control']
print(adata_morabita)
if DEBUG:
    display(adata_morabita.obs.sample(5))
    display(adata_morabita.var.sample(5))    

298
View of AnnData object with n_obs × n_vars = 61472 × 58721
    var: 'gene_ids', 'feature_types', 'genome'
View of AnnData object with n_obs × n_vars = 22796 × 58721
    obs: 'SampleID', 'Diagnosis', 'Batch', 'Cell.Type', 'cluster', 'Age', 'Sex', 'PMI', 'Tangle.Stage', 'Plaque.Stage', 'RIN', 'study'
    var: 'gene_ids', 'feature_types', 'genome'


Unnamed: 0,SampleID,Diagnosis,Batch,Cell.Type,cluster,Age,Sex,PMI,Tangle.Stage,Plaque.Stage,RIN,study
AGTGACTAGAGGATCC-17,Sample-66,Control,2,MG,MG1,90,F,2.92,Stage 2,,10.0,Morabita
TATTGCTAGAGTATAC-9,Sample-96,Control,1,OPC,OPC1,79,M,7.0,Stage 1,Stage A,7.1,Morabita
TTCGGTCAGACGAAGA-9,Sample-96,Control,1,ODC,ODC3,79,M,7.0,Stage 1,Stage A,7.1,Morabita
GATGACTTCGCATTGA-16,Sample-100,Control,1,ODC,ODC3,79,M,,Stage 2,Stage B,8.7,Morabita
CTAACCCAGACGCCCT-13,Sample-58,Control,3,ODC,ODC7,90,M,4.0,Stage 2,Stage B,8.1,Morabita


Unnamed: 0,gene_ids,feature_types,genome
MIR410,ENSG00000199092.4,Gene Expression,GRCh38.p12.premrna2
AL050303.1,ENSG00000224922.1,Gene Expression,GRCh38.p12.premrna2
SRP19,ENSG00000153037.14,Gene Expression,GRCh38.p12.premrna2
NSFL1C,ENSG00000088833.17,Gene Expression,GRCh38.p12.premrna2
AC091167.4,ENSG00000271396.1,Gene Expression,GRCh38.p12.premrna2


### remove doublets from the replication data