# Building New Models to Benchmark Against Celltypist
This notebook is primarily used to look at the datasets that the models will be trained on to make sure that the data looks good and all the required features are present. I also split the data into train and test data here for use in train_models_celltypist.py and annotate_data.py (and by extension 'Benchmaking Models.ipynb')

### Order of Models: 
0. Train the basic CellTypist model on this data for easy comparison with other models later
1. Remove the feature selection from CellTypist (so it only trains the model once)
2. Train the model with L1 regularization instead of L2
3. Train the model only once with only Cytopus genes
4. At the feature selection step, make sure the Cytopus genes are included in the list of top genes
5. Combine models 2 & 4. Use L1 during first step, then make sure Cytopus genes are included, then switch back to L2 regularization
6. Pick top 10,000 most variable genes and have dataset include just those genes before scaling data

### Data Used: 
CT_45 
- from Conde et al. 2022 ('Cross-tissue immune cell analysis reveals tissue-specific features in humans')
- CountAdded_PIP_global_object_for_cellxgene.h5ad
- models trained on this data are saved as 'ct_model_#.pkl'

CT_98
- see CT_45
- CellTypist_Immune_Reference_v2_count.h5ad
- models trained on this data are saved as '98_model_#.pkl'

COV_PBMC
- PBMC cells (Infected with COVID)
- haniffa21.processed.h5ad
- models trained on this are saved as 'COV_model_#.pkl'

Glasner
- from Glasner et al. 2023 ('Conserved transcriptional connectivity of regulatory T cells in the tumor microenvironment informs new combination cancer therapy strategies')
- Lung adenocarcinoma 
- glasner_etal_globalAnndata_20230112.vHTA.h5ad + annotations from 'ad_endo_LS_20211026.results.h5ad', ad_fib_scranLogNorm_filt_20220113.h5ad', 'glasner_ad_myeloid_celltypist_20230606.h5ad'
- models trained on this are saved as 'g_model_#.pkl'

HBCA 
- from Kumar et al. 2023 ('A spatially resolved single-cell genomic atlas of the adult human breast')
- Breast (healthy)
- local.h5ad from cellxgene (renamed Kumar2023_breast.h5ad)
- models produced are saved as 'HBCA_model_#.pkl'

LuCA
- from Salcher et al. 2022 ('High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer')
- local.h5ad from cellxgene (renamed Salcher2022.h5ad)
- Non-small cell lung cancer
- models saved as 'LuCA_model_#.pkl'

HuBMAP
- from Lake et al. 2023 ('An atlas of healthy and injured cell states and niches in the human kidney') 
- Kidney (healthy and injured)
- local.h5ad from cellxgene (remaned Lake2023.h5ad) 
- models saved as 'HuBMAP_model_#.pkl'

Niec/Chu Data 
- from Niec et al. 2022 ('Lymphatics act as a signaling hub to regulate intestinal stem cell activity')
- Colon
- SI_adata_paper.h5ad and LI_adata_paper.h5ad -> two models for the two datasets/parts of the intestine 
- models saved as 'Niec_SI_model_#.pkl' or 'Niec_LI_model_#.pkl'

Brown Data
- from Brown et al. 2019 ('Transcriptional Basis of Mouse and Human Dendritic Cell Heterogeneity')
- Dendritic cells 
- mouse_spleen_normlog.h5ad
- models saved as 'Brown_model_#.pkl'

In [2]:
import scanpy as sc
import pandas as pd
import anndata as ad
import numpy as np
import itertools
from anndata import AnnData
from scipy.sparse import spmatrix
from datetime import datetime
from typing import Optional, Union
from sklearn import __version__ as skv

import cytopus as cp

from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score

#import celltypist as ct #if its throwing an error with sklearn, install scikit-learn version 1.1.0 & that should fix
#from celltypist import logger 
#from celltypist.models import Model
import logger
from models import Model
import train

## Functions 
From celltypist/train.py with some edits/additions

In [3]:
def train_test_split(adata, frac: int = 0.75):
    """
    USING OUTLINE OF CODE FROM trVAE https://doi.org/10.1093/bioinformatics/btaa800
    Split AnnData into test and train datasets - maintains annotations
    
    Params: 
    adata
        Annotated data matrix (Anndata)
    frac
        Fraction of cells to be used in the training set
    """
    no_idx_train = int(adata.shape[0] * frac)
    indices = np.arange(adata.shape[0])
    np.random.shuffle(indices)
    train_index = indices[:no_idx_train]
    test_index = indices[no_idx_train:]
    train = adata[train_index]
    test = adata[test_index]
    return train, test

## Data
Loading cytopus cell type dictionary

In [6]:
G = cp.kb.KnowledgeBase()
cell_dict = G.identities

#make a list of genes from cytopus dict & remove NaNs
#the celltype information doesn't need to be retained since we're applying this gene list to all celltypes
cp_genes = []
for i in cell_dict.values():
    cp_genes.append(i)
cp_genes = list(itertools.chain(*cp_genes)) #make flatlist out of LoL
cp_genes = [x for x in cp_genes if str(x) != 'nan']

KnowledgeBase object containing 75 cell types and 201 cellular processes



### CT_45
Loading in data from celltypist

In [18]:
#adatact_45 = ad.read('../../Data/CountAdded_PIP_global_object_for_cellxgene.h5ad') #local location
adatact_45 = ad.read('/data/peer/adamsj5/cell_typing/CountAdded_PIP_global_object_for_cellxgene.h5ad') #lilac location
#sc.pp.subsample(adata, n_obs = 75000)

In [4]:
trainct_45, testct_45 = train_test_split(adatact_45)

In [11]:
trainct_45

View of AnnData object with n_obs × n_vars = 230833 × 36601
    obs: 'Organ', 'Donor', 'Chemistry', 'Cell_category', 'Predicted_labels_CellTypist', 'Majority_voting_CellTypist', 'Majority_voting_CellTypist_high', 'Manually_curated_celltype', 'Sex', 'Age_range'
    uns: 'Age_range_colors', 'Sex_colors'
    obsm: 'X_umap'
    layers: 'counts'

In [7]:
#trainct_45, testct_45 = train_test_split(adatact_45, 0.3)
trainct_45 = ad.read('/data/peer/adamsj5/cell_typing/train_test_data/CT_45_Train.h5ad')
indatact_45 = trainct_45.X
labelsct_45 = trainct_45.obs["Manually_curated_celltype"]
genesct_45 = trainct_45.var_names

In [8]:
(np.abs(np.expm1(indatact_45[0]).sum()-10000) > 1)

False

Making a data table that only includes genes from cytopus (for model 3)

In [None]:
cp_and_ct_genes_45 = [x for x in cp_genes if x in trainct_45.var_names]
cp_and_ct_genes_45 = np.unique(cp_and_ct_genes_45)

In [None]:
trainct_45_cp = trainct_45[:, cp_and_ct_genes_45]
indatact_45_cp = trainct_45_cp.X
labelsct_45_cp = trainct_45_cp.obs["Manually_curated_celltype"]
genesct_45_cp = trainct_45_cp.var_names

In [5]:
trainct_45.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/CT_45_Train.h5ad') #lilac location
testct_45.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/CT_45_Test.h5ad')

### CT_98
v2 of CellTypist training data that I believe the Immune_All models were trained on. 

In [6]:
#adatact_98 = ad.read('../../Data/CellTypist_Immune_Reference_v2_count.h5ad') #local location
adatact_98 = ad.read('/data/peer/adamsj5/cell_typing/CellTypist_Immune_Reference_v2_count.h5ad') #lilac location

This data is not normalized or transformed. In order to get it to a place where we can train the models, we need to normalize the counts from each cell and transform. CellTypist wants data normalized to 10,000 counts per cell. However, it has been shown that this is not the best technique (Ahlmann-Eltze & Huber, 2023). For our models, we will recommend normalizing to the median library size. You can override CellTypist's expected expression by setting the argument 'check_expression' in the train function to False. After normalizing, the standard is to log transform the data with a pseudocount of 1. 

In [None]:
#find median library size & normalize to that value
lib_size = []
for i in range(675607):
    col_sum = adatact_98[i].X.sum()
    lib_size.append(col_sum)

In [None]:
med_ls = np.median(lib_size) #4725
#med_ls = 4725

In [7]:
sc.pp.normalize_total(adatact_98)

In [None]:
adatact_98.X[0].sum()

In [8]:
#log transform
adatact_98.X= np.log1p(adatact_98.X)

In [None]:
#check that the data looks ok (want this to be less than 1): 
np.abs(np.expm1(adatact_98.X[0]).sum()-med_ls) 

In [None]:

trainct_98 = ad.read('/data/peer/adamsj5/cell_typing/train_test_data/CT_98_train.h5ad')
indatact_98 = trainct_98.X
labelsct_98 = trainct_98.obs["Harmonised_detailed_type"]
genesct_98 = trainct_98.var_names

In [None]:
cp_and_ct_genes_98 = [x for x in cp_genes if x in trainct_98.var_names]
cp_and_ct_genes_98 = np.unique(cp_and_ct_genes_98)

In [None]:
trainct_98_cp = trainct_98[:, cp_and_ct_genes_98]
indatact_98_cp = trainct_98_cp.X
labelsct_98_cp = trainct_98_cp.obs["Harmonised_detailed_type"]
genesct_98_cp = trainct_98_cp.var_names

In [10]:
trainct_98.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/CT_98_Train.h5ad') #lilac location
testct_98.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/CT_98_Test.h5ad') #lilac location

### COV_PBMC
The last 192 features in this matrix are antibodies, not genes. The matrix is normalized to the number of counts per gene, excluding the antibodies and since we don't want to include antibodies in our model either, I will be removing those columns. 

In [12]:
#adata_COV = ad.read('../../Data/haniffa21.processed.h5ad') #local location
adata_COV = ad.read('/data/peer/adamsj5/cell_typing/haniffa21.processed.h5ad') #lilac location

In [13]:
#remove antibody columns
rna_only = [j for j in adata_COV.var_names if 'AB_' not in j]
rna_col_id = [adata_COV.var_names.get_loc(j) for j in rna_only]
adata_COV = adata_COV[:,np.asarray(rna_col_id)]

In [14]:
train_COV, test_COV = train_test_split(adata_COV)

In [None]:
train_COV = ad.read('/data/peer/adamsj5/cell_typing/train_test_data/COV_Train.h5ad')
indata_COV = train_COV.X
labels_COV = train_COV.obs["full_clustering"]
genes_COV = train_COV.var_names

In [None]:
cp_and_ct_genes_COV = [x for x in cp_genes if x in train_COV.var_names]
cp_and_ct_genes_COV = np.unique(cp_and_ct_genes_COV)

In [None]:
train_COV_cp = train_COV[:, cp_and_ct_genes_COV]
indata_COV_cp = train_COV_cp.X
labels_COV_cp = train_COV_cp.obs["full_clustering"]
genes_COV_cp = train_COV_cp.var_names

In [15]:
train_COV.write('/data/peer/adamsj5/cell_typing/train_test_data/COV_Train.h5ad')# lilac location
test_COV.write('/data/peer/adamsj5/cell_typing/train_test_data/COV_Test.h5ad')# lilac location

### Glasner

This dataset combines the cell type labels from 4 datasets: the overall coarsely annotated data, finely annotated endothelial cells data, finely annotated fibroblast data, and finely annotated myeloid cell data. The coarse dataset is the "base" that the other annotations were added to. Because most immune cell types only have very high level labels, and cytopus only contains information about immune cells, at a with a much high resolution of labels, I won't make models 3 & 4 for this data, which rely on that information.  

Additionally, this dataset it normalized to the median library sized and then log transformed with a pseudocount of 0.1 (not 1). 

##### To integrate cell type labels

In [None]:
#adata_g = ad.read('../../Data/glasner_etal_globalAnndata_20230112.vHTA.h5ad') #annotations too coarse, local location
adata_g = ad.read('../../Data/cell_typing/glasner_etal_globalAnndata_20230112.vHTA.h5ad') #annotations too coarse, lilac location

In [None]:
adata_g

In [None]:
adata_g.var = adata_g.var.set_index('gene_name')

In [None]:
#adata_g_endo = ad.read('../../Data/ad_endo_LS_20211026.results.h5ad') #local location
#adata_g_fib = ad.read('../../Data/ad_fib_scranLogNorm_filt_20220113.h5ad') #local location
#adata_g_myl = ad.read('../../Data/glasner_ad_myeloid_celltypist_20230606.h5ad') #local location

adata_g_endo = ad.read('../../Data/cell_typing/ad_endo_LS_20211026.results.h5ad') #lilac location
adata_g_fib = ad.read('../../Data/cell_typing/ad_fib_scranLogNorm_filt_20220113.h5ad') #lilac location
adata_g_myl = ad.read('../../Data/cell_typing/glasner_ad_myeloid_celltypist_20230606.h5ad') #lilac location

In [None]:
adata_glas = adata_g.copy()

In [None]:
finer_cell_types = []
orig_cell_types = [] 

for x in adata_g.obs_names:
    g_idx = np.where(adata_g.obs_names == x)[0][0]
    orig_cell_types.append(adata_glas[g_idx].obs["cell_lineage"].values[0])
    if x in adata_g_endo.obs_names:
        endo_idx = np.where(adata_g_endo.obs_names == x)[0][0]
        finer_cell_types.append(adata_g_endo[:,endo_idx].obs["granular_cell_type"].values[0]
    elif x in adata_g_myl.obs_names:
        myl_idx = np.where(adata_g_myl.obs_names == x)[0][0]
        finer_cell_types.append(adata_g_myl[myl_idx].obs["cell_type"].values[0])
    elif x in adata_g_fib.obs_names:
        fib_idx = np.where(adata_g_fib.obs_names == x)[0][0]
        finer_cell_types.append(adata_g_fib[fib_idx].obs["granular_cell_type"].values[0])
    else:
        finer_cell_types.append(adata_glas[g_idx].obs["cell_lineage"].values[0])

In [None]:
adata_glas.obs["finer_cell_types"] = finer_cell_types
adata_glas.obs["orig_cell_types"] = orig_cell_types

In [None]:
#confirm that theyre in generally the right order
f1_score(adata_glas.obs["cell_lineage"], adata_glas.obs["orig_cell_types"], average = None)

In [None]:
adata_glas.write('../../Data/cell_typing/glasner_fine_annot.h5ad')

##### More granular dataset: 

In [None]:
adata_glas = ad.read('/data/peer/adamsj5/cell_typing/glasner_fine_annot.h5ad') #lilac location

In [3]:
#train_glas, test_glas = train_test_split(adata_glas)
train_glas = ad.read('/data/peer/adamsj5/cell_typing/train_test_data/test_glas.h5ad')
indata_glas = train_glas.X
labels_glas = train_glas.obs['finer_cell_types']
genes_glas = train_glas.var_names

In [None]:
cp_and_ct_genes_glas = [x for x in cp_genes if x in train_glas.var_names]
cp_and_ct_genes_glas = np.unique(cp_and_ct_genes_glas)

In [None]:
train_glas_cp = train_glas[:, cp_and_ct_genes_glas]
indata_glas_cp = train_glas_cp.X
labels_glas_cp = train_glas_cp.obs["finer_cell_types"]
genes_glas_cp = train_glas_cp.var_names

In [None]:
train_glas.write('/data/peer/adamsj5/cell_typing/train_glas.h5ad')# lilac location
test_glas.write('/data/peer/adamsj5/cell_typing/test_glas.h5ad')# lilac location

In [None]:
model, df = train.train_1(X = indata_glas, labels = labels_glas, genes = genes_glas, check_expression = False, use_SGD = True, mini_batch = True, epochs = 1,  balance_cell_type = True, feature_selection = True)

### HBCA

In [10]:
adata_HBCA = ad.read('/data/peer/adamsj5/cell_typing/Kumar2023_breast.h5ad')

In [11]:
adata_HBCA.var = adata_HBCA.var.set_index('feature_name')

In [12]:
adata_HBCA

AnnData object with n_obs × n_vars = 714331 × 33234
    obs: 'mapped_reference_assembly', 'mapped_reference_annotation', 'alignment_software', 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'donor_living_at_sample_collection', 'donor_menopausal_status', 'organism_ontology_term_id', 'sample_uuid', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'sample_derivation_process', 'sample_source', 'donor_BMI_at_collection', 'suspension_depleted_cell_types', 'suspension_derivation_process', 'suspension_dissociation_reagent', 'suspension_dissociation_time', 'suspension_percent_cell_viability', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'sequencing_platform', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'cell_state', 'disease_ontology_term_id', 'sex_ontology_term_id', 'n_count_rna', 'n_feature_rna', 'percent_mito', 'percent_rb', 'tissue_location', 'bmi_group', 'procedure_group', 'ag

In [18]:
train_HBCA, test_HBCA = train_test_split(adata_HBCA)

In [19]:
train_HBCA.write('/data/peer/adamsj5/cell_typing/train_test_data/HBCA_Train.h5ad')# lilac location
test_HBCA.write('/data/peer/adamsj5/cell_typing/train_test_data/HBCA_Test.h5ad')# lilac location

### LuCA

In [26]:
adata_LUCA = ad.read('/data/peer/adamsj5/cell_typing/Salcher2022.h5ad')

In [27]:
adata_LUCA

AnnData object with n_obs × n_vars = 892296 × 17811
    obs: 'sample', 'uicc_stage', 'ever_smoker', 'age', 'donor_id', 'origin', 'dataset', 'ann_fine', 'cell_type_predicted', 'doublet_status', 'leiden', 'n_genes_by_counts', 'total_counts', 'total_counts_mito', 'pct_counts_mito', 'ann_coarse', 'cell_type_tumor', 'tumor_stage', 'EGFR_mutation', 'TP53_mutation', 'ALK_mutation', 'BRAF_mutation', 'ERBB2_mutation', 'KRAS_mutation', 'ROS_mutation', 'origin_fine', 'study', 'platform', 'cell_type_major', 'suspension_type', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
    var: 'is_highly_variable', 'mito', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total

In [28]:
adata_LUCA.X = adata_LUCA.raw.X

In [29]:
sc.pp.normalize_total(adata_LUCA)

In [30]:
adata_LUCA.X= np.log1p(adata_LUCA.X)

In [31]:
np.expm1(adata_LUCA.X[2]).sum()

2969.9846

In [32]:
adata_LUCA.var = adata_LUCA.var.set_index('feature_name')

In [34]:
adata_LUCA.var

Unnamed: 0_level_0,is_highly_variable,mito,n_cells_by_counts,mean_counts,pct_dropout_by_counts,total_counts,feature_is_filtered,feature_reference,feature_biotype
feature_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A1BG,True,False,86866,2.527138,90.681054,2355657.0,False,NCBITaxon:9606,gene
A1BG-AS1,True,False,12257,0.369194,98.685074,344142.0,False,NCBITaxon:9606,gene
A2M,True,False,122241,22.579874,86.886039,21047694.0,False,NCBITaxon:9606,gene
A2M-AS1,False,False,8827,0.172223,99.053043,160537.0,False,NCBITaxon:9606,gene
A2ML1,True,False,5096,0.038011,99.453303,35432.0,False,NCBITaxon:9606,gene
...,...,...,...,...,...,...,...,...,...
ZXDC,False,False,43291,0.442399,95.355760,412380.0,False,NCBITaxon:9606,gene
ZYG11A,False,False,3517,0.009742,99.622698,9081.0,False,NCBITaxon:9606,gene
ZYG11B,False,False,71512,0.275276,92.328224,256597.0,False,NCBITaxon:9606,gene
ZYX,False,False,205370,4.536451,77.967996,4228626.0,False,NCBITaxon:9606,gene


In [33]:
train_LUCA, test_LUCA = train_test_split(adata_LUCA)
train_LUCA.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/LuCA_Train.h5ad')
test_LUCA.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/LuCA_Test.h5ad')

In [23]:
train_LUCA = ad.read('/data/peer/adamsj5/cell_typing/train_test_data/LuCA_Train.h5ad')
test_LUCA = ad.read('/data/peer/adamsj5/cell_typing/train_test_data/LuCA_Test.h5ad')

In [25]:
test_HBCA = ad.read('/data/peer/adamsj5/cell_typing/train_test_data/HBCA_Test.h5ad')
test_HBCA

AnnData object with n_obs × n_vars = 214300 × 33234
    obs: 'mapped_reference_assembly', 'mapped_reference_annotation', 'alignment_software', 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'donor_living_at_sample_collection', 'donor_menopausal_status', 'organism_ontology_term_id', 'sample_uuid', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'sample_derivation_process', 'sample_source', 'donor_BMI_at_collection', 'suspension_depleted_cell_types', 'suspension_derivation_process', 'suspension_dissociation_reagent', 'suspension_dissociation_time', 'suspension_percent_cell_viability', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'sequencing_platform', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'cell_state', 'disease_ontology_term_id', 'sex_ontology_term_id', 'n_count_rna', 'n_feature_rna', 'percent_mito', 'percent_rb', 'tissue_location', 'bmi_group', 'procedure_group', 'ag

In [12]:
len(train_LUCA.var_names)

17811

### HuBMAP

In [3]:
adata_HuBMAP = ad.read('/data/peer/adamsj5/cell_typing/Lake2023.h5ad')

In [9]:
adata_HuBMAP

AnnData object with n_obs × n_vars = 304652 × 33920
    obs: 'nCount_RNA', 'nFeature_RNA', 'library', 'percent.er', 'percent.mt', 'degen.score', 'aEpi.score', 'aStr.score', 'cyc.score', 'matrisome.score', 'collagen.score', 'glycoprotein.score', 'proteoglycan.score', 'S.Score', 'G2M.Score', 'experiment', 'specimen', 'condition.long', 'condition.l1', 'condition.l2', 'donor_id', 'region.l1', 'region.l2', 'percent.cortex', 'percent.medulla', 'tissue_type', 'id', 'pagoda_k100_infomap_coembed', 'subclass.full', 'subclass.l3', 'subclass.l2', 'subclass.l1', 'state.l2', 'state', 'class', 'structure', 'disease_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'eGFR', 'BMI', 'diabetes_history', 'hypertension', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'is_primary_data', 'suspension_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'sel

In [14]:
adata_HuBMAP.var = adata_HuBMAP.var.set_index('feature_name')

In [19]:
np.expm1(adata_HuBMAP.X[30000]).sum()

9957.399

In [21]:
train_HuBMAP, test_HuBMAP = train_test_split(adata_HuBMAP)
train_HuBMAP.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/HuBMAP_Train.h5ad')
test_HuBMAP.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/HuBMAP_Test.h5ad')

### Niec/Chu

In [21]:
adata_Chu_SI = ad.read('/data/peer/adamsj5/cell_typing/SI_adata_paper.h5ad')
adata_Chu_LI = ad.read('/data/peer/adamsj5/cell_typing/LI_adata_paper.h5ad')

In [6]:
adata_Chu_SI

AnnData object with n_obs × n_vars = 2239 × 18573
    obs: 'latent_cell_probability', 'latent_RT_efficiency', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'predicted_doublet', 'doublet_score', 'Phenograph_cluster_k15_dbRM', 'coarse_cluster_dbRM', 'cell_state'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'Phenograph_cluster_k15_dbRM_colors', 'cell_state_colors', 'coarse_cluster_dbRM_colors', 'dendrogram_Phenograph_cluster_k15_dbRM', 'hvg', 'neighbors', 'pca', 'test_elbo', 'test_epoch', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

In [12]:
np.expm1(adata_Chu_SI.X[2]).sum()

9642.598

In [9]:
adata_Chu_LI

AnnData object with n_obs × n_vars = 5163 × 19706
    obs: 'latent_cell_probability', 'latent_RT_efficiency', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'predicted_doublet', 'doublet_score', 'Phenograph_cluster_k15_dbRM', 'coarse_cluster_dbRM', 'cell_state'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'Phenograph_cluster_k15_dbRM_colors', 'cell_state_colors', 'coarse_cluster_dbRM_colors', 'dendrogram_Phenograph_cluster_k15_dbRM', 'hvg', 'neighbors', 'pca', 'test_elbo', 'test_epoch', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

In [10]:
np.expm1(adata_Chu_LI.X[0]).sum()

6735.3877

In [25]:
train_Chu_SI, test_Chu_SI = train_test_split(adata_Chu_SI)
train_Chu_SI.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/Niec_SI_Train.h5ad')
test_Chu_SI.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/Niec_SI_Test.h5ad')

In [26]:
train_Chu_LI, test_Chu_LI = train_test_split(adata_Chu_LI)
train_Chu_LI.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/Niec_LI_Train.h5ad')
test_Chu_LI.write_h5ad('/data/peer/adamsj5/cell_typing/train_test_data/Niec_LI_Test.h5ad')

### Brown Data

In [4]:
adata_Brown = ad.read('/data/peer/adamsj5/cell_typing/mouse_spleen_normlog.h5ad')

In [5]:
adata_Brown

AnnData object with n_obs × n_vars = 4464 × 11755
    obs: 'sample', 'background', 'total_counts', 'mitochondrial_fraction', 'cell_type', 'cluster', 'tsne_x', 'tsne_y'
    var: 'gene_id'
    obsm: 'X_tsne'

In [6]:
adata_Brown.obs['cell_type'].unique()

['cDC1', 'CCR7hi DC', 'cDC2 Tbet-', 'cDC2 Tbet+', 'Monocyte', 'Siglec-H DC', 'cDC2 Mixed']
Categories (7, object): ['CCR7hi DC', 'Monocyte', 'Siglec-H DC', 'cDC1', 'cDC2 Mixed', 'cDC2 Tbet+', 'cDC2 Tbet-']

In [8]:
adata_Brown.var_names

Index(['0610007P14RIK', '0610009B22RIK', '0610009L18RIK', '0610009O20RIK',
       '0610010F05RIK', '0610010K14RIK', '0610012G03RIK', '0610030E20RIK',
       '0610037L13RIK', '0610040J01RIK',
       ...
       'ZUFSP', 'ZW10', 'ZWILCH', 'ZWINT', 'ZXDB', 'ZXDC', 'ZYG11B', 'ZYX',
       'ZZEF1', 'ZZZ3'],
      dtype='object', length=11755)

In [10]:
np.expm1(adata_Brown.X[2]).sum()

731505.8

In [34]:
remove_some = [j for j in adata_Brown.var_names if 'RP_' not in j]
remove_some_id = [adata_Brown.var_names.get_loc(j) for j in remove_some]
adata_Brown_testing = adata_Brown[:,np.asarray(remove_some_id)]

In [35]:
adata_Brown_testing.var_names

Index(['0610007P14RIK', '0610009B22RIK', '0610009L18RIK', '0610009O20RIK',
       '0610010F05RIK', '0610010K14RIK', '0610012G03RIK', '0610030E20RIK',
       '0610037L13RIK', '0610040J01RIK',
       ...
       'ZUFSP', 'ZW10', 'ZWILCH', 'ZWINT', 'ZXDB', 'ZXDC', 'ZYG11B', 'ZYX',
       'ZZEF1', 'ZZZ3'],
      dtype='object', length=11755)

## Models - All at Once

In [None]:
make_all_models(adata_45, annot_col = 'Manually_curated_celltype', abrev = 'CT_45', percent_train = 0.20 , write_loc = 'New Models/CT_45 Models/', train_data = '/data/peer/adamsj5/cell_typing/train_test_data/CT_45_Train.h5ad')


In [None]:
make_all_models(adata_98, annot_col = 'Harmonised_detailed_type', abrev = 'CT_98', percent_train = 0.20 , write_loc = 'New Models/CT_98 Models/', train_data = '/data/peer/adamsj5/cell_typing/train_test_data/CT_98_train.h5ad')


In [None]:
make_all_models(adata_COV, annot_col = 'full_clustering', abrev = 'COV', percent_train = 0.20, write_loc = 'New Models/COV_PBMC Models/', train_data = '/data/peer/adamsj5/cell_typing/train_test_data/COV_Train.h5ad')


In [None]:
make_all_models(adata_glas, annot_col = 'finer_cell_types', abrev = 'g', write_loc = 'New Models/Glasner Models/', train_data = '/data/peer/adamsj5/cell_typing/train_test_data/train_glas.h5ad')


In [None]:
make_all_models(adata_HBCA, annot_col = 'cell_type', abrev = 'HBCA', percent_train = 0.15, write_loc = 'New Models/HBCA Models/', train_data = '/data/peer/adamsj5/cell_typing/train_test_data/HBCA_Train.h5ad')


In [None]:
make_all_models(adata_LUCA, annot_col = 'cell_type', abrev = 'LuCA', percent_train = 0.15, write_loc = 'New Models/LuCA Models/')
