# Testing/Benchmarking Celltypist Models 
### List of Models (made in Making New Models.ipynb)
1. Remove the feature selection from CellTypist (so it only trains the model once)
2. Train the model with L1 regularization instead of L2
3. Train the model only once with only Cytopus genes
4. At the feature selection step, make sure the Cytopus genes are included in the list of top genes

In [10]:
import scanpy as sc
import pandas as pd
import anndata as ad
from anndata import AnnData
import numpy as np
from scipy.sparse import spmatrix
from datetime import datetime
import itertools
from sklearn.metrics import f1_score

import celltypist as ct
from celltypist import models

In [18]:
#Import models 
##Celltypist default model 
models.download_models(model = 'Immune_All_Low.pkl')

##New Models
model_1 = models.Model.load('New Models/ct_model_1.pkl')
model_2 = models.Model.load('New Models/ct_model_2.pkl')
model_3 = models.Model.load('New Models/ct_model_3.pkl')
model_4 = models.Model.load('New Models/ct_model_4.pkl')

📂 Storing models in /Users/labuser/.celltypist/data/models
💾 Total models to download: 1
⏩ Skipping [1/1]: Immune_All_Low.pkl (file exists)


In [19]:
#Import test data - subset of Celltypist data 
test= ad.read('../../Celltypist_test.h5ad')

## Get celltype predictions from each model
Celltypist model

In [20]:
predictions_ct = ct.annotate(test, model = 'Immune_All_Low.pkl', majority_voting = True)
predictions_ct.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 6147 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
⛓️ Over-clustering input data with resolution set to 30
🗳️ Majority voting the predictions
✅ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/Temra cytotoxic T cells,71,Tem/Temra cytotoxic T cells
CZINY-0109_TGCGATACACATAACC,Tem/Temra cytotoxic T cells,101,Tem/Temra cytotoxic T cells
CZINY-0058_GCATCGGCAAGTCATC,Tcm/Naive helper T cells,7,Tcm/Naive helper T cells
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Follicular helper T cells,206,Tem/Effector helper T cells
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Tem/Temra cytotoxic T cells,369,Tem/Trm cytotoxic T cells
CZINY-0102_CGAAGTTAGCACTAGG,Follicular helper T cells,3,Follicular helper T cells


In [21]:
pred_adatact = predictions_ct.to_adata()
pred_adatact.write_h5ad('../../pred_modelct.h5ad')

Model 1

In [None]:
predictions_1 = ct.annotate(test, model = 'New Models/ct_model_1.pkl', majority_voting = True)
predictions_1.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 30123 features used for prediction
⚖️ Scaling input data


In [None]:
pred_adata1 = predictions_1.to_adata()
pred_adata1.write_h5ad('../../pred_model1.h5ad')

Model 2

In [4]:
predictions_2 = ct.annotate(test, model = 'New Models/ct_model_2.pkl', majority_voting = True)
predictions_2.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 6749 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
⛓️ Over-clustering input data with resolution set to 30
🗳️ Majority voting the predictions
✅ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Tem/emra_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [5]:
pred_adata2 = predictions_2.to_adata()

In [3]:
#pred_adata2.write_h5ad('../../pred_model2.h5ad')
pred_adata2= ad.read('../../pred_model2.h5ad')

Model 3

In [6]:
predictions_3 = ct.annotate(test, model = 'New Models/ct_model_3.pkl', majority_voting = True)
predictions_3.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 304 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
⛓️ Over-clustering input data with resolution set to 30
🗳️ Majority voting the predictions
✅ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Trm/em_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [7]:
pred_adata3 = predictions_3.to_adata()

In [4]:
#pred_adata3.write_h5ad('../../pred_model3.h5ad')
pred_adata3= ad.read('../../pred_model3.h5ad')

Model 4

In [8]:
predictions_4 = ct.annotate(test, model = 'New Models/ct_model_4.pkl', majority_voting = True)
predictions_4.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 4712 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
⛓️ Over-clustering input data with resolution set to 30
🗳️ Majority voting the predictions
✅ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Tem/emra_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [9]:
pred_adata4 = predictions_4.to_adata()

In [5]:
#pred_adata4.write_h5ad('../../pred_model4.h5ad')
pred_adata4= ad.read('../../pred_model4.h5ad')

## Benchmarking
### F1 scores

In [23]:
pred_adatact.obs

Unnamed: 0,Organ,Donor,Chemistry,Cell_category,Predicted_labels_CellTypist,Majority_voting_CellTypist,Majority_voting_CellTypist_high,Manually_curated_celltype,Sex,Age_range,predicted_labels,over_clustering,majority_voting,conf_score
CZINY-0105_CACCAAAAGTCAACAA,BMA,D496,3,ILCT,Cytotoxic T cells,Cytotoxic T cells,T cells,Tem/emra_CD8,Male,55-60,Tem/Temra cytotoxic T cells,71,Tem/Temra cytotoxic T cells,0.997435
CZINY-0109_TGCGATACACATAACC,BMA,D496,3,ILCT,Tem/Effector cytotoxic T cells,Cytotoxic T cells,T cells,Tem/emra_CD8,Male,55-60,Tem/Temra cytotoxic T cells,101,Tem/Temra cytotoxic T cells,0.982624
CZINY-0058_GCATCGGCAAGTCATC,BLD,D503,3,ILCT,Tcm/Naive helper T cells,Helper T cells,T cells,Tnaive/CM_CD4,Female,65-70,Tcm/Naive helper T cells,7,Tcm/Naive helper T cells,0.996335
CZINY-0057_AACAACCTCATGCTAG,LLN,D503,3,B,Memory B cells,Memory B cells,B cells,Memory B cells,Female,65-70,Memory B cells,18,Memory B cells,0.999984
CZINY-0106_GTGTTCCAGTGTTGTC,LNG,D496,3,Myeloid,Macrophages,Macrophages,Macrophages,Alveolar macrophages,Male,55-60,Alveolar macrophages,56,Alveolar macrophages,0.999999
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,SPL,D496,3,ILCT,Helper T cells,Helper T cells,T cells,Tfh,Male,55-60,Follicular helper T cells,206,Tem/Effector helper T cells,0.964137
Pan_T7980364_CGTGTAACATGCTGGC,SPL,A36,5v1,B,Memory B cells,Memory B cells,B cells,Memory B cells,Male,60-64,Memory B cells,380,Memory B cells,0.998577
CZINY-0104_CGGCAGTTCCATGCAA,BMA,D496,3,ILCT,Cytotoxic T cells,Cytotoxic T cells,T cells,Tem/emra_CD8,Male,55-60,Tem/Temra cytotoxic T cells,369,Tem/Trm cytotoxic T cells,0.943564
CZINY-0102_CGAAGTTAGCACTAGG,LLN,D496,3,ILCT,Helper T cells,Helper T cells,T cells,Tfh,Male,55-60,Follicular helper T cells,3,Follicular helper T cells,0.948143


In [28]:
#og celltypist  - issue with how they label things
f1_score(pred_adatact.obs["Manually_curated_celltype"], pred_adatact.obs["predicted_labels"], average='weighted')

0.29303052071621205

In [None]:
#model 1 - cant get it to run, gets stuck on Scaling for too long 
#f1_score(pred_adata1.obs["Manually_curated_celltype"], pred_adata1.obs["predicted_labels"], average='macro')

In [29]:
#model 2  - 0.64
f1_score(pred_adata2.obs["Manually_curated_celltype"], pred_adata2.obs["predicted_labels"], average='weighted')

0.8187335698864471

In [30]:
#model 3  - 0.69
f1_score(pred_adata3.obs["Manually_curated_celltype"], pred_adata3.obs["predicted_labels"], average='weighted')

0.8224163133870971

In [31]:
#model 4 - 0.8
f1_score(pred_adata4.obs["Manually_curated_celltype"], pred_adata4.obs["predicted_labels"], average='weighted')

0.9026640701746831