# Testing/Benchmarking Celltypist Models 
### List of Models (made in Making New Models.ipynb)
1. Remove the feature selection from CellTypist (so it only trains the model once)
2. Train the model with L1 regularization instead of L2
3. Train the model only once with only Cytopus genes
4. At the feature selection step, make sure the Cytopus genes are included in the list of top genes

In [2]:
import scanpy as sc
import pandas as pd
import anndata as ad
from anndata import AnnData
import numpy as np
from scipy.sparse import spmatrix
from datetime import datetime
import itertools

import celltypist as ct
from celltypist import models

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [1]:
#Import models 
##Celltypist default model 
models.download_models(model = 'Immune_All_Low.pkl')

##New Models
model_1 = models.Model.load('New Models/ct_model_1.pkl')
model_2 = models.Model.load('New Models/ct_model_2.pkl')
model_3 = models.Model.load('New Models/ct_model_3.pkl')
model_4 = models.Model.load('New Models/ct_model_4.pkl')

NameError: name 'models' is not defined

In [3]:
#Import test data - subset of Celltypist data 
test= ad.read('../../Celltypist_test.h5ad')

#### Get celltype predictions from each model
Celltypist model

In [None]:
predictions_ct = ct.annotate(test, model = 'Immune_All_Low.pkl', majority_voting = True)
predictions_ct.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 6147 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
⛓️ Over-clustering input data with resolution set to 30


Model 1

In [None]:
predictions_1 = ct.annotate(test, model = 'New Models/ct_model_1.pkl', majority_voting = True)
predictions_1.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 30123 features used for prediction
⚖️ Scaling input data


In [None]:
pred_adata1 = predictions_1.to_adata()
pred_adata1.write_h5ad('../../pred_model1.h5ad')

Model 2

In [4]:
predictions_2 = ct.annotate(test, model = 'New Models/ct_model_2.pkl', majority_voting = True)
predictions_2.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 6749 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
⛓️ Over-clustering input data with resolution set to 30
🗳️ Majority voting the predictions
✅ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Tem/emra_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [5]:
pred_adata2 = predictions_2.to_adata()
#pred_adata2.write_h5ad('../../pred_model2.h5ad')

Model 3

In [6]:
predictions_3 = ct.annotate(test, model = 'New Models/ct_model_3.pkl', majority_voting = True)
predictions_3.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 304 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
⛓️ Over-clustering input data with resolution set to 30
🗳️ Majority voting the predictions
✅ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Trm/em_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [7]:
pred_adata3 = predictions_3.to_adata()
#pred_adata3.write_h5ad('../../pred_model3.h5ad')

Model 4

In [8]:
predictions_4 = ct.annotate(test, model = 'New Models/ct_model_4.pkl', majority_voting = True)
predictions_4.predicted_labels

🔬 Input data has 263810 cells and 36601 genes
🔗 Matching reference genes in the model
🧬 4712 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
⛓️ Over-clustering input data with resolution set to 30
🗳️ Majority voting the predictions
✅ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Tem/emra_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [9]:
pred_adata4 = predictions_4.to_adata()
#pred_adata4.write_h5ad('../../pred_model4.h5ad')