# Testing/Benchmarking Celltypist Models 
### List of Models (made in Making New Models.ipynb)
1. Remove the feature selection from CellTypist (so it only trains the model once)
2. Train the model with L1 regularization instead of L2
3. Train the model only once with only Cytopus genes
4. At the feature selection step, make sure the Cytopus genes are included in the list of top genes

In [2]:
import scanpy as sc
import pandas as pd
import anndata as ad
from anndata import AnnData
import numpy as np
from scipy.sparse import spmatrix
from datetime import datetime
import itertools
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt

import celltypist as ct
from celltypist import models

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [18]:
#Import models 
##Celltypist default model 
models.download_models(model = 'Immune_All_Low.pkl')

##New Models
model_1 = models.Model.load('New Models/CT_45 Models/ct_model_1.pkl')
model_2 = models.Model.load('New Models/CT_45 Models/ct_model_2.pkl')
model_3 = models.Model.load('New Models/CT_45 Models/ct_model_3.pkl')
model_4 = models.Model.load('New Models/CT_45 Models/ct_model_4.pkl')

üìÇ Storing models in /Users/labuser/.celltypist/data/models
üíæ Total models to download: 1
‚è© Skipping [1/1]: Immune_All_Low.pkl (file exists)


In [102]:
models.download_models(model = 'Healthy_COVID19_PBMC.pkl')

üìÇ Storing models in /Users/labuser/.celltypist/data/models
üíæ Total models to download: 1
üíæ Downloading model [1/1]: Healthy_COVID19_PBMC.pkl


## Get celltype predictions from each model

### Using CT_45 Models

In [19]:
#Import test data - subset of Celltypist data 
test= ad.read('../../Data/Celltypist_test.h5ad')

In [141]:
predictions_ct = ct.annotate(test, model = 'New Models/CT_45 Models/ct_model_0.pkl', majority_voting = True)
predictions_ct.predicted_labels

üî¨ Input data has 263810 cells and 36601 genes
üîó Matching reference genes in the model
üß¨ 4759 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
‚õìÔ∏è Over-clustering input data with resolution set to 30
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Tem/emra_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [142]:
pred_adatact = predictions_ct.to_adata()
pred_adatact.write_h5ad('../../pred_modelct.h5ad')

Model 1

In [None]:
predictions_1 = ct.annotate(test, model = 'New Models/CT_45 Models/ct_model_1.pkl', majority_voting = True)
predictions_1.predicted_labels

üî¨ Input data has 263810 cells and 36601 genes
üîó Matching reference genes in the model
üß¨ 30123 features used for prediction
‚öñÔ∏è Scaling input data


In [None]:
pred_adata1 = predictions_1.to_adata()
pred_adata1.write_h5ad('../../pred_model1.h5ad')

Model 2

In [4]:
predictions_2 = ct.annotate(test, model = 'New Models/CT_45 Models/ct_model_2.pkl', majority_voting = True)
predictions_2.predicted_labels

üî¨ Input data has 263810 cells and 36601 genes
üîó Matching reference genes in the model
üß¨ 6749 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Can not detect a neighborhood graph, will construct one before the over-clustering
‚õìÔ∏è Over-clustering input data with resolution set to 30
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Tem/emra_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [5]:
pred_adata2 = predictions_2.to_adata()

In [3]:
#pred_adata2.write_h5ad('../../pred_model2.h5ad')
pred_adata2= ad.read('../../pred_model2.h5ad')

Model 3

In [6]:
predictions_3 = ct.annotate(test, model = 'New Models/CT_45 Models/ct_model_3.pkl', majority_voting = True)
predictions_3.predicted_labels

üî¨ Input data has 263810 cells and 36601 genes
üîó Matching reference genes in the model
üß¨ 304 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
‚õìÔ∏è Over-clustering input data with resolution set to 30
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Trm/em_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [7]:
pred_adata3 = predictions_3.to_adata()

In [4]:
#pred_adata3.write_h5ad('../../pred_model3.h5ad')
pred_adata3= ad.read('../../pred_model3.h5ad')

Model 4

In [152]:
predictions_4 = ct.annotate(test, model = 'New Models/CT_45 Models/ct_model_4.pkl', majority_voting = True)
predictions_4.predicted_labels

üî¨ Input data has 263810 cells and 36601 genes
üîó Matching reference genes in the model
üß¨ 4804 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
‚õìÔ∏è Over-clustering input data with resolution set to 30
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Unnamed: 0,predicted_labels,over_clustering,majority_voting
CZINY-0105_CACCAAAAGTCAACAA,Tem/emra_CD8,71,Tem/emra_CD8
CZINY-0109_TGCGATACACATAACC,Tem/emra_CD8,101,Tem/emra_CD8
CZINY-0058_GCATCGGCAAGTCATC,Tnaive/CM_CD4,7,Tnaive/CM_CD4
CZINY-0057_AACAACCTCATGCTAG,Memory B cells,18,Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC,Alveolar macrophages,56,Alveolar macrophages
...,...,...,...
CZINY-0104_TCCTCTTTCCGTTTCG,Tfh,206,Teffector/EM_CD4
Pan_T7980364_CGTGTAACATGCTGGC,Memory B cells,380,Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA,Tem/emra_CD8,369,Trm/em_CD8
CZINY-0102_CGAAGTTAGCACTAGG,Tfh,3,Tfh


In [153]:
pred_adata4 = predictions_4.to_adata()

In [154]:
#pred_adata4.write_h5ad('../../pred_model4.h5ad')
pred_adata4= ad.read('../../pred_model4.h5ad')

### Using CT_98 Models

In [9]:
#Import test data - subset of CT_98 data 
test_98= ad.read('../../Data/CT_98_Test.h5ad')

Model 1

In [3]:
predictions_98_0 = ct.annotate(test_98, model = 'New Models/CT_98 Models/98_model_0.pkl', majority_voting = True)
#predictions_4.predicted_labels
pred_adata98_0 = predictions_98_0.to_adata()
pred_adata98_0.write_h5ad('../../predictions/pred_98_model0.h5ad')

üî¨ Input data has 540487 cells and 38995 genes
üîó Matching reference genes in the model
üß¨ 7805 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Can not detect a neighborhood graph, will construct one before the over-clustering
‚õìÔ∏è Over-clustering input data with resolution set to 30
IOStream.flush timed out
IOStream.flush timed out
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Model 2

In [5]:
predictions_98_2 = ct.annotate(test_98, model = 'New Models/CT_98 Models/98_model_2.pkl', majority_voting = True)
#predictions_4.pif x in genes:redicted_labels
pred_adata98_2 = predictions_98_2.to_adata()
pred_adata98_2.write_h5ad('../../predictions/pred_98_model2.h5ad')

üî¨ Input data has 540487 cells and 38995 genes
üîó Matching reference genes in the model
üß¨ 9121 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
‚õìÔ∏è Over-clustering input data with resolution set to 30
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Model 3

In [10]:
predictions_98_3 = ct.annotate(test_98, model = 'New Models/CT_98 Models/98_model_3.pkl', majority_voting = True)
#predictions_4.predicted_labels
pred_adata98_3 = predictions_98_3.to_adata()
pred_adata98_3.write_h5ad('../../predictions/pred_98_model3.h5ad')

üî¨ Input data has 540487 cells and 38995 genes
üîó Matching reference genes in the model
üß¨ 303 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Can not detect a neighborhood graph, will construct one before the over-clustering
‚õìÔ∏è Over-clustering input data with resolution set to 30
IOStream.flush timed out
IOStream.flush timed out
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Model 4

In [11]:
predictions_98_4 = ct.annotate(test_98, model = 'New Models/CT_98 Models/98_model_4.pkl', majority_voting = True)
#predictions_4.predicted_labels
pred_adata98_4 = predictions_98_4.to_adata()
pred_adata98_4.write_h5ad('../../predictions/pred_98_model4.h5ad')

üî¨ Input data has 540487 cells and 38995 genes
üîó Matching reference genes in the model
üß¨ 7710 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
‚õìÔ∏è Over-clustering input data with resolution set to 30
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


### Using COV_PBMC Models

In [3]:
#Import test data - subset of CT_98 data 
test_COV= ad.read('../../Data/test_COV.h5ad')
#test_COV_cp = ad.read('../../Data/test_COV_cp.h5ad')

Model 0

In [5]:
predictions_COV_0 = ct.annotate(test_COV, model = 'New Models/COV_PBMC Models/COV_model_0.pkl', majority_voting = True)
#predictions_4.predicted_labels
pred_adataCOV_0 = predictions_COV_0.to_adata()
pred_adataCOV_0.write_h5ad('../../predictions/pred_COV_model0.h5ad')

üî¨ Input data has 517893 cells and 24737 genes
üîó Matching reference genes in the model
üß¨ 4716 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Can not detect a neighborhood graph, will construct one before the over-clustering
‚õìÔ∏è Over-clustering input data with resolution set to 30
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Model 2

In [109]:
predictions_COV_2 = ct.annotate(test_COV, model = 'New Models/COV_PBMC Models/COV_model_2.pkl', majority_voting = True)
#predictions_4.predicted_labels
pred_adataCOV_2 = predictions_COV_2.to_adata()
pred_adataCOV_2.write_h5ad('../../predictions/pred_COV_model2.h5ad')

üî¨ Input data has 517893 cells and 24737 genes
üîó Matching reference genes in the model
üß¨ 5968 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
‚õìÔ∏è Over-clustering input data with resolution set to 30
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Model 3

In [105]:
predictions_COV_3 = ct.annotate(test_COV, model = 'New Models/COV_PBMC Models/COV_model_3.pkl', majority_voting = True)
#predictions_COV_4.predicted_labels
pred_adataCOV_3 = predictions_COV_3.to_adata()
pred_adataCOV_3.write_h5ad('../../predictions/pred_COV_model3.h5ad')

In [107]:
#making sure the f1 score is the same if we use dataset with all genes vs just cytopus genes 
#predictions_COV_cp_3 = ct.annotate(test_COV_cp, model = 'New Models/COV_PBMC Models/COV_model_3.pkl', majority_voting = True)
#predictions_COV_3.predicted_labels
#pred_adataCOV_cp_3 = predictions_COV_cp_3.to_adata()
#pred_adataCOV_cp_3.write_h5ad('../../predictions/pred_COV_cp_model3.h5ad')

üî¨ Input data has 517893 cells and 300 genes
üîó Matching reference genes in the model
üß¨ 298 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Can not detect a neighborhood graph, will construct one before the over-clustering
‚õìÔ∏è Over-clustering input data with resolution set to 30
IOStream.flush timed out
IOStream.flush timed out
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


Model 4

In [10]:
predictions_COV_4 = ct.annotate(test_COV, model = 'New Models/COV_PBMC Models/COV_model_4.pkl', majority_voting = True)
#predictions_4.predicted_labels
pred_adataCOV_4 = predictions_COV_4.to_adata()
pred_adataCOV_4.write_h5ad('../../predictions/pred_COV_model4.h5ad')

üî¨ Input data has 517893 cells and 24737 genes
üîó Matching reference genes in the model
üß¨ 4832 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Can not detect a neighborhood graph, will construct one before the over-clustering
‚õìÔ∏è Over-clustering input data with resolution set to 30
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!


## Benchmarking
### F1 scores
Abdelaal et al. used median F1 scores as their primary statistic

#### Train & Test on CT_45

In [41]:
pred_adatact.obs["predicted_labels"]

CZINY-0105_CACCAAAAGTCAACAA      Tem/Temra cytotoxic T cells
CZINY-0109_TGCGATACACATAACC      Tem/Temra cytotoxic T cells
CZINY-0058_GCATCGGCAAGTCATC         Tcm/Naive helper T cells
CZINY-0057_AACAACCTCATGCTAG                   Memory B cells
CZINY-0106_GTGTTCCAGTGTTGTC             Alveolar macrophages
                                            ...             
CZINY-0104_TCCTCTTTCCGTTTCG        Follicular helper T cells
Pan_T7980364_CGTGTAACATGCTGGC                 Memory B cells
CZINY-0104_CGGCAGTTCCATGCAA      Tem/Temra cytotoxic T cells
CZINY-0102_CGAAGTTAGCACTAGG        Follicular helper T cells
CZINY-0061_CGGGACTAGAGGCGGA                       MAIT cells
Name: predicted_labels, Length: 263810, dtype: category
Categories (83, object): ['Age-associated B cells', 'Alveolar macrophages', 'B cells', 'CD16+ NK cells', ..., 'Type 1 helper T cells', 'Type 17 helper T cells', 'gamma-delta T cells', 'pDC']

In [143]:
#og celltypist  - 0.892
np.median(f1_score(pred_adatact.obs["Manually_curated_celltype"], pred_adatact.obs["predicted_labels"], average=None))

0.8924290052746551

In [None]:
#model 1 - cant get it to run, gets stuck on Scaling for too long 
#np.median(f1_score(pred_adata1.obs["Manually_curated_celltype"], pred_adata1.obs["predicted_labels"], average=None))

In [39]:
#model 2  - 0.74
np.median(f1_score(pred_adata2.obs["Manually_curated_celltype"], pred_adata2.obs["predicted_labels"], average=None))

0.7401985360473279

In [38]:
#model 3  - 0.79
np.median(f1_score(pred_adata3.obs["Manually_curated_celltype"], pred_adata3.obs["predicted_labels"], average=None))

0.7864401520440618

In [158]:
#model 4 - 0.887
np.median(f1_score(pred_adata4.obs["Manually_curated_celltype"], pred_adata4.obs["predicted_labels"], average = None))

0.8866913027637537

#### Train & Test on CT_98

In [4]:
#og celltypist  - 0.844
np.median(f1_score(pred_adata98_0.obs["Harmonised_detailed_type"], pred_adata98_0.obs["predicted_labels"], average=None))

0.8438779688779688

In [7]:
#model 2  - 0.508
np.median(f1_score(pred_adata98_2.obs["Harmonised_detailed_type"], pred_adata98_2.obs["predicted_labels"], average=None))

0.5085253579098934

In [14]:
#model 3  - 0.810?
np.median(f1_score(pred_adata98_3.obs["Harmonised_detailed_type"], pred_adata98_3.obs["predicted_labels"], average=None))

0.8097257640640716

In [15]:
#model 4 - 0.810?
np.median(f1_score(pred_adata98_4.obs["Harmonised_detailed_type"], pred_adata98_4.obs["predicted_labels"], average = None))

0.8097257640640716

#### Train & Test on COV_45

In [7]:
#og celltypist model - 0.609
np.median(f1_score(pred_adataCOV_0.obs["full_clustering"], pred_adataCOV_0.obs["predicted_labels"], average=None))

0.6092588150635644

In [8]:
pred_adataCOV_0.obs

Unnamed: 0_level_0,sample_id,n_genes,n_genes_by_counts,total_counts,total_counts_mt,pct_counts_mt,full_clustering,initial_clustering,Resample,Collection_Day,...,Days_from_onset,Site,time_after_LPS,Worst_Clinical_Status,Outcome,patient_id,predicted_labels,over_clustering,majority_voting,conf_score
covid_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CATATGGAGGAATGGA-MH9143325,MH9143325,586,586,923.0,27.0,2.925244,CD83_CD14_mono,CD14,Initial,D0,...,7,Ncl,,Death,Death,MH9143325,CD14_mono,141,CD14_mono,1.000000e+00
GGGTCTGGTACGAAAT-MH9143372,MH9143372,1096,1096,3973.0,121.0,3.045557,CD4.Naive,CD4,Initial,D0,...,13,Ncl,,Non-covid,Home,MH9143372,CD4.Naive,342,CD4.Naive,1.000000e+00
S11_ACTATCTTCCAACCAA-1,AP10,1275,1275,2684.0,66.0,2.459016,B_exhausted,B_cell,Initial,D0,...,1,Sanger,,,unknown,AP10,B_naive,229,B_naive,1.000000e+00
TGACAACCAACCGCCA-MH9143273,MH9143273,654,654,1361.0,98.0,7.200588,CD8.Naive,CD8,Resample,D7,...,6,Ncl,,Critical,Home,MH9143273,CD8.Naive,150,CD8.Naive,1.000000e+00
GGAATAATCCGAGCCA-MH9143271,MH9143271,1114,1114,2566.0,49.0,1.909587,CD83_CD14_mono,CD14,Initial,D0,...,8,Ncl,,Mild,Home,MH9143271,CD83_CD14_mono,80,CD83_CD14_mono,1.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ACACCCTTCGATGAGG-MH9143271,MH9143271,1277,1277,3097.0,42.0,1.356151,CD83_CD14_mono,CD14,Initial,D0,...,8,Ncl,,Mild,Home,MH9143271,CD83_CD14_mono,168,CD83_CD14_mono,1.000000e+00
S14_CAAGATCCAATGGAAT-1,AP2,531,531,750.0,38.0,5.066667,gdT,CD4,Initial,D0,...,47,Sanger,,,unknown,AP2,CD4.Tfh,223,CD4.Tfh,1.000000e+00
CAGAGAGGTTGCGTTA-MH9179825,MH9179825,2142,2142,9305.0,191.0,2.052660,CD4.CM,CD4,Initial,D0,...,13,Ncl,,Moderate,Home,MH9179825,CD4.CM,256,CD4.CM,1.000000e+00
BGCV10_CGTAGGCGTTGCTCCT-1,BGCV10_CV0231,568,568,1193.0,59.0,4.945516,CD8.EM,CD8,Initial,D0,...,Healthy,Cambridge,,Asymptomatic,Home,CV0231,CD8.TE,50,CD8.TE,2.616917e-19


In [110]:
#model 2  - 
np.median(f1_score(pred_adataCOV_2.obs["full_clustering"], pred_adataCOV_2.obs["predicted_labels"], average=None))

0.5556809631301732

In [106]:
#model 3 - 0.507
np.median(f1_score(pred_adataCOV_3.obs["full_clustering"], pred_adataCOV_3.obs["predicted_labels"], average=None))

0.5069204152249135

In [108]:
#model 3 cytopus genes dataset to make sure they are the same - 0.507
np.median(f1_score(pred_adataCOV_cp_3.obs["full_clustering"], pred_adataCOV_cp_3.obs["predicted_labels"], average=None))

0.5069204152249135

In [None]:
#model 4 - 
np.median(f1_score(pred_adataCOV_4.obs["full_clustering"], pred_adataCOV_4.obs["predicted_labels"], average = None))

0.5759539236861051