# How to train a model with CellTypist

There are a few key components to the Vanilla CellTypist model training: 
1. It is a Logistic Regression Classifier that uses Standard Gradient Dissent.
2. The classifer uses L2 regression (which is the default for this function) 
3. The model is trained once on all the genes that are provided. Then it performs feature selection where the top 300 genes are selected for every cell type, based on the coefficents from the classifer. The model is then retrained on the union of the selected genes, which is then the final model. 

For a visual representation of this workflow, see figure 1c from the Conde et al. 2022 paper. 

In order to train a CellTypist model we need a dataset that has a couple necessary features. Most importantly, this dataset needs to have cell type annotations for each cell that the classifier will use when training the dataset. Additionally, CellTypist works with scRNAseq data so you will need a cellxgene matrix with the raw (or normalized) counts from scRNA sequencing. 

## SAIL train function vs the original CellTypist train function 

We added some features to the original CellTypist to add some desired functionality:
1. The primary addition is that the new train function also returns the list of genes that were chosen during feature selection (if feature selection doesn't happen, an empty dataframe is returned). It has the option to integrate Cytopus genes into the feature selection step. Cytopus is a Knowledge base of immune cell types and some of the commonly associated genes for each cell type. By ensuring that Cytopus genes are included in the feature selection, we are incorperating some prior knowledge into our model. 
2. The new train function also has a second output: a pandas DataFrame that shows which 300 genes were chosen for each cell type during feature selection. 
3. When training the classifier, we can now choose which sort of regularization to use (eg L1 instead of L2). Additionally, we can use one kind of normalization for the pre-feature selection training and then switch to a different kind for post-feature section training (either L2 -> L1 or L1 -> L2).
4. There is now the option to input raw data in AnnData form and then it will be normalized to median library size.


In terms of the actual mathematics behind the model, it is pretty much the same.

## Our Recommendations

The easiest type of data to work with when building a CellTypist model is AnnData saved as an h5ad file. The cell type annotations need to be saved in a column in .obs. 

In [None]:
#load the data
adata = ad.read('')
adata

Before fully committing to a model, we recommend setting aside a portion of the data to test on once the model is trained. To do this, you will need to randomly split your data into train and test data, which you can do with the `train_test_split()` function. By default, this model will split the data 70/30, but you can change the percentage you want for training using the `frac` argument. If you intend to train your model using a lsf job, you will need to save this data. 

In [None]:
train, test = train_test_split(adata)

#save - change to your own directory
train.write_h5ad('')
test.write_h5ad('')

In terms of normalization, we recommend normalizing to median library size of your data (https://www.nature.com/articles/s41592-023-01814-1) or to 10,000 counts per cell and then log normalize. CellTypist expects you to normalize to 10,000 counts per cell (and also log normalize), so it will throw an error if the `check_expression` argument is set to False and it isn't normalized to the expected value. The `check_expression` argument defaults to False. Alternaitvately, you can also imput the raw data and set the `normalize` argument to True (default False). This will normlaize your data to median library size and then log normalize so you do not need to worry about it. 

In [None]:
# one way to check the normalization of your data:
np.expm1(adata.X[0]).sum() #the number this produces should be very close to the value you normalized to (just looking at the first cell)

One more thing we recommend you check before moving onto model training is the names of the columns, which should be the gene names. Please ensure that your gene IDs are the actual gene names and not the Ensemble ID's or other similar ID forms. This will produce a model that is much more interperatable. To check, look at the `.var_names` of you adata. If the variable names are the ensemble IDs, the gene names are often saved in a column in .var.

In [None]:
adata.var_names
#adata.var = adata.var.set_index('feature_name')