# selenobot-detect

Before running any of the below code, follow the instructions in the `README` for installing `selenobot` and downloading necessary training, testing, and validation data. 

In [None]:
import selenobot

## Training

`selenobot` uses a modular workflow to train `Classifier` objects on embedding data: **data embedding**, **dataset instantiation**, and **model training**. Each step in the workflow is handled by a separate object, which are described below. 

1. `Embedder`: The `Embedder` class handles the conversion of amino acid sequences, read in from a FASTA file, to numerical representation. The two defined `Embedders` are a `LengthEmbedder`, which reduces a sequence to its length, and the `AacEmbedder`, which produces a representation of the sequence based on its amino acid composition. 
2. `Dataset`: The `Dataset` class acts as a storage container for the data, and manages how it can be accessed by the `Classifier` during training and testing.
4. `Classifier`: This class defines a binary linear classifier which can be trained to distinguish between full-length proteins and truncated selenoproteins. The output of a trained `Classifier` is a prediction of the identity of the input sequence(s). 

### Data embedding

First, we need to instantiate our `Embedder` objects, which will tell the `Dataset` how to manage the data it contains. Because the protein-language model (PLM) embeddings have been pre-computed, we don't need to specify an embedder. The `plm_embedder` variable is set to `None` for consistency.

In [None]:
aac_embedder = selenobot.get_embedder('aac')
len_embedder = selenobot.get_embedder('len')
# The PLM embeddings have been pre-generated, and are stored in a CSV file.
# No additional embedder is needed, but None is used as a placeholder for consistency.
plm_embedder = None

### Dataset instantiation

Now that we have our `Embedder` objects, we can load the raw sequence data (and the pre-computed PLM embeddings) into `Dataset` objects. The `Dataset` object uses the input `Embedder` to process the data it loads in from the file at the specified path. 

In [None]:
# Read the paths to the training and validation data from the selenobot.cfg file.
# These paths were set during the setup procedure. 
train_path = selenobot.get_train_path()
val_path = selenobot.get_val_path()

In [None]:
# Create Datasets for both the training and validation data. 
aac_dataset, aac_val_dataset = selenobot.create_dataset(aac_embedder, train_path), selenobot.create_dataset(aac_embedder, val_path)
len_dataset, len_val_dataset = selenobot.create_dataset(len_embedder, train_path), selenobot.create_dataset(len_embedder, val_path)
plm_dataset, plm_val_dataset = selenobot.create_dataset(plm_embedder, train_path), selenobot.create_dataset(plm_embedder, val_path)


### Model training

Once we have instantiated the `Datasets` for training and validation, we can begin training!

First, we need to create the appropriate `Classifier` for each `Dataset`. The 

## Loading existing weights