# selenobot-detect

Before running any of the below code, follow the instructions in the `README` for installing `selenobot` and downloading necessary training, testing, and validation data. 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
! pip install /home/prichter/Documents/selenobot-detect/

Processing /home/prichter/Documents/selenobot-detect
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: selenobot
  Building wheel for selenobot (setup.py) ... [?25ldone
[?25h  Created wheel for selenobot: filename=selenobot-0.1-py3-none-any.whl size=28084 sha256=f1399ade498beb4bdbd4d7da20b0b8b290059308cc6e2a1c0eb7f03338c4ce9a
  Stored in directory: /home/prichter/.cache/pip/wheels/a0/aa/d8/1eccb2e865f567bdbccad42f9e1546f0e184aab89b2b2568c4
Successfully built selenobot
Installing collected packages: selenobot
  Attempting uninstall: selenobot
    Found existing installation: selenobot 0.1
    Uninstalling selenobot-0.1:
      Successfully uninstalled selenobot-0.1
Successfully installed selenobot-0.1


In [3]:
import selenobot

## Training

`selenobot` uses a modular workflow to train `Classifier` objects on embedding data: **data embedding**, **dataset instantiation**, and **model training**. Each step in the workflow is handled by a separate object, which are described below. 

1. `Embedder`: The `Embedder` class handles the conversion of amino acid sequences, read in from a FASTA file, to numerical representation. The two defined `Embedders` are a `LengthEmbedder`, which reduces a sequence to its length, and the `AacEmbedder`, which produces a representation of the sequence based on its amino acid composition. 
2. `Dataset`: The `Dataset` class acts as a storage container for the data, and manages how it can be accessed by the `Classifier` during training and testing.
4. `Classifier`: This class defines a binary linear classifier which can be trained to distinguish between full-length proteins and truncated selenoproteins. The output of a trained `Classifier` is a prediction of the identity of the input sequence(s). 

### Data embedding

First, we need to instantiate our `Embedder` objects, which will tell the `Dataset` how to manage the data it contains. Because the protein-language model (PLM) embeddings have been pre-computed, we don't need to specify an embedder. The `plm_embedder` variable is set to `None` for consistency.

In [4]:
aac_embedder = selenobot.create_embedder('aac')
len_embedder = selenobot.create_embedder('len')
# The PLM embeddings have been pre-generated, and are stored in a CSV file.
# No additional embedder is needed, but None is used as a placeholder for consistency.
plm_embedder = None

### Dataset instantiation

Now that we have our `Embedder` objects, we can load the raw sequence data (and the pre-computed PLM embeddings) into `Dataset` objects. The `Dataset` object uses the input `Embedder` to process the data it loads in from the file at the specified path. 

In [5]:
# Read the paths to the training and validation data from the selenobot.cfg file.
# These paths were set during the setup procedure. 
train_path = selenobot.get_train_data_path()
val_path = selenobot.get_val_data_path()

print('Training data is stored at:', train_path)
print('Validation data is stored at:', val_path)

Training data is stored at: /home/prichter/data/detect/train.csv
Validation data is stored at: /home/prichter/data/detect/val.csv


In [None]:
# Create Datasets for both the training and validation data. 
aac_dataset, aac_val_dataset = selenobot.create_dataset(aac_embedder, train_path, nrows=10000), selenobot.create_dataset(aac_embedder, val_path, nrows=500)
len_dataset, len_val_dataset = selenobot.create_dataset(len_embedder, train_path, nrows=10000), selenobot.create_dataset(len_embedder, val_path, nrows=500)
plm_dataset, plm_val_dataset = selenobot.create_dataset(plm_embedder, train_path, nrows=10000), selenobot.create_dataset(plm_embedder, val_path, nrows=500)


### Model training

Once we have instantiated the `Datasets` for training and validation, we can begin training!

First, we need to create the appropriate `Classifier` for each `Dataset`. The `create_classifier` function detects the embedding type contained in the input `Dataset`, and uses it to choose the appropriate layer dimensions for the `Classifier`.

In [None]:
aac_classifier = selenobot.create_classifier(aac_dataset)
len_classifier = selenobot.create_classifier(len_dataset)
plm_classifier = selenobot.create_classifier(plm_dataset)

We can now train each model by calling the `train` function. This function uses the input datasets to instantiate a `pytorch` `DataLoader`, which handles the batching of the data for training. The descriptions of the tunable training parameters are given below. 

    
- `model`: A `Classifier` to train on the input Datasets. 
- `dataset`: A `Dataset` containing the training data. 
- `val_dataset` A `Dataset` containing the validation data. 
- `epochs`: The number of epochs to train the model for. 
- `batch_size`: The size of the batches which the training data will be split into. 
- `balance_batches`: Whether or not to ensure that each batch has equal proportion of full-length and truncated proteins. 

In [None]:
kwargs = {'epochs':10, 'batch_size':128, 'balance_batches':True}

In [None]:
aac_train_reporter = selenobot.train(aac_classifier, aac_dataset, val_dataset=aac_val_dataset, **kwargs)

## Loading existing weights