# Regression Models in Selene

Selene is a flexible framework, and can be used for tasks beyond simple classification.
This tutorial demonstrates the simple process of training regression models with Selene.
For this example, we will predict mean ribosomal load (MRL) from 50 base pair 5' UTR sequences using models and data from [*Human 5′ UTR design and variant effect prediction from a massively parallel translation assay*](https://doi.org/10.1101/310375) by Sample et al.
This data was generated from a massively parallel reporter assay (MPRA), which you can read more about it in the preprint on [*bioRxiv*](https://doi.org/10.1101/310375).

## Setup

**Architecture:** The model is defined in [utr_model.py](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/utr_model.py), and only superficially differs from the model in [the paper](https://doi.org/10.1101/310375).
Since this is a real-valued regression problem, it is appropriate that the `criterion` method in `utr_model.py` uses the mean squared error.


**Data:** The data from Sample et al is available on the [Gene Expression Omnibus (GEO)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114002).
However, we have included [the `download_data.py` script](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/download_data.py), to download the data and preprocess it.
It should produce three files, `train.mat`, `validate.mat`, and `test.mat`.
They include the data for training, validation, and testing respectively.
At present, regression models can only be trained with `*.mat` files and the [`MatFileSampler`](http://selene.flatironinstitute.org/samplers.file_samplers.html#selene_sdk.samplers.file_samplers.MatFileSampler).
Further, a `MatFileSampler` must be specified for each of the `train.mat`, `validate.mat`, and `test.mat` files.
These `MatFileSampler`s are then used for the `train`, `validate`, and `test` arguments of a [`MultiFileSampler`](http://selene.flatironinstitute.org/samplers.html#selene_sdk.samplers.MultiFileSampler).
The specific syntax is demonstrated in the configuration file, [`regression_train.yml`](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_train.yml).

**Configuration file:** The configuration file [`regression_train.yml`](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_train.yml) is slightly different than the configuration files in the classification tutorials.
Specifically, `metrics` in `train_model` includes the coefficient of determination (`r2`), since the default metrics (`roc_auc` and `average_precision`) are not appropriate for regression.
Further, `report_gt_feature_n_positives` in `train_model` has been set to zero to prevent spurious filtering based on target values.

## Download the data

To download the data, just run the [`download_data.py`](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/download_data.py) script from the command line:
```sh
python download_data.py
```

## Train the model

In [1]:
from selene_sdk.utils import load_path
from selene_sdk.utils import parse_configs_and_run

Before running `load_path` on `regression_train.yml`, please edit the YAML file to include the absolute path of the model file.

Currently, the model is set to train on GPU.
If you do not have CUDA on your machine, please set `use_cuda` to `False` in the configuration file. Note that using the CPU instead of GPU will slow down training considerably.

In [2]:
configs = load_path("./regression_train.yml")

In [3]:
parse_configs_and_run(configs, lr=0.001)

Outputs and logs saved to ./2018-12-09-15-53-59
2018-12-09 15:54:01,335 - Creating validation dataset.
2018-12-09 15:54:01,361 - 0.02456068992614746 s to load 20096 validation examples (157 validation batches) to evaluate after each training step.
2018-12-09 15:54:24,581 - [STEP 2031] average number of steps per second: 88.0
2018-12-09 15:54:25,020 - validation r2: 0.8104067907778664
2018-12-09 15:54:25,125 - training loss: 0.2401450276374817
2018-12-09 15:54:25,126 - validation loss: 0.18883540832502826
2018-12-09 15:54:47,288 - [STEP 4062] average number of steps per second: 91.9
2018-12-09 15:54:47,729 - validation r2: 0.8564685296471333
2018-12-09 15:54:47,822 - training loss: 0.193187415599823
2018-12-09 15:54:47,823 - validation loss: 0.14294122951995036
2018-12-09 15:55:09,855 - [STEP 6093] average number of steps per second: 92.5
2018-12-09 15:55:10,290 - validation r2: 0.8653072068202623
2018-12-09 15:55:10,376 - training loss: 0.2143666297197342
2018-12-09 15:55:10,377 - vali