In this notebook, we show the pipeline of machine learning based bandits design for Bandit 1-3. The pipeline includes the following steps:
- data pre-processing
- prediction (GPR)
- batch UCB recommendation

We illustrate the pipeline of generating the recommendation on Round 1-3 (Bandit 1-3) in the following.

In [1]:
# direct to proper path
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import cm, rcParams
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
import seaborn as sns

import itertools
from collections import defaultdict
import math
import json
import xarray as xr

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import PairwiseKernel, DotProduct, RBF 
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
from sklearn.model_selection import KFold
from sklearn.manifold import TSNE

from src.embedding import Embedding
from src.environment import Rewards_env
from src.evaluations import evaluate, plot_eva
from src.regression import *
from src.kernels_for_GPK import *
from src.data_generating import generate_data
from src.batch_ucb import *
import src.config

from ipywidgets import IntProgress
from IPython.display import display
import warnings
%matplotlib inline

/data4/u6015325/SynbioML/synbio_rbs/example
['/data4/u6015325/SynbioML', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python36.zip', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python3.6', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python3.6/lib-dynload', '', '/home/users/u6015325/.local/lib/python3.6/site-packages', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python3.6/site-packages', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python3.6/site-packages/IPython/extensions', '/home/users/u6015325/.ipython', '/data4/u6015325/SynbioML/synbio_rbs']


In [2]:
from platform import python_version

print(python_version())

3.6.13


In [3]:
folder_path = '../data/'
raw = 'n'

## Data - Raw TIR
We first illustrate the raw data, which includes the following columns:
- Name: the RBS name
- Group: design groups:
    - BPS-NC: base-by-base changes in the non-core region. 
    - BPS-C: base-by-base changes in the core region. 
    - UNI: Randomly generated sequences with uniform distribution. 
    - PPM: Randomly generated sequences with distribution following the PPM for all natural RBS in \emph{E. coli}. 
    - Bandit-0/1/2/3 - Bandit algorithm generated results for Round 0, 1, 2 and 3 respectively.
- Plate: each plate contains 90 RBS sequences (1-5)
- Round: design round (0-4)
- RBS: 20-base RBS sequences
- RBS6: 6-core RBS sequences
- Rep1 - Rep6: GFPOD for the 4h (using derivatives) for three biological replicates.
- AVERAGE: average value of replicates
- STD: standard divation of replicates

In [4]:
raw_path = folder_path + 'Results_' + raw + '.csv'

if os.path.exists(raw_path): 
    df_raw = pd.read_csv(raw_path) 
else:
    df_raw = generate_data(raw)
df_raw.head()

Unnamed: 0,Name,Group,Plate,Round,RBS,RBS6,Rep1,Rep2,Rep3,Rep4,Rep5,Rep6,AVERAGE,STD
0,RBS_1by1_0,Reference,First_Plate,0,TTTAAGAAGGAGATATACAT,AGGAGA,80.9197,52.402431,98.72044,61.622165,54.151485,45.499195,65.552569,20.281781
1,RBS_1by1_1,BPS-NC,First_Plate,0,CTTAAGAAGGAGATATACAT,AGGAGA,58.33688,40.072951,81.1362,42.042854,45.432032,41.005659,51.337763,16.073928
2,RBS_1by1_2,BPS-NC,First_Plate,0,GTTAAGAAGGAGATATACAT,AGGAGA,38.7807,28.831559,58.76333,24.48787,24.133637,25.596639,33.432289,13.55949
3,RBS_1by1_3,BPS-NC,First_Plate,0,ATTAAGAAGGAGATATACAT,AGGAGA,60.72082,43.093359,74.60529,38.641958,38.049577,31.608154,47.786526,16.424
4,RBS_1by1_4,BPS-NC,First_Plate,0,TCTAAGAAGGAGATATACAT,AGGAGA,58.09954,45.913214,70.53162,44.352931,38.394865,43.641794,50.155661,11.922263


# Data pre-processing

Define the following steps on each replicate:  
- a. In each round, substract the mean of every data points by the reference AVERAGE, and then add 100 (to make the values positive).  
- b. Take log (base e) transformation for each data points.  
- c. Apply z-score normalisation.  
    - c.1 on each round, so that the mean and variance of each replicate of data in each round is zero and one after normalisation. 
    - c.2 on all data, so that the mean and variance of each replicate of all data is zero and one after normalisation. 
- d. Apply min-max normalisation.
    - d.1 on each round
    - d.2 on all data
- e. Apply ratio normalisation. In each round, each data points is devided by the mean of refernce AVERAGE, so that in each round, the reference labels are almost 1. 
    - e.1 on each round
    - e.2 on all data
    
In Round 1 (Bandit-1), we adopt *bc1*. We observed that the reference sequences give differerent TIR values in each round. Thus in Round 2-3 (Bandit-3), we substructed the mean first and adopted *abc1*.


The source code of data generating approaches is defined in src/data_generating.py.

In [5]:
round1='bc1'
round23 = 'abc1'

round1_path = folder_path + 'Results_' + round1 + '.csv'
round23_path = folder_path + 'Results_' + round23 + '.csv'

if os.path.exists(round1_path): 
    df_round1 = pd.read_csv(round1_path) 
else:
    df_round1 = generate_data(round1)

if os.path.exists(round23_path): 
    df_round23  = pd.read_csv(round23_path) 
else:
    df_round23  = generate_data(round23)

In [6]:
df_round1.head()

Unnamed: 0,Name,Group,Plate,Round,RBS,RBS6,Rep1,Rep2,Rep3,Rep4,Rep5,Rep6,AVERAGE,STD
0,RBS_1by1_0,Reference,First_Plate,0,TTTAAGAAGGAGATATACAT,AGGAGA,1.616261,1.814182,1.760954,2.186207,2.028863,1.831982,1.873075,0.20298
1,RBS_1by1_1,BPS-NC,First_Plate,0,CTTAAGAAGGAGATATACAT,AGGAGA,1.166174,1.337018,1.417248,1.4938,1.713526,1.644568,1.462056,0.201367
2,RBS_1by1_2,BPS-NC,First_Plate,0,GTTAAGAAGGAGATATACAT,AGGAGA,0.604551,0.751384,0.851987,0.514929,0.577299,0.795227,0.682563,0.135205
3,RBS_1by1_3,BPS-NC,First_Plate,0,ATTAAGAAGGAGATATACAT,AGGAGA,1.221264,1.466278,1.270212,1.34104,1.39503,1.175433,1.311543,0.109697
4,RBS_1by1_4,BPS-NC,First_Plate,0,TCTAAGAAGGAGATATACAT,AGGAGA,1.160566,1.579025,1.171829,1.59067,1.411255,1.756862,1.445035,0.242117


In [7]:
df_round23.head()

Unnamed: 0,Name,Group,Plate,Round,RBS,RBS6,Rep1,Rep2,Rep3,Rep4,Rep5,Rep6,AVERAGE,STD
0,RBS_1by1_0,Reference,First_Plate,0,TTTAAGAAGGAGATATACAT,AGGAGA,2.482263,2.555338,2.358414,3.10295,2.999178,2.316265,2.635735,0.334474
1,RBS_1by1_1,BPS-NC,First_Plate,0,CTTAAGAAGGAGATATACAT,AGGAGA,1.592779,1.694296,1.79821,1.850738,2.381356,1.996343,1.88562,0.27901
2,RBS_1by1_2,BPS-NC,First_Plate,0,GTTAAGAAGGAGATATACAT,AGGAGA,0.626302,0.774704,0.947196,0.418233,0.532036,0.733023,0.671916,0.18766
3,RBS_1by1_3,BPS-NC,First_Plate,0,ATTAAGAAGGAGATATACAT,AGGAGA,1.696364,1.917735,1.56813,1.600853,1.803056,1.26072,1.641143,0.226911
4,RBS_1by1_4,BPS-NC,First_Plate,0,TCTAAGAAGGAGATATACAT,AGGAGA,1.582321,2.118618,1.417531,2.014216,1.831391,2.186262,1.85839,0.306855


## Prediction & Recommendation

Two types of machine learning algorithms can be applied to drive the experimental design workflow.
- One type of machine learning algorithms is a prediction algorithm (**LEARN**), which helps us learn the function of TIR with respect to RBS sequences. 
- The other type of machine learning algorithm is a recommendation algorithm (**DESIGN**), which recommends RBS sequences to query (test) in each batch based on the predictions from LEARN. 

In round t, the prediction and design is based on the results obtained in all previous rounds. 
Our implementation of the machine learning algorithms was tested in Python 3.6 and used the scikit-learn library.
The prediction code is mainly located in *src.regression.py*, the recommendation code is in $src.batch\_ucb.py$.
To predict, the *src.kernels\_for\_GPK.py* is called to calculate the kernel functions.
When we call the GP_BUCB function, the prediction (GPR) is firstly called and recommendation is conducted based on the recommendation. 

The settings are specified as in the next cell. To generate the recommendation result for round n, change the parameter *design_round = n*.

In [8]:
# setting

design_round = 3 # Round to be designed

rec_size = 90 # in each round, we recommend 90 RBS sequences
l = 6 # maximum kmer as 6
s = 1 # maximum shift as 1
alpha = 2 # GPR noise parameter, get from cross validation
sigma_0 = 1 # signal for kernel matrix 
kernel = 'WD_Kernel_Shift' # weighted degree kernel with shfit
embedding = 'label' # turns strings into categories first and used for kernel 
kernel_norm_flag = True # whether to apply kernel normalisation
centering_flag = True # whether to apply kernel centering
unit_norm_flag = True # whether to apply unit norm for kernel

if design_round == 3: # UCB hyperparameter
    beta = 0
else:
    beta = 2
    
if design_round == 1:  # kernel normalisation over what
    kernel_over_all_flag = False
    df = df_round1
else:
    kernel_over_all_flag = True
    df = df_round23

In [None]:
gpbucb = GP_BUCB(df[df['Round'] < design_round], kernel_name=kernel, l=l, s=s, sigma_0=sigma_0,
                embedding=embedding, alpha=alpha, rec_size=rec_size, beta=beta, 
                kernel_norm_flag=kernel_norm_flag, centering_flag = centering_flag,              
                unit_norm_flag=unit_norm_flag, kernel_over_all_flag = kernel_over_all_flag)

gpbucb_rec_df = gpbucb.run_experiment()

In [10]:
gpbucb_rec_df

Unnamed: 0_level_0,RBS,RBS6,AVERAGE,pred mean,pred std,ucb,lcb
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2729.0,TTTAAGAGGGGGCTATACAT,GGGGGC,,1.051928,0.486657,1.051928,1.051928
2721.0,TTTAAGAGGGGACTATACAT,GGGGAC,,1.017517,0.425420,1.017517,1.017517
162.0,TTTAAGAAAGGAGTATACAT,AAGGAG,,0.989109,0.455444,0.989109,0.989109
3258.0,TTTAAGATAGTGGTATACAT,TAGTGG,,0.985374,0.474377,0.985374,0.985374
2731.0,TTTAAGAGGGGGTTATACAT,GGGGGT,,0.916163,0.350266,0.916163,0.916163
...,...,...,...,...,...,...,...
15.0,TTTAAGAAAAATTTATACAT,AAAATT,,0.573114,0.550503,0.573114,0.573114
26.0,TTTAAGAAAACGGTATACAT,AAACGG,,0.572254,0.561592,0.572254,0.572254
185.0,TTTAAGAAAGTGCTATACAT,AAGTGC,,0.567468,0.539113,0.567468,0.567468
16.0,TTTAAGAAAACAATATACAT,AAACAA,,0.567402,0.595323,0.567402,0.567402
