In this notebook, we show the pipeline of machine learning based bandits design for Bandit 1-3. The pipeline includes the following steps:
- data pre-processing
- prediction (GPR)
- batch UCB recommendation

We illustrate the pipeline of generating the recommendation on Round 1-3 (Bandit 1-3) in the following.

In [1]:
# direct to proper path
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import cm, rcParams
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
import seaborn as sns

import itertools
from collections import defaultdict
import math
import json
import xarray as xr

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import PairwiseKernel, DotProduct, RBF 
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
from sklearn.model_selection import KFold
from sklearn.manifold import TSNE

from src.embedding import Embedding
from src.environment import Rewards_env
from src.evaluations import evaluate, plot_eva
from src.regression import *
from src.kernels_for_GPK import *
from src.data_generating import generate_data

from ipywidgets import IntProgress
from IPython.display import display
import warnings
%matplotlib inline

/data4/u6015325/SynbioML/synbio_rbs/example
['/data4/u6015325/SynbioML', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python36.zip', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python3.6', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python3.6/lib-dynload', '', '/home/users/u6015325/.local/lib/python3.6/site-packages', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python3.6/site-packages', '/localdata/u6015325/anaconda3/envs/synbio_ml/lib/python3.6/site-packages/IPython/extensions', '/home/users/u6015325/.ipython', '/data4/u6015325/SynbioML/synbio_rbs']


In [2]:
folder_path = '../data/'
raw = 'n'

## Data - Raw TIR
We first illustrate the raw data, which includes the following columns:
- Name: 
- Group: 
- Plate: 
- Round: 
- RBS: 20-base RBS sequences
- RBS6: 
- Rep1 - Rep6: GFPOD for the 4h (using derivatives) for three biological replicates.
- AVERAGE: average value of replicates
- STD: standard divation of replicates

In [3]:
raw_path = folder_path + 'Results_' + raw + '.csv'

if os.path.exists(raw_path): 
    df_raw = pd.read_csv(raw_path) 
else:
    df_raw = generate_data(raw)
df_raw.head()

Unnamed: 0,Name,Group,Plate,Round,RBS,RBS6,Rep1,Rep2,Rep3,Rep4,Rep5,Rep6,AVERAGE,STD
0,RBS_1by1_0,Reference,First_Plate,0,TTTAAGAAGGAGATATACAT,AGGAGA,80.9197,52.402431,98.72044,61.622165,54.151485,45.499195,65.552569,20.281781
1,RBS_1by1_1,BPS-NC,First_Plate,0,CTTAAGAAGGAGATATACAT,AGGAGA,58.33688,40.072951,81.1362,42.042854,45.432032,41.005659,51.337763,16.073928
2,RBS_1by1_2,BPS-NC,First_Plate,0,GTTAAGAAGGAGATATACAT,AGGAGA,38.7807,28.831559,58.76333,24.48787,24.133637,25.596639,33.432289,13.55949
3,RBS_1by1_3,BPS-NC,First_Plate,0,ATTAAGAAGGAGATATACAT,AGGAGA,60.72082,43.093359,74.60529,38.641958,38.049577,31.608154,47.786526,16.424
4,RBS_1by1_4,BPS-NC,First_Plate,0,TCTAAGAAGGAGATATACAT,AGGAGA,58.09954,45.913214,70.53162,44.352931,38.394865,43.641794,50.155661,11.922263


# Data pre-processing

Define the following steps on each replicate:  
- a. In each round, substract the mean of every data points by the reference AVERAGE, and then add 100 (to make the values positive).  
- b. Take log (base e) transformation for each data points.  
- c. Apply z-score normalisation.  
    - c.1 on each round, so that the mean and variance of each replicate of data in each round is zero and one after normalisation. 
    - c.2 on all data, so that the mean and variance of each replicate of all data is zero and one after normalisation. 
- d. Apply min-max normalisation.
    - d.1 on each round
    - d.2 on all data
- e. Apply ratio normalisation. In each round, each data points is devided by the mean of refernce AVERAGE, so that in each round, the reference labels are almost 1. 
    - e.1 on each round
    - e.2 on all data
    
In Round 1 (Bandit-1), we adopt *bc1*. We observed that the reference sequences give differerent TIR values in each round. Thus in Round 2-3 (Bandit-3), we substructed the mean first and adopted *abc1*.


The source code of data generating approaches is defined in src/data_generating.py.

In [4]:
round1='bc1'
round23 = 'abc1'

round1_path = folder_path + 'Results_' + round1 + '.csv'
round23_path = folder_path + 'Results_' + round23 + '.csv'

if os.path.exists(round1_path): 
    df_round1 = pd.read_cs(round1_path) 
else:
    df_round1 = generate_data(round1)

if os.path.exists(round23_path): 
    df_round23  = pd.read_cs(round23_path) 
else:
    df_round23  = generate_data(round23)

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.