# Random Sequence Design

Design 60 6-base RBS sequences randomly with half A and half B:

A) Uniformly random.  
B) Make use of PPM frequency. 

Considering there would be some repeated sequences with the 61 Baseline designs and 60 bandit recommendations, we may need to design more than 60 RBS sequences (filter out).

In [1]:
import numpy as np
from itertools import product

In [2]:
# create all combos

combos = []
char_sets = ['A', 'G', 'C', 'T']
design_len = 6

for combo in product(char_sets, repeat= design_len):
    combo = ''.join(combo) 
    combos.append(combo)
    
assert len(combos) == len(char_sets) ** design_len

## Uniformly random

In [3]:
num_design_seq = 30

randomA = np.random.choice(np.asarray(combos), num_design_seq, replace = False)
randomA

array(['ATTTAC', 'ACTGCT', 'CCAGCG', 'TGAGCC', 'AGGGGG', 'GTGATG',
       'ACACTA', 'ACCCGT', 'CTACCG', 'CGCTGT', 'GTAAAA', 'GGTGCA',
       'TCTTTT', 'CGCGCC', 'TCTCTA', 'GTTGAG', 'GCAGGA', 'CTATCC',
       'TTGACT', 'TAGAAA', 'CCACTC', 'GACACG', 'CTTGTC', 'AAGATT',
       'CGGCCG', 'GGGGGA', 'GCGTTA', 'TAGTAT', 'CACCAG', 'TTGAAT'],
      dtype='<U6')

## Random with PPM

Based on the Position Frequency Matrix (PFM) in Fig 5 a) in paper [Quantitative analysis of ribosome binding sites in E.coli.](https://www.ncbi.nlm.nih.gov/pubmed/8165145) (caluculated using 1055 E. coli RBS), we caluate the Position Probability Matrix (PPM) by normalising in terms each position (each column is an independent multinomial distribution). To avoid matrix entries having values of 0, we add pseudocount (0) for each letter of each position. 

Our design will be -13 to -8 pos of RBS. However, note the in table a) there is no data for position -13. We use the uniform frequency for -13. 

Discuss: PPM or PWM (position weight matrix)?
The sequence score in PWM gives an indication of how different the sequence is from a random sequence. The score is bigger than 0 only when the probability of being a functional site is bigger than a random site. However, in this task, we do not want to lose all possible combinations which is relatively not common in natural. Since there is a chance those combinations can actually generate high level of proteins. 

In [4]:
import pandas as pd

In [5]:
data = [['A', 264, 388, 369, 351, 233, 367], 
        ['C', 263, 137, 104, 83, 79, 103],
        ['G', 264, 330, 476, 511, 549, 443],
        ['T', 264, 200, 106, 110, 194, 142]]

df = pd.DataFrame(data, columns = ['pos', -13, -12, -11, -10, -9, -8])
df = df.set_index('pos')
df.index.names = [None]

df

Unnamed: 0,-13,-12,-11,-10,-9,-8
A,264,388,369,351,233,367
C,263,137,104,83,79,103
G,264,330,476,511,549,443
T,264,200,106,110,194,142


In [6]:
df.sum(axis = 0)

-13    1055
-12    1055
-11    1055
-10    1055
-9     1055
-8     1055
dtype: int64

In [7]:
ppm = df/df.sum(axis = 0)
ppm

Unnamed: 0,-13,-12,-11,-10,-9,-8
A,0.250237,0.367773,0.349763,0.332701,0.220853,0.347867
C,0.249289,0.129858,0.098578,0.078673,0.074882,0.09763
G,0.250237,0.312796,0.451185,0.48436,0.520379,0.419905
T,0.250237,0.189573,0.100474,0.104265,0.183886,0.134597


In [8]:
ppm_rec = []
num_design_seq = 30

for r in range(num_design_seq): 
    rbs = ''
    for i in ppm.columns:
        rbs += np.random.choice(['A', 'C', 'G', 'T'], p = ppm[i].values)
    ppm_rec.append(rbs)

In [9]:
ppm_rec

['AAGGAA',
 'CTGGTG',
 'TAAACA',
 'CGAGGT',
 'CGGAGC',
 'TAGGGA',
 'CAGACG',
 'AGGAGA',
 'TAAAGG',
 'GAGGGG',
 'AAAGAC',
 'CAGGGA',
 'GGAAAG',
 'GTCAGG',
 'GGGTGA',
 'GGTATG',
 'ATAATG',
 'GGAGGC',
 'AGAAGG',
 'GAGGCG',
 'CACCGG',
 'CGTCGT',
 'CTGGCG',
 'CAGGAA',
 'AAGGGG',
 'GAAGGG',
 'CAGAAA',
 'GGGGAA',
 'GGGTGG',
 'GTATAG']