# xylE dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.insert(0, '/Users/tareen/Desktop/Research_Projects/2020_mavenn_github/mavenn_git_ssh_local')

import mavenn
import logomaker
import seaborn as sns
import re
import urllib

%matplotlib inline

# Summary 

The *xylE* sort-seq MPRA data of Belliveau et. al., 2018. The authors used fluoresence-activated cell sorting, followed by deep sequencing, to assay gene expression levels from the the *xylE* promoter in *E. coli*. *xylE* is a xylose/proton symporter involved in uptake of xylose. Note that the authors also performed several different experiments at multiple other *E. Coli* promoters, but this notebook is restricted to *xylE*. See Belliveau et al., 2018 for more details.

The authors performed their experiment by splitting the *xylE* promoter into three regions, where the subsequences of the *xylE* promoter were mutagenized. The authors subsequently FACS sorted these variant sequences from each region into 1 of 4 bins. Thus, each of the 3 regions was sorted into 4 bins, and the entire promoter region was sorted 12 bins. This notebook combines the variant promoter subsequences into a single sequence and contatenates the bins values from each of the 3 regions. In the following three dataframes (corresponding to the 3 mutagenized *xylE* regions), the `'x'` column lists variant sequences, and the `'bin'` column  lists the number of read counts for each sequence (observed in 1 of 4 FACS bins).  The `'set'` column indicates whether each sequence is to assigned to the training set, the validation set, or the test set.

**Names**: ``'xylE'``

**Reference**: Nathan M Belliveau, Stephanie L Barnes, William T Ireland, Daniel L Jones, Michael J Sweredoski, Annie Moradian, Sonja Hess, Justin B Kinney, Rob Phillips. Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria. [Proc Natl Acad Sci USA, 115, E4796–E4805 (2018).](https://doi.org/10.1073/pnas.1722055115)

# Download data
The following cell downloads each of the mutagenized regions into 3 pandas dataframes. Note that the urls will have to be updated to replace development to master.

In [2]:
# Download 3 mutagenized regions of datset into pandas dataframes

# region 1
file_name = '20160710_xylE_MG1655_M9xylose_na_mut1_4bins_pymc.csv.gz'
url = f'https://github.com/jbkinney/mavenn/blob/development/mavenn/examples/datasets/raw/{file_name}?raw=true'

data_df_mut1 = pd.read_csv(url,  
                           compression='gzip',
                           index_col=[0])

# region 2
file_name = '20160710_xylE_MG1655_M9xylose_na_mut2_4bins_pymc.csv.gz'
url = f'https://github.com/jbkinney/mavenn/blob/development/mavenn/examples/datasets/raw/{file_name}?raw=true'

data_df_mut2 = pd.read_csv(url,  
                           compression='gzip',
                           index_col=[0])


# region 3
file_name = '20160710_xylE_MG1655_M9xylose_na_mut3_4bins_pymc.csv.gz'
url = f'https://github.com/jbkinney/mavenn/blob/development/mavenn/examples/datasets/raw/{file_name}?raw=true'

data_df_mut3 = pd.read_csv(url,  
                           compression='gzip',
                           index_col=[0])

  mask |= (ar1 == a)


The following two cells give a preview of the raw dataset from mutagenized regions 1 and 2, respectively.

In [3]:
data_df_mut1.head()

Unnamed: 0,x,bin
0,TTAACTGTCTGACGATTCAAACATCAATACACTAATAAAAGATCAG...,0
1,TTAACTGCATGACAATTCCAACCTCAATGCATTGATAAAAGATCAG...,0
2,TTAACTGCCTGACAATTCCAAGATCAATGCAGTGATAAAGGATCAG...,0
3,TTTACTGCGTGTCAATTCGGGCAGCAGTACACTTATAAGAGATCAG...,0
4,TAAACTACCTGACAACTCCAACTTTAACGCACTGATTACAGTTCAG...,0


In [4]:
data_df_mut2.head()

Unnamed: 0,x,bin
0,ACAGAAAAGACATAACGTAAACGCATTGTAAAAAATGATAGTTGCC...,0
1,ACAGAAAAGACATTACGTCAACGCATTGTTAAAATTGATTAATTCC...,0
2,ACAGAAAAGACATTACGTTAACGAATTGTAAAGAAGGATAATAGCC...,0
3,ACAGAAAAGACATTACGTAAACGCATTGTTAAAGATGAAAAATAAC...,0
4,ACAGAAAAGACATTACGTAAACGCATTGTAAAAAATGGTAACTGCC...,0


Transform vector format bin data into matrix from data using mavenn's vec_data_to_mat_data utility function

In [5]:
cts_mut_1 = mavenn.src.utils.vec_data_to_mat_data(data_df_mut1['bin'].values)
cts_mut_2 = mavenn.src.utils.vec_data_to_mat_data(data_df_mut2['bin'].values)
cts_mut_3 = mavenn.src.utils.vec_data_to_mat_data(data_df_mut3['bin'].values)

In [6]:
# this what the transform counts matrix data looks like from region 1. 
cts_mut_1

(array([[1, 0, 0, 0],
        [1, 0, 0, 0],
        [1, 0, 0, 0],
        ...,
        [0, 0, 0, 1],
        [0, 0, 0, 1],
        [0, 0, 0, 1]]),
 array([      0,       1,       2, ..., 2897334, 2897335, 2897336]))

In [7]:
# form concated sequences-ct_* dataframes
cts_mut_1_df = pd.DataFrame(cts_mut_1[0],columns=['ct_0','ct_1','ct_2','ct_3'])
cts_mut_2_df = pd.DataFrame(cts_mut_2[0],columns=['ct_0','ct_1','ct_2','ct_3'])
cts_mut_3_df = pd.DataFrame(cts_mut_3[0],columns=['ct_0','ct_1','ct_2','ct_3'])

data_df_mut1 = data_df_mut1.merge(cts_mut_1_df,how='outer', left_index=True, right_index=True)
data_df_mut2 = data_df_mut2.merge(cts_mut_2_df,how='outer', left_index=True, right_index=True)
data_df_mut3 = data_df_mut3.merge(cts_mut_3_df,how='outer', left_index=True, right_index=True)

In [8]:
# concatenate the 3 three regions together and drop nans.
data_df = pd.concat([data_df_mut3, data_df_mut2, data_df_mut1], axis=1).dropna()

The concated dataset below contains redundant information and will be subsequently cleaned up.

In [9]:
data_df.head()

Unnamed: 0,x,bin,ct_0,ct_1,ct_2,ct_3,x.1,bin.1,ct_0.1,ct_1.1,ct_2.1,ct_3.1,x.2,bin.2,ct_0.2,ct_1.2,ct_2.2,ct_3.2
0,CAGCAATAGCATTATTTTTATCAATTTTGGATAATTATCACAATTA...,0.0,1.0,0.0,0.0,0.0,ACAGAAAAGACATAACGTAAACGCATTGTAAAAAATGATAGTTGCC...,0.0,1.0,0.0,0.0,0.0,TTAACTGTCTGACGATTCAAACATCAATACACTAATAAAAGATCAG...,0,1,0,0,0
1,CGGCAATAGTATTGTTTATATCGATTTTGGATAGTTATCTCAATTA...,0.0,1.0,0.0,0.0,0.0,ACAGAAAAGACATTACGTCAACGCATTGTTAAAATTGATTAATTCC...,0.0,1.0,0.0,0.0,0.0,TTAACTGCATGACAATTCCAACCTCAATGCATTGATAAAAGATCAG...,0,1,0,0,0
2,CGGCAATAGTATTGTTTTTATCAATTTTGGATAATTATCACAATTA...,0.0,1.0,0.0,0.0,0.0,ACAGAAAAGACATTACGTTAACGAATTGTAAAGAAGGATAATAGCC...,0.0,1.0,0.0,0.0,0.0,TTAACTGCCTGACAATTCCAAGATCAATGCAGTGATAAAGGATCAG...,0,1,0,0,0
3,TGGCAATATTATTGTTTTTGTCAATTTTGGATAATTATCACAATTA...,0.0,1.0,0.0,0.0,0.0,ACAGAAAAGACATTACGTAAACGCATTGTTAAAGATGAAAAATAAC...,0.0,1.0,0.0,0.0,0.0,TTTACTGCGTGTCAATTCGGGCAGCAGTACACTTATAAGAGATCAG...,0,1,0,0,0
4,GGGCATTAATATGTTTTTTACCAATTTTGGATTATTATCCCAATTA...,0.0,1.0,0.0,0.0,0.0,ACAGAAAAGACATTACGTAAACGCATTGTAAAAAATGGTAACTGCC...,0.0,1.0,0.0,0.0,0.0,TAAACTACCTGACAACTCCAACTTTAACGCACTGATTACAGTTCAG...,0,1,0,0,0


The sequences from each of the 3 regions currently have some overlapping flanking sequences in the middle region. The following cell creates a temporary variable will be filled with the entire xylE sequence without any overlapping flanks. The index values were obtained by the description of the assay in Belliveau et. al.. 

In [10]:
# temporary variable whose shape will be used to create an emtpy numpy array ... 
temp_x = data_df['x'].values

# ... this numpy will be populated with entire xylE variant sequences
temp_concat = np.empty(shape=(temp_x.shape[0],1),dtype='object')

for _ in range(len(temp_x)):
    temp_concat[_]=temp_x[_][0]+temp_x[_][1][9:46]+temp_x[_][2]

In [11]:
# this dataframe contains all sequences
sequences_df = pd.DataFrame(temp_concat,columns=['seq'])

In [12]:
# do more clean of the redundant-information-containg dataframe
del data_df['x']
del data_df['bin']

This dataframe now only contains counts across all 12 bins.

In [13]:
data_df

Unnamed: 0,ct_0,ct_1,ct_2,ct_3,ct_0.1,ct_1.1,ct_2.1,ct_3.1,ct_0.2,ct_1.2,ct_2.2,ct_3.2
0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0
2,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0
4,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1379847,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0
1379848,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0
1379849,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0
1379850,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0


The entire *xylE* promoter is 150 bp in length

In [14]:
len(sequences_df['seq'].values[0])

150

In [15]:
data_df.insert(0, 'x', sequences_df['seq'].values)

This is the what the dataset looks like now

In [16]:
data_df.head()

Unnamed: 0,x,ct_0,ct_1,ct_2,ct_3,ct_0.1,ct_1.1,ct_2.1,ct_3.1,ct_0.2,ct_1.2,ct_2.2,ct_3.2
0,CAGCAATAGCATTATTTTTATCAATTTTGGATAATTATCACAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0
1,CGGCAATAGTATTGTTTATATCGATTTTGGATAGTTATCTCAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0
2,CGGCAATAGTATTGTTTTTATCAATTTTGGATAATTATCACAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0
3,TGGCAATATTATTGTTTTTGTCAATTTTGGATAATTATCACAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0
4,GGGCATTAATATGTTTTTTACCAATTTTGGATTATTATCCCAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0


In [17]:
# Randomly assign sequences to training, validation, and test sets
final_df = data_df.copy()
np.random.seed(0)
final_df['set'] = np.random.choice(a=['training','test','validation'], 
                                   p=[.6,.2,.2], 
                                   size=len(final_df))

The final dataset

In [18]:
final_df 

Unnamed: 0,x,ct_0,ct_1,ct_2,ct_3,ct_0.1,ct_1.1,ct_2.1,ct_3.1,ct_0.2,ct_1.2,ct_2.2,ct_3.2,set
0,CAGCAATAGCATTATTTTTATCAATTTTGGATAATTATCACAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0,training
1,CGGCAATAGTATTGTTTATATCGATTTTGGATAGTTATCTCAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0,test
2,CGGCAATAGTATTGTTTTTATCAATTTTGGATAATTATCACAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0,test
3,TGGCAATATTATTGTTTTTGTCAATTTTGGATAATTATCACAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0,training
4,GGGCATTAATATGTTTTTTACCAATTTTGGATTATTATCCCAATTA...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,0,0,training
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1379847,CGGCAATAGTATTGTTTTTATCAATTTTGAATTATTATCACATTTA...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0,validation
1379848,AGTCAATGGTATCGTTTTTATCAATTTTGGATAATTATCACAATTA...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0,training
1379849,CTATAATAGTATTGTTTTTATCAATTTTGTATAATTATCACAATTA...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0,test
1379850,CGGCAATAGTATTGTTTTTATGAATTTTGGATAATTATCACAATTA...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0,1,0,0,training


In [19]:
# save data
final_df.to_csv('xylE_data.csv.gz',index=False, compression='gzip')