# xylE dataset

In [1]:
# Standard imports
import numpy as np
import pandas as pd

# Special imports
import mavenn
import pyfasta

# Summary 

Sort-seq MPRA data from Belliveau et. al., 2018. The authors used fluoresence-activated cell sorting of *E. coli* cells, followed by deep sequencing, to assay gene expression levels from variant  *xylE* promoters. Specifically, the authors created three plasmid libraries containing variant *xylE* promoters driving GFP expression. For each library, *xylE* promoters were mutagenized within a window of 52 bp (library 1), 53 bp (library 2), or 60 bp (libary 3) in length, after which cells were sorted into 4 bins using FACS. The three mutagenized windows partially overlap and cover a total of 150 bp spanning positions [-127:+23] relative to the primary *xylE* transcription start site.

Below we format data files provided by Belliveau et al. into a format suitable for analysis in MAVE-NN. In the final formatted dataset, column `'x'` lists variant 150 bp DNA sequences (each containing the variable region embedded as appropriate within the larger wild-type sequence), columns `'ct_0'` to `'ct_11'` list the number of reads obtained for each sequence in each of the 12 FACS bins (4 bins per library), and the `'set'` column indicates whether each sequence is to assigned to the training set, the validation set, or the test set.

**Name**: ``'xylE'``

**Reference**: Belliveau NM, Barnes SL, Ireland WT, Jones DL, Sweredoski MJ, Moradian A, Hess S, Kinney JB, Phillips R. Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria. [Proc Natl Acad Sci USA 115:E4796–E4805 (2018).](https://doi.org/10.1073/pnas.1722055115)

In [2]:
mavenn.load_example_dataset('xylE')

Unnamed: 0,set,ct_0,ct_1,ct_2,ct_3,ct_4,ct_5,ct_6,ct_7,ct_8,ct_9,ct_10,ct_11,x
0,training,0,0,0,0,0,0,0,0,0,0,0,3,AAACAATAGCATTGTTCTTATCAATTTTGGATAAGTATTATAATTA...
1,test,0,0,0,0,0,0,0,0,0,0,1,0,AAACAATAGTATCGTTTTTAGCTCATTTGGATAATTATTACAATTA...
2,test,0,0,0,0,0,0,0,0,0,0,0,1,AAACAATAGTATTGCTTTTATCAATTTAGGATAATTATCACAATTA...
3,training,0,0,0,0,0,0,0,0,0,0,0,5,AAACAATAGTATTGCTTTTATCAATTTAGGATAATTATCACGATTA...
4,training,0,0,0,0,0,0,0,0,0,0,0,2,AAACAATAGTATTGCTTTTATCATTTTTGGAAAATTAACACGATTA...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1847223,training,0,0,0,0,0,0,0,0,1,0,0,0,TTTCAATAGTCTTGTTTTTATCAATTTTGGATAATTATCACAATTT...
1847224,validation,0,0,0,0,0,0,0,0,0,0,1,0,TTTCACTAGAATTGTTTTTATCAATTGTGGATAATTTTCACAATTA...
1847225,training,0,0,0,0,0,0,0,0,0,0,1,0,TTTCACTAGAATTGTTTTTATCAATTGTGGATAATTTTCACAATTA...
1847226,validation,0,0,0,0,0,0,0,0,0,0,3,0,TTTCAGTAGTATGGTTTTTATCAATTTTGGATAATTATCACATTTA...


# Preprocessing

*xylE* is transcribed in the reverse direction starting from a  position 4,242,291 in the *E. coli* genome. Belliveau et al. studied a 150 bp region of the *xylE* promoter, spanning genomic coordinates [4,242,269:4,242,418]. We designate the reverse-strand-sequence of this genomic region as the wild-type sequence. Within this 150 bp region, library 1 spans positions [98:150], library 2 spans positions [52:105], and library 3 spans positions [1:60]. A schematic is shown below.

![Image of 150 bp region.](xylE_promoter.png)

We begin by extracting the 150 bp wild-type sequence from the *E. coli* genome. Note that this step requires the `pyfasta` Python package, which is not a dependency of MAVE-NN.

In [3]:
# Extract WT sequence from E. coli genome
# Genome FASTA file downloaded from https://www.ncbi.nlm.nih.gov/assembly/GCF_000005845.2/ on 12/25/2021
genome_file = '../../mavenn/examples/datasets/raw/GCF_000005845.2_ASM584v2_genomic.fna'
genome = pyfasta.Fasta(genome_file)
wt_seq = genome.sequence({'chr':'NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome',
                          'start': 4242269,
                          'stop': 4242418,
                          'strand': '-'})
print(f'Wild-type sequence (length {len(wt_seq)} bp):')
wt_seq

Wild-type sequence (length 150 bp):


'CGGCAATAGTATTGTTTTTATCAATTTTGGATAATTATCACAATTAAGATCACAGAAAAGACATTACGTAAACGCATTGTAAAAAATGATAATTGCCTTAACTGCCTGACAATTCCAACATCAATGCACTGATAAAAGATCAGAATGGTC'

Next we load and concatenate read counts for each of the three libraries assayed by Belliveau et al.. In the resulting dataframe, `raw_data_df`, each row corresponds to a different Illumina read. Column `'x`' lists only the sequence of the mutagenized promoter region, column `'bin'` lists the FACS bin (0, 1, 2 , or 3) that read was found in, and `'lib'` indicates the library that read was derived from. 

In [4]:
# Download libraries and embed variant sequences within
data_file_template = '../../mavenn/examples/datasets/raw/xylE_lib{}_raw.csv.gz'
raw_data_df = pd.DataFrame(columns=['x','bin','lib'])
for lib_num in [1,2,3]:

    # Read in dataset
    data_file = data_file_template.format(lib_num)
    print(f'Loading lib{lib_num} data from {data_file}...')
    lib_data_df= pd.read_csv(data_file,
                             compression='gzip',
                             index_col=[0])
    lib_data_df['lib'] = lib_num
    raw_data_df = pd.concat([raw_data_df, lib_data_df],
                            ignore_index=True)

raw_data_df

Loading lib1 data from ../../mavenn/examples/datasets/raw/xylE_lib1_raw.csv.gz...


  mask |= (ar1 == a)


Loading lib2 data from ../../mavenn/examples/datasets/raw/xylE_lib2_raw.csv.gz...
Loading lib3 data from ../../mavenn/examples/datasets/raw/xylE_lib3_raw.csv.gz...


Unnamed: 0,x,bin,lib
0,TTAACTGTCTGACGATTCAAACATCAATACACTAATAAAAGATCAG...,0,1
1,TTAACTGCATGACAATTCCAACCTCAATGCATTGATAAAAGATCAG...,0,1
2,TTAACTGCCTGACAATTCCAAGATCAATGCAGTGATAAAGGATCAG...,0,1
3,TTTACTGCGTGTCAATTCGGGCAGCAGTACACTTATAAGAGATCAG...,0,1
4,TAAACTACCTGACAACTCCAACTTTAACGCACTGATTACAGTTCAG...,0,1
...,...,...,...
6859154,CGGCAATAGTATTGTTTTTATCAATATTGGATAATTAACACAATTA...,3,3
6859155,CGGCTATAGTATTGTTTTTATCAATTTTGGATAATTATCACAGTTA...,3,3
6859156,CGGCAATAGTATGGTTTTTATCAATTTTGGACAATTATCAAAATTC...,3,3
6859157,CAGCAATAGTATTGTTTTTATCAATTTTTGATAATTATCCCAATTA...,3,3


Next we pad each variant sequence in `raw_data_df` with wild-type sequence as appropriate. We also renumber the bins so that bins 0-3 correspond to library 1, bins 4-7 correspond to library 2, and bins 8-11 correspond to library 3. 

In [5]:
# List beginning and end of each variant region, in Python coordinates
lib_coords_dict = {
    1: [97, 150],
    2: [51, 105],
    3: [0, 60]
}

tidy_data_df = raw_data_df[['x','bin']].copy()
for lib_num, (start, stop) in lib_coords_dict.items():

    # Get indices for sequences in specific library
    ix = (raw_data_df['lib']==lib_num)

    # Those variant sequences sequence into larger WT sequence
    tidy_data_df.loc[ix, 'x'] = [
        wt_seq[:start] + x + wt_seq[stop:]
        for x in tidy_data_df.loc[ix,'x']
    ]

# Verify that all sequences are the correct length
assert np.all([len(x)==len(wt_seq) for x in tidy_data_df['x']])
print(f'Verified: all sequences are now of length {len(wt_seq)}.')

# Alter bin numbers to accommodate 12 bins
tidy_data_df['bin'] = raw_data_df['bin'] + 4*(raw_data_df['lib']-1)

tidy_data_df

Verified: all sequences are now of length 150.


Unnamed: 0,x,bin
0,CGGCAATAGTATTGTTTTTATCAATTTTGGATAATTATCACAATTA...,0
1,CGGCAATAGTATTGTTTTTATCAATTTTGGATAATTATCACAATTA...,0
2,CGGCAATAGTATTGTTTTTATCAATTTTGGATAATTATCACAATTA...,0
3,CGGCAATAGTATTGTTTTTATCAATTTTGGATAATTATCACAATTA...,0
4,CGGCAATAGTATTGTTTTTATCAATTTTGGATAATTATCACAATTA...,0
...,...,...
6859154,CGGCAATAGTATTGTTTTTATCAATATTGGATAATTAACACAATTA...,11
6859155,CGGCTATAGTATTGTTTTTATCAATTTTGGATAATTATCACAGTTA...,11
6859156,CGGCAATAGTATGGTTTTTATCAATTTTGGACAATTATCAAAATTC...,11
6859157,CAGCAATAGTATTGTTTTTATCAATTTTTGATAATTATCCCAATTA...,11


Finally, we convert these data from the "tidy" format in `tidy_data_df` to a "wide" format in which each row corresponds to a different sequence and the read counts for each bin are listed in a separate column. This wide format is then leveraged to assign each unique variant promoter to the `training`, `validatiaon`, or `test` set. The final dataset is then saved to file. 

In [6]:
# Convert data in "tidy" format to "wide" format
ct_my, x_m = mavenn.src.utils.vec_data_to_mat_data(
    y_n=tidy_data_df['bin'],
    x_n=tidy_data_df['x']
)
M, Y = ct_my.shape
ct_cols = [f'ct_{y}' for y in range(Y)]
wide_data_df = pd.DataFrame()
wide_data_df[ct_cols] = ct_my
wide_data_df['x'] = x_m

# Assign sequences to training, validation, and test sets
np.random.seed(0)
sets = np.random.choice(a=['training','test','validation'],
                        p=[.6,.2,.2], 
                        size=len(wide_data_df))
wide_data_df.insert(loc=0, column='set', value=sets)

# # Save dataset (uncomment to execute)
# wide_data_df.to_csv('xylE_data.csv.gz',index=False, compression='gzip')

wide_data_df

Unnamed: 0,set,ct_0,ct_1,ct_2,ct_3,ct_4,ct_5,ct_6,ct_7,ct_8,ct_9,ct_10,ct_11,x
0,training,0,0,0,0,0,0,0,0,0,0,0,3,AAACAATAGCATTGTTCTTATCAATTTTGGATAAGTATTATAATTA...
1,test,0,0,0,0,0,0,0,0,0,0,1,0,AAACAATAGTATCGTTTTTAGCTCATTTGGATAATTATTACAATTA...
2,test,0,0,0,0,0,0,0,0,0,0,0,1,AAACAATAGTATTGCTTTTATCAATTTAGGATAATTATCACAATTA...
3,training,0,0,0,0,0,0,0,0,0,0,0,5,AAACAATAGTATTGCTTTTATCAATTTAGGATAATTATCACGATTA...
4,training,0,0,0,0,0,0,0,0,0,0,0,2,AAACAATAGTATTGCTTTTATCATTTTTGGAAAATTAACACGATTA...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1847223,training,0,0,0,0,0,0,0,0,1,0,0,0,TTTCAATAGTCTTGTTTTTATCAATTTTGGATAATTATCACAATTT...
1847224,validation,0,0,0,0,0,0,0,0,0,0,1,0,TTTCACTAGAATTGTTTTTATCAATTGTGGATAATTTTCACAATTA...
1847225,training,0,0,0,0,0,0,0,0,0,0,1,0,TTTCACTAGAATTGTTTTTATCAATTGTGGATAATTTTCACAATTA...
1847226,validation,0,0,0,0,0,0,0,0,0,0,3,0,TTTCAGTAGTATGGTTTTTATCAATTTTGGATAATTATCACATTTA...


This final dataframe, `wide_data_df`, has the same format as the `'xylE'` dataset that comes with MAVE-NN. 