# Formatting the sort-seq MPRA dataset of Kinney et al. (2010)

In [1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Special imports
import mavenn

The sort-seq MPRA dataset of Kinney et al., (2010) is included with MAVE-NN and be loaded as a dataframe by calling `mavenn.load_example_dataset('sortseq')`. This dataframe, which is formatted as follows, can then used as input to `mavenn.Model.set_data()`.

In [2]:
# Load built-in sort seq MPRA dataset
builtin_df = mavenn.load_example_dataset('sortseq')
builtin_df.head()

Unnamed: 0,set,ct_0,ct_1,ct_2,ct_3,ct_4,ct_5,ct_6,ct_7,ct_8,ct_9,x
0,test,0,0,0,0,0,0,0,0,1,0,GGCTTTACACTTTAAGCTGCCGCATCGTATGTTATGTGG
1,training,0,1,0,0,0,0,0,0,0,0,GGCTATACATTTTATGTTTCCGGGTCGTATTTTGTGTGG
2,training,0,0,0,0,0,0,0,0,1,0,GGCTTTACATTTTATGCTTCCTTCACGTATGTTGTGTCT
3,test,0,0,0,0,0,1,0,0,0,0,GGCATTACTCTTTGTGCTTCCGGCTCGTATGTTGTGTGG
4,test,0,0,0,0,0,0,0,1,0,0,GACTTTTCAATTTATGCTTTCAGTTGGTATGTTGTGTAG


Here, the '`x`' column lists variant 75 nt DNA sequqences, the `'ct_0'` through `'ct_9'` columns list the number of read counts for each sequence observed in each of the 10 FACS bins, and the `'set'` column indicates whether each sequence is to be included in the training set, validation set, or test set.

This format is quite different from the format of the raw data provided by Kinney et al. (2010), which is available in file `file_S2.txt.gz` at https://github.com/jbkinney/09_sortseq/:

In [7]:
# Load raw dataset
raw_df = pd.read_csv('file_S2.txt.gz', 
                     sep='\t',
                     header=None, 
                     names=['experiment','bin','seq'], 
                     compression='gzip')

# Preview raw_df
raw_df.head()

Unnamed: 0,experiment,bin,seq
0,crp-wt,B0,AATTAAGGGCAGTTAACTCACCCATTAGGCACCCCAGGCTTTACAC...
1,crp-wt,B0,AATTAATATGAGTTTGCTCACCCATTAGGCACCCCAGGCTTTACAC...
2,crp-wt,B0,AATTAATAAGAGTTCACTCACTCATACGGCACCCCAGGCTTTACAC...
3,crp-wt,B0,AATTTATGTGCTTTACCTCACTGATTTGGCACCCCAGGCTTTACAC...
4,crp-wt,B0,AATTAAGGTGAGTTCGCTCGCTCATGAGGCACCCCAGGCTTTACAC...


To reformat this dataset into the one provided with MAVE-NN, we first trim the dataframe to keep only rows corresponding to the `'full-wt'` experiment. We then rename each FACS bin `'BX'` to `'ct_X'` for X = 0, 1, ..., 9, and create a `'ct'` column filled with ones.

In [8]:
# Keep only data from the full-wt experiment
ix = raw_df['experiment']=='full-wt'
sub_df = raw_df[ix].copy().reset_index(drop=True)[['bin','seq']]

# Rename bins BX -> ct_X, where X = 0, 1, ..., 9
sub_df['bin'] = [f'ct_{s[1:]}' for s in sub_df['bin']]

# Add counts column
sub_df['ct'] = 1

# Preview sub_df
sub_df.head()

Unnamed: 0,bin,seq,ct
0,ct_0,AATTGAACTGTGTTTGCTCTCTCATTATGTACCACAGGCTATACAC...,1
1,ct_0,AATTAATGAGAGGTGGTTCACTAATAAGGCACCGCAGGCTTTACAC...,1
2,ct_0,TATTAATTAGAGTTAGCTCAATCATTACGCACAGCATACTTTCCAC...,1
3,ct_0,AATTAATGTGAGTTCGCTCACTCATTAGGCACCCCAGGCTTTCCAC...,1
4,ct_0,AATTAATGTGGGTTTGCTTACTCATTAGGCACCCCAGGCTTTACAC...,1


Next we use the `pivot()` and `groupby()` functions in Pandas to obtain a dataframe in which the `'seq'` column  lists only unique sequences, each of the 10 possible values in the original `'bin'` column now label a separate column, and the values in these new columns report the number of times each sequence was observed in each FACS bin.

In [5]:
# Pivot dataframe
pivot_df = sub_df.pivot(index='seq', values='ct', columns='bin').fillna(0).astype(int)
pivot_df.columns.name = None

# Groupby sequence
pivot_df = pivot_df.groupby('seq').sum()

# Reindex dataframe
pivot_df = pivot_df.reset_index()

# Preview pivot_df
pivot_df.head()

Unnamed: 0,seq,ct_0,ct_1,ct_2,ct_3,ct_4,ct_5,ct_6,ct_7,ct_8,ct_9
0,AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG...,0,1,0,0,0,0,0,0,0,0
1,AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC...,0,0,0,0,0,0,0,0,1,0
2,AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC...,0,0,0,0,0,0,1,0,0,0
3,AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA...,0,0,0,0,0,0,0,0,0,1
4,AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC...,0,0,0,0,0,0,0,0,0,1


Finally, we create a `'set'` column that randomly assigns each sequence to the training, test, or validation set (using a 60:20:20 split), then reorder the columns for clarity. The resulting dataframe is equivalent to the one provided in MAVE-NN.

In [6]:
# Randomly assign sequences to training, validation, and test sets
final_df = pivot_df.copy()
final_df['set'] = np.random.choice(a=['training','test','validation'], p=[.6,.2,.2], size=len(final_df))

# Rearrange columns
new_cols = ['set'] + list(final_df.columns[1:-1]) + ['seq']
final_df = final_df[new_cols]

# Preview final_df
final_df.head()

Unnamed: 0,set,ct_0,ct_1,ct_2,ct_3,ct_4,ct_5,ct_6,ct_7,ct_8,ct_9,seq
0,training,0,1,0,0,0,0,0,0,0,0,AAAAAAAGTGAGTTAGCCAACTAATTAGGCACCGTACGCTTTATAG...
1,test,0,0,0,0,0,0,0,0,1,0,AAAAAATCTGAGTTAGCTTACTCATTAGGCACCCCAGGCTTGACAC...
2,validation,0,0,0,0,0,0,1,0,0,0,AAAAAATCTGAGTTTGCTCACTCTATCGGCACCCCAGTCTTTACAC...
3,training,0,0,0,0,0,0,0,0,0,1,AAAAAATGAGAGTTAGTTCACTCATTCGGCACCACAGGCTTTACAA...
4,validation,0,0,0,0,0,0,0,0,0,1,AAAAAATGGGTGTTAGCTCTATCATTAGGCACCCCCGGCTTTACAC...
