# Example datasets

In [1]:
# Insert local mavenn at beginning of path
import sys
path_to_mavenn_local = '../'
sys.path.insert(0,path_to_mavenn_local)
import mavenn
print(f'Path to mavenn: {mavenn.__path__[0]}')

Path to mavenn: ../mavenn


## Overview

MAVE-NN comes with multiple built-in datasets for use in training or evaluating models. These datasets can be accessed by passing a valid datset name to ``mavenn.load_example_dataset()``. To get a list of datset names, execute this command without any arguments:

In [2]:
mavenn.load_example_dataset()

Please enter a dataset name. Valid choices are:
"ace2rbd"
"gb1"
"mpsa"
"mpsa_replicate"
"sortseq"


Datasets are returned in the form of ``pandas`` dataframes. Common fields include:

- ``'x'``: Assayed sequences, all of which are the same length.
- ``'y'``: Values of continuous measurements (used to train GE models).
- ``'ct_y'``: Read counts observed in bin number ``y``, where ``y`` is an integer ranging from  ``0`` to ``Y-1`` (used to train MPA models).
- ``'set'``: Indicates whether each observation was reserved for the ``'training'``, ``'validation'``, or ``'test'`` set when the example models provided with MAVE-NN were trained. 

Other fields are sometimes provided as well, e.g. the raw input and output counts used to compute measurement values. 

## Protein GB1

The DMS dataset from Olson et al., 2014. The authors used an RNA-display-based selection experiment to assay the binding of protein GB1 variants to IgG beads. These variants included all 1-point and 2-point mutations within the 55 residue GB1 sequence. Only 2-point variants are included in this dataset.

**Name:** ``'gb1'``

**Reference**: Olson C, Wu N, Sun R. (2014). A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. [Curr Biol. 24(22):2643-2651.](https://pubmed.ncbi.nlm.nih.gov/25455030/)

In [3]:
mavenn.load_example_dataset('gb1')

Unnamed: 0,set,input_ct,selected_ct,y,x
0,training,73.0,62.0,-1.021847,QYKLILNGKTLKGETTTEAHDAATAEKVFKQYANDNGVDGEWTYDD...
1,training,122.0,0.0,-7.732188,QYKLILNGKTLKGETTTEAVDAATAEKVFPQYANDNGVDGEWTYDD...
2,training,794.0,598.0,-1.198072,QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANKNGVDGEWTLDD...
3,training,1115.0,595.0,-1.694626,QYKLILNIKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDS...
4,validation,97.0,2.0,-5.819421,QYKLINNGKTLKGETTTEAVDAATAEKVFKIYANDNGVDGEWTYDD...
...,...,...,...,...,...
530732,test,3955.0,11.0,-9.154538,QYKLILNGKTAKGETTTEAVDAATAEKVFKQYAADNGVDGEWTYDD...
530733,validation,191.0,3.0,-6.374636,QYKLILNGKTLKGETTTEAVDAATAEKVFKQYYNQNGVDGEWTYDD...
530734,training,1308.0,9.0,-7.821995,QYKLILNGKTLKGETTTEAIDAATAEKVFAQYANDNGVDGEWTYDD...
530735,test,798.0,1146.0,-0.268075,QYKLILNGITLKGERTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...


## Massively parallel splicing assay

The massively parallel splicing assay (MPSA) dataset of Wong et al., 2018. The authors used 3-exon minigenes to assay how inclusion of the middle exon varies with the sequence of that exon's 5' splice site. Nearly all 5' splice site variants of the form NNN/GYNNNN were measured, where the slash demarcates the exon/intron boundary. The authors performed experiments on multiple replicates of multiple libraries in three different gene contexts: *IKBKAP* exons 19-21, *SMN1* exons 6-8, and *BRCA2* exons 17-19. The dataset provided here is from library 1 replicate 1 in the *BRCA2* context. ``'tot_ct'`` reports the number of reads obtained for each splice site from total processed mRNA transcripts, while ``'ex_ct'`` reports the number of reads obtained from processed mRNA transcripts containing the central exon.

**Name**: ``'mpsa'``

**Reference**: Wong MS, Kinney JB, Krainer AR. Quantitative Activity Profile and Context Dependence of All Human 5' Splice Sites. [Mol Cell. 2018;71(6):1012-1026.e3.](https://doi.org/10.1016/j.molcel.2018.07.033)

In [4]:
mavenn.load_example_dataset('mpsa')

Unnamed: 0,set,tot_ct,ex_ct,y,x
0,training,28,2,-3.273018,GGAGUGAUG
1,training,315,7,-5.303781,AGUGUGCAA
2,test,193,15,-3.599913,UUCGCGCCA
3,validation,27,0,-4.807355,UAAGCUUUU
4,training,130,2,-5.448461,AUGGUCGGG
...,...,...,...,...,...
30478,validation,190,17,-3.407504,CUGGUUGCA
30479,training,154,10,-3.816693,CGCGCACAA
30480,training,407,16,-4.584963,ACUGCUCAC
30481,test,265,6,-5.247928,AUAGUCUAA


## Sort-Seq MPRA

The sort-seq MPRA of Kinney et al., 2010. The authors used fluoresence-activated cell sorting, followed by deep sequencing, to assay gene expression levels from variant *lac* promoters in *E. coli*. The authors performed 6 different experiments, which varied in the region of the *lac* promoter that was mutagenized, the mutation rate used, the *E. coli* host strain, cellular growth conditions, and the number of bins into which cells were sorted. The dataset provided here is from the "full-wt" experiment, in which a 75 bp region if this promoter was mutagenized at ~12% per nucleotide; the sequences provided only include 39 bp surrounding the binding site of RNA polymerase (RNAP). Cells were sorted into 9 bins (numbered 1-9), and sequenced along with the input library (bin 0). 

**Name**: ``'sortseq'``

**Reference**: Kinney J, Murugan A, Callan C, Cox E (2010). Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. [Proc Natl Acad Sci USA. 107(20):9158-9163](https://dx.doi.org/10.1073/pnas.1004290107)

In [5]:
mavenn.load_example_dataset('sortseq')

Unnamed: 0,set,ct_0,ct_1,ct_2,ct_3,ct_4,ct_5,ct_6,ct_7,ct_8,ct_9,x
0,test,0,0,0,0,0,0,0,0,1,0,GGCTTTACACTTTAAGCTGCCGCATCGTATGTTATGTGG
1,training,0,1,0,0,0,0,0,0,0,0,GGCTATACATTTTATGTTTCCGGGTCGTATTTTGTGTGG
2,training,0,0,0,0,0,0,0,0,1,0,GGCTTTACATTTTATGCTTCCTTCACGTATGTTGTGTCT
3,test,0,0,0,0,0,1,0,0,0,0,GGCATTACTCTTTGTGCTTCCGGCTCGTATGTTGTGTGG
4,test,0,0,0,0,0,0,0,1,0,0,GACTTTTCAATTTATGCTTTCAGTTGGTATGTTGTGTAG
...,...,...,...,...,...,...,...,...,...,...,...,...
45773,training,0,0,0,1,0,0,0,0,0,0,GGCTTTTCACTTTATGCTTCTGGCTCGTATGTTGTGTGG
45774,validation,2,0,0,0,0,0,0,0,0,0,GGTTTTACACTTTTTGCTTCCGGGCCAAATGTTGTGTGG
45775,training,0,0,1,0,0,0,0,0,0,0,GGCTCCACACATTATGCTTCCGGCTCGTCTGTTCGCTCG
45776,training,2,0,0,0,0,0,0,0,0,0,GGCTTTACACATTATGCTTCCGGCTCGTATGTTGTTTGG
