# Datasets

## Overview

MAVE-NN comes with multiple built-in datasets for use in training or evaluating models. These datasets can be accessed by passing a valid datset name to ``mavenn.load_example_dataset()``. To get a list of datset names, execute this command without any arguments:

In [1]:
import mavenn
mavenn.load_example_dataset()

Please enter a dataset name. Valid choices are:
"gb1"
"mpsa"
"sortseq"


Datasets are returned in the form of ``pandas`` dataframes. Common fields include:

- ``'x'``: Assayed sequences, all of which are the same length.
- ``'y'``: Values of continuous measurements (used to train GE models).
- ``'ct_y'``: Read counts observed in bin number ``y``, where ``y`` is an integer ranging from  ``0`` to ``Y-1`` (used to train MPA models).
- ``'training_set'``: Boolean flags indicating whether each observation was reserved for training or testing when inferring correpsonding example models provided with MAVE-NN. 

Other fields are sometimes provided as well, e.g. the raw input and output counts used to compute measurement values. 

## Protein GB1

The DMS dataset from Olson et al., 2014. The authors used an RNA-display-based selection experiment to assay the binding of protein GB1 variants to IgG beads. These variants included all 1-point and 2-point mutations within the 55 residue GB1 sequence. Only 2-point variants are included in this dataset.

**Name:** ``'gb1'``

**Reference**: Olson C, Wu N, Sun R. (2014). A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. [Curr Biol. 24(22):2643-2651.](https://pubmed.ncbi.nlm.nih.gov/25455030/)

In [2]:
mavenn.load_example_dataset('gb1')

Unnamed: 0,input_ct,selected_ct,hamming_dist,training_set,y,x
0,173.0,33.0,2,True,-3.145154,AAKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
1,18.0,8.0,2,False,-1.867676,ACKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
2,66.0,2.0,2,True,-5.270800,ADKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
3,72.0,1.0,2,False,-5.979498,AEKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
4,69.0,168.0,2,True,0.481923,AFKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
...,...,...,...,...,...,...
530732,462.0,139.0,2,False,-2.515259,QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530733,317.0,84.0,2,True,-2.693165,QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530734,335.0,77.0,2,True,-2.896589,QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...
530735,148.0,28.0,2,True,-3.150861,QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDD...


## Massively parallel splicing assay

The massively parallel splicing assay (MPSA) dataset of Wong et al., 2018. The authors used 3-exon minigenes to assay how inclusion of the middle exon varies with the sequence of that exon's 5' splice site. Nearly all 5' splice site variants of the form NNN/GYNNNN were measured, where the slash demarcates the exon/intron boundary. The authors performed experiments on multiple replicates of multiple libraries in three different gene contexts: *IKBKAP* exons 19-21, *SMN1* exons 6-8, and *BRCA2* exons 17-19. The dataset provided here is from library 1 replicate 1 in the *BRCA2* context. ``'tot_ct'`` reports the number of reads obtained for each splice site from total processed mRNA transcripts, while ``'ex_ct'`` reports the number of reads obtained from processed mRNA transcripts containing the central exon.

**Name**: ``'mpsa'``

**Reference**: Wong MS, Kinney JB, Krainer AR. Quantitative Activity Profile and Context Dependence of All Human 5' Splice Sites. [Mol Cell. 2018;71(6):1012-1026.e3.](https://doi.org/10.1016/j.molcel.2018.07.033)

In [3]:
mavenn.load_example_dataset('mpsa')

Unnamed: 0,training_set,tot_ct,ex_ct,y,x
0,True,1588,66,-4.567814,ACGGUCCAU
1,True,1533,118,-3.688265,AUUGCCAGG
2,True,1459,399,-1.867896,ACAGCGGUA
3,True,1414,246,-2.518219,AACGCCAGG
4,True,1412,60,-4.533808,ACGGCUUGG
...,...,...,...,...,...
30485,True,10,0,-3.459432,UAGGCAACG
30486,True,10,1,-2.459432,GCUGCAAUG
30487,True,10,0,-3.459432,CCCGUGUUC
30488,True,10,2,-1.874469,UAGGCGGCG


## Sort-Seq MPRA

The sort-seq MPRA of Kinney et al., 2010. The authors used fluoresence-activated cell sorting, followed by deep sequencing, to assay gene expression levels from variant *lac* promoters in *E. coli*. The authors performed 6 different experiments, which varied in the region of the *lac* promoter that was mutagenized, the mutation rate used, the *E. coli* host strain, cellular growth conditions, and the number of bins into which cells were sorted. The dataset provided here is from the "rnap-wt" experiment, in which a 39 bp region containing the RNA polymerase (RNAP) binidng site was mutagenized at ~15% per nucleotide. Cells were sorted into 9 bins (numbered 1-9), and sequenced along with the input library (bin 0). 

**Name**: ``'sortseq'``

**Reference**: Kinney J, Murugan A, Callan C, Cox E (2010). Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. [Proc Natl Acad Sci USA. 107(20):9158-9163](https://dx.doi.org/10.1073/pnas.1004290107)

In [4]:
mavenn.load_example_dataset('sortseq')

Unnamed: 0,training_set,ct_0,ct_1,ct_2,ct_3,ct_4,ct_5,ct_6,ct_7,ct_8,ct_9,x
0,True,0,0,0,1,0,0,0,0,0,0,AAATACACACTTGCTGCTTCCGGCTCGTATGTTGTGTGG
1,True,0,0,1,0,0,0,0,0,0,0,AAATTTACACTGTATGCTTCCGGCTCGCATGGCGTTTGC
2,True,1,0,0,0,0,0,0,0,0,0,AAATTTACACTTTATGCATCAGACTCGTATGTTGTGTGG
3,True,0,0,0,1,0,0,0,0,0,0,AAATTTACACTTTATGCTTCTGGCGCGTATGCGGCGTGG
4,True,0,1,0,0,0,0,0,0,0,0,AACATTACATTTTATGCTTCCGGCTCGTATGGTGTGTGG
...,...,...,...,...,...,...,...,...,...,...,...,...
45773,True,0,0,0,0,1,0,0,0,0,0,TTTATTACACTTTTTGCTTCCGACTCGTATGTTGAGTGG
45774,False,0,1,0,0,0,0,0,0,0,0,TTTTGTAGACTTTATGCTTCTGGATCCTATGGTGTGTGG
45775,False,0,0,0,0,0,0,0,0,2,0,TTTTTGACACTTTATGCTTCCGGCTCGTATACTGTGAGG
45776,False,0,0,1,0,0,0,0,0,0,0,TTTTTTACACTTTCTGCTTCCTGCTGGTAGGTTGCGTGC
