# Mammal Enhancers Data Processing

This notebook creates the dataset needed to train a classification model on enhancer sequences from several mammalian species.

#### Mammalian Enhancer Sequences
This dataset uses enhancer sequences related to the paper [Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences](https://www.biorxiv.org/content/biorxiv/early/2018/02/14/264200.full.pdf) by Cohn et al. The data is available from the [Enhancer CNN](https://github.com/cohnDikla/enhancer_CNN) repo. The dataset contains enhancer sequences and negative sequences from 17 different species. For each species, there are 14,000 enhancer sequences and 14,000 negative sequences.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai import *
from fastai.text import *
from Bio import Seq
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import FeatureLocation, CompoundLocation
import networkx as nx

In [3]:
sys.path.append("../../..")
from utils import *

In [4]:
path = Path('F:/genome/mammals/')

In [5]:
organisms = [i for i in os.listdir(path/'enhancers') if not 'peaks' in i]
organisms

['Cat',
 'Cow',
 'Dog',
 'Dolphin',
 'Ferret',
 'Guinea_pig',
 'Human',
 'Macaque',
 'Marmoset',
 'Mouse',
 'Naked_mole_rat',
 'Opossum',
 'Pig',
 'Rabbit',
 'Rat',
 'Tasmanian_devil',
 'Tree_shrew']

In [6]:
def partition_data(df):
    
    train_size = int(len(df)*0.8)
    valid_size = int(len(df)*0.9) - train_size
    
    train_df = df.sample(train_size)
    test_val = df.drop(train_df.index)
    valid_df = test_val.sample(valid_size)
    test_df = test_val.drop(valid_df.index)
    train_df['set'] = 'train'
    valid_df['set'] = 'valid'
    test_df['set'] = 'test'
    
    return (train_df, valid_df, test_df)

In [7]:
fname1 = 'positive_samples'
fname2 = 'negative_samples'

In [8]:
trains = []
vals = []
tests = []

for organism in organisms:
    for fname in [fname1, fname2]:
        file = open(path/'enhancers'/organism/fname)
        seqs = [i.strip('\n') for i in file if set(i.strip('\n').upper()) == set('ATGC')]
        df = pd.DataFrame(seqs, columns=['Sequence'])
        df['Enhancer'] = fname.split('_')[0]
        df['Organism'] = organism

        train, val, test = partition_data(df)
        trains.append(train)
        vals.append(val)
        tests.append(test)

In [9]:
train_df = pd.concat(trains)
valid_df = pd.concat(vals)
test_df = pd.concat(tests)
data_df = pd.concat([train_df, valid_df, test_df])

In [10]:
data_df.shape

(475995, 4)

In [11]:
data_df.Organism.value_counts()

Rat                28000
Dolphin            28000
Ferret             28000
Naked_mole_rat     28000
Guinea_pig         28000
Marmoset           28000
Rabbit             28000
Cat                28000
Pig                28000
Tasmanian_devil    28000
Human              28000
Dog                28000
Tree_shrew         28000
Mouse              27999
Opossum            27999
Macaque            27999
Cow                27998
Name: Organism, dtype: int64

In [14]:
data_df.reset_index(inplace=True, drop=True)

In [15]:
data_df.to_csv(path/'enhancer_data.csv', index=False)