# Human Genome Data Processing

This notebook creates the dataset needed to train a classification model on short promoter sequences from the human genome

#### Human Promoter Classification Short Sequences
This dataset will be made with sequences used in the paper [Recognition of Prokaryotic and Eukaryotic Promoters using Convolutional Deep Learning Neural Networks](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0171410). This dataset consists of small (250 bp) sequences centered approximately -200/50 around TSS sites. The data also contains negative examples of the same length. The data files `human_non_tata.fa` and `human_nonprom_big.fa` are downloaded from [this repo](https://github.com/solovictor/CNNPromoterData). The paper specifically uses two different models for classifying `tata` containing promoters and `non-tata` promoters. However the dataset for the `tata` promoters is not in the repo, and therefore will not be used.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai import *
from fastai.text import *
from Bio import Seq
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import FeatureLocation, CompoundLocation
import networkx as nx

In [3]:
sys.path.append("../../..")
from utils import *

In [4]:
path = Path('F:/genome/human/')

# Short Sequence Classification Data

Similar to the paper, 15% of the sequences will be used for testing. Of the remaining, 90% of the sequences will be used for training and 10% for validation.

In [5]:
fname1 = 'human_non_tata.fa'
fname2 = 'human_nonprom_big.fa'

In [6]:
fasta1 = SeqIO.parse(path/fname1, 'fasta')
seqs1 = [i.seq.__str__() for i in fasta1 if set(i.seq.__str__()) == set('ATGC')]
seq1_df = pd.DataFrame(seqs1, columns=['Sequence'])
seq1_df['Promoter'] = 1

In [7]:
fasta2 = SeqIO.parse(path/fname2, 'fasta')
seqs2 = [i.seq.__str__() for i in fasta2 if set(i.seq.__str__()) == set('ATGC')]
seq2_df = pd.DataFrame(seqs2, columns=['Sequence'])
seq2_df['Promoter'] = 0

In [8]:
seq1_df.shape, seq2_df.shape

((19809, 2), (27703, 2))

In [9]:
seq1_df.drop_duplicates(inplace=True)
seq2_df.drop_duplicates(inplace=True)

In [10]:
seq1_df.shape, seq2_df.shape

((19787, 2), (27038, 2))

In [11]:
def partition_data(df):
    
    train_size = int(len(df)*0.85*.9)
    valid_size = int(len(df)*0.85) - train_size
    
    train_df = df.sample(train_size)
    test_val = df.drop(train_df.index)
    valid_df = test_val.sample(valid_size)
    test_df = test_val.drop(valid_df.index)
    train_df['set'] = 'train'
    valid_df['set'] = 'valid'
    test_df['set'] = 'test'
    
    return (train_df, valid_df, test_df)

In [12]:
t1, v1, test1 = partition_data(seq1_df)
t2, v2, test2 = partition_data(seq2_df)
data_df = pd.concat([t1,t2,v1,v2,test1,test2])

In [13]:
data_df[data_df.set == 'train'].shape, data_df[data_df.set == 'valid'].shape, data_df[data_df.set == 'test'].shape, data_df.shape

((35821, 3), (3979, 3), (7025, 3), (46825, 3))

In [14]:
data_df.to_csv(path/'human_promoters_short.csv', index=False)