# Human lncRNA Classification Data

This notebook creates the dataset needed to train a classification model on human coding mRNA sequences vs long noncoding RNA (lncRNA) sequences.

#### lncRNA Data
This dataset comes from the paper [A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6144860/) by Hill et al. and is available for download [here](https://osf.io/4htpy/). 

Several data files are used.

__mRNAs.TRAIN.fa__ and __lncRNAs.TRAIN.fa__ contain long transcript examples of coding mRNA and lncRNA sequences.

__mRNAs.train16K.fa__ and __lncRNAs.train16K.fa__ are samples of the above datasets screened for sequences between 200 and 1000 bp in length

__mRNAs.TEST500.fa__ and __lncRNAs.TEST500.fa__ contain the test set

__mRNAs.CHALLENGE500.fa__ and __lncRNAs.CHALLENGE500.fa__ contain an additional challenge test set

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai import *
from fastai.text import *
from Bio import Seq
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import FeatureLocation, CompoundLocation
import networkx as nx

In [3]:
sys.path.append("../../..")
from utils import *

In [4]:
path = Path('F:/genome/human/')

# Size selected data

This builds a dataset from the __train16k__ files

In [5]:
mrna = 'lncRNA/mRNAs.train16K.fa'
incrna = 'lncRNA/lncRNAs.train16K.fa'

mrna_test = 'lncRNA/mRNAs.TEST500.fa'
incrna_test = 'lncRNA/lncRNAs.TEST500.fa'

mrna_challenge = 'lncRNA/mRNAs.CHALLENGE500.fa'
incrna_challenge = 'lncRNA/lncRNAs.CHALLENGE500.fa'

In [5]:
def parse_fasta(filename, label):
    fasta = SeqIO.parse(path/filename, 'fasta')
    fs = [i for i in fasta]
    seqs = [str(i.seq) for i in fs]
    df = pd.DataFrame(seqs, columns=['Sequence'])
    df['Target'] = label
    df = df.drop_duplicates()
    return df

In [6]:
def partition_data(df):
    
    train_size = int(len(df)*0.90)
    valid_size = int(len(df)) - train_size
    
    train_df = df.sample(train_size)
    valid_df = df.drop(train_df.index)
    
    train_df['set'] = 'train'
    valid_df['set'] = 'valid'
    
    return pd.concat([train_df, valid_df])

In [8]:
mrna_df = partition_data(parse_fasta(mrna, 'mRNA'))
incrna_df = partition_data(parse_fasta(incrna, 'lncRNA'))

mrna_testdf = parse_fasta(mrna_test, 'mRNA')
mrna_testdf['set'] = 'test'
incrna_testdf = parse_fasta(incrna_test, 'lncRNA')
incrna_testdf['set'] = 'test'

mrna_challengedf = parse_fasta(mrna_challenge, 'mRNA')
mrna_challengedf['set'] = 'test_challenge'
incrna_challengedf = parse_fasta(incrna_challenge, 'lncRNA')
incrna_challengedf['set'] = 'test_challenge'

In [9]:
dfs = [mrna_df, incrna_df, mrna_testdf, incrna_testdf, mrna_challengedf, incrna_challengedf]

In [10]:
[i.shape for i in dfs]

[(15978, 3), (15950, 3), (500, 3), (500, 3), (499, 3), (500, 3)]

In [11]:
data_df = pd.concat(dfs)

In [12]:
data_df.head()

Unnamed: 0,Sequence,Target,set
6704,GCTCAGCATTTGGGGACGCTCTCAGCTCTCGGCGCACGGCCCAGGT...,mRNA,train
4382,GTGACGCGCAAGCCTGGGCCGCTCCTCCTTCCCTCACCCGACGGCC...,mRNA,train
8479,ATAGGTATGATCTCGTGAAATCTTGAGAGAAACTGAATGACGAATG...,mRNA,train
10611,AGATTCAGGCGTGTAAACCAGCCGGAGCGGCGCGGCAGCGGCAGGA...,mRNA,train
8735,GTAGAGGGCTGTACCTTTTTGGCGCTGTGGAAGCCGTTGCTGTGCT...,mRNA,train


In [13]:
data_df.to_csv(path/'lncRNA_data.csv', index=False)

# Long read data

This builds a dataset from the full length sequence files

In [7]:
mrna = 'lncRNA/mRNAs.TRAIN.fa'
incrna = 'lncRNA/lncRNAs.TRAIN.fa'

mrna_test = 'lncRNA/mRNAs.TEST500.fa'
incrna_test = 'lncRNA/lncRNAs.TEST500.fa'

mrna_challenge = 'lncRNA/mRNAs.CHALLENGE500.fa'
incrna_challenge = 'lncRNA/lncRNAs.CHALLENGE500.fa'

In [8]:
mrna_df = partition_data(parse_fasta(mrna, 'mRNA'))
incrna_df = partition_data(parse_fasta(incrna, 'lncRNA'))

mrna_testdf = parse_fasta(mrna_test, 'mRNA')
mrna_testdf['set'] = 'test'
incrna_testdf = parse_fasta(incrna_test, 'lncRNA')
incrna_testdf['set'] = 'test'

mrna_challengedf = parse_fasta(mrna_challenge, 'mRNA')
mrna_challengedf['set'] = 'test_challenge'
incrna_challengedf = parse_fasta(incrna_challenge, 'lncRNA')
incrna_challengedf['set'] = 'test_challenge'

In [9]:
dfs = [mrna_df, incrna_df, mrna_testdf, incrna_testdf, mrna_challengedf, incrna_challengedf]

In [10]:
[i.shape for i in dfs]

[(86978, 3), (24339, 3), (500, 3), (500, 3), (499, 3), (500, 3)]

In [11]:
data_df = pd.concat(dfs)

In [12]:
data_df.head()

Unnamed: 0,Sequence,Target,set
76980,AGACCGCGGTGACGTCTCCACCGCGCCAAACTCACTGAAAATCAAA...,mRNA,train
22459,TAGCACACATTTACGCTCGCCTGCCGCGGGCCGCTCTCCGTGCTGG...,mRNA,train
26784,CCCCTCGCGCCGGGAGGAGCTGGCGGCGAGCGCCGAGCCGGGCGCG...,mRNA,train
34361,GGCGGGGCGCGGCGGTTCCGGCCCAGCCATGGCGGACGAGGCCCCG...,mRNA,train
48292,AGTCCTGAGTGCATGCTCTGCGGTCTGGGGTCACCTGGGGTGCTTA...,mRNA,train


In [14]:
data_df.shape

(113316, 3)

In [16]:
data_df = data_df[~data_df.Sequence.map(lambda x: 'N' in x.upper())]

In [17]:
data_df.shape

(113315, 3)

In [18]:
data_df.to_csv(path/'lncRNA_data2.csv', index=False)