The goal of this notebook is to load all the data into a collection such as a list. This will let us divide the training data into training/validation/test sets in a future notebook.

In [1]:
import os
from Bio import SeqIO

In [2]:
root_dir = '/home/jonas/peppred'
os.chdir(root_dir)
os.getcwd()

'/home/jonas/peppred'

In [4]:
os.chdir('data/training_data/negative_examples/non_tm')
os.getcwd()

'/home/jonas/peppred/data/training_data/negative_examples/non_tm'

Easy way to collect all files in a directory

https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory

In [7]:
files = []
for (dirpath, dirname, filenames) in os.walk(os.getcwd()):
    files.extend(filenames)
files

['._g9.faa',
 '._g2.faa',
 '._g5.faa',
 'g0.faa',
 'g2.faa',
 'g9.faa',
 '._g6.faa',
 '._g7.faa',
 'g6.faa',
 '._g0.faa',
 'g7.faa',
 '._g8.faa',
 'g4.faa',
 'g1.faa',
 '._g3.faa',
 '._g1.faa',
 '._g4.faa',
 'g3.faa',
 'g8.faa',
 'g5.faa']

Use the glob module to filter out files starting with '.' 

https://docs.python.org/3/library/glob.html#glob.glob

In [10]:
import glob
files = glob.glob('*.faa')
files

['g0.faa',
 'g2.faa',
 'g9.faa',
 'g6.faa',
 'g7.faa',
 'g4.faa',
 'g1.faa',
 'g3.faa',
 'g8.faa',
 'g5.faa']

We can descend into subdirectories with the `**` and recursive setting for glob

https://stackoverflow.com/questions/14798220/how-can-i-search-sub-folders-using-glob-glob-module-in-python/22388582

In [12]:
data_files = glob.glob('/home/jonas/peppred/data/**/*.faa', recursive=True)
data_files

['/home/jonas/peppred/data/training_data/positive_examples/tm/m5.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/tm/m3.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/tm/m9.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/tm/m2.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/tm/m1.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/tm/m0.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/tm/m6.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/tm/m8.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/tm/m7.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/tm/m4.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/non_tm/s2.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/non_tm/s6.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/non_tm/s8.faa',
 '/home/jonas/peppred/data/training_data/positive_examples/non_tm

Use Biopythons SeqIO.parse to parse the FASTA formatted files. This will be a list of generators that yield another generator. 

In [21]:
data = [SeqIO.parse(file, 'fasta') for file in data_files]
data

[<generator object parse at 0x7f30a2a98888>,
 <generator object parse at 0x7f30a2b62e08>,
 <generator object parse at 0x7f30a2aa0570>,
 <generator object parse at 0x7f30a2aa0518>,
 <generator object parse at 0x7f30a2aa04c0>,
 <generator object parse at 0x7f30a2aa0468>,
 <generator object parse at 0x7f30a2aa0410>,
 <generator object parse at 0x7f30a2aa03b8>,
 <generator object parse at 0x7f30a2aa0360>,
 <generator object parse at 0x7f30a2aa0308>,
 <generator object parse at 0x7f30a2aa05c8>,
 <generator object parse at 0x7f30a2aa0620>,
 <generator object parse at 0x7f30a2aa0678>,
 <generator object parse at 0x7f30a2aa06d0>,
 <generator object parse at 0x7f30a2aa0728>,
 <generator object parse at 0x7f30a2aa0780>,
 <generator object parse at 0x7f30a2aa07d8>,
 <generator object parse at 0x7f30a2aa0830>,
 <generator object parse at 0x7f30a2aa0888>,
 <generator object parse at 0x7f30a2aa08e0>,
 <generator object parse at 0x7f30a2aa0938>,
 <generator object parse at 0x7f30a2aa0990>,
 <generato

In [24]:
for item in data:
    for subitem in item:
        print(subitem)

ID: RMP1_HUMAN
Name: RMP1_HUMAN
Description: RMP1_HUMAN O60894 148 AA.
Number of features: 0
Seq('MARALCRLPRRGLWLLLAHHLFMTTACQEANYGALLRELCLTQFQVDMEAVGET...iii', SingleLetterAlphabet())
ID: E315_ADE05
Name: E315_ADE05
Description: E315_ADE05 P06498 132 AA.
Number of features: 0
Seq('MKFTVTFLLIICTLSAFCSPTSKPQRHISCRFTRIWNIPSCYNEKSDLSEAWLY...iii', SingleLetterAlphabet())
ID: RIB1_RAT
Name: RIB1_RAT
Description: RIB1_RAT P07153; 605 AA.
Number of features: 0
Seq('MEAPIVLLLLLWLALAPTPGSASSEAPPLVNEDVKRTVDLSSHLAKVTAEVVLA...iii', SingleLetterAlphabet())
ID: GPBB_HUMAN
Name: GPBB_HUMAN
Description: GPBB_HUMAN P13224; 206 AA.
Number of features: 0
Seq('MGSGPRGALSLLLLLLAPPSRPAAGCPAPCSCAGTLVDCGRRGLTWASLPTAFP...iii', SingleLetterAlphabet())
ID: 5HT3_MOUSE
Name: 5HT3_MOUSE
Description: 5HT3_MOUSE P23979; 487 AA.
Number of features: 0
Seq('MRLCIPQVLLALFLSMLTAPGEGSRRRATQEDTTQPALLRLSDHLLANYKKGVR...ooo', SingleLetterAlphabet())
ID: EGFR_HUMAN
Name: EGFR_HUMAN
Description: EGFR_HUMAN P00533; O00688; 1210 A

ID: PSAA_SYNY3
Name: PSAA_SYNY3
Description: PSAA_SYNY3 P29254, 751 AA.
Number of features: 0
Seq('MTISPPEREAKAKVSVDNNPVPTSFEKWGKPGHFDRTLARGPKTTTWIWNLHAN...ooo', SingleLetterAlphabet())
ID: PSAB_SYNY3
Name: PSAB_SYNY3
Description: PSAB_SYNY3 P29255, 730 AA.
Number of features: 0
Seq('ATKFPKFSQDLAQDPTTRRIWYGIATAHDFETHDGMTEENLYQKIFASHFGHIA...ooo', SingleLetterAlphabet())
ID: LCND_LACLA
Name: LCND_LACLA
Description: LCND_LACLA Q00565 474 AA.
Number of features: 0
Seq('MFDKKLLESSELYDKRYRNFSTLIILPLFILLVGGVIFTFFAHKELTVISTGSI...OOO', SingleLetterAlphabet())
ID: HOKC_ECOLI
Name: HOKC_ECOLI
Description: HOKC_ECOLI P22982, 50 AA.
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFTAYESE#iii...ooo', SingleLetterAlphabet())
ID: DHSD_BOVIN
Name: DHSD_BOVIN
Description: DHSD_BOVIN Q95123 158 AA.
Number of features: 0
Seq('MALWRLSVLCGAKEGRALFLRTPVVRPALVSAFLQDRPAQGWCGTQHIHLSPSH...ooo', SingleLetterAlphabet())
ID: CY1_BOVIN
Name: CY1_BOVIN
Description: CY1_BOVIN P00125 241 AA.
Numbe