# Tutorial One: Parsing different Input Files 

## Introduction:

The IPTK support different input files commonly used to describe the identification of proteomic experiments, for example, pepXML and idXML, along with CSV tables. Theses files are parsed and analyized by the IO module to generate a simple table that is used throughout the library, refered to as the identification table. The table composites of 4 columns containing, the peptide sequences, the parent protein accession, the start index of the peptide in the parent protein and the end index in the parent proteins, i.e. the indicies of the peptide in the parent protein. An Example of the identification table is shown below. Please notice, that if the peptide can be mapped to different proteins each mapping is treated indiviually, i.e. each parent protein has his own row. 

In [10]:
# load the modules 
import os
import pandas as pd
from IPTK.Utils.DevFunctions import simulate_an_experimental_ident_table_from_fasta
from IPTK.IO.InFunctions import load_identification_table, parse_xml_based_format_to_identification_table, parse_mzTab_to_identification_table, parse_text_table 

## Identification Table Example

In [11]:
# Simulating an identification table 
import pandas as pd
# genetate table 
table : pd.DataFrame = simulate_an_experimental_ident_table_from_fasta(
    path2load='data/sequences/e_coli.fasta',num_pep=10,num_prot=2)  
print(table)
# SEE the documentation from more information about the simulation function 

peptides proteins  start_index  end_index
0          FDLSGAFRVNDATF   P11446           99        113
1         LISDLHPQLKGIVDL   P11446           43         58
2       DADLLDLNQWPVINATS   P11446          167        184
3  KGIVDLPLQPMSDISEFSPGVD   P11446           52         74
4      ELLEQAAYGLAEWCGNKL   P11446          125        143
5         TANELDTARRRTTGS   P02924          163        178
6        FQTEWKFADKAGKDLG   P02924           39         55
7              AADIIGIGIN   P02924          245        255
8            ATGFYGSLLPSP   P02924          267        279
9          GYKSSEMLYNWVAK   P02924          282        296


Below is an example of using the library to parse and analyze different input 

### Load A Pre-exsisting Table 

In [12]:
table: pd.DataFrame = load_identification_table('data/inputs/Identification_table.csv',sep=',')
print(table)

peptide protein  start_index  end_index
0    EVSHDLAPQFLEAG  P11446           82         96
1    LVRLYDKGVPALKN  P11446          268        282
2   HLGADVIFTPHLGNF  P11446          220        235
3  VGLPFCDIGFAVQGEH  P11446          283        299
4       PGVDVVFLATA  P11446           70         81
5    LYKEMQKRGWDVKE  P02924          143        157
6   NWVAKDVEPPKFTEV  P02924          291        306
7    SLLPSPDVHGYKSS  P02924          273        287
8  LPSPDVHGYKSSEMLY  P02924          275        291
9    AKGKPMDTVPLVMM  P02924          117        131


### Read mzTab File


In [13]:
table: pd.DataFrame = parse_mzTab_to_identification_table(path2mzTab='data/MM15_Melanom.mzTab',path2fastaDB='data/test_decoy.fasta')
print(table)

peptide protein  start_index  end_index
0     TVFDNFFIKK  Q9H040          414        424
1      SPNDFSVSL  P16333          312        321
2      GRLFAVVHF  Q7Z2W9           92        101
3    LPIKMDYGEEL  Q9BT88          121        132
4      ARLASLMNL  P43243           49         58
..           ...     ...          ...        ...
418    TPAPPDLTL  Q32M88          180        189
419  EVVDHVFPLLK  Q14693          843        854
420   ELITLEIIHR  P56377           75         85
421   NVVFDVQIPK  P19823          102        112
422    KLLPQFLLH  Q8N4J0           99        108

[423 rows x 4 columns]


### Load a pepXML file

In [17]:
table: pd.DataFrame = parse_xml_based_format_to_identification_table(path2PepXML='data/epitheluim_all_ids_merged_psm.pepXML', 
                        is_idXML=False, path2fastaDB='data/uniprot_sprot.fasta')
print(table)

peptide     protein  start_index  end_index
0         LRQETYLTLATVFVLLRL      Q9ZPE9          198        216
1             NCQDGSDEDDCVDC      Q700K0         2472       2486
2      GVALVLLSLALLLTVLRIIRG      Q52965           11         32
3           AGVVKKKKKKKKKKKK      Q95LY5          560        576
4            LLRRRRIKRKRTIRC      P13388          664        679
...                      ...         ...          ...        ...
28434        DILVQRHSMLRARFR  A0A319DV72         3167       3182
28435     KGDLVAIMTLKDELVALG      Q9V1A5          282        300
28436     KGDLVAIMTLKDELVALG      C5A255          282        300
28437     KGDLVAIMTLKDELVALG      Q5JJE8          282        300
28438           STVSFEVKRPKK      Q6UIL3          452        464

[28421 rows x 4 columns]


### Load an idXML file 

In [None]:
table: pd.DataFrame = parse_xml_based_format_to_identification_table(path2PepXML='data/epitheluim_all_ids_merged_psm.idXML', 
                        is_idXML=True, path2fastaDB='data/uniprot_sprot.fasta')
print(table)

### Load a CSV Table 

In [26]:
table: pd.DataFrame = parse_text_table('data/HLA_Demo.csv',sep=';',path2fastaDB='data/uniprot_sprot.fasta')
print(table)

5840 5840 5840 5840
                peptide protein  start_index  end_index
0       DGSVIRTIPKDNAQG  Q5AFA2          176        191
1      AKSNFEKLSNDLANDA  P53707          263        279
2        GNDPNALRGFHIHQ  O59924           36         50
3      IEGNDPNALRGFHIHQ  O59924           34         50
4          NGIFITYKNVPA  Q59L12          258        270
...                 ...     ...          ...        ...
5835    GKGVVVVIKRRSGQR  P46779           56         71
5836   GKGVVVVIKRRSGQRK  P46779           56         72
5837  GKGVVVVIKRRSGQRKP  P46779           56         73
5838    GITKPAIRRLARRGG  P62803           28         43
5839  YVTRYIYNREEYARFDS  P01920           57         74

[5830 rows x 4 columns]
