# Tutorial One: Parsing different Input Files 

## Introduction:

The IPTK support different input files commonly used to describe the identification of proteomic experiments, for example, pepXML and idXML, along with CSV tables. These files are parsed and analyzed by the IO module to generate a simple table that is used throughout the library, referred to as the identification table. The table composites of 4 columns containing, the peptide sequences, the parent protein accession, the start index of the peptide in the parent protein and the end index in the parent proteins, i.e., the indices of the peptide in the parent protein. An Example of the identification table is shown below. Please notice, that if the peptide can be mapped to different proteins each mapping is treated individually, i.e., each parent protein has his own row. 

#### Note

The data used in this tutorial is available at the current working directory in the data folder  

In [1]:
# load the modules 
import os
import pandas as pd
from IPTK.Utils.DevFunctions import simulate_an_experimental_ident_table_from_fasta
from IPTK.IO.InFunctions import load_identification_table, parse_xml_based_format_to_identification_table
 

## Identification Table Example

In [3]:
# Simulating an identification table 
import pandas as pd
# genetate table 
table : pd.DataFrame = simulate_an_experimental_ident_table_from_fasta(
    path2load='data/human_proteome.fasta',num_pep=10,num_prot=2)  
print(table)
# SEE the documentation from more information about the simulation function 

           peptides proteins  start_index  end_index
0    AHGSDKSKDFYPFG   Q8N7X0           16         30
1     LFEVKKDTERADE   Q8N7X0         1373       1386
2       INSEKWDAGKG   Q8N7X0           55         66
3     GSLVLKIHTYATK   Q8N7X0          724        737
4  KALEFMDLSQYVRKTD   Q8N7X0         1532       1548
5            ETLPEI   Q5T1N1          130        136
6     SLSKLSPTSQKGT   Q5T1N1          350        363
7     ENTSDLEGPVAAG   Q5T1N1          191        204
8      QMCQKLKEQTDQ   Q5T1N1          378        390
9      LKPKRICSQRVN   Q5T1N1          724        736


Below is an example of using the library to parse and analyze different input 

### Load A Pre-exsisting Table 

In [12]:
table: pd.DataFrame = load_identification_table('data/IdentificationTable.csv',sep=',')
print(table)

peptide protein  start_index  end_index
0    EVSHDLAPQFLEAG  P11446           82         96
1    LVRLYDKGVPALKN  P11446          268        282
2   HLGADVIFTPHLGNF  P11446          220        235
3  VGLPFCDIGFAVQGEH  P11446          283        299
4       PGVDVVFLATA  P11446           70         81
5    LYKEMQKRGWDVKE  P02924          143        157
6   NWVAKDVEPPKFTEV  P02924          291        306
7    SLLPSPDVHGYKSS  P02924          273        287
8  LPSPDVHGYKSSEMLY  P02924          275        291
9    AKGKPMDTVPLVMM  P02924          117        131


As shown a bove, the identification table has 4 columns, namely, peptide, protein, start index and end index. All other input formats are translated to this format internally

### Load a pepXML file

In [8]:
table: pd.DataFrame = parse_xml_based_format_to_identification_table(path2XML_file='data/0810202_0.5_all_ids_merged_psm_perc_filtered.idXML', 
                        is_idXML=True, path2fastaDB='data/human_proteome.fasta')
print(table)

              peptide     protein  start_index  end_index
0     LLPKKTESHHKAKGK      Q6FI13          115        130
1     LLPKKTESHHKAKGK      Q93077          115        130
2     LLPKKTESHHKAKGK      P0C0S8          115        130
3     LLPKKTESHHKAKGK      P04908          115        130
4     LLPKKTESHHKAKGK      P20671          115        130
...               ...         ...          ...        ...
4666     AQGGVLPNIQAV      H0YFX9           66         78
4667     AQGGVLPNIQAV  A0A0U1RRH7          103        115
4668     AQGGVLPNIQAV  A0A0U1RR32          103        115
4669      NAAPGVDLTQL      P13645          306        317
4670       DINTDGAVNF      P05109           58         68

[4667 rows x 4 columns]
