# Tutorial One: Parsing different Input Files 

## Introduction:

IPTK support different input files commonly used to describe the identification of proteomic experiments, for example, pepXML and idXML, along with CSV tables. These files are parsed and analyzed by the IO module to generate a simple table that is used throughout the library, referred to as the identification table. The table composites of 4 columns containing, the peptide sequences, the parent protein accession, the start index of the peptide in the parent protein and the end index in the parent proteins, i.e., the indices of the peptide in the parent protein. An Example of the identification table is shown below. Please notice, that if the peptide can be mapped to different proteins each mapping is treated individually, i.e., each parent protein has his own row. 

#### Note

 - The data used in this tutorial is available at the current working directory in the data folder. I mean, the code here assumes that the current working directory is set to Tutorials. 

In [1]:
## Make sure the library is installed and installing it if it is not installed.  
try: 
    import IPTK 
except ModuleNotFoundError: 
    import os
    os.system("pip install iptkl --user")

In [2]:
# Load the modules 
import os
import pandas as pd 
from IPTK.Utils.DevFunctions import simulate_an_experimental_ident_table_from_fasta
from IPTK.IO.InFunctions import load_identification_table, parse_xml_based_format_to_identification_table

## Identification table example

In [3]:
# Simulate an identification table, this function can be used to generate a dummy table that can be used for developing
# functions and tools for analyzing or parsing the identification table. 
import pandas as pd
# genetate table 
table : pd.DataFrame = simulate_an_experimental_ident_table_from_fasta(
    path2load='data/human_proteome.fasta',num_pep=10,num_prot=2)  
print(table)
# see the documentation from more information about the simulation function 

            peptides proteins  start_index  end_index
0   TNNSVSKEIWLDFEDF   Q8N7X0          622        638
1   IVSQTTATQEKSQEEL   Q8N7X0          604        620
2        WSEADINSEKW   Q8N7X0           50         61
3      VCFSALVRWGEYG   Q8N7X0          685        698
4      VLVTRSRSCPLVA   Q8N7X0          456        469
5      EKMDESKYTSAPS   Q5T1N1          477        490
6        FSIVLHEKAPH   Q5T1N1          629        640
7     NTSDLEGPVAAGDS   Q5T1N1          192        206
8  PSPHFYSCRISGSKSLC   Q5T1N1          765        782
9     HLTLQQQVHKHEST   Q5T1N1          437        451


Below is an example of using the library to parse and analyze different inputs. 

### Load a pre-exsisting table 

In [4]:
table: pd.DataFrame = load_identification_table('data/IdentificationTable.csv',sep=',')
print(table)

              peptide     protein  start_index  end_index
0     LLPKKTESHHKAKGK      Q6FI13          115        130
1     LLPKKTESHHKAKGK      Q93077          115        130
2     LLPKKTESHHKAKGK      P0C0S8          115        130
3     LLPKKTESHHKAKGK      P04908          115        130
4     LLPKKTESHHKAKGK      P20671          115        130
...               ...         ...          ...        ...
4662     AQGGVLPNIQAV      H0YFX9           66         78
4663     AQGGVLPNIQAV  A0A0U1RRH7          103        115
4664     AQGGVLPNIQAV  A0A0U1RR32          103        115
4665      NAAPGVDLTQL      P13645          306        317
4666       DINTDGAVNF      P05109           58         68

[4667 rows x 4 columns]


All other input formats are translated to this format internally. 

### Load a pepXML file

In [5]:
## Below we load an idxXML file that we will be using again in tutorial 3. Here, it is just introduce for illustrating the general syntax of loading 
## an idXML file. The same function can be used for loading pepXML files by sitting is_idxML to false. 
## The path2fastaDB is used to get the position of the peptides in there part proteins.
## This idXML files have been generated, using MHCquant as described in tutorial 3
table: pd.DataFrame = parse_xml_based_format_to_identification_table(path2XML_file='data/0810202_0.5_all_ids_merged_psm_perc_filtered.idXML', 
                        is_idXML=True, path2fastaDB='data/human_proteome.fasta')
print(table)

              peptide     protein  start_index  end_index
0     LLPKKTESHHKAKGK      Q6FI13          115        130
1     LLPKKTESHHKAKGK      Q93077          115        130
2     LLPKKTESHHKAKGK      P0C0S8          115        130
3     LLPKKTESHHKAKGK      P04908          115        130
4     LLPKKTESHHKAKGK      P20671          115        130
...               ...         ...          ...        ...
4666     AQGGVLPNIQAV      H0YFX9           66         78
4667     AQGGVLPNIQAV  A0A0U1RRH7          103        115
4668     AQGGVLPNIQAV  A0A0U1RR32          103        115
4669      NAAPGVDLTQL      P13645          306        317
4670       DINTDGAVNF      P05109           58         68

[4667 rows x 4 columns]


To learn more please check [Tutorial 2](Tutorial_two_creating_an_experiment_object.ipynb)