# Tutorial One: Parsing different Input Files 

## Introduction:

IPTK support different input files commonly used to describe the identification of proteomic experiments, for example, pepXML and idXML, along with CSV tables. These files are parsed and analyzed by the IO module to generate a simple table that is used throughout the library, referred to as the identification table. The table composites of 4 columns containing, the peptide sequences, the parent protein accession, the start index of the peptide in the parent protein and the end index in the parent proteins, i.e., the indices of the peptide in the parent protein. An example of the identification table is shown below. Please notice, that if the peptide can be mapped to different proteins each mapping is treated individually, i.e., each parent protein has his own row. 

#### Note

The data used in this tutorial is available in the this github-repository. To run the tutorial clone the repository e.g. by: git clone https://github.com/ikmb/iptoolkit.git  
The code here assumes that the current working directory is set to the Tutorials folder. 

In [1]:
## Make sure the library is installed and installing it if it is not installed.  
try: 
    import IPTK 
except ModuleNotFoundError: 
    import os
    os.system("pip install iptkl --user")

In [2]:
# Load the modules 
import os
import pandas as pd 
from IPTK.Utils.DevFunctions import simulate_an_experimental_ident_table_from_fasta
from IPTK.IO.InFunctions import load_identification_table, parse_xml_based_format_to_identification_table

### Load a pre-exsisting table 

#### Note

The input table is a csv file with four columns and a header. Attention: The Columns need to be in the defined order (peptide, protein, start_index, end_index).  

In [4]:
table: pd.DataFrame = load_identification_table('data/IdentificationTable.csv',sep=',')
print(table)

              peptide     protein  start_index  end_index
0     LLPKKTESHHKAKGK      Q6FI13          115        130
1     LLPKKTESHHKAKGK      Q93077          115        130
2     LLPKKTESHHKAKGK      P0C0S8          115        130
3     LLPKKTESHHKAKGK      P04908          115        130
4     LLPKKTESHHKAKGK      P20671          115        130
...               ...         ...          ...        ...
4662     AQGGVLPNIQAV      H0YFX9           66         78
4663     AQGGVLPNIQAV  A0A0U1RRH7          103        115
4664     AQGGVLPNIQAV  A0A0U1RR32          103        115
4665      NAAPGVDLTQL      P13645          306        317
4666       DINTDGAVNF      P05109           58         68

[4667 rows x 4 columns]


All other input formats are translated to this format internally. 

### Load a pepXML file

#### Note

Below we load an idXML file that we will be using again in tutorial 3. Here, it is just introduced for illustrating the general syntax of loading an idXML file. The idXML files have been generated, using MHCQuant as described in tutorial 3

In [5]:
## Load the table:
#-----------------
table: pd.DataFrame = parse_xml_based_format_to_identification_table(
    path2XML_file='data/0810202_0.5_all_ids_merged_psm_perc_filtered.idXML', ## The path to the idXML file
    is_idXML=True, ## Control flag for the function to read an idXML file 
    path2fastaDB='data/human_proteome.fasta' ## the path to sequence database, it is used to get the position of the peptides in the parent proteins.
    )
## Print the table:
#------------------
print(table)

              peptide     protein  start_index  end_index
0     LLPKKTESHHKAKGK      Q6FI13          115        130
1     LLPKKTESHHKAKGK      Q93077          115        130
2     LLPKKTESHHKAKGK      P0C0S8          115        130
3     LLPKKTESHHKAKGK      P04908          115        130
4     LLPKKTESHHKAKGK      P20671          115        130
...               ...         ...          ...        ...
4666     AQGGVLPNIQAV      H0YFX9           66         78
4667     AQGGVLPNIQAV  A0A0U1RRH7          103        115
4668     AQGGVLPNIQAV  A0A0U1RR32          103        115
4669      NAAPGVDLTQL      P13645          306        317
4670       DINTDGAVNF      P05109           58         68

[4667 rows x 4 columns]


#### Note

sometimes we have the following situation; a peptide, for example, _GPDGRLLRGHNQYAYDGK_ which is mapped to the following protein:  _XPDGRLLRGHNQYAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAARVAEQDRAYLEGTCVEWLRRYLENGKDTLERADPPKTHVTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDRTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEPSSQSTVPIVGIVAGLAVLAVVVIGAVVAAVMCRRKSSGHFLPTGGKGGSYSQAACSDSAQGSDVSLTA_ by the search engine. 

When the function tries to map the peptide to the protein to extract the start and end indiceis, a **ValueError** exception will be thrown because the peptide is not part of the protein. To understand why, let's look at the peptide sequence and its match in the protein. 

 - **Petide**:  _GPDGRLLRGHNQYAYDGK_
 - **Protein**: _XPDGRLLRGHNQYAYDGK_

As you can see, we have an **X** instead of **G**, hence we do not have a complete match. The default behaviour of the function is to skip these peptides. However, this behaviour can be over ridden to allow the function to throw an error when mismatches are encountered as shown below.  



In [7]:
try:
    table: pd.DataFrame = parse_xml_based_format_to_identification_table(path2XML_file='data/0810202_0.5_all_ids_merged_psm_perc_filtered.idXML', 
                        is_idXML=True, path2fastaDB='data/human_proteome.fasta',
                        remove_if_not_matched=False # throw an error if there is no match
                        )
except ValueError as exp:
    print(f'I have encountered the following exception:\n {exp}')

I have encountered the following exception:
 Peptide sequence: GPDGRLLRGHNQYAYDGK could not be extracted from protein sequence: XPDGRLLRGHNQYAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAARVAEQDRAYLEGTCVEWLRRYLENGKDTLERADPPKTHVTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDRTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEPSSQSTVPIVGIVAGLAVLAVVVIGAVVAAVMCRRKSSGHFLPTGGKGGSYSQAACSDSAQGSDVSLTA with accession: A0A140T951


To learn more please check [Tutorial 2](Tutorial_two_creating_an_experiment_object.ipynb)