# Matched pairs tutorial

For this tutorial, a set of examples will be illustrated to extract matched pairs information from a SAR dataset, which can be obtained from local csv files or through dataframes in novodataset. The functions have been configured to be flexible around the nature of the molecules in terms of the number of reference peptides included. Here we provide some examples depending on the question to answer

Before starting, the pepfunn package can be installed as follows:

In [None]:
!pip install --user pepfunn

### 1. Using a dataset with a reference peptide

As an example, we provide a reduced CSV files with data of a GLP1 project, including experimental potency assays. The goal is to check all the pairs and the mutations responsible to increase or decrease the activities based on a reference peptide, in this case the GLP1 substrate. First we load the CSV file and configure some variables:

In [1]:
# Import matched pairs module
from pepfunn.NN.matchedpairs import MatchedPairs

Then we need to read the file (or download the dataset) and define the columns containing the assays, and the one having the NNCD id. Finally, we add in a list if the peptide contains any of the hard-coded substrates, which are: `['INSULIN', 'INSULIN_A', 'INSULIN_B', 'GLP1', 'AMYLIN', 'CALCITONIN', 'PYY', 'EXENDIN']`. For this case we will select `GLP1`.

In [2]:
# Import pandas
import pandas as pd

# Columns containing the assays
property_columns = ['GLP-1R_EC50_sema_norm', 'GLP-1R_EC50_w/HSA_sema_norm']

# Read the tsv file
df = pd.read_csv('example_glp1_hv.tsv', sep='\t')

# Disregard lines with NaN in the 'property_column' values
df = df.dropna(subset=property_columns)

# Reset the indices
df = df.reset_index(drop=True)

# Column having 
id_column='NNCD'

# List with the reference substrates (it could be more than one)
references=['GLP1']

df

Unnamed: 0,NNCD,Sequence,GLP1-R_pEC50_Norm,GLP-1R_EC50_sema_norm,GLP-1R_EC50_w/HSA_sema_norm,GLP-1R_pEC50_w/HSA_Norm
0,0721-0000-1909,HWEGTFTSDVSSYLERQAARXFIAWLVGGGG,0.150695,1.4148,0.7408,2.194348
1,0721-0000-1688,HGEGTFTSDVSSYLEGTAAXEFIAWIVRGQG,1.374765,23.7009,15.4223,3.51281
2,0721-0000-1911,HWEGTFTSDVSSYLERQAARXFDAWLVGGRG,2.840083,691.9625,723.1285,5.183876
3,0721-0000-1371,HAEGTFTSDVSSYLEGQAAXHFIAWLVRGRP,0.158423,1.4402,0.5072,2.02983
4,0721-0000-1455,HAEGTFTSDVSSYLEGQAAXLFIAWLRRGGG,0.158875,1.4417,0.505,2.02792
5,0721-0000-1946,HDEGTFTSDVSSYLEGQAARXFIAGLVRGRP,1.296294,19.7831,18.101,3.582365
6,0721-0000-1637,HGDGTFTSDVSSYLEGRAAXEFIAWLVRGRP,0.232157,1.7067,1.1316,2.378362
7,0721-0000-1683,HGEGTFTSDVSSYLEGQAAXEFGAWLVRPRI,1.71213,51.5383,38.4144,3.909156
8,0721-0000-1839,HGEGTFTSDVSSYLEGQAARXFVAWLVRIRI,0.494892,3.1253,0.8872,2.2727
9,0721-0000-1403,HAEGTFTSDVSSYLEWQTAXEFIAYLVRGRG,0.485267,3.0568,1.839,2.589246


We do not require to have the sequence as a column in the dataframe because the function can extract this information using the NNCD id. With these variable, we call the first function to extract the region in the sequence where the substrate is, plus some additional columns explained below:

In [3]:
df_sequences = MatchedPairs.get_sequences(df, id_column, references)
df_sequences

Unnamed: 0,ID,Begin,End,Indexes,Peptide,GLP1,Modifications
0,0721-0000-1909,,G,0,HWEGTFTSDVSSYLERQAARKFIAWLVGGGG,HWEGTFTSDVSSYLERQAARKFIAWLVGGG,"{21: ['Modification1'], 31: ['amid']}"
1,0721-0000-1688,,G,0,HGEGTFTSDVSSYLEGTAAKEFIAWIVRGQG,HGEGTFTSDVSSYLEGTAAKEFIAWIVRGQ,"{20: ['Modification1'], 31: ['amid']}"
2,0721-0000-1911,,G,0,HWEGTFTSDVSSYLERQAARKFDAWLVGGRG,HWEGTFTSDVSSYLERQAARKFDAWLVGGR,"{21: ['Modification1'], 31: ['amid']}"
3,0721-0000-1371,,P,0,HAEGTFTSDVSSYLEGQAAKHFIAWLVRGRP,HAEGTFTSDVSSYLEGQAAKHFIAWLVRGR,"{20: ['Modification1'], 31: ['amid']}"
4,0721-0000-1455,,G,0,HAEGTFTSDVSSYLEGQAAKLFIAWLRRGGG,HAEGTFTSDVSSYLEGQAAKLFIAWLRRGG,"{20: ['Modification1'], 31: ['amid']}"
5,0721-0000-1946,,P,0,HDEGTFTSDVSSYLEGQAARKFIAGLVRGRP,HDEGTFTSDVSSYLEGQAARKFIAGLVRGR,"{21: ['Modification1'], 31: ['amid']}"
6,0721-0000-1637,,P,0,HGDGTFTSDVSSYLEGRAAKEFIAWLVRGRP,HGDGTFTSDVSSYLEGRAAKEFIAWLVRGR,"{20: ['Modification1'], 31: ['amid']}"
7,0721-0000-1683,,I,0,HGEGTFTSDVSSYLEGQAAKEFGAWLVRPRI,HGEGTFTSDVSSYLEGQAAKEFGAWLVRPR,"{20: ['Modification1'], 31: ['amid']}"
8,0721-0000-1839,,I,0,HGEGTFTSDVSSYLEGQAARKFVAWLVRIRI,HGEGTFTSDVSSYLEGQAARKFVAWLVRIR,"{21: ['Modification1'], 31: ['amid']}"
9,0721-0000-1403,,G,0,HAEGTFTSDVSSYLEWQTAKEFIAYLVRGRG,HAEGTFTSDVSSYLEWQTAKEFIAYLVRGR,"{20: ['Modification1'], 31: ['amid']}"


The generated dataframe contains the following:
- **ID:** NNCD number
- **Begin:** The sequence fragment before the start of the substrate sequence
- **End:** The sequence fragment after the end of the substrate sequence
- **Indexes:** Index of where the fragment start in the original sequence (useful for ennumeration purposes)
- **Peptide:** Original peptide sequence
- **GLP1 (named based on the reference):** This will contain the fragment that will be compared. The numbering will be defined based on the reference.
- **Modifications:** Dictionary containing all the reported modifications of the molecules in the NNCD, with the numbering based on the original peptide

The previous function is the more demanding functionality, which will be proportional to the number of sequences. The delay is given to the extraction of the modification directly from the NNCD. If extracting modifications is not relevant, the function can be called as:

In [4]:
df_seq_no_mod = MatchedPairs.get_sequences(df, id_column, references, add_mod=False)
df_seq_no_mod

Unnamed: 0,ID,Begin,End,Indexes,Peptide,GLP1
0,0721-0000-1909,,G,0,HWEGTFTSDVSSYLERQAARKFIAWLVGGGG,HWEGTFTSDVSSYLERQAARKFIAWLVGGG
1,0721-0000-1688,,G,0,HGEGTFTSDVSSYLEGTAAKEFIAWIVRGQG,HGEGTFTSDVSSYLEGTAAKEFIAWIVRGQ
2,0721-0000-1911,,G,0,HWEGTFTSDVSSYLERQAARKFDAWLVGGRG,HWEGTFTSDVSSYLERQAARKFDAWLVGGR
3,0721-0000-1371,,P,0,HAEGTFTSDVSSYLEGQAAKHFIAWLVRGRP,HAEGTFTSDVSSYLEGQAAKHFIAWLVRGR
4,0721-0000-1455,,G,0,HAEGTFTSDVSSYLEGQAAKLFIAWLRRGGG,HAEGTFTSDVSSYLEGQAAKLFIAWLRRGG
5,0721-0000-1946,,P,0,HDEGTFTSDVSSYLEGQAARKFIAGLVRGRP,HDEGTFTSDVSSYLEGQAARKFIAGLVRGR
6,0721-0000-1637,,P,0,HGDGTFTSDVSSYLEGRAAKEFIAWLVRGRP,HGDGTFTSDVSSYLEGRAAKEFIAWLVRGR
7,0721-0000-1683,,I,0,HGEGTFTSDVSSYLEGQAAKEFGAWLVRPRI,HGEGTFTSDVSSYLEGQAAKEFGAWLVRPR
8,0721-0000-1839,,I,0,HGEGTFTSDVSSYLEGQAARKFVAWLVRIRI,HGEGTFTSDVSSYLEGQAARKFVAWLVRIR
9,0721-0000-1403,,G,0,HAEGTFTSDVSSYLEWQTAKEFIAYLVRGRG,HAEGTFTSDVSSYLEWQTAKEFIAYLVRGR


If more sequences are added to the initial dataframe, to avoid requesting again the initial sequences, we can provide the previous dataframe as backup to run the calculations faster:

In [5]:
df_seq_updated = MatchedPairs.get_sequences(df, id_column, references, ref_df_sequences=df_sequences)
df_seq_updated

Unnamed: 0,ID,Begin,End,Indexes,Peptide,GLP1,Modifications
0,0721-0000-1909,,G,0,HWEGTFTSDVSSYLERQAARKFIAWLVGGGG,HWEGTFTSDVSSYLERQAARKFIAWLVGGG,"{21: ['Modification1'], 31: ['amid']}"
1,0721-0000-1688,,G,0,HGEGTFTSDVSSYLEGTAAKEFIAWIVRGQG,HGEGTFTSDVSSYLEGTAAKEFIAWIVRGQ,"{20: ['Modification1'], 31: ['amid']}"
2,0721-0000-1911,,G,0,HWEGTFTSDVSSYLERQAARKFDAWLVGGRG,HWEGTFTSDVSSYLERQAARKFDAWLVGGR,"{21: ['Modification1'], 31: ['amid']}"
3,0721-0000-1371,,P,0,HAEGTFTSDVSSYLEGQAAKHFIAWLVRGRP,HAEGTFTSDVSSYLEGQAAKHFIAWLVRGR,"{20: ['Modification1'], 31: ['amid']}"
4,0721-0000-1455,,G,0,HAEGTFTSDVSSYLEGQAAKLFIAWLRRGGG,HAEGTFTSDVSSYLEGQAAKLFIAWLRRGG,"{20: ['Modification1'], 31: ['amid']}"
5,0721-0000-1946,,P,0,HDEGTFTSDVSSYLEGQAARKFIAGLVRGRP,HDEGTFTSDVSSYLEGQAARKFIAGLVRGR,"{21: ['Modification1'], 31: ['amid']}"
6,0721-0000-1637,,P,0,HGDGTFTSDVSSYLEGRAAKEFIAWLVRGRP,HGDGTFTSDVSSYLEGRAAKEFIAWLVRGR,"{20: ['Modification1'], 31: ['amid']}"
7,0721-0000-1683,,I,0,HGEGTFTSDVSSYLEGQAAKEFGAWLVRPRI,HGEGTFTSDVSSYLEGQAAKEFGAWLVRPR,"{20: ['Modification1'], 31: ['amid']}"
8,0721-0000-1839,,I,0,HGEGTFTSDVSSYLEGQAARKFVAWLVRIRI,HGEGTFTSDVSSYLEGQAARKFVAWLVRIR,"{21: ['Modification1'], 31: ['amid']}"
9,0721-0000-1403,,G,0,HAEGTFTSDVSSYLEWQTAKEFIAYLVRGRG,HAEGTFTSDVSSYLEWQTAKEFIAYLVRGR,"{20: ['Modification1'], 31: ['amid']}"


After this, we can calculate all the pairs. We will use the `df_sequences` dataframe to assign all the possible mutations, and taking into account the modifications extracted from NNCD.

In [6]:
df_pairs = MatchedPairs.get_pairs(df, df_sequences, property_columns, references, operation_columns=['divide', 'substract'], use_mod=True)
df_pairs

Unnamed: 0,ID_A,ID_B,Begin_A,Begin_B,End_A,End_B,Indexes_A,Indexes_B,Mutations_GLP1,Origin_Mutations_GLP1,Distance_GLP1,Diff_GLP-1R_EC50_sema_norm,Min_GLP-1R_EC50_sema_norm,Max_GLP-1R_EC50_sema_norm,Diff_GLP-1R_EC50_w/HSA_sema_norm,Min_GLP-1R_EC50_w/HSA_sema_norm,Max_GLP-1R_EC50_w/HSA_sema_norm
0,0721-0000-1909,0721-0000-1688,,,G,G,0,0,(2)W/G | (16)R/G | (17)Q/T | (20)R/Mod1 | (21)...,(2)W->(2)G | (16)R->(16)G | (17)Q->(17)T | (20...,8,0.059694,1.4148,23.7009,-14.6815,0.7408,15.4223
1,0721-0000-1909,0721-0000-1911,,,G,G,0,0,(23)I/D | (30)G/R,(23)I->(23)D | (30)G->(30)R,2,0.002045,1.4148,691.9625,-722.3877,0.7408,723.1285
2,0721-0000-1909,0721-0000-1371,,,G,P,0,0,(2)W/A | (16)R/G | (20)R/Mod1 | (21)Mod1/H | (...,(2)W->(2)A | (16)R->(16)G | (20)R->(20)Mod1 | ...,6,0.982364,1.4148,1.4402,0.2336,0.5072,0.7408
3,0721-0000-1909,0721-0000-1455,,,G,G,0,0,(2)W/A | (16)R/G | (20)R/Mod1 | (21)Mod1/L | (...,(2)W->(2)A | (16)R->(16)G | (20)R->(20)Mod1 | ...,6,0.981341,1.4148,1.4417,0.2358,0.5050,0.7408
4,0721-0000-1909,0721-0000-1946,,,G,P,0,0,(2)W/D | (16)R/G | (25)W/G | (28)G/R | (30)G/R,(2)W->(2)D | (16)R->(16)G | (25)W->(25)G | (28...,5,0.071516,1.4148,19.7831,-17.3602,0.7408,18.1010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
751,0721-0000-1649,0721-0000-1841,,,I,I,0,0,(3)D/E | (19)V/A | (20)Mod1/R | (21)E/Mod1 | (...,(3)D->(3)E | (19)V->(19)A | (20)Mod1->(20)R | ...,6,0.006315,2.1236,336.2905,-194.3637,0.9228,195.2865
752,0721-0000-1902,0721-0000-1841,,,G,I,0,0,(29)G/P | (30)S/R,(29)G->(29)P | (30)S->(30)R,2,0.047190,15.8696,336.2905,-189.2805,6.0060,195.2865
753,0721-0000-1649,0721-0000-1855,,,I,G,0,0,(3)D/E | (15)E/T | (19)V/A | (20)Mod1/R | (21)...,(3)D->(3)E | (15)E->(15)T | (19)V->(19)A | (20...,7,0.481706,2.1236,4.4085,-0.8807,0.9228,1.8035
754,0721-0000-1902,0721-0000-1855,,,G,G,0,0,(15)E/T | (23)W/I | (24)A/T | (26)L/I | (30)S/R,(15)E->(15)T | (23)W->(23)I | (24)A->(24)T | (...,5,3.599773,4.4085,15.8696,4.2025,1.8035,6.0060


From the 27 sequences, a total of 756 matched pairs were detected with directionality A->B. The `operation_columns` variable will describe how the experimental values will be compared, using either substract or divide operations (by default they will be divided). We can also activate or deactivate the `use_mod` variable if the Modifications were captured in the `df_sequences` dataframe. The following are the generated columns:

- **ID_A:** NNCD number of peptide A
- **ID_A:** NNCD number of peptide B
- **Begin_A:** The sequence fragment before the start of the substrate sequence in peptide A
- **Begin_B:** The sequence fragment before the start of the substrate sequence in peptide B
- **End_A:** The sequence fragment after the end of the substrate sequence in peptide A
- **End_B:** The sequence fragment after the end of the substrate sequence in peptide B
- **Indexes_A:** Index of where the fragment start in the original sequence in peptide A
- **Indexes_B:** Index of where the fragment start in the original sequence in peptide B
- **Mutations_GLP1 (named based on the reference):** List of mutations found in the reference fragments following the format `(1)C/D | (2)E/F`, where the number in parenthesis is the position based on the reference, and the mutation is replacing `C` by `D` or `E` by `F`. If multiple mutations are reported, they will be separated by `|` symbol.
- **Origin_Mutations_GLP1 (named based on the reference):** These are the same mutations but numbered based on the original sequences. It means that the number of the position can change depending on the sequence lenght of each peptide. The format will be something like: `(1)C->(2)D | (2)E->(3)F`, where the numbering will be assigned to each amino acid, and the direction represented by the arrow `->`. If multiple mutations are present, they will be separated by `|` symbol.
- **Distance_GLP1 (named based on the reference):** The Hamming distance (number of mutations) between the fragments containing the reference peptide.
- **Diff, Min and Max columns per property:** Based on the number of property columns, the difference of the assay based on the operation will be reported. This can be used to decide which mutations are improving or affecting negatively the property. The Min and Max columns are also reported to filter based on erroneous or not desired activity values

The `df_pairs` can be filtered even further depending on the question to solve. For example check cases where a number of mutations improve the activity towards a receptor, or focus on what mutations are decreasing the activity in an experimental assays. The way to analyze the data is open to the researcher and project. The CSV can be also analyzed in Spotfire, and we have the option to upload such dataframes in novodatasets to be included as official SAR tables in RProjects (support of Data Stewards).

### 2. Using a dataset with a peptide containing two reference peptides provided by the user (not hard-coded)

Here we will use as an example part of a dataframe containing peptides with both GLP1 and AMYLIN fragments. In this case we have two reference substrates in the same sequences. However, instead of choosing them directly from the hard=coded list, we will add directly the reference sequences, and call the categories `GLP1-per` and `CALCITONIN-per`. The other variables are defined based on the first example

In [2]:
# Import matched pairs module
from pepfunn.NN.matchedpairs import MatchedPairs
import pandas as pd

# Columns containing the assays
property_columns = ['GLP1R CRE luc 0% HSA EC50 [pM]_relative', 'hGIPR CRE-luc 1% Ova 0% HSA (AMKH) EC50 [pM]_relative']

# Read the tsv file
df = pd.read_csv('example_triple.tsv', sep='\t')

# Disregard lines with NaN in the 'property_column' values
df = df.dropna(subset=property_columns)

# Reset the indices
df = df.reset_index(drop=True)

# Column having 
id_column='NNC no. (API)'

# List with the personalized reference substrates (any name can be used)
references=['GLP1-per', 'CALCITONIN-per']

# List of reference sequences
seq_refs = ['HAEGTFTSDVSSYLEGQAAKEFIAWLVKGR', 'ASHLSTAQTARLSAELHQLATLPRTETGSGSP']

df

Unnamed: 0,NNC no. (API),Protein sequence,Sequence length,Format,PepTalk Acylation Sites,hAmyR3 potency. EC50 (pM). Luciferase. 1% OVA. no HSA. 384 format EC50 [pM]_relative,hGIPR CRE-luc 1% Ova 0% HSA (AMKH) EC50 [pM]_relative,GLP1R CRE luc 0% HSA EC50 [pM]_relative
0,0487-0000-6009,YGEGTFTSDYSIYLEKQAAGEFVNWLLAGGPSSGASELSTAALGRL...,66,Linear,GLP-1 22K,2.905254,2.145647,22.915722
1,0487-0000-6010,YGEGTFTSDYSIYLEKQAAGEFVNWLLAGGASELSTAALGRLSAEL...,62,Linear,GLP-1 22K,,6.175844,68.494658
2,0487-0000-6008,YGEGTFTSDYSIYLEKQAAGEFVNWLLAGGPSSGAPPPSASELSTA...,71,Linear,GLP-1 22K,,5.619993,85.898372
3,0638-0000-0039,yGEGTFISDYSIALDRIAQQKFVEWLLAQRPGGGGEASELSTAALG...,68,Building block,-16K,3.864916,0.58281,3095.648858
4,0487-0000-0923,YGEGTFTSDYSILLEKQAAREFIEWLLAGGPSSGAPPPSGEASELS...,73,Linear,GLP-1 22K,3.152673,5.192381,55.369728
5,0487-0000-6013,YGEGTFTSDYSILLEEQAAREFIEWLLAGGPSKGAPPPSGGGEASE...,75,Linear,LS2K,4.395929,1.933707,4.43441
6,0638-0000-2002,YGEGTFTSDYSILLEKQAAREFIEWLLAGGPSSGAPPPSGGGGGCS...,76,Building block,-29K,2.14783,5.27584,27.428402
7,0638-0000-2003,YGEGTFTSDYSILLEKQAAREFIEWLLAGGPSSGAPPPSGGGGGAS...,76,Building block,-29K,1.116773,6.070733,50.413309
8,0487-0000-6014,YGEGTFTSDYSIGLDKIAQKAFVQWLIAGGPSSGAPPPSEASELST...,72,Linear,GLP-1 26K,3.319658,4.098126,121.229757
9,0487-0000-6015,YGEGTFTSDYSIGLDKIAQKAFVQWLIAGGPSSGAPPPSGGGGEAS...,76,Linear,GLP-1 26K,3.387125,2.434991,87.858201


We can call the function to get the sequences based on the reference peptide list. As in the previous case, the modifications per peptide will be stored in a dictionary, and in addition a `Linker` column is created with the sequence between the two reference peptides.

In [3]:
df_sequences = MatchedPairs.get_sequences(df, id_column, references, seq_refs=seq_refs)
df_sequences

Unnamed: 0,ID,Linker,Begin,End,Indexes,Peptide,GLP1-per,CALCITONIN-per,Modifications
0,0487-0000-6009,PSSG,,,0-34,YGEGTFTSDYSIYLEKQAAGEFVNWLLAGGPSSGASELSTAALGRL...,YGEGTFTSDYSIYLEKQAAGEFVNWLLAGG,ASELSTAALGRLSAELHELATLPRTETGSGSP,"{16: ['Modification1'], 2: ['Aib'], 20: ['Aib'..."
1,0487-0000-6010,,,,0-30,YGEGTFTSDYSIYLEKQAAGEFVNWLLAGGASELSTAALGRLSAEL...,YGEGTFTSDYSIYLEKQAAGEFVNWLLAGG,ASELSTAALGRLSAELHELATLPRTETGSGSP,"{16: ['Modification1'], 2: ['Aib'], 20: ['Aib'..."
2,0487-0000-6008,PSSGAPPPS,,,0-39,YGEGTFTSDYSIYLEKQAAGEFVNWLLAGGPSSGAPPPSASELSTA...,YGEGTFTSDYSIYLEKQAAGEFVNWLLAGG,ASELSTAALGRLSAELHELATLPRTETGSGSP,"{16: ['Modification1'], 2: ['Aib'], 20: ['Aib'..."
3,0638-0000-0039,PGGGGE,,,0-36,YGEGTFISDYSIALDRIAQQKFVEWLLAQRPGGGGEASELSTAALG...,YGEGTFISDYSIALDRIAQQKFVEWLLAQR,ASELSTAALGRLSAELHELATLPRTETGSGSP,"{21: ['Modification1'], 2: ['Aib'], 1: ['acety..."
4,0487-0000-0923,PSSGAPPPSGE,,,0-41,YGEGTFTSDYSILLEKQAAREFIEWLLAGGPSSGAPPPSGEASELS...,YGEGTFTSDYSILLEKQAAREFIEWLLAGG,ASELSTAALGRLSAELHELATLPRTETGSGSP,"{16: ['Modification1'], 2: ['Aib'], 73: ['amid']}"
5,0487-0000-6013,PSKGAPPPSGGGE,,,0-43,YGEGTFTSDYSILLEEQAAREFIEWLLAGGPSKGAPPPSGGGEASE...,YGEGTFTSDYSILLEEQAAREFIEWLLAGG,ASELSTAALGRLSAELHELATLPRTETGSGSP,"{33: ['Modification1'], 2: ['Aib'], 75: ['amid']}"
6,0638-0000-2002,PSSGAPPPSGGGGG,,,0-44,YGEGTFTSDYSILLEKQAAREFIEWLLAGGPSSGAPPPSGGGGGCS...,YGEGTFTSDYSILLEKQAAREFIEWLLAGG,CSNLSTCMLGRLSQDLHRLQTYPKTDVGANAP,"{16: ['Modification1'], 2: ['Aib'], 76: ['amid']}"
7,0638-0000-2003,PSSGAPPPSGGGGG,,,0-44,YGEGTFTSDYSILLEKQAAREFIEWLLAGGPSSGAPPPSGGGGGAS...,YGEGTFTSDYSILLEKQAAREFIEWLLAGG,ASNLSTAMLGRLSQDLHRLQTYPKTDVGANAP,"{16: ['Modification1'], 2: ['Aib'], 76: ['amid']}"
8,0487-0000-6014,PSSGAPPPSE,,,0-40,YGEGTFTSDYSIGLDKIAQKAFVQWLIAGGPSSGAPPPSEASELST...,YGEGTFTSDYSIGLDKIAQKAFVQWLIAGG,ASELSTAALGRLSAELHELATLPRTETGSGSP,"{20: ['Modification1'], 2: ['Aib'], 13: ['Aib'..."
9,0487-0000-6015,PSSGAPPPSGGGGE,,,0-44,YGEGTFTSDYSIGLDKIAQKAFVQWLIAGGPSSGAPPPSGGGGEAS...,YGEGTFTSDYSIGLDKIAQKAFVQWLIAGG,ASELSTAALGRLSAELHELATLPRTETGSGSP,"{20: ['Modification1'], 2: ['Aib'], 13: ['Aib'..."


To calculate the pairs we can use the same function as in the first example, but including the references and the property columns.

In [4]:
df_pairs = MatchedPairs.get_pairs(df, df_sequences, property_columns, references, operation_columns=['divide', 'divide'], use_mod=True)
df_pairs

Unnamed: 0,ID_A,ID_B,Linker_A,Linker_B,Begin_A,Begin_B,End_A,End_B,Indexes_A,Indexes_B,...,Distance_GLP1-per,Mutations_CALCITONIN-per,Origin_Mutations_CALCITONIN-per,Distance_CALCITONIN-per,Diff_GLP1R CRE luc 0% HSA EC50 [pM]_relative,Min_GLP1R CRE luc 0% HSA EC50 [pM]_relative,Max_GLP1R CRE luc 0% HSA EC50 [pM]_relative,Diff_hGIPR CRE-luc 1% Ova 0% HSA (AMKH) EC50 [pM]_relative,Min_hGIPR CRE-luc 1% Ova 0% HSA (AMKH) EC50 [pM]_relative,Max_hGIPR CRE-luc 1% Ova 0% HSA (AMKH) EC50 [pM]_relative
0,0487-0000-6009,0487-0000-6010,PSSG,,,,,,0-34,0-30,...,0,,,0,0.334562,22.915722,68.494658,0.347426,2.145647,6.175844
1,0487-0000-6009,0487-0000-6008,PSSG,PSSGAPPPS,,,,,0-34,0-39,...,0,,,0,0.266777,22.915722,85.898372,0.381788,2.145647,5.619993
2,0487-0000-6009,0638-0000-0039,PSSG,PGGGGE,,,,,0-34,0-36,...,12,,,0,0.007403,22.915722,3095.648858,3.681552,0.582810,2.145647
3,0487-0000-6009,0487-0000-0923,PSSG,PSSGAPPPSGE,,,,,0-34,0-41,...,4,,,0,0.413867,22.915722,55.369728,0.413230,2.145647,5.192381
4,0487-0000-6009,0487-0000-6013,PSSG,PSKGAPPPSGGGE,,,,,0-34,0-43,...,5,,,0,5.167704,4.434410,22.915722,1.109603,1.933707,2.145647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301,0662-0000-0014,0662-0000-0012,PSSGAPPPSGGGGKGGE,PSSGAPPPSKGGE,,,,,0-47,0-43,...,0,(3)E/N,(50)E->(46)N,1,1.718895,19.152432,32.921013,0.788862,5.371988,6.809797
302,0662-0000-0015,0662-0000-0012,PSKGAPPPSE,PSSGAPPPSKGGE,,,,,0-40,0-43,...,0,(3)E/N,(43)E->(46)N,1,213.358704,19.152432,4086.338066,2036.350437,6.809797,13867.133457
303,0662-0000-0014,0662-0000-0013,PSSGAPPPSGGGGKGGE,PSKGAPPPSGGGGGGGE,,,,,0-47,0-47,...,0,,,0,2.863724,11.495876,32.921013,2.293270,2.342502,5.371988
304,0662-0000-0015,0662-0000-0013,PSKGAPPPSE,PSKGAPPPSGGGGGGGE,,,,,0-40,0-47,...,0,,,0,355.461206,11.495876,4086.338066,5919.796346,2.342502,13867.133457


In the results we will have similar columns to the first case, but with some additions including the linker and the mutations and distance for the `CALCITONIN` part of the peptide. In addition, with the sequences dataframe we can generate a table containing the alignment of the sequences taking into account the fragments that match the reference peptides, and the linker section that can include gaps in the alignment. This is an example including also in the alignment the chemical modifications of each sequence.

In [5]:
from pepfunn.NN.sarplots import SARPlots

dataset = list(df_sequences['ID'].values)

df_alignment, total_alignment = SARPlots.generate_alignments(dataset, df_sequences, references, mode='multiple', add_mod=True)
df_alignment

Unnamed: 0,nncno,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p70,p71,p72,p73,p74,p75,p76,p77,p78,p79
0,0487-0000-6009,Y,Aib,E,G,T,F,T,S,D,...,P,R,T,E,T,G,S,G,S,P
1,0487-0000-6010,Y,Aib,E,G,T,F,T,S,D,...,P,R,T,E,T,G,S,G,S,P
2,0487-0000-6008,Y,Aib,E,G,T,F,T,S,D,...,P,R,T,E,T,G,S,G,S,P
3,0638-0000-0039,"acetyl,D-Form",Aib,E,G,T,F,I,S,D,...,P,R,T,E,T,G,S,G,S,P
4,0487-0000-0923,Y,Aib,E,G,T,F,T,S,D,...,P,R,T,E,T,G,S,G,S,P
5,0487-0000-6013,Y,Aib,E,G,T,F,T,S,D,...,P,R,T,E,T,G,S,G,S,P
6,0638-0000-2002,Y,Aib,E,G,T,F,T,S,D,...,P,K,T,D,V,G,A,N,A,P
7,0638-0000-2003,Y,Aib,E,G,T,F,T,S,D,...,P,K,T,D,V,G,A,N,A,P
8,0487-0000-6014,Y,Aib,E,G,T,F,T,S,D,...,P,R,T,E,T,G,S,G,S,P
9,0487-0000-6015,Y,Aib,E,G,T,F,T,S,D,...,P,R,T,E,T,G,S,G,S,P
