# Detection of similar sequences

For comparing sequences, PepFuNN provides a set of functionalities to match peptide having different sizes and even cases when the peptides contain non-natural amino acids. For the latest case there functions to use directly the SMILES format, or get access to a pre-generated similarity matrix based on the monomer dictionary provided within PepFuNN.

Before starting, the pepfunn package can be installed as follows:

In [None]:
!pip install --user pepfunn

Another way is to download the gihub code and running locally in the same folder: `pip install -e .`

### 1. Comparing sequences with the same size

Initially we can compare two sequences having the same lenght using a basic Hamming distance, or scored based on a similarity matrix

In [1]:
from pepfunn.similarity import Alignment

seq1='AFTGYW'
seq2='AGTGYL'

In [2]:
score1 = Alignment.align_samelen_matrix(seq1, seq2)
score2 = Alignment.align_samelen_local(seq1, seq2)

print(f'Similarity using a local matrix to score: {score1}')
print(f'Similarity using the Hamming distance: {score2}')

Similarity using a local matrix to score: 11.815126050420169
Similarity using the Hamming distance: 4


Let's do a similar example but with two more different sequences in terms of amino acid content:

In [3]:
seq1='NPVVHFFKNIVTPRTPPPSQ'
seq2='AAAAAFFKNIVAAAAAAAAA'

score1 = Alignment.align_samelen_matrix(seq1, seq2)
score2 = Alignment.align_samelen_local(seq1, seq2)

print(f'Similarity using a local matrix to score: {score1}')
print(f'Similarity using the Hamming distance: {score2}')

Similarity using a local matrix to score: 6.087468571256227
Similarity using the Hamming distance: 6


Finally, the sequences can be compared using the SMILES instead of the sequences as follows:

In [9]:
seq1='AFTGYW'
seq2='AGTGYL'

score3 = Alignment.align_smiles(seq1, seq2)

print(f'Similarity using SMILES: {score3}')

Similarity using SMILES: 0.4652777777777778


### 2. Comparing sequences with different size

For peptides of different size it is possible to run the comparisons, and even include non-natural amino acids if they are available in the monomer dictionary. These are two examples:

- **mode:** For the alignments there are two options: weighted (using similarity matrix) or unweighted based on Hamming distance

In [4]:
seq1='WWSEVNRAEF'
seq2='KTEEISEVNIVAEF'

score, align1, align2 = Alignment.align_difflen_matrix(seq1, seq2, mode="unweighted")
print(f'Score of the alignment: {score}')
print(f'Aligned sequence 1: {align1}')
print(f'Aligned sequence 2: {align2}')

Score of the alignment: 7.0
Aligned sequence 1: WW-----SEVNR--AEF
Aligned sequence 2: --KTEEISEVN-IVAEF


We can do something similar with two peptides in BILN format having non-natural amino acids and adding the similarity matrix to weight the alignment:

In [8]:
seq1='K-Aib-M-P'
seq2='S-A-Aib-P'

score, align1, align2 = Alignment.align_difflen_matrix(seq1, seq2, mode="weighted")
print(f'Score of the alignment: {score}')

Score of the alignment: 8.0


### 3. Overall similarity no matter the size

Finally, we can calculate the overall similarity no matter the length with a number between 0 and 1, which can be inferred as a percentage identity. The `mode` option can be used to provide context about the size of the peptide:

In [11]:
sim1 = Alignment.similarity_pair('AFTGYW', 'AGTGYL', mode='same')
sim2 = Alignment.similarity_pair('NPVVHFFKNIVTPRTPPPSQ', 'AAAAAFFKNIVAAAAAAAAA', mode='same')
sim3 = Alignment.similarity_pair('LLSHYTSY', 'LLSHYTSY', mode='same')
sim4 = Alignment.similarity_pair("W-W-S-E-V-N-R-A-E-F", "K-T-E-E-I-S-E-V-N-I-V-A-E-F", mode='diff')
sim5 = Alignment.similarity_pair("K-Aib-M-P","S-A-Aib-P", mode='diff')

print(f'Full similarity between AFTGYW and AGTGYL is: {sim1}')
print(f'Full similarity between NPVVHFFKNIVTPRTPPPSQ and AAAAAFFKNIVAAAAAAAAA is: {sim2}')
print(f'Full similarity between LLSHYTSY and LLSHYTSY is: {sim3}')
print(f'Full similarity between W-W-S-E-V-N-R-A-E-F and K-T-E-E-I-S-E-V-N-I-V-A-E-F is: {sim4}')
print(f'Full similarity between K-Aib-M-P and S-A-Aib-P is: {sim5}')

Full similarity between AFTGYW and AGTGYL is: 0.49229691876750703
Full similarity between NPVVHFFKNIVTPRTPPPSQ and AAAAAFFKNIVAAAAAAAAA is: 0.07609335714070284
Full similarity between LLSHYTSY and LLSHYTSY is: 1.0
Full similarity between W-W-S-E-V-N-R-A-E-F and K-T-E-E-I-S-E-V-N-I-V-A-E-F is: 0.5916079783099616
Full similarity between K-Aib-M-P and S-A-Aib-P is: 0.5


### 4. Monomer-based similarity

A new functionality in PepFuNN is to compare peptides based on the monomer content and their properties. In the publication we explain the monomer-based fingerprint where the peptide is transformed into a graph, and the nodes are treated as atoms in a Morgan Fingerprint methodology. This way we can compare peptide with complex topologies as soon as the properties are available in the monomer dictionary, which is included in the PepFuNN package. The following are two examples using natural and non-natural amino acids:

In [14]:
from pepfunn.similarity import simMonFP

The main option to configure the fingerprints are:

- **radius:** Maximum number of nodes surrounding an amino acid to generate a bit in the fingerprint.
- **nBits:** Number of bits the fingerprint will contain (i.e. 1024).
- **add_freq:** Flag to take into account the repetiticn of the motif within the sequence. This will generate more specific fingerprints.


In [15]:
seq1='NPVVHFFKNIVTPRTPPPSQ'
seq2='AAAAAFFKNIVAAAAAAAAA'

sim=simMonFP(seq1, seq2, radius=2, nBits=1024, add_freq=True)
print(f'Monomer-based similarity is: {sim}')

Monomer-based similarity is: 0.2037037037037037


In [16]:
seq1='K-Aib-M-P'
seq2='S-A-Aib-P'

sim=simMonFP(seq1, seq2, radius=2, nBits=1024, add_freq=True)
print(f'Monomer-based similarity is: {sim}')

Monomer-based similarity is: 0.125


### 5. Monomer-based descriptors

Finally, a function to generate autocorrelation descriptors for machine learning models is provided. The sequence should be in BILN format, and a dictionary with a set of amino acid-based descriptors will be generated. The monomers should be part of the monomer dataset provided in the code. For details please check the manuscript:

In [18]:
from pepfunn.similarity import pepDescriptors

desc=pepDescriptors("K-Aib-Iva-P-L-C-D")
descriptors=desc.moranCorrelation()

print(f"Predicted AA-based descriptors: {descriptors}")

Predicted AA-based descriptors: {'nrot-lag1': -0.20175438596491233, 'nrot-lag2': 0.059210526315789526, 'nrot-lag3': -0.8733552631578948, 'nrot-lag4': 0.05921052631578947, 'nrot-lag5': 0.1282894736842105, 'logp-lag1': 0.14621536728866413, 'logp-lag2': -0.3769784064899992, 'logp-lag3': -0.2190660764542613, 'logp-lag4': -0.7524979628713062, 'logp-lag5': -0.0016269255745366142, 'tpsa-lag1': 0.04386218549019564, 'tpsa-lag2': 0.026700387149926952, 'tpsa-lag3': -0.9840543967715248, 'tpsa-lag4': -0.30384299963295636, 'tpsa-lag5': -0.5206000880452041, 'mw-lag1': -0.43399083564883817, 'mw-lag2': -0.09377410766615871, 'mw-lag3': -0.4707166259042365, 'mw-lag4': 0.6020883342486392, 'mw-lag5': 0.21727062700987973}


For any questions, please contact: raoc@novonordisk.com