This notebook documents the code used to analyze EBV and CMV immunodominant proteins against a list of MS-associated proteins. 
Note that requirements for PEPMatch are listed on the original GitHub repository found here: https://github.com/IEDB/PEPMatch and will not be discussed here. 

The code has 3 parts: preprocessing of the master list of peptides used as the reference (in this case, the list of MS-associated proteins), the assessing the overlap of actual matches between protein of interest (say, EBNA1) and Master List of proteins, and running the shuffled version of the same protein (EBNA1) against the Master List of proteins as the control.

Packages to download and import 

In [None]:
pip install pepmatch

In [None]:
import pandas as pd
import random
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
from pepmatch import Matcher
from pepmatch import preprocessor

1. Proprocessing the Master List of proteins used as the reference (in this case, the list of MS-associated proteins)
Note: the list of MS-associated proteins was determined from an IEDB query, using UNIPROT to compile the actual list of sequences as one large fasta file.

In [None]:
from pepmatch import Preprocessor
preprocessor = Preprocessor('X.fasta', split = 2, preprocess_format = 'pickle') 
preprocessor.preprocess()
#output is Master_List to be used in the following analyses

2. Assessing the overlap between protein of interest and MS-associated protein list
NOTE: for the 9mers, the max_mismatches parameter is set to 2 for the max amount of physiologically relevant mismatching for MHC I peptides that still allows for cross-presentation, and the split is equal to 2 to find the maximum number of hits out of the analysis.

In [None]:
#input amino acid sequence of protein of interest
NAX_string = ''
k = 9
#split the protein into 9mer peptides overlapping by 1 amino acid to compile list of all peptides derived from protein of interest
NAX_9mer = [NAX_string[idx:idx + k] for idx in range(len(NAX_string) - k + 1)]
#run the matching
matcher_NAX = Matcher(NAX_9mer, 'Master_List', max_mismatches=2, split = 2)
results_NAX = matcher_NAX.match()
#convert the matching results to a dataframe using pandas
results_NAX = pd.DataFrame(results_NAX)
#drop the duplicate matches 
results_NAX_ = results_NAX.drop_duplicates(subset = 0)
total_NAX = len(results_NAX_)
#count the number of hits
conserved_NAX = len(results_NAX_[results_NAX_[3].astype(str) == 'Homo sapiens'].count(axis=1))
print(total_NAX)
print(conserved_NAX)
#output: the total number of possible hits (total_NAX) and the number of actual overlapping hits determined by PEPMatch (conserved_NAX)

3. Performing the same analysis as in (2) but running the shuffled version of the same protein for use in statistical analysis.
NOTE: in this code block, I created a loop to run a different shuffle version of the protein 30 times over, and then I averaged the output to use in the statistical analysis. 

In [None]:
#run the random function
def my_func():
    NAX_string = ''
    #create a shuffled version of the above protein
    random_NAX_string = ''.join(random.sample(NAX_string, len(NAX_string)))
    k = 9
    NAX_9mer_random = [random_NAX_string[idx:idx + k] for idx in range(len(random_NAX_string) - k + 1)]
    matcher_NA = Matcher(NAX_9mer_random, 'Master_List', max_mismatches = 2, split = 2)
    results_random_NA = matcher_NA.match()
    results_random_df_NA = pd.DataFrame(results_random_NA)
    results_NA_ = results_random_df_NA.drop_duplicates(subset = 0)
    conserved = len(results_NA_[results_NA_[3].astype(str) == 'Homo sapiens'].count(axis=1))
    print(conserved)
for i in range(30):
    print(my_func())