# Proteomics Practical Exercises (Student Version)

This notebook contains practical exercises covering computational proteomics concepts. Each exercise demonstrates key techniques used in modern proteomics research.

**Instructions**: Complete each exercise by following the task list. Fill in the code cells marked with `# TODO:` comments.

## Table of Contents
1. **Core Exercises**: UniProt API, PRIDE queries, peptide digestion, PSM matching
2. **Intermediate Exercises**: Spectrum prediction, RNA-protein modeling, phosphoproteomics, quantitative analysis
3. **Advanced Exercises**: Variant peptides, stability prediction, multiomics integration


In [None]:
# Import required libraries
import os
import json
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyteomics import mass, parser, mgf
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, classification_report, roc_auc_score, roc_curve
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Set matplotlib style
plt.style.use('default')
%matplotlib inline


## Exercise 1: Query UniProt API for Protein Annotations

**Aim**: Learn how to call the UniProt REST API, parse JSON responses, extract protein features (domains, PTMs), and visualize domain architecture.

**Tasks**:
1. Construct the UniProt API URL for protein ID "P04637" (TP53)
2. Make an HTTP GET request to retrieve the protein data
3. Parse the JSON response
4. Extract the protein sequence
5. Extract domain features (type == "Domain")
6. Extract PTM features (type == "Modified residue")
7. Create a visualization showing:
   - Protein sequence as a horizontal line
   - Domains as colored rectangles
   - Phosphorylation sites as red stars
   - Other PTMs as orange circles
8. Add proper labels, title, and legend to the plot

**Hint**: Use `requests.get()` for API calls, `response.json()` to parse JSON, and `plt.Rectangle()` for domain visualization.


In [None]:
# Exercise 1: UniProt API Query
protein_id = "P04637"  # TP53
url = f"https://rest.uniprot.org/uniprotkb/{protein_id}.json"
print(f"Querying {url} ...")

# TODO: Construct the UniProt API URL
# url = f"..."

# TODO: Make HTTP GET request with timeout=10
# response = ...

# TODO: Check response status and parse JSON
# data = ...

# TODO: Extract sequence
# seq = ...

# TODO: Extract features
# features = ...

# TODO: Initialize lists for domains and PTMs
# domain_positions = []
# phospho_positions = []
# other_ptms = []

# TODO: Loop through features and categorize them
# for f in features:
#     if f["type"] == "Domain":
#         # Extract start and end positions
#         # Append to domain_positions
#     elif f["type"] == "Modified residue":
#         # Extract position and description
#         # Check if it's a phosphorylation site
#         # Append to appropriate list

# TODO: Print summary statistics
# print(f"Found {len(domain_positions)} domains")
# print(f"Found {len(phospho_positions)} phosphorylation sites")
# print(f"Found {len(other_ptms)} other PTMs")

# TODO: Create visualization
# fig, ax = plt.subplots(figsize=(12, 2))
# # Draw protein sequence line
# # Draw domains as rectangles
# # Mark phosphorylation sites
# # Mark other PTMs
# # Add labels, title, legend
# plt.show()


## Exercise 2: Query PRIDE for Experiment Metadata

**Aim**: Learn API access, filtering, and metadata exploration. Visualize the distribution of experiments over time.

**Tasks**:
1. Construct the PRIDE API URL for project listing
2. Make a GET request with parameter `show=100` to get first 100 projects
3. Parse the JSON response
4. Convert the project list to a pandas DataFrame
5. Extract and convert publication dates to datetime format
6. Extract year from publication dates
7. Create a bar plot showing number of publications per year
8. Print summary statistics (year range, total projects)

**Hint**: Use `pd.to_datetime()` for date conversion and `dt.year` to extract years.


In [None]:
# Exercise 2: PRIDE Query
# Using PRIDE API v3: https://www.ebi.ac.uk/pride/ws/archive/v3/
print("Querying PRIDE (first 100 projects)...")

# Try PRIDE API v3 endpoint
url = "https://www.ebi.ac.uk/pride/ws/archive/v3/projects"
response = requests.get(url, params={"pageSize": 100, "page": 0}, timeout=15)

# TODO: Parse JSON response
# resp = ...

# TODO: Convert to DataFrame
# df = ...

# TODO: Convert publicationDate to datetime
# df["publicationDate"] = ...

# TODO: Extract year
# df["year"] = ...

# TODO: Create bar plot of publications per year
# plt.figure(figsize=(10, 5))
# # Count publications per year
# # Plot as bar chart
# # Add labels and title
# plt.show()

# TODO: Print summary statistics


## Exercise 3: Simulate Tryptic Digestion & Compute Peptide Masses

**Aim**: Learn peptide digestion, m/z calculation, and mass distribution relevant for LC-MS/MS analysis.

**Tasks**:
1. Define a protein sequence (use the provided example or your own)
2. Perform tryptic digestion using `parser.cleave()` with `parser.expasy_rules["trypsin"]` and `min_length=6`
3. For each peptide, calculate:
   - Monoisotopic mass using `mass.calculate_mass()`
   - m/z for 2+ charge state: (mass + 2 × 1.007825) / 2
   - m/z for 3+ charge state: (mass + 3 × 1.007825) / 3
4. Create a DataFrame with columns: peptide, length, mass, mz_2plus, mz_3plus
5. Display summary statistics and first 10 peptides
6. Create two histograms:
   - Peptide mass distribution
   - m/z distribution for 2+ charge state

**Hint**: Use `parser.cleave()` for digestion and `mass.calculate_mass(sequence=pep)` for mass calculation.


In [None]:
# Exercise 3: Tryptic Digestion & Peptide Mass Calculation
sequence = "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD"

print(f"Input sequence length: {len(sequence)} amino acids")

# TODO: Perform tryptic digestion
# peptides = ...

# TODO: Calculate masses and m/z for each peptide
# rows = []
# for pep in peptides:
#     # Calculate monoisotopic mass
#     # Calculate m/z for 2+ and 3+ charge states
#     # Append to rows

# TODO: Create DataFrame
# df_peptides = ...

# TODO: Display summary and first 10 peptides
# print(df_peptides.describe())
# print(df_peptides.head(10))

# TODO: Create histograms
# fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# # Plot mass distribution
# # Plot m/z distribution
# plt.show()


## Exercise 4: Peptide-Spectrum Matching (PSM)

**Aim**: Learn the concept of PSM (peptide-spectrum matching). Compute theoretical b/y fragments and match them to experimental spectra.

**Tasks**:
1. Define a target peptide (e.g., "PEPTIDE")
2. Calculate theoretical b and y ion fragments using `mass.fast_mass(peptide, ion_types="by")`
3. Create a synthetic experimental spectrum:
   - Include theoretical m/z values
   - Add noise peaks
   - Assign intensities (make theoretical peaks stronger)
4. Match theoretical fragments to experimental peaks with tolerance ±0.5 Da
5. Create an annotated spectrum plot:
   - Plot experimental peaks as stem plot
   - Mark matched fragments with red vertical lines and labels
   - Mark unmatched fragments with orange lines
6. Print summary of matched ions

**Hint**: Use `np.abs(mz - theo_mz) < tolerance` for matching and `ax.axvline()` for marking fragments.


In [None]:
# Exercise 4: Peptide-Spectrum Matching
peptide = "PEPTIDE"
print(f"Target peptide: {peptide}")

# TODO: Calculate theoretical fragments (b and y ions)
# fragments = {}
# for ion, mass_val in mass.fast_mass(peptide, ion_types="by").items():
#     fragments[ion] = mass_val

# TODO: Create synthetic experimental spectrum
# np.random.seed(42)
# # Get theoretical m/z values
# # Add noise peaks
# # Combine and assign intensities
# # Sort by m/z

# TODO: Match theoretical fragments to experimental peaks
# tolerance = 0.5
# matches = []
# # Loop through fragments and find matches
# # Store matched ions with their m/z values

# TODO: Create annotated spectrum plot
# fig, ax = plt.subplots(figsize=(12, 6))
# # Plot experimental peaks
# # Mark matched fragments
# # Mark unmatched fragments
# # Add labels and title
# plt.show()

# TODO: Print summary of matches


## Exercise 5: Predict MS/MS Spectra using Deep Learning (Prosit)

**Aim**: Demonstrate how deep learning models predict MS/MS spectra. Compare predicted vs theoretical masses.

**Tasks**:
1. Define a peptide and charge state
2. Calculate theoretical fragments (b and y ions)
3. Simulate Prosit-style intensity predictions:
   - Assign higher intensities to b and y ions
   - Use random exponential distribution for intensities
4. Create a Prosit API payload structure (JSON format)
5. Plot the predicted spectrum with fragment annotations
6. Annotate the top 5 most intense peaks

**Note**: Real Prosit API requires authentication. This exercise demonstrates the concept.

**Hint**: Use `np.random.exponential()` for intensity simulation and `ax.annotate()` for peak labels.


In [None]:
# Exercise 5: Predict MS/MS Spectrum (Prosit-style)
peptide = "PEPTIDE"
charge = 2

# TODO: Calculate theoretical fragments
# fragments = ...

# TODO: Simulate predicted intensities
# np.random.seed(42)
# predicted_intensities = {}
# # Assign intensities to each fragment
# # Higher intensities for b and y ions

# TODO: Create Prosit API payload structure
# prosit_payload = {
#     "peptide_sequences": [...],
#     "charges": [...],
#     "collision_energies": [...],
#     "instrument_types": [...]
# }
# print(json.dumps(prosit_payload, indent=2))

# TODO: Plot predicted spectrum
# fig, ax = plt.subplots(figsize=(12, 6))
# # Plot fragments with intensities
# # Annotate top 5 peaks
# plt.show()


## Exercise 6: Quantitative Proteomics - Volcano Plot

**Aim**: Learn how to create a volcano plot for quantitative proteomics data (e.g., from label-free quantification or TMT experiments).

**Tasks**:
1. Generate synthetic differential expression data:
   - log2 fold changes (most near 0, some significant)
   - p-values (lower for significant changes)
2. Calculate -log10(p-value) for each protein
3. Define significance thresholds:
   - Fold change threshold: |logFC| > 1.0
   - p-value threshold: p < 0.05
4. Classify proteins as:
   - Significant upregulated
   - Significant downregulated
   - Not significant
5. Create volcano plot:
   - x-axis: log2 fold change
   - y-axis: -log10(p-value)
   - Color points by significance category
   - Add threshold lines (vertical and horizontal)
6. Print summary statistics

**Hint**: Use boolean indexing for classification and `ax.axvline()`/`ax.axhline()` for threshold lines.


In [None]:
# Exercise 6: Quantitative Proteomics Volcano Plot

np.random.seed(42)
n_proteins = 5000

# Simulate log fold changes (most proteins unchanged, some differentially expressed)
logFC = np.random.normal(0, 0.5, n_proteins)
# Add some significant changes
n_sig = 200
logFC[:n_sig] = np.random.choice([-1, 1], n_sig) * np.random.uniform(0.5, 2.5, n_sig)

# TODO: Calculate -log10(p-value)
# df_diffexp["neglog10p"] = ...

# TODO: Define significance thresholds
# fc_threshold = 1.0
# p_threshold = 0.05
# neglog10p_threshold = ...

# TODO: Classify proteins
# df_diffexp["significant"] = ...
# df_diffexp["upregulated"] = ...
# df_diffexp["downregulated"] = ...

# TODO: Create volcano plot
# fig, ax = plt.subplots(figsize=(10, 8))
# # Plot non-significant (gray)
# # Plot upregulated (red)
# # Plot downregulated (blue)
# # Add threshold lines
# # Add labels and legend
# plt.show()

# TODO: Print summary statistics


## Exercise 9: Variant Peptide Generation

**Aim**: Demonstrate proteogenomics by applying genetic variants to protein sequences, performing digestion, and identifying variant-specific peptides.

**Tasks**:
1. Define a wild-type protein sequence
2. Define variants as list of tuples: (position, new_amino_acid)
3. Perform tryptic digestion of wild-type sequence
4. For each variant:
   - Create mutated sequence
   - Perform tryptic digestion
   - Identify variant-specific peptides (peptides not in wild-type)
5. Calculate masses for all variant peptides
6. Create a DataFrame with variant peptide information
7. Visualize:
   - Peptide length distribution
   - Peptide mass distribution
8. Display variant-specific peptides

**Hint**: Use set operations to find peptides unique to variants: `set(mut_peptides) - set(wt_peptides)`.


In [None]:
# Exercise 9: Variant Peptide Generation
sequence = "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD"

# TODO: Define variants
# variants = [
#     (72, "R"),
#     (175, "H"),
#     (248, "W"),
# ]

# TODO: Digest wild-type sequence
# wt_peptides = ...

# TODO: Process each variant
# rows = []
# for pos, aa_mut in variants:
#     # Create mutated sequence
#     # Digest mutated sequence
#     # Find variant-specific peptides
#     # Calculate masses and store information

# TODO: Create DataFrame
# df_variants = ...

# TODO: Visualize distributions
# fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# # Plot length distribution
# # Plot mass distribution
# plt.show()

# TODO: Display variant-specific peptides
