# üß¨ Proteomics and Metabolomics: Hands-on Practice

## Table of Contents
1. [Mass Spectrometry Data Simulation](#practice-1-mass-spectrometry-data-simulation)
2. [Peptide Mass Calculation](#practice-2-peptide-mass-calculation)
3. [Protein Identification Scoring](#practice-3-protein-identification-scoring)
4. [Quantitative Proteomics Analysis](#practice-4-quantitative-proteomics-analysis)
5. [Metabolite Peak Detection](#practice-5-metabolite-peak-detection)
6. [Pathway Enrichment Analysis](#practice-6-pathway-enrichment-analysis)
7. [Biomarker Discovery Workflow](#practice-7-biomarker-discovery-workflow)
8. [Integration: Multi-omics Data Visualization](#practice-8-integration-multi-omics-data-visualization)

## Installing and Importing Essential Libraries

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import signal, stats
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
sns.set_style('whitegrid')
sns.set_palette('husl')

print("‚úÖ All libraries loaded successfully!")
print("üß¨ Ready for proteomics and metabolomics analysis!")

---
## Practice 1: Mass Spectrometry Data Simulation

### üéØ Learning Objectives
- Understand the structure of MS data
- Simulate mass spectra with peaks
- Visualize m/z (mass-to-charge ratio) vs intensity

### üìñ Key Concepts
**Mass Spectrum:** A plot showing the abundance (intensity) of ions at different mass-to-charge (m/z) ratios

In [None]:
# 1.1 Simulate a simple mass spectrum
def simulate_mass_spectrum(num_peaks=10, noise_level=0.1):
    """
    Simulate a mass spectrum with peaks
    
    Parameters:
    - num_peaks: Number of peptide peaks
    - noise_level: Baseline noise level
    """
    np.random.seed(42)
    
    # Generate m/z values (300-2000 Da range, typical for peptides)
    mz_range = np.linspace(300, 2000, 2000)
    
    # Initialize spectrum with baseline noise
    spectrum = np.random.exponential(noise_level, len(mz_range))
    
    # Add peptide peaks
    peak_positions = np.random.uniform(400, 1800, num_peaks)
    peak_intensities = np.random.uniform(5, 50, num_peaks)
    peak_widths = np.random.uniform(0.5, 2.0, num_peaks)
    
    for pos, intensity, width in zip(peak_positions, peak_intensities, peak_widths):
        # Add Gaussian peaks
        peak = intensity * np.exp(-((mz_range - pos) ** 2) / (2 * width ** 2))
        spectrum += peak
    
    # Create DataFrame
    ms_data = pd.DataFrame({
        'm/z': mz_range,
        'Intensity': spectrum
    })
    
    return ms_data, peak_positions, peak_intensities

# Generate and visualize
ms_data, peak_pos, peak_int = simulate_mass_spectrum(num_peaks=15)

print("üìä Mass Spectrum Generated")
print(f"   - m/z range: {ms_data['m/z'].min():.1f} - {ms_data['m/z'].max():.1f}")
print(f"   - Number of data points: {len(ms_data)}")
print(f"   - Number of peaks: {len(peak_pos)}")
print(f"   - Max intensity: {ms_data['Intensity'].max():.2f}")

# Plot
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.plot(ms_data['m/z'], ms_data['Intensity'], linewidth=0.8, color='#1E64C8')
plt.xlabel('m/z (Mass-to-Charge Ratio)', fontsize=12, fontweight='bold')
plt.ylabel('Relative Intensity', fontsize=12, fontweight='bold')
plt.title('üî¨ Simulated Mass Spectrum (MS1)', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(peak_pos, peak_int, s=100, c='#FF6B6B', alpha=0.6, edgecolors='black', linewidth=1.5)
plt.xlabel('Peak m/z', fontsize=12, fontweight='bold')
plt.ylabel('Peak Intensity', fontsize=12, fontweight='bold')
plt.title('üìç Detected Peptide Peaks', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úÖ Visualization complete!")

---
## Practice 2: Peptide Mass Calculation

### üéØ Learning Objectives
- Calculate theoretical peptide masses
- Understand amino acid composition
- Match theoretical to experimental masses

### üìñ Key Concepts
**Monoisotopic Mass:** The mass calculated using the most abundant isotope of each element

In [None]:
# 2.1 Amino acid mass table and peptide mass calculator
def calculate_peptide_mass(sequence):
    """
    Calculate the monoisotopic mass of a peptide
    
    Parameters:
    - sequence: Amino acid sequence (e.g., 'PEPTIDE')
    """
    # Monoisotopic masses of amino acids (in Da)
    aa_masses = {
        'A': 71.037114,  'C': 103.009185, 'D': 115.026943, 'E': 129.042593,
        'F': 147.068414, 'G': 57.021464,  'H': 137.058912, 'I': 113.084064,
        'K': 128.094963, 'L': 113.084064, 'M': 131.040485, 'N': 114.042927,
        'P': 97.052764,  'Q': 128.058578, 'R': 156.101111, 'S': 87.032028,
        'T': 101.047679, 'V': 99.068414,  'W': 186.079313, 'Y': 163.063329
    }
    
    # H2O mass (added to form peptide bond)
    water_mass = 18.010565
    
    # Calculate mass
    total_mass = water_mass  # Start with water
    for aa in sequence.upper():
        if aa in aa_masses:
            total_mass += aa_masses[aa]
        else:
            print(f"‚ö†Ô∏è  Unknown amino acid: {aa}")
    
    return total_mass

# Example peptides from trypsin digestion
peptides = [
    'PEPTIDER',      # Example peptide 1
    'MVHLTPEEK',     # From hemoglobin
    'LFTGHPETLEK',   # From hemoglobin
    'FLASVSTVLTSK',  # From hemoglobin
]

print("üßÆ Peptide Mass Calculator")
print("=" * 60)

peptide_data = []
for peptide in peptides:
    mass = calculate_peptide_mass(peptide)
    peptide_data.append({
        'Sequence': peptide,
        'Length': len(peptide),
        'Mass (Da)': mass,
        'm/z (z=1)': mass + 1.007825,  # Add proton
        'm/z (z=2)': (mass + 2 * 1.007825) / 2  # Doubly charged
    })

df_peptides = pd.DataFrame(peptide_data)
print(df_peptides.to_string(index=False))
print("\n‚úÖ Mass calculations complete!")

# Visualize charge states
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.barh(df_peptides['Sequence'], df_peptides['Mass (Da)'], color='#4CAF50', alpha=0.7, edgecolor='black')
ax1.set_xlabel('Monoisotopic Mass (Da)', fontsize=12, fontweight='bold')
ax1.set_title('üíö Peptide Masses', fontsize=14, fontweight='bold')
ax1.grid(axis='x', alpha=0.3)

width = 0.35
x = np.arange(len(df_peptides))
ax2.bar(x - width/2, df_peptides['m/z (z=1)'], width, label='z=1 (singly charged)', color='#2196F3', alpha=0.7, edgecolor='black')
ax2.bar(x + width/2, df_peptides['m/z (z=2)'], width, label='z=2 (doubly charged)', color='#FF9800', alpha=0.7, edgecolor='black')
ax2.set_xlabel('Peptide', fontsize=12, fontweight='bold')
ax2.set_ylabel('m/z', fontsize=12, fontweight='bold')
ax2.set_title('‚ö° Charge State Comparison', fontsize=14, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(range(1, len(df_peptides)+1))
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

---
## Practice 3: Protein Identification Scoring

### üéØ Learning Objectives
- Understand peptide-spectrum matching (PSM)
- Calculate basic identification scores
- Apply false discovery rate (FDR) filtering

### üìñ Key Concepts
**FDR (False Discovery Rate):** The expected proportion of false positives among identified peptides

In [None]:
# 3.1 Simulate peptide-spectrum matches with scores
def simulate_psm_data(n_target=200, n_decoy=50):
    """
    Simulate PSM (Peptide-Spectrum Match) data with target and decoy matches
    """
    np.random.seed(42)
    
    # Target PSMs (real matches) - higher scores
    target_scores = np.random.beta(8, 2, n_target) * 100  # Skewed towards high scores
    target_data = pd.DataFrame({
        'PSM_ID': [f'T_{i:04d}' for i in range(n_target)],
        'Score': target_scores,
        'Type': 'Target'
    })
    
    # Decoy PSMs (false matches) - lower scores
    decoy_scores = np.random.beta(2, 5, n_decoy) * 100  # Skewed towards low scores
    decoy_data = pd.DataFrame({
        'PSM_ID': [f'D_{i:04d}' for i in range(n_decoy)],
        'Score': decoy_scores,
        'Type': 'Decoy'
    })
    
    # Combine and sort by score
    psm_data = pd.concat([target_data, decoy_data], ignore_index=True)
    psm_data = psm_data.sort_values('Score', ascending=False).reset_index(drop=True)
    
    return psm_data

# 3.2 Calculate FDR
def calculate_fdr(psm_data, score_threshold):
    """
    Calculate FDR at a given score threshold
    FDR = (# Decoy hits) / (# Target hits)
    """
    filtered = psm_data[psm_data['Score'] >= score_threshold]
    n_decoy = (filtered['Type'] == 'Decoy').sum()
    n_target = (filtered['Type'] == 'Target').sum()
    
    if n_target == 0:
        return 0, 0, 0
    
    fdr = n_decoy / n_target
    return fdr, n_target, n_decoy

# Generate data
psm_data = simulate_psm_data(n_target=200, n_decoy=50)

print("üîç Protein Identification Scoring")
print("=" * 60)
print(f"Total PSMs: {len(psm_data)}")
print(f"Target PSMs: {(psm_data['Type'] == 'Target').sum()}")
print(f"Decoy PSMs: {(psm_data['Type'] == 'Decoy').sum()}")

# Calculate FDR at different thresholds
thresholds = [50, 60, 70, 80, 90]
print("\nüìä FDR at Different Score Thresholds:")
print("-" * 60)

fdr_results = []
for thresh in thresholds:
    fdr, n_target, n_decoy = calculate_fdr(psm_data, thresh)
    fdr_results.append({
        'Threshold': thresh,
        'FDR': fdr,
        'Target_PSMs': n_target,
        'Decoy_PSMs': n_decoy
    })
    print(f"  Score ‚â• {thresh:3d}: FDR = {fdr:.3f} ({n_target} targets, {n_decoy} decoys)")

# Find threshold for 1% FDR
score_range = np.linspace(psm_data['Score'].min(), psm_data['Score'].max(), 100)
fdr_curve = [calculate_fdr(psm_data, s)[0] for s in score_range]
threshold_1pct = score_range[np.where(np.array(fdr_curve) <= 0.01)[0][0]] if any(np.array(fdr_curve) <= 0.01) else None

if threshold_1pct:
    print(f"\n‚úÖ Score threshold for 1% FDR: {threshold_1pct:.2f}")
else:
    print(f"\n‚ö†Ô∏è  Cannot achieve 1% FDR with current data")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Score distributions
axes[0].hist(psm_data[psm_data['Type']=='Target']['Score'], bins=30, alpha=0.7, 
             label='Target', color='#4CAF50', edgecolor='black')
axes[0].hist(psm_data[psm_data['Type']=='Decoy']['Score'], bins=30, alpha=0.7, 
             label='Decoy', color='#F44336', edgecolor='black')
axes[0].set_xlabel('PSM Score', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[0].set_title('üìä Score Distributions', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: FDR curve
axes[1].plot(score_range, fdr_curve, linewidth=2, color='#2196F3')
axes[1].axhline(y=0.01, color='red', linestyle='--', linewidth=2, label='1% FDR')
if threshold_1pct:
    axes[1].axvline(x=threshold_1pct, color='green', linestyle='--', linewidth=2, label=f'Threshold: {threshold_1pct:.1f}')
axes[1].set_xlabel('Score Threshold', fontsize=12, fontweight='bold')
axes[1].set_ylabel('FDR', fontsize=12, fontweight='bold')
axes[1].set_title('üìà FDR vs Score Threshold', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

# Plot 3: Number of identifications vs FDR
n_targets = [calculate_fdr(psm_data, s)[1] for s in score_range]
axes[2].plot(fdr_curve, n_targets, linewidth=2, color='#9C27B0', marker='o', markersize=3)
axes[2].axvline(x=0.01, color='red', linestyle='--', linewidth=2, label='1% FDR')
axes[2].set_xlabel('FDR', fontsize=12, fontweight='bold')
axes[2].set_ylabel('Number of Target PSMs', fontsize=12, fontweight='bold')
axes[2].set_title('üéØ PSMs vs FDR', fontsize=14, fontweight='bold')
axes[2].legend()
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úÖ Analysis complete!")

---
## Practice 4: Quantitative Proteomics Analysis

### üéØ Learning Objectives
- Compare label-free quantification methods
- Analyze protein abundance changes
- Identify differentially expressed proteins

### üìñ Key Concepts
**Label-Free Quantification (LFQ):** Comparing protein abundances without isotope labeling

In [None]:
# 4.1 Simulate quantitative proteomics data
def simulate_proteomics_data(n_proteins=100, n_samples_per_group=6):
    """
    Simulate LFQ intensities for control vs treatment
    """
    np.random.seed(42)
    
    protein_names = [f'Protein_{i:03d}' for i in range(n_proteins)]
    
    # Control group
    control_data = np.random.lognormal(mean=10, sigma=1.5, size=(n_proteins, n_samples_per_group))
    
    # Treatment group (some proteins change)
    treatment_data = control_data.copy()
    
    # Make 20 proteins upregulated
    upregulated = np.random.choice(n_proteins, 20, replace=False)
    for idx in upregulated:
        treatment_data[idx] *= np.random.uniform(1.5, 3.0)
    
    # Make 20 proteins downregulated
    downregulated = np.random.choice([i for i in range(n_proteins) if i not in upregulated], 20, replace=False)
    for idx in downregulated:
        treatment_data[idx] *= np.random.uniform(0.3, 0.7)
    
    # Create DataFrame
    control_cols = [f'Control_{i+1}' for i in range(n_samples_per_group)]
    treatment_cols = [f'Treatment_{i+1}' for i in range(n_samples_per_group)]
    
    df = pd.DataFrame(
        np.column_stack([control_data, treatment_data]),
        columns=control_cols + treatment_cols,
        index=protein_names
    )
    
    # Add metadata
    df['True_Status'] = 'Unchanged'
    df.loc[[f'Protein_{i:03d}' for i in upregulated], 'True_Status'] = 'Upregulated'
    df.loc[[f'Protein_{i:03d}' for i in downregulated], 'True_Status'] = 'Downregulated'
    
    return df, control_cols, treatment_cols

# Generate data
proteomics_df, ctrl_cols, trt_cols = simulate_proteomics_data()

print("üß™ Quantitative Proteomics Data Generated")
print("=" * 60)
print(f"Number of proteins: {len(proteomics_df)}")
print(f"Control samples: {len(ctrl_cols)}")
print(f"Treatment samples: {len(trt_cols)}")
print(f"\nTrue differential expression:")
print(proteomics_df['True_Status'].value_counts())

# Calculate statistics
proteomics_df['Control_Mean'] = proteomics_df[ctrl_cols].mean(axis=1)
proteomics_df['Treatment_Mean'] = proteomics_df[trt_cols].mean(axis=1)
proteomics_df['Log2_FC'] = np.log2(proteomics_df['Treatment_Mean'] / proteomics_df['Control_Mean'])

# T-test
p_values = []
for idx in proteomics_df.index:
    ctrl = proteomics_df.loc[idx, ctrl_cols]
    trt = proteomics_df.loc[idx, trt_cols]
    _, p = stats.ttest_ind(ctrl, trt)
    p_values.append(p)

proteomics_df['P_value'] = p_values
proteomics_df['-log10(P)'] = -np.log10(proteomics_df['P_value'])

# Classify by statistical cutoffs
proteomics_df['Significant'] = 'Not Significant'
proteomics_df.loc[(proteomics_df['Log2_FC'] > 1) & (proteomics_df['P_value'] < 0.05), 'Significant'] = 'Upregulated'
proteomics_df.loc[(proteomics_df['Log2_FC'] < -1) & (proteomics_df['P_value'] < 0.05), 'Significant'] = 'Downregulated'

print("\nüìä Statistical Analysis Results:")
print("-" * 60)
print(proteomics_df['Significant'].value_counts())

# Visualize: Volcano plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Volcano plot
colors = {'Not Significant': 'gray', 'Upregulated': '#F44336', 'Downregulated': '#2196F3'}
for sig_type, color in colors.items():
    data = proteomics_df[proteomics_df['Significant'] == sig_type]
    axes[0].scatter(data['Log2_FC'], data['-log10(P)'], 
                   c=color, label=sig_type, alpha=0.6, s=50, edgecolors='black', linewidth=0.5)

axes[0].axhline(y=-np.log10(0.05), color='red', linestyle='--', linewidth=1.5, label='P = 0.05')
axes[0].axvline(x=1, color='green', linestyle='--', linewidth=1.5, label='Log2FC = ¬±1')
axes[0].axvline(x=-1, color='green', linestyle='--', linewidth=1.5)
axes[0].set_xlabel('Log2 Fold Change', fontsize=13, fontweight='bold')
axes[0].set_ylabel('-Log10 P-value', fontsize=13, fontweight='bold')
axes[0].set_title('üåã Volcano Plot', fontsize=15, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Heatmap of top 20 differentially expressed
top_de = proteomics_df[proteomics_df['Significant'] != 'Not Significant'].nsmallest(20, 'P_value')
heatmap_data = top_de[ctrl_cols + trt_cols]
heatmap_data_log = np.log2(heatmap_data + 1)

im = axes[1].imshow(heatmap_data_log, aspect='auto', cmap='RdBu_r')
axes[1].set_yticks(range(len(top_de)))
axes[1].set_yticklabels(top_de.index, fontsize=8)
axes[1].set_xticks(range(len(ctrl_cols + trt_cols)))
axes[1].set_xticklabels(ctrl_cols + trt_cols, rotation=45, ha='right', fontsize=9)
axes[1].set_title('üî• Top 20 Differentially Expressed Proteins', fontsize=15, fontweight='bold')
plt.colorbar(im, ax=axes[1], label='Log2 Intensity')

plt.tight_layout()
plt.show()

print("\n‚úÖ Quantitative analysis complete!")

---
## Practice 5: Metabolite Peak Detection

### üéØ Learning Objectives
- Detect peaks in LC-MS chromatograms
- Extract peak features (m/z, RT, intensity)
- Understand signal processing basics

### üìñ Key Concepts
**Extracted Ion Chromatogram (XIC):** Intensity vs retention time for a specific m/z range

In [None]:
# 5.1 Simulate LC-MS chromatogram
def simulate_chromatogram(n_peaks=8, noise_level=0.5, duration=20):
    """
    Simulate an LC-MS chromatogram (retention time vs intensity)
    """
    np.random.seed(42)
    
    # Time points (retention time in minutes)
    time = np.linspace(0, duration, 1000)
    
    # Baseline + noise
    signal = np.random.normal(noise_level, noise_level * 0.3, len(time))
    
    # Add metabolite peaks (Gaussian)
    peak_rts = np.random.uniform(2, duration-2, n_peaks)
    peak_intensities = np.random.uniform(5, 30, n_peaks)
    peak_widths = np.random.uniform(0.15, 0.4, n_peaks)
    
    for rt, intensity, width in zip(peak_rts, peak_intensities, peak_widths):
        peak = intensity * np.exp(-((time - rt) ** 2) / (2 * width ** 2))
        signal += peak
    
    return time, signal, peak_rts, peak_intensities

# Generate chromatogram
time, signal, true_rts, true_ints = simulate_chromatogram(n_peaks=10)

# Detect peaks using scipy
peaks, properties = signal.find_peaks(signal, height=2, prominence=1.5, width=5)

print("üìà LC-MS Chromatogram Analysis")
print("=" * 60)
print(f"Retention time range: 0 - {time[-1]:.1f} minutes")
print(f"True number of peaks: {len(true_rts)}")
print(f"Detected peaks: {len(peaks)}")

# Extract peak features
peak_data = []
for i, peak_idx in enumerate(peaks):
    peak_data.append({
        'Peak_ID': i + 1,
        'Retention_Time (min)': time[peak_idx],
        'Intensity': signal[peak_idx],
        'Width': properties['widths'][i] * (time[1] - time[0]),
        'Prominence': properties['prominences'][i]
    })

df_peaks = pd.DataFrame(peak_data)
print("\nüéØ Detected Peak Features:")
print(df_peaks.to_string(index=False))

# Visualize
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Plot 1: Full chromatogram
axes[0].plot(time, signal, linewidth=1.5, color='#1E64C8', label='XIC Signal')
axes[0].plot(time[peaks], signal[peaks], 'ro', markersize=10, label=f'Detected Peaks (n={len(peaks)})')
axes[0].scatter(true_rts, true_ints, marker='x', s=200, c='green', linewidths=3, label='True Peaks', zorder=5)
axes[0].set_xlabel('Retention Time (min)', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Intensity', fontsize=13, fontweight='bold')
axes[0].set_title('üìä Extracted Ion Chromatogram (XIC)', fontsize=15, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: Peak features
colors_map = plt.cm.viridis(np.linspace(0, 1, len(df_peaks)))
axes[1].scatter(df_peaks['Retention_Time (min)'], df_peaks['Intensity'], 
               s=df_peaks['Width']*500, c=colors_map, alpha=0.6, edgecolors='black', linewidth=2)
axes[1].set_xlabel('Retention Time (min)', fontsize=13, fontweight='bold')
axes[1].set_ylabel('Peak Intensity', fontsize=13, fontweight='bold')
axes[1].set_title('üé® Peak Features (size = peak width)', fontsize=15, fontweight='bold')
axes[1].grid(alpha=0.3)

# Add annotations
for i, row in df_peaks.iterrows():
    if row['Intensity'] > 10:
        axes[1].annotate(f"P{row['Peak_ID']}", 
                        xy=(row['Retention_Time (min)'], row['Intensity']),
                        xytext=(5, 5), textcoords='offset points', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n‚úÖ Peak detection complete!")

---
## Practice 6: Pathway Enrichment Analysis

### üéØ Learning Objectives
- Understand pathway analysis concepts
- Calculate enrichment scores
- Visualize metabolic pathways

### üìñ Key Concepts
**Pathway Enrichment:** Statistical test to determine if a set of metabolites is over-represented in biological pathways

In [None]:
# 6.1 Simulate pathway enrichment analysis
def simulate_pathway_analysis():
    """
    Simulate metabolite-pathway associations and enrichment
    """
    np.random.seed(42)
    
    # Define pathways
    pathways = {
        'Glycolysis': ['Glucose', 'Glucose-6-P', 'Fructose-6-P', 'Pyruvate', 'Lactate', 'ATP', 'NADH'],
        'TCA Cycle': ['Citrate', 'Isocitrate', 'Alpha-ketoglutarate', 'Succinate', 'Fumarate', 'Malate', 'Oxaloacetate'],
        'Fatty Acid Oxidation': ['Palmitoyl-CoA', 'Acetyl-CoA', 'FADH2', 'NADH', 'ATP'],
        'Amino Acid Metabolism': ['Glutamate', 'Glutamine', 'Alanine', 'Aspartate', 'Asparagine', 'Serine'],
        'Nucleotide Metabolism': ['AMP', 'ADP', 'ATP', 'GMP', 'GDP', 'GTP', 'UMP', 'CMP'],
        'Pentose Phosphate': ['Glucose-6-P', 'Ribulose-5-P', 'Ribose-5-P', 'NADPH'],
    }
    
    # Simulated detected metabolites (enriched in Glycolysis and TCA)
    detected_metabolites = [
        'Glucose', 'Glucose-6-P', 'Pyruvate', 'Lactate', 'ATP',  # Glycolysis
        'Citrate', 'Succinate', 'Fumarate', 'Malate',  # TCA
        'Glutamate', 'Alanine',  # Amino acids
        'Palmitoyl-CoA', 'Unknown-1', 'Unknown-2'  # Others
    ]
    
    return pathways, detected_metabolites

# 6.2 Calculate enrichment
def calculate_enrichment(pathways, detected):
    """
    Calculate pathway enrichment using hypergeometric test (simplified)
    """
    results = []
    
    total_unique = len(set([m for pathway in pathways.values() for m in pathway]))
    n_detected = len(detected)
    
    for pathway_name, pathway_metabolites in pathways.items():
        # Count overlap
        overlap = set(pathway_metabolites) & set(detected)
        n_overlap = len(overlap)
        n_pathway = len(pathway_metabolites)
        
        # Enrichment score (simplified)
        expected = (n_detected * n_pathway) / total_unique
        enrichment = n_overlap / expected if expected > 0 else 0
        
        # Hypergeometric p-value (using scipy)
        p_value = stats.hypergeom.sf(n_overlap - 1, total_unique, n_pathway, n_detected)
        
        results.append({
            'Pathway': pathway_name,
            'Total_in_Pathway': n_pathway,
            'Detected': n_overlap,
            'Expected': expected,
            'Enrichment': enrichment,
            'P_value': p_value,
            'Metabolites': ', '.join(overlap) if overlap else 'None'
        })
    
    return pd.DataFrame(results).sort_values('P_value')

# Run analysis
pathways, detected = simulate_pathway_analysis()
enrichment_results = calculate_enrichment(pathways, detected)

print("üß¨ Pathway Enrichment Analysis")
print("=" * 80)
print(f"Total detected metabolites: {len(detected)}")
print(f"Number of pathways analyzed: {len(pathways)}")
print("\nüìä Enrichment Results:")
print(enrichment_results[['Pathway', 'Detected', 'Expected', 'Enrichment', 'P_value']].to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Enrichment scores
enrichment_results_sorted = enrichment_results.sort_values('Enrichment', ascending=True)
colors = ['#F44336' if p < 0.05 else '#BDBDBD' for p in enrichment_results_sorted['P_value']]
axes[0].barh(enrichment_results_sorted['Pathway'], enrichment_results_sorted['Enrichment'], 
            color=colors, edgecolor='black', linewidth=1.5)
axes[0].axvline(x=1, color='black', linestyle='--', linewidth=2, label='Expected')
axes[0].set_xlabel('Enrichment Score', fontsize=13, fontweight='bold')
axes[0].set_title('üìä Pathway Enrichment Scores', fontsize=15, fontweight='bold')
axes[0].legend()
axes[0].grid(axis='x', alpha=0.3)

# Plot 2: -log10(p-value)
enrichment_results_sorted['-log10(P)'] = -np.log10(enrichment_results_sorted['P_value'] + 1e-10)
colors2 = ['#2196F3' if p < 0.05 else '#BDBDBD' for p in enrichment_results_sorted['P_value']]
axes[1].barh(enrichment_results_sorted['Pathway'], enrichment_results_sorted['-log10(P)'], 
            color=colors2, edgecolor='black', linewidth=1.5)
axes[1].axvline(x=-np.log10(0.05), color='red', linestyle='--', linewidth=2, label='P = 0.05')
axes[1].set_xlabel('-Log10(P-value)', fontsize=13, fontweight='bold')
axes[1].set_title('üìà Statistical Significance', fontsize=15, fontweight='bold')
axes[1].legend()
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úÖ Pathway analysis complete!")
print(f"\nüéØ Significantly enriched pathways (P < 0.05):")
sig_pathways = enrichment_results[enrichment_results['P_value'] < 0.05]
for _, row in sig_pathways.iterrows():
    print(f"  ‚Ä¢ {row['Pathway']}: {row['Detected']} metabolites (P = {row['P_value']:.4f})")

---
## Practice 7: Biomarker Discovery Workflow

### üéØ Learning Objectives
- Build a classification model for biomarker discovery
- Perform feature selection
- Evaluate diagnostic performance with ROC curves

### üìñ Key Concepts
**ROC Curve:** Receiver Operating Characteristic - plots sensitivity vs (1-specificity) to evaluate classifier performance

In [None]:
# 7.1 Simulate clinical metabolomics data
def simulate_clinical_data(n_samples=100, n_metabolites=50):
    """
    Simulate metabolomics data for healthy vs disease groups
    """
    np.random.seed(42)
    
    # Generate features
    n_healthy = n_samples // 2
    n_disease = n_samples - n_healthy
    
    # Most metabolites are similar between groups
    healthy_data = np.random.normal(10, 2, (n_healthy, n_metabolites))
    disease_data = np.random.normal(10, 2, (n_disease, n_metabolites))
    
    # But 10 metabolites are discriminative biomarkers
    biomarker_indices = np.random.choice(n_metabolites, 10, replace=False)
    for idx in biomarker_indices:
        # Disease samples have different levels
        disease_data[:, idx] += np.random.uniform(3, 8)
    
    # Combine data
    X = np.vstack([healthy_data, disease_data])
    y = np.array([0] * n_healthy + [1] * n_disease)  # 0=Healthy, 1=Disease
    
    # Create feature names
    feature_names = [f'Metabolite_{i+1}' for i in range(n_metabolites)]
    
    return X, y, feature_names, biomarker_indices

# Generate data
X, y, feature_names, true_biomarkers = simulate_clinical_data(n_samples=120, n_metabolites=50)

print("üè• Clinical Biomarker Discovery")
print("=" * 60)
print(f"Total samples: {len(X)}")
print(f"  Healthy: {(y == 0).sum()}")
print(f"  Disease: {(y == 1).sum()}")
print(f"Total metabolites: {X.shape[1]}")
print(f"True biomarkers (simulated): {len(true_biomarkers)}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=5)
clf.fit(X_train_scaled, y_train)

# Predictions
y_pred = clf.predict(X_test_scaled)
y_pred_proba = clf.predict_proba(X_test_scaled)[:, 1]

# Feature importance
importances = clf.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print("\nüî¨ Top 10 Important Features (Potential Biomarkers):")
print(feature_importance_df.head(10).to_string(index=False))

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

print(f"\nüìà Model Performance:")
print(f"  ROC AUC: {roc_auc:.3f}")
print(f"  Accuracy: {(y_pred == y_test).mean():.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\nüìä Confusion Matrix:")
print(f"  True Negatives:  {cm[0, 0]}")
print(f"  False Positives: {cm[0, 1]}")
print(f"  False Negatives: {cm[1, 0]}")
print(f"  True Positives:  {cm[1, 1]}")

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Plot 1: Feature importance
top_features = feature_importance_df.head(15)
axes[0, 0].barh(top_features['Feature'], top_features['Importance'], 
               color='#4CAF50', edgecolor='black', linewidth=1.5)
axes[0, 0].set_xlabel('Importance', fontsize=12, fontweight='bold')
axes[0, 0].set_title('üî¨ Top 15 Feature Importances', fontsize=14, fontweight='bold')
axes[0, 0].invert_yaxis()
axes[0, 0].grid(axis='x', alpha=0.3)

# Plot 2: ROC curve
axes[0, 1].plot(fpr, tpr, color='#2196F3', linewidth=3, label=f'ROC (AUC = {roc_auc:.3f})')
axes[0, 1].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier')
axes[0, 1].fill_between(fpr, tpr, alpha=0.3, color='#2196F3')
axes[0, 1].set_xlabel('False Positive Rate (1 - Specificity)', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('True Positive Rate (Sensitivity)', fontsize=12, fontweight='bold')
axes[0, 1].set_title('üìà ROC Curve', fontsize=14, fontweight='bold')
axes[0, 1].legend(loc='lower right', fontsize=11)
axes[0, 1].grid(alpha=0.3)

# Plot 3: Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True, 
           xticklabels=['Healthy', 'Disease'], yticklabels=['Healthy', 'Disease'],
           ax=axes[1, 0], linewidths=2, linecolor='black')
axes[1, 0].set_xlabel('Predicted Label', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('True Label', fontsize=12, fontweight='bold')
axes[1, 0].set_title('üìä Confusion Matrix', fontsize=14, fontweight='bold')

# Plot 4: Predicted probabilities distribution
healthy_probs = y_pred_proba[y_test == 0]
disease_probs = y_pred_proba[y_test == 1]
axes[1, 1].hist(healthy_probs, bins=15, alpha=0.7, label='Healthy', color='#4CAF50', edgecolor='black')
axes[1, 1].hist(disease_probs, bins=15, alpha=0.7, label='Disease', color='#F44336', edgecolor='black')
axes[1, 1].axvline(x=0.5, color='black', linestyle='--', linewidth=2, label='Threshold')
axes[1, 1].set_xlabel('Predicted Probability (Disease)', fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[1, 1].set_title('üéØ Prediction Distribution', fontsize=14, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úÖ Biomarker discovery analysis complete!")

---
## Practice 8: Integration - Multi-omics Data Visualization

### üéØ Learning Objectives
- Integrate proteomics and metabolomics data
- Perform PCA for dimensionality reduction
- Create comprehensive multi-omics visualizations

### üìñ Key Concepts
**Multi-omics Integration:** Combining different layers of biological data (proteins, metabolites) for holistic analysis

In [None]:
# 8.1 Simulate integrated multi-omics dataset
def simulate_multiomics_data(n_samples=60):
    """
    Simulate combined proteomics and metabolomics data
    """
    np.random.seed(42)
    
    # Three groups: Control, Treatment_A, Treatment_B
    n_per_group = n_samples // 3
    
    # Proteomics data (30 proteins)
    proteins_ctrl = np.random.normal(10, 1.5, (n_per_group, 30))
    proteins_trtA = np.random.normal(12, 1.5, (n_per_group, 30))
    proteins_trtB = np.random.normal(8, 1.5, (n_per_group, 30))
    
    # Metabolomics data (40 metabolites)
    metabolites_ctrl = np.random.normal(5, 1, (n_per_group, 40))
    metabolites_trtA = np.random.normal(6, 1, (n_per_group, 40))
    metabolites_trtB = np.random.normal(4, 1, (n_per_group, 40))
    
    # Combine omics layers
    X_proteins = np.vstack([proteins_ctrl, proteins_trtA, proteins_trtB])
    X_metabolites = np.vstack([metabolites_ctrl, metabolites_trtA, metabolites_trtB])
    X_integrated = np.hstack([X_proteins, X_metabolites])
    
    # Labels
    groups = ['Control'] * n_per_group + ['Treatment_A'] * n_per_group + ['Treatment_B'] * n_per_group
    
    # Feature names
    protein_names = [f'Protein_{i+1}' for i in range(30)]
    metabolite_names = [f'Metabolite_{i+1}' for i in range(40)]
    feature_names = protein_names + metabolite_names
    
    return X_integrated, X_proteins, X_metabolites, groups, feature_names

# Generate data
X_multi, X_prot, X_metab, groups, features = simulate_multiomics_data(n_samples=90)

print("üî¨ Multi-omics Integration Analysis")
print("=" * 60)
print(f"Total samples: {len(X_multi)}")
print(f"  Control: {groups.count('Control')}")
print(f"  Treatment A: {groups.count('Treatment_A')}")
print(f"  Treatment B: {groups.count('Treatment_B')}")
print(f"\nFeatures:")
print(f"  Proteins: {X_prot.shape[1]}")
print(f"  Metabolites: {X_metab.shape[1]}")
print(f"  Total integrated: {X_multi.shape[1]}")

# PCA on integrated data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_multi)

pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

print(f"\nüìä PCA Results:")
print(f"  PC1 variance explained: {pca.explained_variance_ratio_[0]:.3f}")
print(f"  PC2 variance explained: {pca.explained_variance_ratio_[1]:.3f}")
print(f"  PC3 variance explained: {pca.explained_variance_ratio_[2]:.3f}")
print(f"  Total variance (3 PCs): {pca.explained_variance_ratio_[:3].sum():.3f}")

# Separate PCA for each omics layer
pca_prot = PCA(n_components=2)
X_prot_pca = pca_prot.fit_transform(scaler.fit_transform(X_prot))

pca_metab = PCA(n_components=2)
X_metab_pca = pca_metab.fit_transform(scaler.fit_transform(X_metab))

# Visualize
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# Define colors
color_map = {'Control': '#4CAF50', 'Treatment_A': '#2196F3', 'Treatment_B': '#FF9800'}
colors = [color_map[g] for g in groups]

# Plot 1: Integrated PCA (2D)
ax1 = fig.add_subplot(gs[0, :2])
for group in ['Control', 'Treatment_A', 'Treatment_B']:
    mask = [g == group for g in groups]
    ax1.scatter(X_pca[mask, 0], X_pca[mask, 1], 
               c=color_map[group], label=group, s=100, alpha=0.7, edgecolors='black', linewidth=1.5)
ax1.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})', fontsize=12, fontweight='bold')
ax1.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})', fontsize=12, fontweight='bold')
ax1.set_title('üåê Integrated Multi-omics PCA', fontsize=14, fontweight='bold')
ax1.legend(loc='best')
ax1.grid(alpha=0.3)

# Plot 2: Scree plot
ax2 = fig.add_subplot(gs[0, 2])
n_components = min(10, len(pca.explained_variance_ratio_))
ax2.bar(range(1, n_components+1), pca.explained_variance_ratio_[:n_components], 
       color='#9C27B0', edgecolor='black', linewidth=1.5, alpha=0.7)
ax2.set_xlabel('Principal Component', fontsize=11, fontweight='bold')
ax2.set_ylabel('Variance Explained', fontsize=11, fontweight='bold')
ax2.set_title('üìä Scree Plot', fontsize=13, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)

# Plot 3: Proteomics PCA
ax3 = fig.add_subplot(gs[1, 0])
for group in ['Control', 'Treatment_A', 'Treatment_B']:
    mask = [g == group for g in groups]
    ax3.scatter(X_prot_pca[mask, 0], X_prot_pca[mask, 1], 
               c=color_map[group], label=group, s=80, alpha=0.7, edgecolors='black', linewidth=1)
ax3.set_xlabel('Protein PC1', fontsize=11, fontweight='bold')
ax3.set_ylabel('Protein PC2', fontsize=11, fontweight='bold')
ax3.set_title('üß¨ Proteomics Layer', fontsize=13, fontweight='bold')
ax3.grid(alpha=0.3)

# Plot 4: Metabolomics PCA
ax4 = fig.add_subplot(gs[1, 1])
for group in ['Control', 'Treatment_A', 'Treatment_B']:
    mask = [g == group for g in groups]
    ax4.scatter(X_metab_pca[mask, 0], X_metab_pca[mask, 1], 
               c=color_map[group], label=group, s=80, alpha=0.7, edgecolors='black', linewidth=1)
ax4.set_xlabel('Metabolite PC1', fontsize=11, fontweight='bold')
ax4.set_ylabel('Metabolite PC2', fontsize=11, fontweight='bold')
ax4.set_title('‚öóÔ∏è Metabolomics Layer', fontsize=13, fontweight='bold')
ax4.grid(alpha=0.3)

# Plot 5: 3D PCA
ax5 = fig.add_subplot(gs[1, 2], projection='3d')
for group in ['Control', 'Treatment_A', 'Treatment_B']:
    mask = [g == group for g in groups]
    ax5.scatter(X_pca[mask, 0], X_pca[mask, 1], X_pca[mask, 2],
               c=color_map[group], label=group, s=60, alpha=0.7, edgecolors='black', linewidth=0.5)
ax5.set_xlabel('PC1', fontsize=10, fontweight='bold')
ax5.set_ylabel('PC2', fontsize=10, fontweight='bold')
ax5.set_zlabel('PC3', fontsize=10, fontweight='bold')
ax5.set_title('üé® 3D PCA View', fontsize=13, fontweight='bold')
ax5.legend(loc='best', fontsize=9)

# Plot 6: Correlation heatmap (sample selection)
ax6 = fig.add_subplot(gs[2, :])
# Select 20 random features for visualization
selected_features = np.random.choice(X_multi.shape[1], 20, replace=False)
correlation_matrix = np.corrcoef(X_scaled[:, selected_features].T)
im = ax6.imshow(correlation_matrix, cmap='coolwarm', aspect='auto', vmin=-1, vmax=1)
ax6.set_title('üî• Feature Correlation Heatmap (Sample)', fontsize=14, fontweight='bold')
ax6.set_xlabel('Features', fontsize=11, fontweight='bold')
ax6.set_ylabel('Features', fontsize=11, fontweight='bold')
plt.colorbar(im, ax=ax6, label='Correlation')

plt.show()

print("\n‚úÖ Multi-omics integration complete!")
print("\nüí° Key Insights:")
print("  ‚Ä¢ The integrated analysis shows clear separation between treatment groups")
print("  ‚Ä¢ Both proteomics and metabolomics contribute to the overall pattern")
print("  ‚Ä¢ Multi-omics provides more comprehensive biological insights than single-omics")

---
## üéØ Practice Complete!

### Summary of What We Learned:

1. **Mass Spectrometry Basics**: Simulating and visualizing MS data with m/z peaks
2. **Peptide Analysis**: Calculating theoretical masses and charge states
3. **Protein Identification**: Understanding PSM scoring and FDR control
4. **Quantitative Proteomics**: Analyzing differential expression with volcano plots
5. **Metabolite Detection**: Peak detection in LC-MS chromatograms
6. **Pathway Analysis**: Enrichment testing for biological interpretation
7. **Biomarker Discovery**: Building predictive models with ROC analysis
8. **Multi-omics Integration**: Combining proteomics and metabolomics with PCA

### Key Insights:
- Proteomics and metabolomics provide complementary views of biological systems
- Statistical methods (t-tests, FDR, ROC curves) are essential for data interpretation
- Integration of multiple omics layers enhances biological understanding
- Machine learning enables biomarker discovery from high-dimensional data

### Real-World Applications:
- **Clinical Diagnostics**: Disease biomarker discovery
- **Drug Development**: Understanding drug mechanisms and toxicity
- **Precision Medicine**: Patient stratification and treatment selection
- **Systems Biology**: Mapping cellular networks and pathways

### Next Steps:
- Explore real datasets from public repositories (PRIDE, MetaboLights)
- Learn advanced tools: MaxQuant, MetaboAnalyst, Perseus
- Study time-series metabolomics and flux analysis
- Integrate with genomics and transcriptomics data

---

### üìö Additional Resources:
- **PRIDE Database**: https://www.ebi.ac.uk/pride/
- **MetaboLights**: https://www.ebi.ac.uk/metabolights/
- **Human Metabolome Database**: https://hmdb.ca/
- **KEGG Pathways**: https://www.genome.jp/kegg/pathway.html

### üéì Congratulations!
You've completed a comprehensive hands-on introduction to proteomics and metabolomics analysis!

---