# Antimicrobial Peptide Classification - Student Assignment

---

## üß¨ Problem Statement: Why Antimicrobial Peptide Classification?

### What are Antimicrobial Peptides (AMPs)?

Antimicrobial peptides (AMPs) are small proteins that play a crucial role in the innate immune system of all living organisms. They are:

- **Natural antibiotics**: Part of the body's first line of defense against pathogens
- **Broad-spectrum**: Effective against bacteria, fungi, viruses, and even cancer cells
- **Fast-acting**: Kill microbes within minutes
- **Low resistance**: Pathogens develop resistance much slower compared to conventional antibiotics

### Why is This Problem Important?

#### 1. **Antibiotic Resistance Crisis** üö®
- Traditional antibiotics are becoming ineffective due to bacterial resistance
- WHO lists antibiotic resistance as one of the top 10 global public health threats
- By 2050, antimicrobial resistance could cause 10 million deaths annually
- **AMPs offer a promising alternative** with lower resistance development

#### 2. **Drug Discovery** üíä
- Identifying AMPs from protein sequences can accelerate drug development
- Computational prediction is **faster and cheaper** than experimental screening
- Can screen millions of sequences in silico before lab testing
- Reduces time from discovery to clinical trials

#### 3. **Personalized Medicine** üè•
- Understanding human AMPs helps in:
  - Diagnosing immune deficiencies
  - Developing targeted therapies
  - Understanding disease susceptibility
  - Creating personalized treatment plans

#### 4. **Agricultural Applications** üåæ
- AMPs can protect crops from pathogens
- Reduce need for chemical pesticides
- Improve food security

### The Computational Challenge

**Problem**: Given a protein sequence, can we predict whether it has antimicrobial properties?

**Why Machine Learning?**
- Experimental validation is expensive (\$10,000+ per peptide)
- Time-consuming (months to years)
- ML can screen thousands of candidates in hours
- Accuracy of 80-95% can significantly reduce experimental workload

**Real-World Impact**:
- Pharmaceutical companies use these models to identify drug candidates
- Researchers discover new AMPs from genomic databases
- Helps combat emerging infectious diseases

---

## üìä Dataset Overview

### Dataset Description

**File**: `antimicrobial_peptide.csv`

**Size**: 2,000 protein sequences from human proteins (Homo sapiens)

**Source**: UniProt database (Universal Protein Resource)
- UniProt is the world's most comprehensive catalog of protein sequences
- Contains experimentally validated and computationally predicted proteins
- Our data comes from human proteome (taxonomy ID: 9606)

### Class Labels

The dataset contains two classes:

| Label | Class | Count | Description |
|-------|-------|-------|-------------|
| **0** | Non-AMP | 1,000 | Regular proteins without antimicrobial activity |
| **1** | AMP | 1,000 | Proteins with antimicrobial properties |

**Balanced Dataset**: Equal number of samples per class (1,000 each)
- Prevents bias toward majority class
- Ensures fair model training
- Simplifies evaluation metrics

### Sequence Characteristics

**Overall Statistics**:
- **Mean length**: ~419 amino acids
- **Median length**: ~317 amino acids
- **Range**: 32 to 7,592 amino acids

**By Class**:
- **Label 0 (Non-AMP)**: Mean = 333 aa, Median = 271 aa
- **Label 1 (AMP)**: Mean = 505 aa, Median = 387 aa

### Biological Context

#### What Makes a Protein Antimicrobial?

AMPs typically have these characteristics:

1. **Positive Charge**
   - Rich in lysine (K) and arginine (R)
   - Attracts to negatively charged bacterial membranes

2. **Amphipathic Structure**
   - Contains both hydrophobic and hydrophilic regions
   - Allows insertion into lipid membranes

3. **Small to Medium Size**
   - Typically 10-50 amino acids (though our dataset includes larger proteins)
   - Easier to penetrate cell membranes

4. **Specific Amino Acid Composition**
   - High content of: K, R, W, F, L
   - Low content of: E, D (negatively charged)

#### Mechanism of Action

AMPs kill microbes by:
1. **Membrane Disruption**: Creating pores in bacterial membranes
2. **Intracellular Targets**: Inhibiting DNA/RNA synthesis
3. **Immune Modulation**: Activating immune responses

---

## üéØ Your Task

You will implement a complete machine learning pipeline to classify antimicrobial peptides:

### Part 1: Data Exploration
- Load and examine the dataset
- Visualize class distribution
- Analyze sequence length patterns
- Compare amino acid composition between classes

### Part 2: Feature Extraction
Implement three different feature extraction methods:

#### 2.1 AAC (Amino Acid Composition)
- **What**: Frequency of each of the 20 standard amino acids
- **Dimension**: 20 features
- **Pros**: Simple, interpretable, captures basic composition
- **Cons**: Loses sequence order information

#### 2.2 k-mers (Tri-peptides)
- **What**: Frequency of overlapping 3-amino-acid subsequences
- **Dimension**: 500 features (top 500 most common tri-peptides)
- **Pros**: Captures local sequence patterns, preserves some order
- **Cons**: Still limited context window

#### 2.3 ProtVec (Word2Vec Embeddings)
- **What**: Learned vector representations using Word2Vec on k-mers
- **Dimension**: 100 features
- **Pros**: Captures semantic relationships, dense representation
- **Cons**: Less interpretable, requires training

### Part 3: Model Training
Train three classifiers on each feature set:
1. **Logistic Regression**: Linear baseline
2. **Random Forest**: Non-linear ensemble method
3. **SVM**: Kernel-based classifier

**Total**: 3 features √ó 3 classifiers = 9 models

### Part 4: Evaluation
Evaluate models using:
- **Accuracy**: Overall correctness
- **Precision**: How many predicted AMPs are actually AMPs?
- **Recall**: How many actual AMPs did we find?
- **F1-Score**: Harmonic mean of precision and recall
- **AUC-ROC**: Area under ROC curve
- **Cross-Validation**: 5-fold CV for robust estimation

### Part 5: Visualization & Interpretation
- Compare performance across features and classifiers
- Generate ROC curves
- Create confusion matrices
- Visualize embeddings (PCA)
- Identify best model

---

## üìñ Background: Feature Extraction Methods

### Why Different Features?

Different features capture different aspects of protein sequences:

| Feature | What it Captures | Best For |
|---------|------------------|----------|
| **AAC** | Overall composition | Quick baseline, interpretable |
| **k-mer** | Local patterns, motifs | Functional sites, domains |
| **ProtVec** | Semantic relationships | Complex patterns, context |

### Evolution of Protein Feature Extraction

1. **1990s**: Hand-crafted features (AAC, physicochemical properties)
2. **2000s**: Sequence-based features (k-mers, position-specific)
3. **2010s**: Embedding methods (ProtVec, inspired by NLP)
4. **2020s**: Deep learning (ESM, ProtBERT - transformers)

This assignment covers methods from eras 1-3, giving you a comprehensive understanding!

---

## üî¨ Expected Results

Based on similar studies, you should expect:

| Feature | Expected F1-Score | Why? |
|---------|-------------------|------|
| AAC | 0.65 - 0.75 | Simple features, limited information |
| k-mer | 0.75 - 0.85 | Captures local patterns |
| ProtVec | 0.80 - 0.88 | Learned representations, semantic meaning |

**Best Classifier**: Usually Random Forest or SVM (non-linear methods)

---

## üí° Tips for Success

1. **Start Simple**: Implement AAC first, then move to more complex features
2. **Test Incrementally**: Test each function with a small example before running on full dataset
3. **Use Random Seeds**: Set `random_state=42` everywhere for reproducibility
4. **Standardize Features**: Always scale features before training (use `StandardScaler`)
5. **Cross-Validate**: Don't rely only on test set performance
6. **Visualize**: Plots help you understand what's happening
7. **Compare**: The goal is to compare different approaches, not just get high accuracy

**Data Splitting**:
- Training: 80% (1,600 sequences)
- Testing: 20% (400 sequences)
- Use stratified split to maintain class balance

**Random Seed**: 42 (for reproducibility)

---

## üöÄ Let's Begin!

Now that you understand the problem and dataset, let's start implementing!



---

# Implementation Section

## TODO: Students will complete the following sections

## 1. Install Required Packages

**TODO**: Install the necessary Python packages

In [None]:
# TODO: Install required packages
# Hint: !pip install -q package_name


## 2. Import Libraries

**TODO**: Import all necessary libraries for data processing, ML, and visualization

In [None]:
# TODO: Import libraries
# You will need: pandas, numpy, matplotlib, seaborn, sklearn modules, gensim


## 3. Load Dataset

**TODO**: Upload and load the antimicrobial_peptide.csv file

In [None]:
# TODO: Upload the dataset file
# Hint: Use files.upload() in Colab

# TODO: Load the CSV file into a pandas DataFrame

# TODO: Display basic information about the dataset
# - Number of rows and columns
# - Column names
# - First few rows


## 4. Exploratory Data Analysis

**TODO**: Analyze and visualize the dataset

In [None]:
# TODO: Check class distribution
# How many sequences in each class (label 0 vs label 1)?

# TODO: Add a column for sequence length

# TODO: Calculate statistics for sequence lengths
# - Overall mean, median, min, max
# - By class (label 0 and label 1)


In [None]:
# TODO: Create visualizations
# 1. Bar plot of class distribution
# 2. Histogram of sequence lengths (separate by class)

# Hint: Use matplotlib or seaborn


## 5. Feature Extraction

### 5.1 Amino Acid Composition (AAC)

**TODO**: Implement AAC feature extraction

In [None]:
# TODO: Define a function to calculate amino acid composition
# Input: protein sequence (string)
# Output: numpy array of 20 values (frequency of each amino acid)

def amino_acid_composition(sequence):
    """
    Calculate the frequency of each of the 20 standard amino acids.
    
    Args:
        sequence (str): Protein sequence
    
    Returns:
        np.array: 20-dimensional feature vector
    """
    # TODO: Implement this function
    # Hint: The 20 standard amino acids are: ACDEFGHIKLMNPQRSTVWY
    # Calculate: count(amino_acid) / total_length for each amino acid
    pass

# TODO: Test your function with a simple example
# Example: amino_acid_composition('ACDEFGHIKLMNPQRSTVWY')


In [None]:
# TODO: Extract AAC features for all sequences in the dataset
# Create a numpy array of shape (2000, 20)


### 5.2 k-mer Features (Tri-peptides)

**TODO**: Implement k-mer feature extraction

In [None]:
# TODO: Define a function to extract k-mers from a sequence
def extract_kmers(sequence, k=3):
    """
    Extract overlapping k-mers from a sequence.
    
    Args:
        sequence (str): Protein sequence
        k (int): Length of k-mer (default: 3)
    
    Returns:
        list: List of k-mers
    """
    # TODO: Implement this function
    # Hint: For sequence 'ABCDE' with k=3, k-mers are: ['ABC', 'BCD', 'CDE']
    pass

# TODO: Test your function
# Example: extract_kmers('ACDEFG', k=3) should return ['ACD', 'CDE', 'DEF', 'EFG']


In [None]:
# TODO: Find the top 500 most common k-mers across all sequences
# Hint: Use Counter from collections module

# TODO: Create k-mer frequency features for each sequence
# For each sequence, calculate the frequency of each of the top 500 k-mers
# Result should be a numpy array of shape (2000, 500)


### 5.3 ProtVec (Word2Vec Embeddings)

**TODO**: Implement ProtVec feature extraction

In [None]:
# TODO: Train a Word2Vec model on k-mers
# Steps:
# 1. Convert each sequence to a list of k-mers (use extract_kmers function)
# 2. Train Word2Vec model using gensim
# 3. Parameters: vector_size=100, window=5, min_count=2, seed=42

# Hint: from gensim.models import Word2Vec


In [None]:
# TODO: Convert sequences to ProtVec embeddings
# For each sequence:
# 1. Extract k-mers
# 2. Get Word2Vec vector for each k-mer
# 3. Average all k-mer vectors to get sequence embedding
# Result should be a numpy array of shape (2000, 100)


## 6. Model Training and Evaluation

**TODO**: Train classifiers and evaluate performance

In [None]:
# TODO: Split data into training and testing sets
# Use train_test_split with:
# - test_size=0.2
# - random_state=42
# - stratify=y (to maintain class balance)

# Do this for each feature set (AAC, k-mer, ProtVec)


In [None]:
# TODO: Define classifiers
# 1. Logistic Regression (max_iter=1000, random_state=42)
# 2. Random Forest (n_estimators=100, random_state=42)
# 3. SVM (kernel='rbf', probability=True, random_state=42)


In [None]:
# TODO: For each feature set and each classifier:
# 1. Standardize features using StandardScaler
# 2. Perform 5-fold cross-validation on training set
# 3. Train on full training set
# 4. Predict on test set
# 5. Calculate metrics: Accuracy, Precision, Recall, F1-Score, AUC-ROC
# 6. Store results in a DataFrame

# Total: 9 models (3 features √ó 3 classifiers)


## 7. Results and Visualization

**TODO**: Visualize and compare results

In [None]:
# TODO: Create a results table showing all metrics for all models
# Display as a pandas DataFrame


## 8. Analysis and Conclusions

**TODO**: Answer the following questions based on your results

### Questions to Answer:

1. **Which feature extraction method performed best? Why do you think that is?**
   - YOUR ANSWER HERE

2. **Which classifier worked best overall? Was it consistent across all feature types?**
   - YOUR ANSWER HERE

3. **Look at the confusion matrices. Are there more false positives or false negatives? What does this mean in the context of drug discovery?**
   - YOUR ANSWER HERE

4. **If you had access to more computational resources, what would you try next? (Hint: Think about deep learning methods like ESM or ProtBERT)**
   - YOUR ANSWER HERE