<a href="https://colab.research.google.com/github/gokceuludogan/protein-ml-crash-course/blob/wip/Chapter_6_Protein_Structure_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>




# Chapter 6: Protein Structure Prediction

## Overview

Protein structure prediction is one of the most challenging problems in bioinformatics. The goal is to predict the **3D structure** of a protein given its **amino acid sequence**. Understanding a protein’s structure is key to determining its function and interactions.

In this chapter, we will:

- Explore the basics of protein structure prediction.
- Understand the significance of **contact maps** and **distance matrices**.
- Implement basic machine learning models to predict structural properties.
- Discuss advanced techniques like **AlphaFold**, a state-of-the-art deep learning model for 3D structure prediction.

---

## 1. Basics of Protein Structure Prediction

### The Protein Folding Problem

Proteins are composed of chains of amino acids that fold into unique three-dimensional shapes. Predicting this 3D shape from the 1D amino acid sequence is called the **protein folding problem**. The structure of a protein is usually described at three levels:

- **Secondary structure**: Local folding into alpha helices and beta sheets.
- **Tertiary structure**: Overall 3D shape of the protein.
- **Quaternary structure**: Structure of multi-protein complexes.

---

## 2. Protein Contact Maps and Distance Matrices

A **contact map** is a simplified representation of a protein’s 3D structure. It’s a 2D matrix where each element represents whether two amino acids are in contact in the 3D structure. Contact maps are widely used as intermediate steps in protein structure prediction.

### Distance Matrix

A **distance matrix** is similar to a contact map but instead records the Euclidean distance between the Cα atoms (alpha carbon atoms) of amino acids. The distance matrix is key for reconstructing the 3D structure from 2D data.

### Visual Example of a Contact Map:

A contact map for a protein sequence is a binary matrix, where 1 indicates that two residues are in contact (usually within a 5Å threshold).

|  | A1 | A2 | A3 | A4 | A5 |
| --- | --- | --- | --- | --- | --- |
| **A1** | 1 | 0 | 1 | 0 | 0 |
| **A2** | 0 | 1 | 0 | 1 | 0 |
| **A3** | 1 | 0 | 1 | 0 | 1 |
| **A4** | 0 | 1 | 0 | 1 | 0 |
| **A5** | 0 | 0 | 1 | 0 | 1 |

---

## 3. Predicting Protein Structural Properties

To simplify the problem, we can predict secondary structural elements or distance matrices. In this section, we will predict **distance matrices** using sequence data.

### Data Preprocessing for Distance Matrix Prediction

We need a dataset with protein sequences and known distance matrices, which can be extracted from protein structures available in the **Protein Data Bank (PDB)**.

### Code Example: Loading and Preprocessing PDB Data

```python
python
Copy code
from Bio.PDB import PDBParser
import numpy as np

# Load PDB structure
def load_structure(pdb_id):
    parser = PDBParser()
    structure = parser.get_structure(pdb_id, f'{pdb_id}.pdb')
    model = structure[0]
    return model

# Compute distance matrix
def compute_distance_matrix(structure):
    atoms = [residue['CA'] for residue in structure.get_residues() if 'CA' in residue]
    num_atoms = len(atoms)
    distance_matrix = np.zeros((num_atoms, num_atoms))

    for i in range(num_atoms):
        for j in range(num_atoms):
            distance_matrix[i, j] = atoms[i] - atoms[j]

    return distance_matrix

# Example usage
pdb_id = '1A3N'  # Example PDB ID
structure = load_structure(pdb_id)
distance_matrix = compute_distance_matrix(structure)
print("Distance Matrix Shape:", distance_matrix.shape)

```

---

## 4. Machine Learning for Distance Matrix Prediction

We can train machine learning models to predict the **distance matrix** between amino acids in a protein sequence. The model will take a one-hot encoded protein sequence as input and predict the distance matrix.

### Code Example: Distance Matrix Prediction with Random Forest

```python
python
Copy code
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Example data (for demonstration purposes)
X_sequences = np.random.rand(100, 20, 20)  # 100 protein sequences, 20 amino acids
y_distances = np.random.rand(100, 20, 20)  # Corresponding distance matrices

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_sequences, y_distances, test_size=0.2, random_state=42)

# Train a Random Forest Regressor for predicting distances
rf = RandomForestRegressor()
rf.fit(X_train.reshape(80, -1), y_train.reshape(80, -1))

# Predict on test set
y_pred = rf.predict(X_test.reshape(20, -1))

# Reshape predicted distance matrix
y_pred_reshaped = y_pred.reshape(20, 20, 20)
print("Predicted Distance Matrix Shape:", y_pred_reshaped.shape)

```

---

---

## 5. Model Evaluation

### Evaluating Distance Matrix Predictions

To evaluate how well our model predicts the distance matrix, we can use metrics like **Root Mean Squared Error (RMSE)** or **Mean Absolute Error (MAE)**.

### Code Example: RMSE for Distance Matrix Prediction

```python
python
Copy code
from sklearn.metrics import mean_squared_error

# Flatten true and predicted matrices
y_test_flat = y_test.reshape(20, -1)
y_pred_flat = y_pred.reshape(20, -1)

# Compute RMSE
rmse = np.sqrt(mean_squared_error(y_test_flat, y_pred_flat))
print("RMSE for Distance Matrix Prediction:", rmse)

```

---

## 7. Conclusion

In this chapter, we explored the fundamentals of **protein structure prediction**, focusing on distance matrices and contact maps. We built a simple machine learning model to predict distance matrices and discussed the revolutionary deep learning approach, **AlphaFold**. Protein structure prediction remains a frontier in bioinformatics, with exciting developments continuing to emerge.