# Lab: Protein Dynamics Analysis üß¨ (Student Version)

> "Proteins are not static structures‚Äîthey are dynamic machines that dance through time!"

This lab explores protein dynamics through computational simulation. You'll learn how proteins move, which regions are flexible, and how motion relates to biological function.

**By the end of this lab, you'll be able to:**
- Download and prepare protein structures from the PDB
- Analyze static protein properties (Ramachandran plots, B-factors)
- Run molecular dynamics simulations
- Visualize and interpret protein motion
- Connect protein dynamics to biological function

**Real-world relevance**: Understanding protein dynamics is crucial for drug design, enzyme engineering, and understanding diseases like Alzheimer's (caused by protein misfolding)!


## 1. Download and Prepare the Structure

Your first task is to download a protein structure from the RCSB database in mmCIF format and convert it to PDB.

**What you'll learn:**
- What the Protein Data Bank (PDB) is - the world's largest repository of protein structures
- How protein structures are stored (mmCIF vs PDB formats)
- Why formats matter in computational biology

**üí° Think about it**: The PDB contains over 200,000 structures! Each one represents years of experimental work. What protein will you analyze?

**Tasks:**
1. Import the necessary libraries
2. Write functions to download and convert structures
3. Download a protein structure (try PDB ID: 1E0L, 1UBQ, or 1CRN)

**Outcome:** You should have a real protein structure ready to analyze.


In [None]:
# TODO: Import the necessary libraries
# You'll need: os, numpy, matplotlib, Bio.PDB, mdtraj, requests, openmm
# Hint: Check the imports section for what you need

# Write your imports here:


### Task 1.1: Write a function to download structures

Write a function `download_structure(pdb_id, out_dir="structures")` that:
- Takes a PDB ID (e.g., "1E0L") and optional output directory
- Downloads the mmCIF file from RCSB (URL: `https://files.rcsb.org/download/{pdb_id}.cif`)
- Saves it to the specified directory
- Returns the path to the downloaded file

**Hint**: Use `requests.get()` to download, and make sure to handle the response status code.


In [None]:
# TODO: Write the download_structure function
def download_structure(pdb_id, out_dir="structures"):
    """
    Download mmCIF using RCSB API and return local filepath.
    
    Parameters:
    - pdb_id: PDB identifier (e.g., "1E0L")
    - out_dir: Output directory (default: "structures")
    
    Returns:
    - Path to downloaded .cif file
    """
    # TODO: Implement this function
    # Hint: 
    # 1. Create output directory if it doesn't exist (os.makedirs)
    # 2. Build the URL (lowercase pdb_id)
    # 3. Use requests.get() to download
    # 4. Check status code (should be 200)
    # 5. Write the content to a file
    # 6. Print success message and return file path
    pass


### Task 1.2: Write a function to convert mmCIF to PDB

Write a function `cif_to_pdb(cif_file, out_file=None)` that:
- Takes a mmCIF file path
- Uses BioPython to parse the mmCIF file
- Converts it to PDB format
- Saves the PDB file
- Returns the path to the PDB file

**Hint**: Use `MMCIFParser` from `Bio.PDB` to parse, and `PDBIO` to save.


In [None]:
# TODO: Write the cif_to_pdb function
def cif_to_pdb(cif_file, out_file=None):
    """
    Convert mmCIF file to PDB using BioPython.
    
    Parameters:
    - cif_file: Path to input .cif file
    - out_file: Path to output .pdb file (default: replace .cif with .pdb)
    
    Returns:
    - Path to converted .pdb file
    """
    # TODO: Implement this function
    # Hint:
    # 1. If out_file is None, replace .cif with .pdb in cif_file
    # 2. Create MMCIFParser (QUIET=True)
    # 3. Parse the structure
    # 4. Create PDBIO object
    # 5. Set structure and save
    # 6. Print success message and return file path
    pass


### Task 1.3: Download and convert a structure

Now use your functions to download and convert a protein structure. Try different PDB IDs:
- `1E0L`: Small protein (good for quick MD)
- `1UBQ`: Ubiquitin (well-studied, very flexible)
- `1CRN`: Crambin (small, stable protein)

**Questions to think about:**
- What happens if you try an invalid PDB ID?
- What's the difference between mmCIF and PDB formats?


In [None]:
# TODO: Download and convert a structure
# Choose a PDB ID and use your functions
pdb_id = "1E0L"  # Try different IDs!

# TODO: Call your download_structure function
# TODO: Call your cif_to_pdb function
# Store the result in pdb_file variable


## 2. Static Protein Analysis

Before simulating motion, we examine the protein in its static state. Think of this as a "snapshot" from a crystallography experiment.

**We will:**
- Compute phi/psi backbone torsion angles
- Plot a Ramachandran diagram ‚Üí shows which conformations are allowed
- Extract B-factors ‚Üí reflect how much atoms fluctuate in experiments

**This teaches:**
- Where protein chains can bend (like a flexible hose vs a rigid pipe)
- Which regions are flexible vs rigid (loops vs alpha-helices)
- How crystallography captures but also simplifies motion (a single snapshot of a dancing protein!)

**üéØ Key Question**: Can we predict which parts will move the most, just from the static structure?

**Outcome:** You'll gain a clearer understanding that even static crystallography hints at dynamics.


### Task 2.1: Compute Phi/Psi Angles and Create Ramachandran Plot

The **Ramachandran plot** visualizes backbone torsion angles (phi and psi) and shows which conformations are sterically allowed.

**Think of it like this**: Imagine a protein backbone as a chain of beads. Phi and psi are like the angles of rotation at each joint. Some combinations are impossible (like trying to fold your arm the wrong way), while others create familiar structures:
- **Alpha-helices** cluster in one region (top-left)
- **Beta-sheets** cluster in another region (top-right)  
- **Loops** can be anywhere but prefer certain angles

**Tasks:**
1. Write a function to compute phi and psi angles using mdtraj
2. Write a function to plot the Ramachandran plot
3. Compute and visualize the angles for your protein

**Hint**: Use `md.compute_phi()` and `md.compute_psi()` from mdtraj. Don't forget to convert radians to degrees for plotting!


In [None]:
# TODO: Write function to compute Ramachandran angles
def compute_ramachandran(pdb_file):
    """
    Compute phi and psi backbone torsion angles.
    
    Parameters:
    - pdb_file: Path to PDB file
    
    Returns:
    - phi: Array of phi angles (in radians)
    - psi: Array of psi angles (in radians)
    """
    # TODO: Implement this function
    # Hint:
    # 1. Load the PDB file using mdtraj (md.load)
    # 2. Compute phi angles (md.compute_phi)
    # 3. Compute psi angles (md.compute_psi)
    # 4. Flatten the arrays and return
    pass

# TODO: Write function to plot Ramachandran plot
def plot_ramachandran(phi, psi, title="Ramachandran Plot"):
    """
    Plot phi vs psi angles showing allowed conformations.
    
    Parameters:
    - phi: Array of phi angles (in radians)
    - psi: Array of psi angles (in radians)
    - title: Plot title
    """
    # TODO: Implement this function
    # Hint:
    # 1. Convert radians to degrees (np.degrees)
    # 2. Create scatter plot (phi vs psi)
    # 3. Set axis limits (-180 to 180)
    # 4. Add labels, title, grid
    # 5. Show the plot
    pass


In [None]:
# TODO: Compute and plot Ramachandran angles for your protein
# Use your functions above


**Questions:**
- What regions of the Ramachandran plot are populated? (alpha-helix region, beta-sheet region?)
- What does this tell you about the secondary structure of your protein?


### Task 2.2: Extract and Plot B-Factors

**B-factors** (also called temperature factors) from crystallography reflect atomic displacement. Higher B-factors indicate more flexible regions of the protein.

**üí° Analogy**: Think of B-factors like a "blurriness" measure in a photo. If you took many photos of a person dancing, the moving parts would be blurry (high B-factor), while still parts would be sharp (low B-factor).

**ü§î Prediction**: Before we run the MD simulation, can you guess which regions will have high B-factors? (Hint: Loops and chain termini are usually more flexible than secondary structure elements!)

**Tasks:**
1. Write a function to extract B-factors from the PDB file
2. Write a function to plot B-factors along the protein
3. Identify the most flexible regions

**Hint**: Use `PDBParser` from `Bio.PDB` to parse the structure, then iterate through atoms to get their B-factors.


In [None]:
# TODO: Write function to extract B-factors
def extract_bfactor(pdb_file):
    """
    Extract B-factors from PDB file.
    
    Parameters:
    - pdb_file: Path to PDB file
    
    Returns:
    - Array of B-factors for all atoms
    """
    # TODO: Implement this function
    # Hint:
    # 1. Parse PDB file using PDBParser
    # 2. Iterate through all atoms in the structure
    # 3. Get B-factor for each atom (atom.get_bfactor())
    # 4. Return as numpy array
    pass

# TODO: Write function to plot B-factors
def plot_bfactor(bfactors):
    """
    Plot B-factor distribution along the protein.
    
    Parameters:
    - bfactors: Array of B-factors
    """
    # TODO: Implement this function
    # Hint:
    # 1. Create a figure and axes
    # 2. Plot B-factors vs atom index
    # 3. Add labels, title, grid
    # 4. Print statistics (mean, max, min)
    # 5. Show the plot
    pass


In [None]:
# TODO: Extract and plot B-factors for your protein


**Questions:**
- Which regions have the highest B-factors?
- Do these regions correspond to loops, termini, or secondary structure elements?
- **Make a prediction**: Will these same regions show high flexibility in the MD simulation?


## 3. Running a Molecular Dynamics Simulation üé¨

Now we let the protein move in a simulated environment! This is like creating a movie of protein motion.

**We will:**
- Add an implicit solvent (water effects, but simplified - like a "friction" effect)
- Minimize energy (relax the structure - like letting a stretched spring settle)
- Run a short MD simulation using OpenMM (like a physics simulation in a video game!)

**This gives us:**
- A trajectory over time ‚Üí like a movie of protein motion with thousands of frames

**‚è±Ô∏è Time Note**: This simulation represents 0.2 nanoseconds of "real" time. In reality, proteins move on timescales from picoseconds to seconds. We're capturing the fast movements!

**üéÆ Think of it like**: Simulating how a protein would move if you could watch it frame-by-frame at atomic resolution.

**Outcome:** A .dcd file containing ~1000+ snapshots of the protein shape changing.


### Task 3.1: Write MD Simulation Function

Write a function to run a molecular dynamics simulation. This function should:

1. Load the PDB structure
2. Create a force field (AMBER)
3. Create a system with the force field
4. Create an integrator (Langevin dynamics)
5. Minimize energy
6. Run the simulation and save trajectory

**Key parameters:**
- `temperature`: Simulation temperature (default: 300 K)
- `sim_time_ns`: Simulation time in nanoseconds (default: 0.2 ns)
- `out_dcd`: Output trajectory filename

**Hint**: Use OpenMM classes:
- `PDBFile` to load structure
- `ForceField` to create force field
- `LangevinIntegrator` for dynamics
- `Simulation` to run the simulation
- `DCDReporter` to save trajectory


In [None]:
# TODO: Write function to run MD simulation
def run_md_simulation(pdb_file, temperature=300, sim_time_ns=0.2, out_dcd="traj.dcd"):
    """
    Run an implicit-solvent MD simulation of a small protein.
    
    Parameters:
    - pdb_file: Input PDB structure
    - temperature: Simulation temperature in Kelvin (default: 300K ‚âà room temp)
    - sim_time_ns: Simulation time in nanoseconds (default: 0.2 ns)
    - out_dcd: Output trajectory file name
    
    Returns:
    - Path to trajectory file
    """
    # TODO: Implement this function
    # Steps:
    # 1. Load PDB file using PDBFile
    # 2. Create Modeller from topology and positions
    # 3. Create ForceField (try "amber14-all.xml")
    # 4. Create system from force field and topology
    # 5. Create LangevinIntegrator (temperature, friction, time step)
    #    - Temperature: temperature * kelvin
    #    - Friction: 1/picosecond
    #    - Time step: 0.002 * picosecond
    # 6. Create Simulation object
    # 7. Set positions
    # 8. Minimize energy
    # 9. Add DCDReporter (save every 100 steps)
    # 10. Calculate steps needed (sim_time_ns * 1000 * 1000 / 2)
    # 11. Run simulation (sim.step(steps))
    # 12. Return trajectory file path
    pass


In [None]:
# TODO: Run MD simulation
# Note: This may take a few minutes depending on your system
# Start with a short simulation (0.2 ns)


## 4. Analyzing Protein Motion üé≠

Now for the exciting part! We have a "movie" of protein motion. Let's analyze it to understand how the protein moves.

**We will compute:**
- **RMSD** (Root Mean Square Deviation) - how much the structure deviates over time
- **RMSF** (Root Mean Square Fluctuation) - which residues are flexible
- **Free Energy Landscape** - what stable conformations exist

**This teaches:**
- Proteins visit many shapes, not just one (like a dancer performing many poses)
- Stability is linked to energy minima (proteins "prefer" certain conformations)
- Function often involves movement between states (enzymes need to open/close, receptors need to bind/unbind)

**ü§î Key Questions to answer:**
- Do the flexible regions match what we predicted from B-factors?
- How much does the protein deviate from its starting structure?
- Are there multiple stable conformations?

**Outcome:** You'll see how biological function emerges from dynamics, not static structure.


### Task 4.1: Load Trajectory and Compute RMSD/RMSF

**Tasks:**
1. Write a function to load the trajectory and compute RMSD and RMSF
2. RMSD measures overall structural change (distance from reference)
3. RMSF measures per-residue flexibility (fluctuation)

**Hint**: 
- Use `md.load()` to load trajectory
- Use `md.rmsd()` to compute RMSD (compare to frame 0)
- Use `md.rmsf()` to compute RMSF (per atom, then select CA atoms for per-residue)


In [None]:
# TODO: Write function to analyze trajectory
def analyze_trajectory(pdb_file, traj_file):
    """
    Analyze MD trajectory to compute RMSD and RMSF.
    
    Parameters:
    - pdb_file: Path to PDB file (for topology)
    - traj_file: Path to trajectory file (.dcd)
    
    Returns:
    - traj: Trajectory object
    - rmsd: RMSD values per frame
    - rmsf: RMSF values per residue (using CA atoms)
    """
    # TODO: Implement this function
    # Steps:
    # 1. Load trajectory using md.load(traj_file, top=pdb_file)
    # 2. Compute RMSD to first frame: md.rmsd(traj, traj, 0)
    # 3. Compute RMSF per atom: md.rmsf(traj, frame=0)
    # 4. Select CA atoms: traj.topology.select('name == CA')
    # 5. Extract RMSF for CA atoms only (for per-residue analysis)
    # 6. Print trajectory info and return traj, rmsd, rmsf
    pass


In [None]:
# TODO: Analyze your trajectory


### Task 4.2: Plot RMSD Over Time

**RMSD** shows how much the protein structure deviates from its starting conformation. A stable protein will show relatively low RMSD, while flexible proteins will show higher RMSD.

**Tasks:**
1. Write a function to plot RMSD over time
2. Convert RMSD from nm to √Ö (multiply by 10)
3. Add statistics and interpretation

**Questions:**
- What does the RMSD plot tell you about protein stability?
- Is the RMSD increasing, decreasing, or fluctuating?
- What would a very stable protein look like vs a very flexible one?


In [None]:
# TODO: Write function to plot RMSD
def plot_rmsd(rmsd):
    """
    Plot RMSD over time.
    
    Parameters:
    - rmsd: RMSD values (in nm)
    """
    # TODO: Implement this function
    # Steps:
    # 1. Convert nm to √Ö (multiply by 10)
    # 2. Create figure and axes
    # 3. Plot RMSD vs frame number
    # 4. Add mean line (axhline)
    # 5. Fill area under curve for better visualization
    # 6. Add labels, title, grid, legend
    # 7. Print statistics (mean, max, min, std)
    # 8. Interpret: <1.0 √Ö = very stable, <2.5 √Ö = moderately stable, >2.5 √Ö = flexible
    pass

# TODO: Plot your RMSD data


### Task 4.3: Plot RMSF Per Residue

**RMSF** identifies which residues are most flexible. This is crucial for understanding functional regions, binding sites, and areas that might be important for conformational changes.

**Tasks:**
1. Write a function to plot RMSF per residue
2. Highlight the most flexible regions
3. Identify the top 5 most flexible residues

**Questions:**
- Which residues are most flexible?
- Do these match your B-factor predictions?
- What types of regions typically show high RMSF? (loops, termini, secondary structure?)


In [None]:
# TODO: Write function to plot RMSF
def plot_rmsf(rmsf):
    """
    Plot RMSF per residue.
    
    Parameters:
    - rmsf: RMSF values per residue (in nm)
    """
    # TODO: Implement this function
    # Steps:
    # 1. Convert nm to √Ö (multiply by 10)
    # 2. Create figure and axes
    # 3. Create bar plot with color gradient (green colormap)
    # 4. Add mean line
    # 5. Highlight top 25% flexible residues (percentile 75)
    # 6. Add labels, title, grid, legend
    # 7. Find and print top 5 most flexible residues
    # 8. Print statistics
    pass

# TODO: Plot your RMSF data


### Task 4.4: Compute and Plot Free Energy Landscape

The **Free Energy Landscape** shows the probability distribution of protein conformations.

**Think of it like a topographic map:**
- **Deep valleys** (dark blue/purple) = stable conformations (low energy, protein "prefers" these)
- **High peaks** (yellow) = unstable conformations (high energy, rarely visited)
- **Multiple valleys** = protein can adopt multiple stable conformations

**Tasks:**
1. Write a function to compute PCA (Principal Component Analysis) on the trajectory
2. Write a function to compute and plot the free energy landscape
3. Free energy = -ln(probability)

**Hint**: 
- Reshape trajectory coordinates to 2D: (n_frames, n_atoms*3)
- Use SVD for PCA: `np.linalg.svd()`
- Project onto first two principal components
- Create 2D histogram and compute free energy as -ln(histogram)


In [None]:
# TODO: Write function to compute Free Energy Landscape using PCA
def compute_fel(traj, n_components=2):
    """
    Compute Free Energy Landscape using Principal Component Analysis.
    
    Parameters:
    - traj: Trajectory object
    - n_components: Number of principal components (default: 2)
    
    Returns:
    - pc1: First principal component
    - pc2: Second principal component
    """
    # TODO: Implement this function
    # Steps:
    # 1. Reshape coordinates: traj.xyz.reshape(traj.n_frames, -1)
    #    This converts (n_frames, n_atoms, 3) to (n_frames, n_atoms*3)
    # 2. Center the data (subtract mean)
    # 3. Compute SVD: np.linalg.svd(X, full_matrices=False)
    # 4. Project onto first two PCs: U[:,0]*S[0], U[:,1]*S[1]
    # 5. Print variance explained by each PC
    pass

# TODO: Write function to plot Free Energy Landscape
def plot_fel(pc1, pc2, bins=50):
    """
    Plot Free Energy Landscape.
    
    Parameters:
    - pc1: First principal component
    - pc2: Second principal component
    - bins: Number of bins for histogram (default: 50)
    """
    # TODO: Implement this function
    # Steps:
    # 1. Compute 2D histogram: np.histogram2d(pc1, pc2, bins=bins)
    # 2. Normalize histogram to probability
    # 3. Compute free energy: -np.log(hist + 1e-10)
    # 4. Create figure and plot as image (imshow)
    # 5. Add contour lines for better visualization
    # 6. Add colorbar, labels, title
    # 7. Find and print energy minima
    # 8. Print interpretation
    pass

# TODO: Compute and plot free energy landscape


**Questions:**
- How many stable conformations (valleys) does your protein have?
- What does the free energy landscape tell you about protein dynamics?
- How much variance do the first two principal components explain?


## üéØ Challenge Questions & Reflection

Answer these questions based on your analysis:

1. **Prediction vs Reality**: Did the regions with high B-factors (from crystallography) match the regions with high RMSF (from MD)? Why or why not?

2. **RMSD Interpretation**: What does your RMSD plot tell you? Is the protein becoming more or less stable over time? What would you expect for a very stable protein vs a flexible one?

3. **Free Energy Landscape**: How many valleys (stable conformations) do you see in the free energy landscape? What does this tell you about the protein's ability to adopt different shapes?

4. **Biological Function**: How might flexibility be important for a protein's function? Can you think of examples where rigidity would be important instead?

5. **Simulation Limitations**: We simulated 0.2 nanoseconds. Real biological processes can take microseconds to seconds. What events might we miss in such a short simulation?

**üí° Bonus Challenge**: Try running the simulation with a different temperature (e.g., 350K = 77¬∞C). How does the RMSD and flexibility change? What does this tell you about temperature effects on proteins?


## Summary & Reflection üéì

In this lab, you've learned:

1. **Structure Download**: How to retrieve protein structures from the PDB
2. **Static Analysis**: Ramachandran plots and B-factors reveal protein geometry and flexibility hints
3. **MD Simulation**: How to simulate protein motion using OpenMM
4. **Dynamic Analysis**: RMSD, RMSF, and free energy landscapes reveal how proteins move

**üîë Key Insights to Reflect On:**
- How do static (B-factors) and dynamic (RMSF) measures of flexibility compare?
- What does the free energy landscape tell you about protein conformations?
- How might protein dynamics relate to biological function?

**üöÄ Next Steps to Explore:**
- Try different proteins! Change the `pdb_id` variable and see how different proteins behave
- Experiment with simulation parameters (temperature, time)
- Compare your results with the literature

**üìö Real-World Applications:**
Think about how protein dynamics relates to:
- Drug design (targeting specific conformations)
- Enzyme catalysis (motion enables function)
- Disease (protein misfolding)
- Evolution (optimizing function through dynamics)
