# Genomics Laboratory
## Student Notebook

This notebook contains exercises for the Genomics laboratory session. Complete each exercise by following the step-by-step instructions and filling in the code stubs.

**Learning Objectives:**
- Retrieve genomic data from public databases using REST APIs
- Analyze DNA sequences and compute basic statistics
- Work with genetic variants and their annotations
- Integrate multi-omics data (genomics, epigenomics)
- Design CRISPR guide RNAs
- Model gene expression dynamics using ODEs

**Instructions:**
- Read each exercise description carefully
- Follow the step-by-step instructions
- Fill in the code stubs marked with `# TODO:` comments
- Run each cell and verify your results
- Ask for help if you get stuck!

---

## Laboratory Pipeline Overview

This laboratory session follows a complete bioinformatics analysis pipeline:

1. **Exercise 1: Retrieve Gene Sequence**
   - Fetch genomic sequence from Ensembl REST API
   - Compute basic sequence statistics (GC content, CpG motifs, length)
   - Visualize GC content along the sequence

2. **Exercise 2: Fetch Variants**
   - Retrieve known genetic variants (SNPs/INDELs) from Ensembl Variation API
   - Plot variant distribution along the gene
   - Annotate variant impact

3. **Exercise 3: Fetch Methylation Data**
   - Query ENCODE database for whole-genome bisulfite sequencing (WGBS) data
   - Understand epigenomic datasets

4. **Exercise 4: Merge Variant + Methylation Data**
   - Integrate variant positions with methylation scores
   - Identify regulatory SNPs (variants in high-methylation zones)

5. **Exercise 5: CRISPR Guide Design**
   - Identify PAM sites (NGG) for SpCas9
   - Design and score guide RNAs using heuristics

6. **Exercise 6: Synthetic Gene Circuit Simulation**
   - Model gene expression using ordinary differential equations (ODEs)
   - Simulate dynamic behavior of gene circuits


In [1]:
# Import required libraries
%matplotlib inline
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from scipy.integrate import solve_ivp
import os

# Configure plotting
plt.style.use('seaborn-v0_8')

# Set up output directory
output_dir = "lab_outputs"
os.makedirs(output_dir, exist_ok=True)

print("✓ Libraries imported successfully")


✓ Libraries imported successfully


## Exercise 1: Retrieve Gene Sequence Using an API

**Goal:** Teach API usage, JSON parsing, and basic bioinformatics manipulation.

**Background:** The Ensembl REST API provides programmatic access to genomic data including DNA sequences, gene annotations, and variant information. This exercise introduces you to working with REST APIs in bioinformatics.

**Task List:**
1. Implement the `fetch_gene_sequence_ensembl` function:
   - Construct the API URL using the Ensembl REST API endpoint
   - Make an HTTP GET request
   - Check if the request was successful
   - Parse the JSON response
   - Extract and return the sequence in uppercase
2. Choose a gene to analyze (e.g., TP53, BRCA1, or TERT)
3. Fetch the gene sequence using your function
4. Store the sequence in FASTA format:
   - Create a FASTA header line (starts with ">")
   - Format the sequence with line breaks (60 characters per line)
   - Save to a file
5. Compute basic sequence statistics:
   - Calculate sequence length
   - Calculate overall GC content
   - Count CpG motifs
   - Count each nucleotide (A, T, G, C)
6. Implement `compute_gc_content` function:
   - Use a sliding window approach
   - Calculate GC content for each window
   - Return a DataFrame with start, end, and gc_content columns
7. Compute GC content using a sliding window
8. Visualize GC content along the sequence:
   - Plot GC content vs. genomic position
   - Add a horizontal line for overall GC content
   - Add labels and title

**Key Concepts:**
- REST API usage for biological databases
- JSON parsing and data extraction
- DNA sequence analysis
- Sliding window operations
- FASTA file format

**Useful Ensembl Gene IDs:**
- TP53: `ENSG00000141510`
- BRCA1: `ENSG00000012048`
- TERT: `ENSG00000164362`


In [None]:
# TODO: Implement fetch_gene_sequence_ensembl function
def fetch_gene_sequence_ensembl(gene_id):
    """
    Fetch DNA sequence for a given human gene using Ensembl REST API.
    
    Parameters:
    - gene_id (str): Ensembl gene ID (e.g., "ENSG00000141510" for TP53)
    
    Returns:
    - str: Gene sequence in uppercase A/C/G/T
    
    Hint: Use the Ensembl REST API endpoint:
    https://rest.ensembl.org/sequence/id/{gene_id}?content-type=application/json
    """
    # TODO: Construct the API URL
    url = None  # Fill in the URL
    
    # TODO: Make the API request
    response = None  # Use requests.get()
    
    # TODO: Check if request was successful
    # Use response.ok or response.raise_for_status()
    
    # TODO: Parse JSON response
    data = None  # Use response.json()
    
    # TODO: Extract sequence and convert to uppercase
    seq = None  # Get 'seq' from data and convert to uppercase
    
    return seq


In [None]:
# Exercise 1: Retrieve Gene Sequence
print("=" * 60)
print("Exercise 1: Retrieve Gene Sequence Using Ensembl API")
print("=" * 60)

# TODO: Choose a gene to analyze
# Options: TP53 (ENSG00000141510), BRCA1 (ENSG00000012048), TERT (ENSG00000164362)
gene_id = None  # Fill in an Ensembl gene ID
gene_name = None  # Fill in the gene name (e.g., "TP53")

print(f"\nFetching sequence for {gene_name} ({gene_id})...")

try:
    # TODO: Fetch the gene sequence using fetch_gene_sequence_ensembl
    sequence = None
    
    print(f"✓ Successfully retrieved sequence")
    print(f"  Sequence length: {len(sequence):,} bp")
    print(f"  First 50 bases: {sequence[:50]}")
    
except Exception as e:
    print(f"✗ Error fetching sequence: {e}")
    sequence = None


In [None]:
# TODO: Store sequence as FASTA format
if sequence is not None:
    # TODO: Create FASTA format string
    # FASTA format: first line starts with ">", followed by description
    # Subsequent lines contain the sequence (typically 60-80 characters per line)
    fasta_header = None  # Format: f">{gene_id} {gene_name}"
    
    # TODO: Format sequence with line breaks (every 60 characters)
    fasta_sequence = None  # Use list comprehension or loop
    
    # TODO: Combine header and sequence
    fasta_content = None
    
    # TODO: Save to file
    fasta_filename = None  # e.g., f"{output_dir}/{gene_name}_sequence.fasta"
    # TODO: Write fasta_content to file using open() and write()
    
    print(f"✓ Sequence saved to {fasta_filename}")
    print(f"\nFASTA preview:")
    print(fasta_content[:200] + "..." if len(fasta_content) > 200 else fasta_content)
else:
    print("⚠ No sequence available to save")


In [None]:
# TODO: Compute basic sequence statistics
if sequence is not None:
    print("\n" + "=" * 60)
    print("Basic Sequence Statistics")
    print("=" * 60)
    
    # TODO: Calculate sequence length
    seq_length = None  # Use len(sequence)
    
    # TODO: Calculate overall GC content
    # GC content = (G + C) / total_length
    g_count = None  # Count 'G' in sequence
    c_count = None  # Count 'C' in sequence
    gc_count = None  # Sum of G and C
    gc_content = None  # Calculate percentage
    
    # TODO: Count CpG motifs
    # CpG motif = "CG" dinucleotide
    cpg_count = None  # Count occurrences of "CG" in sequence
    cpg_density = None  # CpG count per 1000 bp
    
    # TODO: Count each nucleotide
    a_count = None  # Count 'A'
    t_count = None  # Count 'T'
    
    # TODO: Print statistics
    print(f"\nSequence Statistics for {gene_name}:")
    print(f"  Length: {seq_length:,} bp")
    print(f"  GC content: {gc_content:.2%}")
    print(f"  CpG motifs: {cpg_count:,}")
    print(f"  CpG density: {cpg_density:.2f} per 1000 bp")
    print(f"\nNucleotide composition:")
    print(f"  A: {a_count:,} ({a_count/seq_length:.2%})")
    print(f"  T: {t_count:,} ({t_count/seq_length:.2%})")
    print(f"  G: {g_count:,} ({g_count/seq_length:.2%})")
    print(f"  C: {c_count:,} ({c_count/seq_length:.2%})")
else:
    print("⚠ No sequence available for analysis")


In [None]:
# TODO: Implement compute_gc_content function
def compute_gc_content(sequence, window=200):
    """
    Compute GC-content using a sliding window across the sequence.
    
    Parameters:
    - sequence (str): DNA sequence
    - window (int): Window size for sliding window
    
    Returns:
    - pandas.DataFrame: Columns ['start', 'end', 'gc_content']
    
    Algorithm:
    1. Slide a window of size 'window' along the sequence
    2. For each window, calculate GC content = (G + C) / window_size
    3. Store start position, end position, and GC content
    """
    results = []
    seq_len = len(sequence)
    
    # TODO: Implement sliding window
    # Loop through sequence with step size = window
    # For each window:
    #   - Extract window sequence
    #   - Count G and C
    #   - Calculate GC content
    #   - Append [start, end, gc_content] to results
    
    # TODO: Create DataFrame
    df_gc = None  # Use pd.DataFrame() with columns ['start', 'end', 'gc_content']
    
    return df_gc


In [None]:
# TODO: Compute GC content using sliding window
if sequence is not None:
    print("\n" + "=" * 60)
    print("Computing GC Content with Sliding Window")
    print("=" * 60)
    
    # TODO: Set window size
    window_size = None  # e.g., 200 bp
    
    # TODO: Compute GC content using compute_gc_content function
    df_gc = None
    
    print(f"✓ Computed GC content for {len(df_gc)} windows")
    print(f"\nFirst 5 windows:")
    # TODO: Display first 5 rows of df_gc
else:
    print("⚠ No sequence available for GC content analysis")


In [None]:
# TODO: Plot GC content along the sequence
if sequence is not None and 'df_gc' in locals() and df_gc is not None:
    print("\n" + "=" * 60)
    print("Visualizing GC Content Along Gene")
    print("=" * 60)
    
    # TODO: Create plot
    plt.figure(figsize=(12, 5))
    
    # TODO: Plot GC content
    # Use plt.plot() with df_gc['start'] and df_gc['gc_content']
    
    # TODO: Add labels and title
    plt.xlabel("Genomic position (bp)", fontsize=12, fontweight='bold')
    plt.ylabel("GC content", fontsize=12, fontweight='bold')
    plt.title(f"GC Content Along {gene_name} Gene (window size: {window_size} bp)", 
              fontsize=14, fontweight='bold')
    
    # TODO: Add horizontal line for overall GC content
    # Use plt.axhline() to show the overall GC content as reference
    
    # TODO: Add grid for better readability
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # TODO: Print summary statistics
    print(f"\nGC Content Statistics:")
    print(f"  Mean: {df_gc['gc_content'].mean():.3f}")
    print(f"  Median: {df_gc['gc_content'].median():.3f}")
    print(f"  Min: {df_gc['gc_content'].min():.3f}")
    print(f"  Max: {df_gc['gc_content'].max():.3f}")
    print(f"  Std Dev: {df_gc['gc_content'].std():.3f}")
else:
    print("⚠ No GC content data available for plotting")


## Exercise 2: Fetch and Analyze Known Variants

**Goal:** Introduce variant data, annotation, and visualization.

**Background:** Genetic variants (SNPs, INDELs) are differences in DNA sequence between individuals. Understanding where variants occur and their functional impact is crucial for understanding disease mechanisms and personalized medicine. The Ensembl Variation API provides access to annotated variants from multiple sources.

**Task List:**
1. Implement the `fetch_variants_for_gene` function:
   - Construct the Ensembl Variation API URL
   - Make an HTTP GET request
   - Parse the JSON response
   - Extract variant information (id, start, end, alleles, consequence_type)
   - Return a DataFrame
2. Fetch variants for your chosen gene
3. Convert start/end positions to numeric
4. Filter variants by location and impact:
   - Promoter region variants (-1000 bp from gene start)
   - Coding region variants (based on consequence types)
   - High-impact variants (missense, nonsense, etc.)
5. Assign impact scores to variants:
   - Create a mapping dictionary for consequence types to impact scores
   - Assign scores to each variant
   - Display impact score distribution
6. Create two visualizations:
   - Histogram of variant density along the gene
   - Scatter plot of variant impact score vs. genomic position (highlight high-impact variants)

**Key Concepts:**
- Variant annotation and consequence types
- Promoter regions and regulatory elements
- Coding vs. non-coding variants
- Variant impact classification
- Data filtering and visualization


In [None]:
# TODO: Implement fetch_variants_for_gene function
def fetch_variants_for_gene(gene_id):
    """
    Retrieve SNPs/INDELs for a given gene using Ensembl Variation API.
    
    Parameters:
    - gene_id (str): Ensembl gene ID
    
    Returns:
    - DataFrame: Variation table with columns:
        ['id', 'start', 'end', 'alleles', 'consequence_type']
    """
    # TODO: Construct the API URL
    url = None  # Fill in the URL
    
    # TODO: Make the API request
    response = None  # Use requests.get()
    
    # TODO: Check if request was successful
    
    # TODO: Parse JSON response
    data = None
    
    # TODO: Extract variant information
    rows = []
    # TODO: Loop through data and extract: id, start, end, alleles, consequence_type
    
    # TODO: Create DataFrame
    df = None  # Use pd.DataFrame() with appropriate columns
    
    return df


In [None]:
# Exercise 2: Fetch and Analyze Known Variants
print("=" * 60)
print("Exercise 2: Fetch and Analyze Known Variants")
print("=" * 60)

# TODO: Check if we have the gene sequence from Exercise 1
# If not, re-fetch it or use gene_id if available

# TODO: Fetch variants using fetch_variants_for_gene function
df_variants = None

# TODO: Convert start/end to numeric if needed
# Use pd.to_numeric() with errors='coerce'

# TODO: Remove rows with invalid positions
# Use dropna() on 'start' and 'end' columns

# TODO: Display variant statistics
# Print total variants, first 5 variants, consequence type counts


In [None]:
# TODO: Filter variants by location and impact
if df_variants is not None and len(df_variants) > 0:
    print("\n" + "=" * 60)
    print("Filtering Variants")
    print("=" * 60)
    
    # TODO: Filter 1: Promoter region variants (-1000 bp from gene start)
    # Use minimum start position as gene start reference
    min_start = None  # Calculate from df_variants
    promoter_start = None  # min_start - 1000
    promoter_end = None  # min_start
    
    # TODO: Filter variants in promoter region
    df_promoter = None  # Use boolean indexing
    
    # TODO: Filter 2: Coding region variants
    # Define coding-related consequence types
    coding_consequences = None  # List of consequence types
    
    # TODO: Filter variants with coding consequences
    df_coding = None
    
    # TODO: Filter 3: High-impact variants (missense, nonsense)
    high_impact_consequences = None  # List of high-impact types
    
    # TODO: Filter high-impact variants
    df_high_impact = None
    
    # TODO: Print filtering results
    print(f"\n1. Promoter region variants: {len(df_promoter)}")
    print(f"2. Coding region variants: {len(df_coding)}")
    print(f"3. High-impact variants: {len(df_high_impact)}")
    
else:
    print("⚠ No variants available for filtering")


In [None]:
# TODO: Assign impact scores to variants
if df_variants is not None and len(df_variants) > 0:
    print("\n" + "=" * 60)
    print("Assigning Impact Scores")
    print("=" * 60)
    
    # TODO: Define impact score mapping dictionary
    # Higher score = more severe impact
    # Include: transcript_ablation, splice variants, stop_gained, frameshift,
    #          nonsense_variant, missense_variant, synonymous_variant, etc.
    impact_scores = {
        # TODO: Fill in the mapping
        # Example: 'missense_variant': 6,
        #          'nonsense_variant': 7,
        #          etc.
    }
    
    # TODO: Assign impact scores using map()
    df_variants['impact_score'] = None  # Use map() with lambda function
    
    # TODO: Print impact score distribution
    # Use value_counts() and sort_index()
    
    # TODO: Print mean, max, min impact scores
    
else:
    print("⚠ No variants available for impact scoring")


In [None]:
# TODO: Plot 1: Histogram of variant density along the gene
if df_variants is not None and len(df_variants) > 0:
    print("\n" + "=" * 60)
    print("Plot 1: Variant Density Histogram")
    print("=" * 60)
    
    plt.figure(figsize=(12, 5))
    
    # TODO: Create histogram
    # Use plt.hist() with df_variants['start']
    bins = None  # e.g., 50
    
    # TODO: Add vertical span for promoter region if available
    # Use plt.axvspan() for promoter_start to promoter_end
    
    # TODO: Add labels, title, grid, legend
    plt.xlabel("Genomic position (bp)", fontsize=12, fontweight='bold')
    plt.ylabel("Variant count", fontsize=12, fontweight='bold')
    plt.title(f"Variant Density Along {gene_name} Gene", fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print(f"✓ Histogram created")
else:
    print("⚠ No variants available for density plot")


In [None]:
# TODO: Plot 2: Scatter plot of variant impact score vs. genomic position
if df_variants is not None and len(df_variants) > 0 and 'impact_score' in df_variants.columns:
    print("\n" + "=" * 60)
    print("Plot 2: Variant Impact Score vs. Genomic Position")
    print("=" * 60)
    
    plt.figure(figsize=(14, 6))
    
    # TODO: Create scatter plot
    # Color code by impact score
    # Use plt.scatter() with df_variants['start'] and df_variants['impact_score']
    # Set c=df_variants['impact_score'] and use a colormap (e.g., 'RdYlGn_r')
    
    # TODO: Highlight high-impact variants
    # If df_high_impact exists and has impact_score column:
    #   - Plot them with larger markers, different color, marker='*'
    #   - Add label for legend
    
    # TODO: Add colorbar
    # Use plt.colorbar()
    
    # TODO: Add labels, title, grid, legend
    plt.xlabel("Genomic position (bp)", fontsize=12, fontweight='bold')
    plt.ylabel("Impact Score", fontsize=12, fontweight='bold')
    plt.title(f"Variant Impact Score vs. Genomic Position for {gene_name}", 
              fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # TODO: Calculate correlation between position and impact score
    correlation = None  # Use df_variants['start'].corr(df_variants['impact_score'])
    print(f"\nCorrelation between position and impact score: {correlation:.3f}")
else:
    print("⚠ No variants available for impact score plot")


## Exercise 3: Retrieve Methylation Profile (ENCODE API)

**Goal:** Integrate multi-omics data to identify regulatory variants.

**Background:** DNA methylation is an epigenetic modification that can regulate gene expression. Whole-genome bisulfite sequencing (WGBS) provides genome-wide methylation profiles. Integrating variant data with methylation data helps identify regulatory SNPs (rSNPs) that may affect gene expression through epigenetic mechanisms.

**Task List:**
1. Implement the `fetch_encode_methylation` function:
   - Query ENCODE REST API for WGBS files
   - Parse JSON response
   - Extract file metadata (accession, format, output_type, url)
   - Return a DataFrame
2. Query ENCODE for WGBS methylation data
3. Generate methylation profile for variant positions:
   - Find all CpG sites in the sequence
   - For each variant position, find nearest CpG site
   - Simulate methylation beta values based on:
     - Distance to CpG site
     - Whether variant is in promoter region
   - Create DataFrame with position, methylation, nearest_cpg, distance_to_cpg
4. Identify CpG islands:
   - Use sliding window approach
   - Calculate CpG density for each window
   - Identify regions with density > 0.6
5. Merge variants with methylation scores:
   - Merge on position (start)
   - Fill missing values with median
6. Identify candidate regulatory SNPs:
   - SNPs in highly methylated CpG islands (methylation > 0.7)
   - Variants in promoters with methylation > 0.7
   - Combine and deduplicate results
7. Visualize results:
   - Plot methylation profile along gene with regulatory SNPs highlighted
   - Plot methylation vs. impact score

**Key Concepts:**
- Epigenomics and DNA methylation
- Multi-omics data integration
- Regulatory SNPs (rSNPs)
- CpG islands and promoter methylation
- ENCODE database


In [None]:
# TODO: Implement fetch_encode_methylation function
def fetch_encode_methylation(gene_name, limit=200):
    """
    Retrieve WGBS methylation track files from ENCODE.
    
    Parameters:
    - gene_name (str): Used only for reference; ENCODE query is global
    - limit (int): Max returned records
    
    Returns:
    - DataFrame: Minimal metadata table for methylation files
    """
    url = "https://www.encodeproject.org/search/"
    params = {
        "type": "File",
        "assay_title": "WGBS",
        "status": "released",
        "limit": limit,
        "format": "json"
    }
    
    # TODO: Make API request with appropriate headers
    response = None  # Use requests.get() with params and headers
    
    # TODO: Check response and parse JSON
    data = None
    
    # TODO: Extract file information from data['@graph']
    rows = []
    # TODO: Loop through hits and extract: accession, file_format, output_type, href
    
    # TODO: Create DataFrame
    df = None  # Use pd.DataFrame() with columns ['accession', 'format', 'output_type', 'url']
    
    return df


In [None]:
# Exercise 3: Retrieve Methylation Profile
print("=" * 60)
print("Exercise 3: Retrieve Methylation Profile (ENCODE API)")
print("=" * 60)

# TODO: Query ENCODE API for WGBS files
df_methylation_files = None  # Use fetch_encode_methylation()

# TODO: Display file metadata
# Print number of files, first 5 files, file formats, output types


In [None]:
# TODO: Generate methylation profile for variant positions
if 'sequence' in locals() and sequence is not None and df_variants is not None and len(df_variants) > 0:
    print("\n" + "=" * 60)
    print("Generating Methylation Profile")
    print("=" * 60)
    
    # TODO: Find all CpG positions in the sequence
    cpg_positions = []
    # TODO: Loop through sequence and find "CG" dinucleotides
    
    # TODO: Get variant positions
    variant_positions = None  # Extract from df_variants['start']
    
    # TODO: Create methylation DataFrame
    methylation_data = []
    # TODO: For each variant position:
    #   - Find nearest CpG site
    #   - Calculate distance to CpG
    #   - Determine if in promoter region
    #   - Calculate methylation value (simulate based on distance and promoter status)
    #   - Append to methylation_data
    
    df_methylation = None  # Create DataFrame from methylation_data
    
    # TODO: Identify CpG islands
    # Use sliding window approach
    # Calculate CpG density for each window
    # Identify regions with density > 0.6
    cpg_islands = []
    # TODO: Implement CpG island detection
    
    print(f"✓ Generated methylation profile")
    print(f"✓ Identified {len(cpg_islands)} CpG islands")
    
else:
    print("⚠ Sequence or variants not available")
    df_methylation = None


In [None]:
# TODO: Merge variants with methylation scores
if df_variants is not None and len(df_variants) > 0 and df_methylation is not None and len(df_methylation) > 0:
    print("\n" + "=" * 60)
    print("Merging Variants with Methylation Data")
    print("=" * 60)
    
    # TODO: Merge on position
    # Use df_variants.merge() with df_methylation on 'start' and 'position'
    df_merged = None
    
    # TODO: Fill missing methylation values with median
    # Use fillna() with median value
    
    # TODO: Display merged data preview and statistics
    print(f"✓ Merged {len(df_merged)} variants with methylation data")
    
else:
    print("⚠ Cannot merge: variants or methylation data not available")
    df_merged = None


In [None]:
# TODO: Identify candidate regulatory SNPs
if df_merged is not None and len(df_merged) > 0:
    print("\n" + "=" * 60)
    print("Identifying Candidate Regulatory SNPs")
    print("=" * 60)
    
    # TODO: Criterion 1: SNPs in highly methylated CpG islands
    # Check if variant is in a CpG island AND methylation > 0.7
    regulatory_snps_cpg = []
    # TODO: Loop through df_merged and check conditions
    
    df_regulatory_cpg = None  # Create DataFrame from regulatory_snps_cpg
    
    # TODO: Criterion 2: Variants in promoters with methylation > 0.7
    # Filter promoter variants with methylation > 0.7
    df_regulatory_promoter = None
    
    # TODO: Combine all regulatory SNPs and remove duplicates
    df_regulatory_all = None  # Use pd.concat() and drop_duplicates()
    
    # TODO: Display summary statistics
    print(f"  Total candidate regulatory SNPs: {len(df_regulatory_all)}")
    print(f"  From CpG islands: {len(df_regulatory_cpg)}")
    print(f"  From promoters: {len(df_regulatory_promoter)}")
    
else:
    print("⚠ Cannot identify regulatory SNPs: merged data not available")
    df_regulatory_all = pd.DataFrame()


In [None]:
# TODO: Visualize methylation profile and regulatory SNPs
if df_merged is not None and len(df_merged) > 0:
    print("\n" + "=" * 60)
    print("Visualizing Methylation Profile and Regulatory SNPs")
    print("=" * 60)
    
    fig, axes = plt.subplots(2, 1, figsize=(14, 10))
    
    # TODO: Plot 1: Methylation along gene with variant positions
    ax1 = axes[0]
    # TODO: Scatter plot of all variants colored by methylation
    # TODO: Highlight regulatory SNPs with larger markers, different color, marker='*'
    # TODO: Add promoter region highlight (axvspan)
    # TODO: Add CpG island highlights (axvspan for each island)
    # TODO: Add colorbar, labels, title, grid, legend
    
    # TODO: Plot 2: Methylation vs Impact Score
    ax2 = axes[1]
    # TODO: Scatter plot of impact_score vs methylation
    # TODO: Highlight regulatory SNPs
    # TODO: Add threshold line at methylation = 0.7
    # TODO: Add colorbar, labels, title, grid, legend
    
    plt.tight_layout()
    plt.show()
    
    print(f"✓ Visualization complete")
else:
    print("⚠ No data available for visualization")


## Exercise 4: Design CRISPR Guide RNAs

**Goal:** Connect sequence data with computational CRISPR design.

**Background:** CRISPR-Cas9 is a genome editing technology that uses a guide RNA (gRNA) to direct the Cas9 nuclease to specific DNA sequences. The SpCas9 system requires a Protospacer Adjacent Motif (PAM) sequence "NGG" (where N is any nucleotide) immediately downstream of the target site. Effective guide design requires considering multiple factors including GC content, avoiding repetitive sequences, and minimizing off-target effects.

**Task List:**
1. Implement the `find_ngg_sites` function:
   - Use regex to find all NGG patterns in the sequence
   - For each PAM site, extract the 20-bp guide sequence upstream
   - Return DataFrame with guide, pam_start, pam_end
2. Identify all PAM sites and extract guide sequences
3. Implement the `score_guides` function:
   - Calculate GC content score (optimal ~50%)
   - Apply poly-T penalty (avoid TTTT runs)
   - Calculate position score (prefer guides near TSS)
   - Calculate combined weighted score
4. Score all guides using multiple criteria
5. Perform off-target prediction analysis:
   - Calculate sequence complexity
   - Estimate off-target count based on complexity and GC content
   - Simulate top off-target sites
   - Add off-target penalty to final score
6. Visualize guide RNA design results:
   - Distribution of guide scores
   - GC content distribution
   - Guide score vs. genomic position
   - Off-target count vs. guide score

**Key Concepts:**
- CRISPR-Cas9 system and PAM sites
- Guide RNA design principles
- Off-target prediction
- Scoring algorithms for guide efficiency


In [None]:
# TODO: Implement find_ngg_sites function
def find_ngg_sites(sequence):
    """
    Identify PAM sites (NGG) for SpCas9 and extract guide sequences.
    
    Parameters:
    - sequence (str): DNA sequence
    
    Returns:
    - DataFrame: ['guide', 'pam_start', 'pam_end']
    """
    guides = []
    # TODO: Use regex to find all NGG patterns
    # Use re.finditer() with lookahead pattern: r'(?=(.GG))'
    # For each match:
    #   - Get PAM start position
    #   - If pam_start >= 20, extract 20-bp guide upstream
    #   - Append [guide, pam_start, pam_start+3] to guides
    
    # TODO: Create DataFrame
    df = None  # Use pd.DataFrame() with columns ['guide', 'pam_start', 'pam_end']
    
    return df


In [None]:
# TODO: Implement score_guides function
def score_guides(df_guides, tss_position=0):
    """
    Score guides by GC content, poly-T avoidance, and position relative to TSS.
    
    Parameters:
    - df_guides (DataFrame): Guide sequences with PAM positions
    - tss_position (int): Transcription start site position
    
    Returns:
    - DataFrame with additional scoring columns
    """
    scores = []
    gc_scores = []
    poly_t_penalties = []
    position_scores = []
    
    # TODO: Loop through each guide
    for _, row in df_guides.iterrows():
        g = row['guide']
        pam_pos = row['pam_start']
        
        # TODO: 1. Calculate GC content score (optimal ~50%)
        gc = None  # Calculate GC content
        gc_score = None  # Score: 1.0 for 50% GC, decreasing as it deviates
        
        # TODO: 2. Calculate poly-T penalty
        poly_t_penalty = None  # 0.3 for TTTT, 0.1 for TTT, 0.0 otherwise
        
        # TODO: 3. Calculate position score
        distance_from_tss = None  # Calculate distance
        position_score = None  # Higher score for guides near TSS
        
        # TODO: 4. Calculate combined score (weighted)
        combined_score = None  # Weight: gc_score*0.4 + position_score*0.3 - poly_t_penalty
        
        scores.append(combined_score)
        gc_scores.append(gc_score)
        poly_t_penalties.append(poly_t_penalty)
        position_scores.append(position_score)
    
    # TODO: Add columns to df_guides
    df_guides['gc_content'] = None  # Calculate for each guide
    df_guides['gc_score'] = gc_scores
    df_guides['poly_t_penalty'] = poly_t_penalties
    df_guides['position_score'] = position_scores
    df_guides['combined_score'] = scores
    
    return df_guides


In [None]:
# Exercise 4: Design CRISPR Guide RNAs
print("=" * 60)
print("Exercise 4: Design CRISPR Guide RNAs")
print("=" * 60)

# TODO: Check if we have the gene sequence from Exercise 1
# If not, re-fetch it

# TODO: Find all NGG PAM sites and extract guide sequences
df_guides = None  # Use find_ngg_sites()

# TODO: Display PAM site distribution
# Print number of guides, first 5 guides, position range


In [None]:
# TODO: Score guides using multiple criteria
if df_guides is not None and len(df_guides) > 0:
    print("\n" + "=" * 60)
    print("Scoring Guide RNAs")
    print("=" * 60)
    
    # TODO: Set TSS position (use 0 as default)
    tss_position = None
    
    # TODO: Score guides using score_guides function
    df_guides = None  # Use score_guides()
    
    # TODO: Sort by combined score (descending)
    df_guides = None  # Use sort_values()
    
    # TODO: Display scoring statistics
    # Print mean, max, min combined score
    # Print mean GC content, guides with poly-T runs
    
    # TODO: Display top 10 guides by combined score
    
    # TODO: Analyze guide quality distribution
    # Categorize: Excellent (≥0.7), Good (0.5-0.7), Fair (0.3-0.5), Poor (<0.3)
    
else:
    print("⚠ No guides available for scoring")


In [None]:
# TODO: Off-Target Prediction Analysis
if df_guides is not None and len(df_guides) > 0:
    print("\n" + "=" * 60)
    print("Off-Target Prediction Analysis")
    print("=" * 60)
    
    # TODO: Select top guides for off-target analysis
    top_guides = None  # Use head(10)
    
    # TODO: For each guide, simulate off-target predictions
    off_target_data = []
    # TODO: Loop through top_guides:
    #   - Calculate sequence complexity (unique k-mers)
    #   - Estimate off-target count based on complexity and GC content
    #   - Simulate top 3 off-target sites with mismatches
    #   - Append to off_target_data
    
    df_off_target = None  # Create DataFrame from off_target_data
    
    # TODO: Add off-target count to main guides dataframe
    # Use map() to add 'estimated_off_targets' column
    
    # TODO: Recalculate final score including off-target penalty
    df_guides['off_target_penalty'] = None  # Normalize off-target count
    df_guides['final_score'] = None  # combined_score - off_target_penalty
    
    # TODO: Re-sort by final score
    
    # TODO: Display top 5 guides by final score
    
else:
    print("⚠ No guides available for off-target analysis")


In [None]:
# TODO: Visualize guide RNA design results
if df_guides is not None and len(df_guides) > 0:
    print("\n" + "=" * 60)
    print("Visualizing Guide RNA Design Results")
    print("=" * 60)
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # TODO: Plot 1: Guide scores distribution
    ax1 = axes[0, 0]
    # TODO: Histogram of combined_score
    # TODO: Add vertical line for mean
    
    # TODO: Plot 2: GC content distribution
    ax2 = axes[0, 1]
    # TODO: Histogram of gc_content
    # TODO: Add vertical line at 0.5 (optimal) and mean
    
    # TODO: Plot 3: Score vs Position
    ax3 = axes[1, 0]
    # TODO: Scatter plot of pam_start vs combined_score
    # TODO: Color by gc_content
    # TODO: Highlight top 10 guides
    
    # TODO: Plot 4: Off-target vs Score
    ax4 = axes[1, 1]
    # TODO: Scatter plot of combined_score vs estimated_off_targets
    # TODO: Color by final_score
    # TODO: Highlight top 10 guides
    
    plt.tight_layout()
    plt.show()
    
    print(f"✓ Visualization complete")
else:
    print("⚠ No data available for visualization")


## Exercise 5: Synthetic Biology Mini-Model in Python

**Goal:** Introduce simple dynamical modeling (ODEs) of gene circuits.

**Background:** Synthetic biology uses mathematical models to predict and design biological systems. Ordinary Differential Equations (ODEs) are commonly used to model gene expression dynamics. A toggle switch is a bistable circuit where two genes mutually repress each other, creating two stable states. An activator circuit involves a gene that activates its own expression, creating positive feedback.

**Task List:**
1. Implement `simulate_gene_ode` function:
   - Define ODE: dX/dt = alpha - beta * X
   - Use `scipy.integrate.solve_ivp` to integrate
   - Return DataFrame with time and expression columns
2. Simulate simple gene expression model
3. Implement `simulate_toggle_switch` function:
   - Define ODEs for mutual repression
   - Integrate using solve_ivp
   - Return DataFrame with time, X1, X2 columns
4. Simulate toggle switch circuit
5. Implement `simulate_activator_circuit` function:
   - Define ODE with positive feedback
   - Include knockdown_factor parameter
   - Integrate using solve_ivp
6. Simulate activator circuit without knockdown
7. Simulate activator circuit with CRISPR knockdown:
   - Use top guide from Exercise 4 (if available)
   - Simulate with different knockdown strengths
   - Compare results
8. Create summary visualization comparing all models

**Key Concepts:**
- Ordinary Differential Equations (ODEs)
- Gene circuit modeling
- Toggle switches and activator circuits
- Numerical integration
- CRISPR-mediated gene knockdown


In [None]:
# TODO: Implement simple gene expression ODE simulation
def simulate_gene_ode(alpha=1.0, beta=0.5, t_end=20):
    """
    Simulate simple gene expression ODE: dX/dt = alpha - beta * X
    
    Parameters:
    - alpha (float): Production rate
    - beta (float): Degradation rate
    - t_end (int): Duration of simulation
    
    Returns:
    - DataFrame with time series
    """
    def ode(t, X):
        # TODO: Return dX/dt = alpha - beta * X
        return None
    
    # TODO: Use solve_ivp to integrate
    # t_span=[0, t_end], y0=[0], t_eval=np.linspace(0, t_end, 200)
    sol = None
    
    # TODO: Create DataFrame
    df = None  # Use pd.DataFrame() with 'time' and 'expression' columns
    
    return df


In [None]:
# TODO: Implement toggle switch ODE model
def simulate_toggle_switch(alpha1=2.0, alpha2=2.0, beta1=1.0, beta2=1.0, 
                           n=2, K=1.0, t_end=50, initial_conditions=[0.1, 0.1]):
    """
    Simulate a toggle switch circuit:
        dX1/dt = alpha1 / (1 + (X2/K)^n) - beta1 * X1
        dX2/dt = alpha2 / (1 + (X1/K)^n) - beta2 * X2
    
    Returns:
    - DataFrame with time series for both genes
    """
    def ode(t, y):
        X1, X2 = y
        # TODO: Calculate dX1_dt and dX2_dt
        dX1_dt = None
        dX2_dt = None
        return [dX1_dt, dX2_dt]
    
    # TODO: Use solve_ivp to integrate
    sol = None
    
    # TODO: Create DataFrame with time, X1, X2 columns
    df = None
    
    return df


In [None]:
# TODO: Implement activator circuit ODE model
def simulate_activator_circuit(alpha=1.0, beta=0.5, n=2, K=0.5, t_end=20, 
                               initial_condition=0.1, knockdown_factor=1.0):
    """
    Simulate an activator circuit (positive feedback):
        dX/dt = alpha * (X^n / (K^n + X^n)) - beta * X * knockdown_factor
    
    Parameters:
    - knockdown_factor: Multiplier for degradation (1.0 = no knockdown, >1.0 = knockdown)
    
    Returns:
    - DataFrame with time series
    """
    def ode(t, X):
        # TODO: Calculate production term (positive feedback)
        production = None  # alpha * (X^n / (K^n + X^n))
        
        # TODO: Calculate degradation term (with knockdown)
        degradation = None  # beta * X * knockdown_factor
        
        # TODO: Return dX/dt
        return None
    
    # TODO: Use solve_ivp to integrate
    sol = None
    
    # TODO: Create DataFrame with time and expression columns
    df = None
    
    return df


In [None]:
# Exercise 5: Synthetic Biology Mini-Model
print("=" * 60)
print("Exercise 5: Synthetic Biology Mini-Model in Python")
print("=" * 60)

# TODO: Part 1: Simple gene expression model
print("\n" + "=" * 60)
print("Part 1: Simple Gene Expression Model")
print("=" * 60)

# TODO: Simulate simple gene expression
df_simple = None  # Use simulate_gene_ode()

# TODO: Plot simple model
# Plot time vs expression
# Add horizontal line for steady state (alpha/beta)
# Add labels, title, legend, grid

plt.tight_layout()
plt.show()


In [None]:
# TODO: Part 2: Toggle Switch Circuit
print("\n" + "=" * 60)
print("Part 2: Toggle Switch Circuit")
print("=" * 60)

# TODO: Simulate toggle switch
df_toggle = None  # Use simulate_toggle_switch()

# TODO: Plot toggle switch
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# TODO: Plot 1: Time course
ax1 = axes[0]
# Plot X1 and X2 vs time
# Add labels, title, legend, grid

# TODO: Plot 2: Phase portrait
ax2 = axes[1]
# Plot X1 vs X2
# Mark start and end points
# Add labels, title, legend, grid

plt.tight_layout()
plt.show()

# TODO: Analyze toggle switch behavior
# Print final X1 and X2 levels
# Determine which state the circuit switched to


In [None]:
# TODO: Part 3: Activator Circuit (Positive Feedback)
print("\n" + "=" * 60)
print("Part 3: Activator Circuit (Positive Feedback)")
print("=" * 60)

# TODO: Simulate activator circuit without knockdown
df_activator_no_kd = None  # Use simulate_activator_circuit() with knockdown_factor=1.0

# TODO: Plot activator circuit
# Plot time vs expression
# Add labels, title, legend, grid

plt.tight_layout()
plt.show()

# TODO: Print analysis
# Print final expression level


In [None]:
# TODO: Part 4: CRISPR Knockdown Effect
print("\n" + "=" * 60)
print("Part 4: CRISPR-Mediated Knockdown Effect")
print("=" * 60)

# TODO: Check if we have guides from Exercise 4
# If available, select top guide and display its information

# TODO: Simulate different knockdown strengths
# Use knockdown_factors: [1.0, 1.5, 2.0, 3.0, 5.0]
knockdown_factors = None
knockdown_labels = None

knockdown_results = {}
# TODO: Loop through knockdown factors and simulate
# Store results in knockdown_results dictionary

# TODO: Plot comparison
# Plot all knockdown scenarios on same plot
# Use different colors for each
# Add labels, title, legend, grid

plt.tight_layout()
plt.show()

# TODO: Calculate knockdown efficiency
# For each knockdown scenario, calculate reduction percentage
# Print efficiency analysis


In [None]:
# TODO: Summary visualization: All models together
print("\n" + "=" * 60)
print("Summary: Comparison of All Models")
print("=" * 60)

fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# TODO: Plot 1: Simple model
ax1 = axes[0, 0]
# Plot simple model
# Add steady state line

# TODO: Plot 2: Toggle switch
ax2 = axes[0, 1]
# Plot X1 and X2 vs time

# TODO: Plot 3: Activator (no knockdown)
ax3 = axes[1, 0]
# Plot activator circuit

# TODO: Plot 4: Knockdown comparison
ax4 = axes[1, 1]
# Plot all knockdown scenarios

plt.tight_layout()
plt.show()

# TODO: Print key takeaways
print(f"\nKey Takeaways:")
print(f"  1. Simple model: Linear production and degradation → steady state")
print(f"  2. Toggle switch: Bistable system with two stable states")
print(f"  3. Activator circuit: Positive feedback → high expression")
print(f"  4. CRISPR knockdown: Increases degradation → reduces expression")
