# Protein–Protein Interactions Laboratory
## Student Notebook

This notebook contains exercises for the Protein–Protein Interactions laboratory session. Complete each task by following the step-by-step instructions and filling in the code stubs.

**Learning Objectives:**
- Retrieve protein-protein interaction data from public databases
- Build and analyze protein interaction networks
- Apply graph theory concepts to biological systems
- Understand network robustness and biological implications

**Instructions:**
- Read each task description carefully
- Follow the step-by-step instructions
- Fill in the code stubs marked with `# TODO:` comments
- Run each cell and verify your results
- Ask for help if you get stuck!


In [None]:
# Import required libraries
import requests
import networkx as nx
import matplotlib.pyplot as plt
import json
import os
from collections import defaultdict
import numpy as np
import pandas as pd
import seaborn as sns

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✓ Libraries imported successfully")


## Task 1: Retrieve PPI Data from STRING API

**Aim:** Retrieve first-neighbor protein-protein interactions (PPIs) from the STRING database for two proteins: **PARG** (Poly(ADP-ribose) glycohydrolase) and **TP53** (p53 tumor suppressor). Compare their interaction networks by analyzing the interaction scores.

**Background:** STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a database of known and predicted protein-protein interactions. It integrates data from multiple sources including experimental evidence, computational predictions, and text mining. First neighbors are proteins that directly interact with the query protein.

**Steps to complete:**
1. Implement the `string_first_neighbors` function to query STRING API
2. Retrieve interactions for both PARG and TP53
3. Create DataFrames for each protein's interactions
4. Calculate and display statistics (mean, max, min) for interaction scores
5. Compare the two networks

**Key Concepts:**
- REST API usage for biological databases
- Protein identifiers and species taxonomy IDs (9606 = human)
- Interaction confidence scores
- Dataframe manipulation


In [None]:
# TODO: Implement string_first_neighbors function
def string_first_neighbors(protein, species=9606, min_score=0.7):
    """
    Retrieve first neighbors of a protein using STRING interactors API.
    
    Parameters:
    - protein: protein name (e.g., "TP53")
    - species: NCBI taxonomy ID (9606 = human)
    - min_score: minimum confidence score threshold
    
    Returns:
    - list of tuples: [(protein1, protein2, score), ...]
    
    Hint: Use the STRING API endpoint:
    https://string-db.org/api/json/interaction_partners?identifier={protein}&species={species}
    """
    # TODO: Construct the API URL
    url = None  # Fill in the URL
    
    # TODO: Make the API request
    response = None  # Use requests.get()
    
    # TODO: Check if request was successful
    # Use response.ok or response.raise_for_status()
    
    # TODO: Parse JSON response
    data = None  # Use response.json()
    
    # TODO: Extract interactions as list of tuples (protein, partner, score)
    # Filter by min_score threshold
    edges = []  # Fill in the list comprehension
    
    return edges


In [None]:
# Task 1: Retrieve PPI data from STRING API for PARG and TP53
# TODO: Set up parameters
proteins = None  # List of proteins to query
species_id = None  # Human species ID
min_score = None  # Minimum confidence score threshold

print("=" * 60)
print("Task 1: Retrieving First Neighbors from STRING Database")
print("=" * 60)

# TODO: Create dictionary to store DataFrames
dataframes = {}

# TODO: Loop through each protein
for protein in proteins:
    print(f"\nFetching STRING first neighbors for {protein}...")
    
    try:
        # TODO: Get interactions using string_first_neighbors function
        interactions = None
        
        # TODO: Create DataFrame with columns: ['Protein_A', 'Protein_B', 'Score']
        df = None
        
        # TODO: Store DataFrame in dataframes dictionary
        dataframes[protein] = None
        
        # TODO: Print summary information
        print(f"✓ Retrieved {len(df)} interactions for {protein}")
        print(f"  First 3 interactions:")
        # TODO: Display first 3 interactions
        
    except Exception as e:
        print(f"✗ Error: {e}")
        # Create empty dataframe as fallback
        dataframes[protein] = pd.DataFrame(columns=['Protein_A', 'Protein_B', 'Score'])

# TODO: Display statistics for each protein
print("\n" + "=" * 60)
print("Interaction Score Statistics")
print("=" * 60)

for protein in proteins:
    df = dataframes[protein]
    if len(df) > 0:
        scores = df['Score']
        print(f"\n{protein} ({len(df)} interactions):")
        # TODO: Calculate and print mean, max, min scores
        print(f"  Mean score: {None:.3f}")  # Fill in
        print(f"  Max score:  {None:.3f}")  # Fill in
        print(f"  Min score:  {None:.3f}")  # Fill in
    else:
        print(f"\n{protein}: No interactions found")

# TODO: Display DataFrames (first 10 rows)
print("\n" + "=" * 60)
print("DataFrames Summary")
print("=" * 60)
# TODO: Print head of each dataframe

# TODO: Compare the two networks
# TODO: Find common interactors between PARG and TP53 networks


## Task 2: Build n-Hop Neighborhood Graphs using BFS

**Aim:** Use Breadth-First Search (BFS) to find all proteins within n hops (default n=2) from the target proteins (PARG and TP53), and also retrieve all interactions between these discovered proteins. This creates a complete interaction subgraph centered on each target protein.

**Background:** A protein-centered network includes not only direct interactors but also their interactions with each other. This provides a more complete view of the local interaction neighborhood. BFS systematically explores the network layer by layer, ensuring we capture all proteins within the specified distance.

**Steps to complete:**
1. Implement `string_n_hop_neighbors` function using BFS
2. Step 1: Use BFS to discover all nodes within n hops
3. Step 2: For each discovered node, get all its first neighbors
4. Step 3: Filter to keep only edges between discovered nodes
5. Build NetworkX graphs for both PARG and TP53
6. Display graph statistics

**Key Concepts:**
- Breadth-First Search (BFS) algorithm
- n-hop neighborhood discovery
- Building complete subgraphs


In [None]:
# TODO: Implement string_n_hop_neighbors function
def string_n_hop_neighbors(protein, n=2, species=9606, min_score=0.7):
    """
    Retrieve nodes and edges up to n hops away from the input protein using BFS.
    Also retrieves all interactions between the discovered nodes.
    
    Algorithm:
    1. Use BFS to discover all nodes within n hops
    2. For each discovered node, get all its first neighbors
    3. Filter to keep only edges between discovered nodes
    4. Return all edges in the subgraph
    
    Returns: list of tuples [(protein1, protein2, score), ...]
    """
    # Step 1: BFS to discover all nodes within n hops
    frontier = {protein}  # Starting set
    visited = {protein}   # All discovered nodes
    
    # TODO: Implement BFS loop for n hops
    
    
    # Step 2: Get all interactions between discovered nodes
    all_edges = []
    discovered_nodes = list(visited)
    
    print(f"  Discovered {len(discovered_nodes)} nodes, fetching all interactions...")
    
    # TODO: Fetch interactions for each discovered node
    
    # TODO: Remove duplicate edges if any (keep first occurrence)
    # Hint: Use a set to track seen edges (use sorted tuple as key)
    seen = set()
    unique_edges = []
    # TODO: Filter duplicates
    
    print(f"  Retrieved {len(unique_edges)} unique interactions")
    return unique_edges


In [None]:
# Task 2: Build n-hop neighborhood graphs for PARG and TP53
# TODO: Set parameters
n_hops = None  # Number of hops (default: 2)
species_id = None  # Human species ID
min_score = None  # Minimum confidence score

print("=" * 60)
print("Task 2: Building n-Hop Neighborhood Graphs (BFS)")
print("=" * 60)

# TODO: Create dictionary to store graphs
graphs = {}

# TODO: Loop through each protein
for protein in ["PARG", "TP53"]:
    print(f"\n{'='*60}")
    print(f"Processing {protein} (n={n_hops} hops)")
    print(f"{'='*60}")
    
    try:
        # TODO: Get all edges in the n-hop neighborhood
        edges = None
        
        # TODO: Build NetworkX graph
        G = nx.Graph()
        # TODO: Add edges to graph (with weight=score)
        
        # TODO: Store graph
        graphs[protein] = None
        
        # TODO: Print graph statistics
        print(f"\n✓ Graph constructed for {protein}:")
        print(f"  Nodes: {None}")  # Fill in
        print(f"  Edges: {None}")  # Fill in
        print(f"  Density: {None:.4f}")  # Fill in
        print(f"  Is connected: {None}")  # Fill in
        
        # TODO: If not connected, print component information
        
    except Exception as e:
        print(f"✗ Error building graph for {protein}: {e}")
        graphs[protein] = nx.Graph()  # Empty graph as fallback

# TODO: Compare the two graphs
print("\n" + "=" * 60)
print("Graph Comparison")
print("=" * 60)
# TODO: Print comparison statistics


## Task 3: Visualize PPI Networks

**Aim:** Visualize the n-hop neighborhood graphs created in Task 2 for PARG and TP53. Create publication-quality network visualizations with proper node highlighting, edge styling, and layout algorithms.

**Steps to complete:**
1. Check if graphs exist from Task 2
2. For each graph, create a network visualization with:
   - Target protein highlighted in red
   - Other nodes in light blue
   - Edge widths based on interaction scores
   - Node labels
3. Create degree distribution plots
4. Display all visualizations inline

**Key Concepts:**
- Network visualization techniques
- Layout algorithms (spring layout)
- Node and edge styling
- Highlighting important nodes


In [None]:
# Task 3: Visualize the n-hop neighborhood graphs
print("=" * 60)
print("Task 3: Visualizing PPI Networks")
print("=" * 60)

# TODO: Check if graphs were created in Task 2
if 'graphs' not in locals() or len(graphs) == 0:
    print("⚠ Warning: Graphs not found. Please run Task 2 first.")
else:
    for protein in ["PARG", "TP53"]:
        if protein not in graphs or graphs[protein].number_of_nodes() == 0:
            print(f"⚠ Warning: No graph available for {protein}")
            continue
        
        G = graphs[protein]
        print(f"\n{'='*60}")
        print(f"Visualizing {protein} network")
        print(f"{'='*60}")
        print(f"Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
        
        # TODO: Create visualization
        plt.figure(figsize=(14, 10))
        
        # TODO: Use spring layout (seed=42 for reproducibility)
        pos = None
        
        # TODO: Create node colors (red for target protein, lightblue for others)
        node_colors = None
        
        # TODO: Create node sizes (larger for target protein)
        node_sizes = None
        
        # TODO: Draw nodes
        # Use nx.draw_networkx_nodes()
        
        # TODO: Draw edges (optionally weight by score)
        # Get edges with data: G.edges(data=True)
        # Calculate edge widths based on weight
        edge_widths = None  # Calculate from edge weights
        # Use nx.draw_networkx_edges()
        
        # TODO: Draw labels
        # Use nx.draw_networkx_labels()
        
        # TODO: Set title and axis
        plt.title(f"{protein} {n_hops}-Hop Neighborhood Network\n"
                 f"({G.number_of_nodes()} nodes, {G.number_of_edges()} edges)",
                 fontsize=16, fontweight='bold', pad=20)
        plt.axis('off')
        plt.tight_layout()
        plt.show()
        
        # TODO: Additional visualization: Degree distribution
        # Calculate degrees
        degrees = None  # Use dict(G.degree())
        degree_values = None  # Extract values
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
        
        # TODO: Plot degree distribution histogram
        # Use ax1.hist()
        
        # TODO: Plot top 10 nodes by degree
        # Sort degrees, get top 10, create bar plot on ax2
        
        plt.tight_layout()
        plt.show()
    
    print("\n" + "=" * 60)
    print("Visualization Complete")
    print("=" * 60)


## Task 4: Graph Analysis and Centrality Measures

**Aim:** Perform comprehensive graph analysis for both PARG and TP53 networks. Calculate basic graph statistics (number of nodes, connected components) and compute centrality measures (degree, betweenness, closeness) to identify key hub proteins. Visualize the networks with top-centrality nodes highlighted.

**Steps to complete:**
1. For each graph, calculate basic statistics:
   - Number of nodes and edges
   - Number of connected components
   - Whether the graph is connected
2. Calculate centrality measures:
   - Degree centrality
   - Betweenness centrality
   - Closeness centrality
3. Create bar plots showing top 10 nodes for each centrality measure
4. Visualize networks with top 5 nodes by each centrality highlighted
5. Compare results between PARG and TP53

**Key Concepts:**
- Basic graph statistics
- Centrality measures and their interpretation
- Network visualization with node highlighting


In [None]:
# Task 4: Graph Analysis and Centrality Measures
print("=" * 60)
print("Task 4: Graph Analysis and Centrality Measures")
print("=" * 60)

# TODO: Check if graphs were created in Task 2
if 'graphs' not in locals() or len(graphs) == 0:
    print("⚠ Warning: Graphs not found. Please run Task 2 first.")
else:
    # TODO: Store results for both graphs
    analysis_results = {}
    
    for protein in ["PARG", "TP53"]:
        if protein not in graphs or graphs[protein].number_of_nodes() == 0:
            print(f"⚠ Warning: No graph available for {protein}")
            continue
        
        G = graphs[protein]
        print(f"\n{'='*60}")
        print(f"Analyzing {protein} Network")
        print(f"{'='*60}")
        
        # TODO: Calculate basic graph statistics
        num_nodes = None  # Use G.number_of_nodes()
        num_edges = None  # Use G.number_of_edges()
        components = None  # Use nx.connected_components(G)
        num_components = None  # Count components
        is_connected = None  # Use nx.is_connected(G)
        
        print(f"\nBasic Graph Statistics:")
        print(f"  Number of nodes: {num_nodes}")
        print(f"  Number of edges: {num_edges}")
        print(f"  Number of connected components: {num_components}")
        print(f"  Is connected: {is_connected}")
        # TODO: If not connected, print largest component size
        
        # TODO: Calculate centrality measures
        print(f"\nComputing centrality measures...")
        degree_cent = None  # Use nx.degree_centrality(G)
        betweenness_cent = None  # Use nx.betweenness_centrality(G)
        closeness_cent = None  # Use nx.closeness_centrality(G)
        
        # TODO: Create DataFrame from centrality measures
        centrality_df = None  # Use pd.DataFrame()
        # TODO: Sort by degree centrality
        
        # TODO: Store results
        analysis_results[protein] = {
            'graph': G,
            'centrality_df': centrality_df,
            'stats': {
                'nodes': num_nodes,
                'edges': num_edges,
                'components': num_components,
                'is_connected': is_connected
            }
        }
        
        print(f"✓ Centrality measures computed")
        print(f"\nTop 10 hub proteins (by degree centrality):")
        # TODO: Print top 10
        
        # TODO: Visualize centrality measures with bar plots
        fig, axes = plt.subplots(1, 3, figsize=(18, 5))
        
        # TODO: Plot top 10 by degree centrality on axes[0]
        # TODO: Plot top 10 by betweenness centrality on axes[1]
        # TODO: Plot top 10 by closeness centrality on axes[2]
        
        plt.tight_layout()
        plt.show()
        
        # TODO: Correlation between centrality measures
        print(f"\nCorrelation between centrality measures ({protein}):")
        # TODO: Calculate and print correlation matrix
        
        # TODO: Visualize network with top-centrality nodes highlighted
        print(f"\nVisualizing network with top-centrality nodes highlighted...")
        
        # TODO: Get top 5 nodes for each centrality measure
        top_degree_nodes = None
        top_between_nodes = None
        top_close_nodes = None
        
        # TODO: Create three visualizations, one for each centrality measure
        # Use subplots(1, 3) and highlight nodes in different colors
        
        plt.tight_layout()
        plt.show()
        
        # TODO: Print top nodes for each centrality
        print(f"\nTop 5 nodes by centrality ({protein}):")
        # TODO: Print lists
    
    # TODO: Compare results between PARG and TP53
    if len(analysis_results) == 2:
        print("\n" + "=" * 60)
        print("Comparison: PARG vs TP53")
        print("=" * 60)
        # TODO: Compare graph sizes, connectivity, average centrality values


## Task 5: Community Detection (Louvain Clustering)

**Aim:** Identify functional communities/modules in both PARG and TP53 networks using the Louvain algorithm. Compare the community structure between the two networks.

**Steps to complete:**
1. Check if python-louvain is available (import community.community_louvain)
2. For each graph:
   - Apply Louvain clustering to find communities
   - Calculate modularity score
   - Organize proteins by cluster
   - Display community sizes
3. Visualize networks with nodes colored by community
4. Compare community structures between PARG and TP53

**Key Concepts:**
- Community detection algorithms
- Modularity optimization
- Network partitioning


In [None]:
# Task 5: Community Detection using Louvain algorithm
print("=" * 60)
print("Task 5: Community Detection (Louvain Clustering)")
print("=" * 60)

# TODO: Check if Louvain is available
try:
    import community.community_louvain as community_louvain
    louvain_available = True
except ImportError:
    louvain_available = False
    print("⚠ python-louvain not installed. Using NetworkX alternative method...")

# TODO: Check if graphs were created in Task 2
if 'graphs' not in locals() or len(graphs) == 0:
    print("⚠ Warning: Graphs not found. Please run Task 2 first.")
else:
    # TODO: Store clustering results
    clustering_results = {}
    
    for protein in ["PARG", "TP53"]:
        if protein not in graphs or graphs[protein].number_of_nodes() == 0:
            print(f"⚠ Warning: No graph available for {protein}")
            continue
        
        G = graphs[protein]
        print(f"\n{'='*60}")
        print(f"Community Detection for {protein} Network")
        print(f"{'='*60}")
        
        if louvain_available:
            print("Performing Louvain clustering...")
            
            # TODO: Compute best partition
            partition = None  # Use community_louvain.best_partition(G)
            
            # TODO: Organize proteins by cluster
            clusters = defaultdict(list)
            # TODO: Fill clusters dictionary
            
            print(f"✓ Identified {len(clusters)} communities")
            print(f"\nCommunity sizes:")
            # TODO: Print community sizes and members (for small clusters)
            
            # TODO: Calculate modularity
            modularity = None  # Use community_louvain.modularity(partition, G)
            print(f"\nNetwork modularity: {modularity:.4f}")
            
            clustering_results[protein] = {
                'partition': partition,
                'clusters': clusters,
                'modularity': modularity
            }
            
        else:
            # TODO: Fallback: Use NetworkX built-in community detection
            print("Using NetworkX greedy modularity communities...")
            communities = None  # Use nx.community.greedy_modularity_communities(G)
            
            # TODO: Create partition dict and calculate modularity
            partition = {}
            # TODO: Fill partition dictionary
            
            modularity = None  # Use nx.community.modularity(G, communities)
            
            # TODO: Convert to clusters dict format
            clusters = defaultdict(list)
            # TODO: Fill clusters dictionary
            
            clustering_results[protein] = {
                'partition': partition,
                'clusters': clusters,
                'modularity': modularity
            }
        
        # TODO: Visualize network with communities
        plt.figure(figsize=(14, 10))
        
        # TODO: Use consistent layout
        pos = None
        
        # TODO: Color nodes by community
        node_colors = None  # List of cluster IDs
        cmap = plt.cm.tab20
        
        # TODO: Highlight target protein
        node_sizes = None
        node_edge_colors = None
        node_linewidths = None
        
        # TODO: Draw network
        # Use nx.draw_networkx_nodes(), nx.draw_networkx_edges(), nx.draw_networkx_labels()
        
        plt.title(f"{protein} Network with Communities (Louvain, {len(clusters)} clusters, modularity={modularity:.3f})", 
                  fontsize=16, fontweight='bold', pad=20)
        plt.axis('off')
        plt.tight_layout()
        plt.show()
    
    # TODO: Compare community structures between PARG and TP53
    if len(clustering_results) == 2:
        print("\n" + "=" * 60)
        print("Comparison: PARG vs TP53 Communities")
        print("=" * 60)
        # TODO: Compare number of communities, modularity, community sizes, common nodes


## Task 6: Functional Enrichment Analysis (g:Profiler)

**Aim:** Query g:Profiler API to identify enriched Gene Ontology (GO) terms and KEGG pathways for proteins in both PARG and TP53 networks. Compare the functional enrichment between the two networks to understand their biological roles.

**Background:** Functional enrichment analysis identifies which biological processes, molecular functions, cellular components, or pathways are overrepresented in a set of genes/proteins compared to the background. g:Profiler is a web-based tool that performs enrichment analysis using multiple databases including:
- **GO:BP** (Gene Ontology Biological Process): Biological processes proteins are involved in
- **GO:MF** (Gene Ontology Molecular Function): Molecular functions proteins perform
- **GO:CC** (Gene Ontology Cellular Component): Cellular locations where proteins are found
- **KEGG**: Kyoto Encyclopedia of Genes and Genomes pathways
- **REAC**: Reactome pathways

**Biological Significance:** Enrichment analysis helps interpret network data by:
- Identifying common biological functions of interacting proteins
- Revealing pathways that are overrepresented in the network
- Understanding the biological context of protein interactions
- Comparing functional profiles between different networks

**Key Concepts:**
- Functional enrichment analysis
- Statistical significance (p-values, FDR correction)
- Gene Ontology and pathway databases
- Interpreting enrichment results

In [None]:
# Task 6: Functional Enrichment Analysis using g:Profiler
print("=" * 60)
print("Task 6: Functional Enrichment Analysis (g:Profiler)")
print("=" * 60)

# TODO: Check if graphs were created in Task 2
if 'graphs' not in locals() or len(graphs) == 0:
    print("⚠ Warning: Graphs not found. Please run Task 2 first.")
else:
    # TODO: Store enrichment results
    enrichment_results = {}
    
    # TODO: Query g:Profiler API
    url = "https://biit.cs.ut.ee/gprofiler/api/gost/profile/"
    
    # Color map for different sources
    source_colors = {
        'GO:BP': 'steelblue',
        'GO:MF': 'coral',
        'GO:CC': 'mediumseagreen',
        'KEGG': 'purple',
        'REAC': 'orange'
    }
    
    for protein in ["PARG", "TP53"]:
        if protein not in graphs or graphs[protein].number_of_nodes() == 0:
            print(f"⚠ Warning: No graph available for {protein}")
            continue
        
        G = graphs[protein]
        # TODO: Get list of all protein nodes
        genes = None
        
        print(f"\n{'='*60}")
        print(f"Enrichment Analysis for {protein} Network")
        print(f"{'='*60}")
        print(f"Querying g:Profiler for {len(genes)} genes...")
        
        # TODO: Prepare API payload
        payload = {
            "organism": "hsapiens",
            "query": None,  # Fill in genes list
            "sources": ["GO:BP", "GO:MF", "GO:CC", "KEGG", "REAC"],
            "user_threshold": None,  # P-value threshold (e.g., 0.05)
            "all_results": False,
            "ordered": True
        }
        
        try:
            # TODO: Make API request
            response = None  # Use requests.post()
            
            # TODO: Check response and parse JSON
            results = None
            
            # TODO: Parse and display top enriched terms
            if 'result' in results and len(results['result']) > 0:
                enrichment_df = None  # Create DataFrame from results['result']
                
                # TODO: Filter by p-value and sort
                enrichment_df = None  # Filter p_value < 0.05 and sort
                
                enrichment_results[protein] = enrichment_df
                
                print(f"✓ Found {len(enrichment_df)} significantly enriched terms (p < 0.05)")
                print(f"\nTop 10 enriched terms:")
                # TODO: Display top 10 terms
                
                # TODO: Visualize top enriched terms
                top_terms = None  # Get top 15
                
                if len(top_terms) > 0:
                    fig, ax = plt.subplots(figsize=(12, 8))
                    
                    # TODO: Create bar plot with -log10(p-value)
                    # Color bars by source
                    # Add legend
                    
                    plt.tight_layout()
                    plt.show()
                
                # TODO: Breakdown by source
                print(f"\nEnrichment by source ({protein}):")
                # TODO: Count terms by source
                
            else:
                print("⚠ No enrichment results found")
                enrichment_results[protein] = pd.DataFrame()
                
        except requests.exceptions.RequestException as e:
            print(f"✗ Error querying g:Profiler: {e}")
            enrichment_results[protein] = pd.DataFrame()
        except Exception as e:
            print(f"✗ Error processing enrichment results: {e}")
            enrichment_results[protein] = pd.DataFrame()
    
    # TODO: Compare enrichment between PARG and TP53
    if len(enrichment_results) == 2:
        parg_df = enrichment_results['PARG']
        tp53_df = enrichment_results['TP53']
        
        if len(parg_df) > 0 and len(tp53_df) > 0:
            print("\n" + "=" * 60)
            print("Comparison: PARG vs TP53 Enrichment")
            print("=" * 60)
            
            # TODO: Compare number of enriched terms
            # TODO: Find common enriched terms
            # TODO: Compare by source


## Task 7: Simulated Network Attack (Protein Inhibition)

**Aim:** Simulate the effect of inhibiting the central protein (TP53 or PARG) by removing it from each network. Compare graph properties before and after removal to understand the impact of targeted protein inhibition on network structure and connectivity.

**Steps to complete:**
1. For each graph, calculate metrics BEFORE removal:
   - Number of nodes and edges
   - Network density
   - Connectivity status
   - Number of connected components
   - Largest component size
   - Average degree, clustering, path length
2. Remove the central protein from the graph
3. Calculate the same metrics AFTER removal
4. Calculate impact metrics:
   - Nodes/edges lost
   - Largest component retention
   - Network fragmentation
   - Connectivity loss
5. Visualize before and after networks
6. Create bar chart comparing metrics
7. Compare impact between PARG and TP53 removal

**Key Concepts:**
- Network robustness and vulnerability
- Targeted node removal
- Graph connectivity and fragmentation


In [None]:
# Task 7: Simulated Network Attack (Removing Central Proteins)
print("=" * 60)
print("Task 7: Simulated Network Attack (Protein Inhibition)")
print("=" * 60)

# TODO: Check if graphs were created in Task 2
if 'graphs' not in locals() or len(graphs) == 0:
    print("⚠ Warning: Graphs not found. Please run Task 2 first.")
else:
    # TODO: Store attack results
    attack_results = {}
    
    for protein in ["PARG", "TP53"]:
        if protein not in graphs or graphs[protein].number_of_nodes() == 0:
            print(f"⚠ Warning: No graph available for {protein}")
            continue
        
        G = graphs[protein]
        print(f"\n{'='*60}")
        print(f"Simulating Attack: Removing {protein} from Network")
        print(f"{'='*60}")
        
        # TODO: Calculate metrics BEFORE removal
        print(f"\nBEFORE removal of {protein}:")
        before_metrics = {
            'nodes': None,  # Use G.number_of_nodes()
            'edges': None,  # Use G.number_of_edges()
            'density': None,  # Use nx.density(G)
            'is_connected': None,  # Use nx.is_connected(G)
            'components': None,  # Count connected components
            'largest_component': None,  # Size of largest component
            'avg_degree': None,  # Calculate from degrees
            'clustering': None,  # Use nx.average_clustering(G)
        }
        
        # TODO: If connected, calculate path length and diameter
        if before_metrics['is_connected']:
            before_metrics['avg_path_length'] = None  # Use nx.average_shortest_path_length(G)
            before_metrics['diameter'] = None  # Use nx.diameter(G)
        else:
            before_metrics['avg_path_length'] = None
            before_metrics['diameter'] = None
        
        # TODO: Print before metrics
        
        # TODO: Remove the central protein
        G_attacked = G.copy()
        if protein in G_attacked.nodes():
            # TODO: Remove the node
            pass
            print(f"\n✓ Removed {protein} from network")
        
        # TODO: Calculate metrics AFTER removal
        print(f"\nAFTER removal of {protein}:")
        after_metrics = {
            'nodes': None,  # Calculate from G_attacked
            'edges': None,
            'density': None,
            'is_connected': None,
            'components': None,
            'largest_component': None,
            'avg_degree': None,
            'clustering': None,
        }
        
        # TODO: If connected after removal, calculate path length and diameter
        
        # TODO: Print after metrics with changes
        
        # TODO: Calculate impact metrics
        impact = {
            'nodes_lost': None,  # Calculate difference
            'edges_lost': None,
            'largest_component_loss': None,
            'largest_component_retention': None,  # Percentage
            'fragmentation': None,  # Additional components
            'connectivity_lost': None  # Boolean: was connected, now not
        }
        
        # TODO: Print impact summary
        
        attack_results[protein] = {
            'before': before_metrics,
            'after': after_metrics,
            'impact': impact,
            'graph_attacked': G_attacked
        }
        
        # TODO: Visualize before and after
        fig, axes = plt.subplots(1, 2, figsize=(20, 8))
        
        # TODO: Plot network BEFORE removal (highlight target protein in red)
        # TODO: Plot network AFTER removal
        
        plt.tight_layout()
        plt.show()
        
        # TODO: Bar plot comparing metrics
        # Compare: nodes, edges, density, avg_degree, clustering
        # Use side-by-side bars (before vs after)
        
        plt.tight_layout()
        plt.show()
    
    # TODO: Compare impact between PARG and TP53 removal
    if len(attack_results) == 2:
        print("\n" + "=" * 60)
        print("Comparison: Impact of Removing PARG vs TP53")
        print("=" * 60)
        # TODO: Compare fragmentation, component retention, connectivity loss, edges lost
