# Multi-Gene Pathfinder Query System

This notebook performs multi-gene pathfinder queries using NCATS Translator TCT library.

**Workflow:**
1. Gene normalization (HUGO/Ensembl ‚Üí CURIEs)
2. Multi-gene TCT PathFinder queries
3. Knowledge graph collection and merging
4. Graph clustering
5. Semantic labeling (future)
6. Neo4j export (future)

**Test Dataset:** COVID-19 differential expression genes from PMC11255397

## Setup & Dependencies

In [79]:
import json
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from typing import List, Dict, Any, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime
from pathlib import Path

# TCT imports - matching PathFinder notebook pattern
try:
    from TCT import TCT
    from TCT import name_resolver
    from TCT import translator_metakg
    from TCT import translator_kpinfo
    from TCT import translator_query
    TCT_AVAILABLE = True
    print("‚úì TCT library loaded successfully")
except ImportError as e:
    print(f"‚ö†Ô∏è  TCT not installed. Run: pip install TCT")
    print(f"   Error: {e}")
    TCT_AVAILABLE = False

# Visualization imports
try:
    import ipycytoscape
    CYTOSCAPE_AVAILABLE = True
    print("‚úì ipycytoscape loaded successfully")
except ImportError:
    print("‚ö†Ô∏è  ipycytoscape not installed. Run: pip install ipycytoscape")
    CYTOSCAPE_AVAILABLE = False

# Configuration
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)
(DATA_DIR / "raw").mkdir(exist_ok=True)
(DATA_DIR / "processed").mkdir(exist_ok=True)

print(f"\n‚úì Setup complete")
print(f"  TCT available: {TCT_AVAILABLE}")
print(f"  Cytoscape available: {CYTOSCAPE_AVAILABLE}")

‚úì TCT library loaded successfully
‚úì ipycytoscape loaded successfully

‚úì Setup complete
  TCT available: True
  Cytoscape available: True


## 1. Gene Normalization

Use TCT's built-in `name_resolver.batch_lookup()` to normalize gene symbols to CURIEs.

In [80]:
def normalize_genes_with_tct(gene_symbols: List[str]) -> pd.DataFrame:
    """Normalize gene symbols using TCT's name_resolver.
    
    Args:
        gene_symbols: List of HUGO gene symbols (e.g., ['CD6', 'IFITM3'])
        
    Returns:
        DataFrame with columns: original, curie, label, types
    """
    if not TCT_AVAILABLE:
        print("‚ö†Ô∏è  TCT not available - cannot normalize genes")
        return pd.DataFrame()
    
    print(f"Normalizing {len(gene_symbols)} genes using TCT name_resolver...")
    
    # Use TCT's batch_lookup for efficient normalization
    gene_info_dict = name_resolver.batch_lookup(gene_symbols)
    
    results = []
    for gene_symbol in gene_symbols:
        if gene_symbol in gene_info_dict:
            info = gene_info_dict[gene_symbol]
            results.append({
                'original': gene_symbol,
                'curie': info.curie if hasattr(info, 'curie') else None,
                'label': info.name if hasattr(info, 'name') else None,
                'types': info.types if hasattr(info, 'types') else []
            })
            print(f"  ‚úì {gene_symbol} ‚Üí {info.curie if hasattr(info, 'curie') else 'N/A'}")
        else:
            print(f"  ‚úó {gene_symbol} - not found")
            results.append({
                'original': gene_symbol,
                'curie': None,
                'label': None,
                'types': []
            })
    
    df = pd.DataFrame(results)
    # Filter out genes that weren't found
    df = df[df['curie'].notna()]
    
    print(f"\n‚úì Successfully normalized {len(df)}/{len(gene_symbols)} genes")
    return df

# Test with a single gene if TCT is available
if TCT_AVAILABLE:
    test_result = name_resolver.batch_lookup(['CD6'])
    if 'CD6' in test_result:
        print(f"\nTest: CD6 ‚Üí {test_result['CD6'].curie if hasattr(test_result['CD6'], 'curie') else 'N/A'}")
else:
    print("\n‚ö†Ô∏è  Install TCT to use gene normalization: pip install TCT")


Test: CD6 ‚Üí NCBIGene:113886265


## 2. Test Gene List (COVID-19)

Define and normalize the test gene list from PMC11255397.

In [81]:
# COVID-19 differential expression genes from README
COVID_GENES = [
    "CD6",
    "IFITM3",
    "IFITM2",
    "STAT5A",
    "KLRG1",
    "DPP4",
    "IL32",
    "PIK3AP1",
    "FYN",
    "IL4R"
]

# Normalize all genes using TCT
normalized_genes_df = normalize_genes_with_tct(COVID_GENES)

# Display results
if len(normalized_genes_df) > 0:
    print("\nNormalized Genes:")
    display(normalized_genes_df[["original", "curie", "label"]])
else:
    print("\n‚ö†Ô∏è  No genes were successfully normalized")

Normalizing 10 genes using TCT name_resolver...
  ‚úì CD6 ‚Üí NCBIGene:113886265
  ‚úì IFITM3 ‚Üí NCBIGene:10410
  ‚úì IFITM2 ‚Üí NCBIGene:10581
  ‚úì STAT5A ‚Üí NCBIGene:6776
  ‚úì KLRG1 ‚Üí NCBIGene:611464
  ‚úì DPP4 ‚Üí NCBIGene:1803
  ‚úì IL32 ‚Üí NCBIGene:9235
  ‚úì PIK3AP1 ‚Üí NCBIGene:100155165
  ‚úì FYN ‚Üí NCBIGene:791125
  ‚úì IL4R ‚Üí NCBIGene:705404

‚úì Successfully normalized 10/10 genes

Normalized Genes:


Unnamed: 0,original,curie,label
0,CD6,NCBIGene:113886265,
1,IFITM3,NCBIGene:10410,
2,IFITM2,NCBIGene:10581,
3,STAT5A,NCBIGene:6776,
4,KLRG1,NCBIGene:611464,
5,DPP4,NCBIGene:1803,
6,IL32,NCBIGene:9235,
7,PIK3AP1,NCBIGene:100155165,
8,FYN,NCBIGene:791125,
9,IL4R,NCBIGene:705404,


## 3. TCT Query Setup

Load Translator resources and prepare query configuration.

In [82]:
if not TCT_AVAILABLE:
    print("‚ö†Ô∏è  TCT not available - skipping query setup")
else:
    print("Loading Translator resources...")
    print("(This may take 1-2 minutes and some APIs may timeout - this is normal)\n")
    
    try:
        # Load resources - matching PathFinder notebook pattern
        APInames, metaKG, Translator_KP_info = translator_metakg.load_translator_resources()
        
        print(f"\n‚úì Successfully loaded resources:")
        print(f"  APIs: {len(APInames)}")
        print(f"  MetaKG edges: {len(metaKG)}")
        
    except Exception as e:
        print(f"\n‚ö†Ô∏è  Error during resource loading: {e}")
        print("\nAttempting fallback: loading without Plover APIs...")
        
        # Fallback: load components separately, skip add_plover_API
        Translator_KP_info, APInames = translator_kpinfo.get_translator_kp_info()
        metaKG = translator_metakg.get_KP_metadata(APInames)
        
        print(f"\n‚úì Loaded resources (fallback mode):")
        print(f"  APIs: {len(APInames)}")
        print(f"  MetaKG edges: {len(metaKG)}")
    
    # Prepare query inputs
    input_gene_curies = normalized_genes_df['curie'].tolist()
    input_gene_categories = ['biolink:Gene']
    
    # Disease: COVID-19
    COVID19_CURIE = "MONDO:0100096"
    disease_categories = ['biolink:Disease']
    
    # Select predicates for gene-disease connections
    all_predicates = list(set(
        TCT.select_concept(
            sub_list=input_gene_categories,
            obj_list=disease_categories,
            metaKG=metaKG
        )
    ))
    
    print(f"\n‚úì Found {len(all_predicates)} predicates for gene‚Üídisease queries")
    if len(all_predicates) > 0:
        print(f"  Examples: {[p.replace('biolink:', '') for p in all_predicates[:5]]}")
    
    # Select APIs capable of answering gene-disease queries
    selected_APIs = TCT.select_API(
        sub_list=input_gene_categories,
        obj_list=disease_categories,
        metaKG=metaKG
    )
    
    print(f"\n‚úì Selected {len(selected_APIs)} APIs for querying")
    if len(selected_APIs) > 0:
        print(f"  Examples: {selected_APIs[:3]}")

Loading Translator resources...
(This may take 1-2 minutes and some APIs may timeout - this is normal)


‚ö†Ô∏è  Error during resource loading: HTTPSConnectionPool(host='kg2cplover.rtx.ai', port=9990): Max retries exceeded with url: /kg2c/meta_knowledge_graph (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x134d28b90>, 'Connection to kg2cplover.rtx.ai timed out. (connect timeout=None)'))

Attempting fallback: loading without Plover APIs...

‚úì Loaded resources (fallback mode):
  APIs: 54
  MetaKG edges: 10593

‚úì Found 39 predicates for gene‚Üídisease queries
  Examples: ['associated_with_decreased_likelihood_of', 'associated_with', 'gene_associated_with_condition', 'caused_by', 'has_biomarker']

‚úì Selected 15 APIs for querying
  Examples: ['BioThings Explorer (BTE) TRAPI', 'Automat-ubergraph(Trapi v1.5.0)', 'Automat-ehr-clinical-connections-kp(Trapi v1.5.0)']


## 4. Execute TCT Queries

Query Translator APIs in parallel for gene-disease paths.

**Note on API Errors:**
- "NoneType is not iterable" = API returned no results or empty response for this query
- This is normal - not all KPs have data for every gene/predicate combination
- Successful APIs still provide valuable results

In [83]:
if not TCT_AVAILABLE or len(normalized_genes_df) == 0:
    print("‚ö†Ô∏è  Skipping queries - TCT not available or no genes normalized")
    query_results = {}
else:
    print("Creating TRAPI query for gene neighborhood discovery...")
    print("Strategy: Find all entities connected to COVID-19 genes\n")
    
    # Format query: Find neighbors of genes (empty target list = find anything connected)
    # This matches the PathFinder/NetworkFinder pattern from TCT examples
    query_json = TCT.format_query_json(
        input_gene_curies,              # Input genes
        [],                             # Empty = find neighbors (not specific target)
        input_gene_categories,          # Gene categories
        disease_categories,             # Look for disease connections
        all_predicates                  # Use discovered predicates
    )
    
    print(f"‚úì Query created:")
    print(f"  Input genes: {len(input_gene_curies)}")
    print(f"  Query type: Neighborhood discovery (find all connections)")
    print(f"  Predicates: {len(all_predicates)}")
    
    # Build API predicates dictionary
    API_predicates = {}
    API_withMetaKG = list(set(metaKG['API']))
    for api in API_withMetaKG:
        API_predicates[api] = list(set(metaKG[metaKG['API'] == api]['Predicate']))
    
    print(f"\nQuerying {len(selected_APIs)} Translator APIs in parallel...")
    print("‚è≥ This will take 1-2 minutes.")
    print("\n‚ÑπÔ∏è  About API Errors:")
    print("  ‚Ä¢ 'NoneType is not iterable' = API returned empty/no data for this query")
    print("  ‚Ä¢ Common causes: API has no edges for these genes/predicates, or API is down")
    print("  ‚Ä¢ This is NORMAL - we only need some APIs to succeed\n")
    
    # Execute parallel queries - matching PathFinder pattern
    # Limit max_workers to 5 to avoid network saturation
    query_results = translator_query.parallel_api_query(
        query_json=query_json,
        select_APIs=selected_APIs,
        APInames=APInames,
        API_predicates=API_predicates,
        max_workers=min(5, len(selected_APIs))
    )
    
    # Save raw results
    output_file = DATA_DIR / "raw" / f"tct_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    
    # Convert to serializable format
    results_to_save = {}
    for k, v in query_results.items():
        if isinstance(v, dict):
            results_to_save[k] = v
    
    with open(output_file, 'w') as f:
        json.dump(results_to_save, f, indent=2)
    
    # Count successes and analyze failures
    successful_apis = []
    failed_apis = []
    
    # Track which APIs succeeded by checking the results
    for api in selected_APIs:
        # APIs that succeed will have contributed edges to query_results
        # Check if any results came from this API (would need source tracking)
        # For now, just count dict results
        pass
    
    successful = sum(1 for v in query_results.values() if isinstance(v, dict))
    
    print(f"\n{'='*60}")
    print("QUERY RESULTS")
    print(f"{'='*60}")
    print(f"APIs queried: {len(selected_APIs)}")
    print(f"Edges found: {len(results_to_save)}")
    print(f"Results saved: {output_file}")
    print(f"\nInterpretation:")
    print(f"  ‚úì Found {len(results_to_save):,} edges from successful APIs")
    print(f"  ‚ÑπÔ∏è  Failed APIs had no data for these specific genes/predicates")
    print(f"  ‚úì This is sufficient data for analysis and visualization")
    print(f"{'='*60}\n")

Creating TRAPI query for gene neighborhood discovery...
Strategy: Find all entities connected to COVID-19 genes

‚úì Query created:
  Input genes: 10
  Query type: Neighborhood discovery (find all connections)
  Predicates: 39

Querying 15 Translator APIs in parallel...
‚è≥ This will take 1-2 minutes.

‚ÑπÔ∏è  About API Errors:
  ‚Ä¢ 'NoneType is not iterable' = API returned empty/no data for this query
  ‚Ä¢ Common causes: API has no edges for these genes/predicates, or API is down
  ‚Ä¢ This is NORMAL - we only need some APIs to succeed

'Automat-ehr-clinical-connections-kp(Trapi v1.5.0)' generated an exception: argument of type 'NoneType' is not iterable
Automat-ubergraph(Trapi v1.5.0): Success!
'Automat-ctd(Trapi v1.5.0)' generated an exception: argument of type 'NoneType' is not iterable
'Genetics Data Provider for NCATS Biomedical Translator Reasoners' generated an exception: argument of type 'NoneType' is not iterable
'Automat-ehr-may-treat-kp(Trapi v1.5.0)' generated an excepti

## 5. Parse Query Results

Extract subject-predicate-object triples from TCT results.

In [84]:
def parse_tct_results_to_dataframe(result_dict: Dict, gene_curies: List[str]) -> pd.DataFrame:
    """Parse TCT query results into a DataFrame of edges.
    
    Args:
        result_dict: Dictionary from translator_query.parallel_api_query()
        gene_curies: List of input gene CURIEs to filter for
        
    Returns:
        DataFrame with columns: Subject, Object, Predicate, SubjectName, ObjectName
    """
    if not result_dict:
        print("‚ö†Ô∏è  No results to parse")
        return pd.DataFrame()
    
    print(f"Parsing {len(result_dict)} result entries...")
    
    # Extract edges
    subjects = []
    objects = []
    predicates = []
    
    for k, v in result_dict.items():
        if isinstance(v, dict):
            subjects.append(v.get('subject', ''))
            objects.append(v.get('object', ''))
            predicates.append(v.get('predicate', ''))
    
    if not subjects:
        print("‚ö†Ô∏è  No edges found in results")
        return pd.DataFrame()
    
    # Get unique nodes for name lookup
    unique_nodes = list(set(subjects + objects))
    print(f"Looking up names for {len(unique_nodes)} unique nodes...")
    
    node_info_dict = name_resolver.batch_lookup(unique_nodes)
    
    # Map CURIEs to names
    subject_names = []
    object_names = []
    
    for subj, obj in zip(subjects, objects):
        subj_info = node_info_dict.get(subj)
        obj_info = node_info_dict.get(obj)
        
        subject_names.append(subj_info.name if subj_info and hasattr(subj_info, 'name') else subj)
        object_names.append(obj_info.name if obj_info and hasattr(obj_info, 'name') else obj)
    
    # Create DataFrame
    df = pd.DataFrame({
        'Subject': subject_names,
        'SubjectCURIE': subjects,
        'Object': object_names,
        'ObjectCURIE': objects,
        'Predicate': predicates
    })
    
    # Remove duplicates
    df = df.drop_duplicates()
    
    print(f"‚úì Parsed {len(df)} unique edges")
    
    return df

# Parse results if available
if 'query_results' in globals() and query_results:
    edges_df = parse_tct_results_to_dataframe(query_results, input_gene_curies)
    
    if len(edges_df) > 0:
        print("\nSample edges:")
        display(edges_df[['Subject', 'Predicate', 'Object']].head(10))
else:
    edges_df = pd.DataFrame()
    print("‚ö†Ô∏è  No query results available to parse")

Parsing 1537 result entries...
Looking up names for 476 unique nodes...
‚úì Parsed 812 unique edges

Sample edges:


Unnamed: 0,Subject,Predicate,Object
0,MONDO:0008903,biolink:regulates,NCBIGene:10581
1,MONDO:0005101,biolink:regulates,NCBIGene:10581
2,MONDO:0012268,biolink:regulates,NCBIGene:10581
3,MONDO:0005011,biolink:regulates,NCBIGene:10581
4,MONDO:0004892,biolink:related_to,NCBIGene:10581
5,MONDO:0005105,biolink:regulates,NCBIGene:1803
6,MONDO:0005350,biolink:regulates,NCBIGene:1803
7,MONDO:0011122,biolink:related_to,NCBIGene:1803
8,MONDO:0005148,biolink:related_to,NCBIGene:1803
9,MONDO:0005090,biolink:related_to,NCBIGene:1803


## 6. Interactive Graph Visualization

Visualize knowledge graph using ipycytoscape.

In [85]:
def visualize_knowledge_graph(edges_df: pd.DataFrame, title: str = "Knowledge Graph", 
                              max_edges: int = 50) -> Optional[Any]:
    """Create interactive Cytoscape visualization from edge DataFrame.
    
    Args:
        edges_df: DataFrame with Subject, Object, Predicate columns
        title: Graph title
        max_edges: Maximum edges to visualize (default 50 to avoid overwhelming browser)
        
    Returns:
        CytoscapeWidget or None if visualization unavailable
    """
    if not CYTOSCAPE_AVAILABLE:
        print("‚ö†Ô∏è  ipycytoscape not available - install with: pip install ipycytoscape")
        return None
    
    if len(edges_df) == 0:
        print("‚ö†Ô∏è  No edges to visualize")
        return None
    
    # Sample edges - ENSURE ALL QUERY GENES ARE REPRESENTED!
    if len(edges_df) > max_edges:
        print(f"‚ö†Ô∏è  Dataset has {len(edges_df)} edges - sampling {max_edges} for visualization")
        print(f"   Strategy: Include at least one edge per query gene")
        
        # Build full graph to analyze connectivity
        G_full = nx.from_pandas_edgelist(
            edges_df,
            source='SubjectCURIE',
            target='ObjectCURIE',
            create_using=nx.DiGraph
        )
        
        # Get degree (connectivity) for all nodes
        degrees = dict(G_full.degree())
        
        # Identify which input genes are in the graph
        input_genes_in_graph = [g for g in input_gene_curies if g in G_full.nodes()]
        
        print(f"   Found {len(input_genes_in_graph)}/{len(input_gene_curies)} query genes in results")
        
        # COVID-19 disease node
        covid_curie = "MONDO:0100096"
        
        # STEP 1: Get at least one edge for EACH query gene
        edges_per_gene = []
        for gene_curie in input_genes_in_graph:
            gene_edges = edges_df[
                (edges_df['SubjectCURIE'] == gene_curie) | 
                (edges_df['ObjectCURIE'] == gene_curie)
            ]
            
            if len(gene_edges) > 0:
                # Take the highest-degree edge for this gene
                gene_edges = gene_edges.copy()
                gene_edges['score'] = gene_edges.apply(
                    lambda row: degrees.get(row['SubjectCURIE'], 0) + degrees.get(row['ObjectCURIE'], 0),
                    axis=1
                )
                # Take top 2 edges per gene to show some context
                top_edges = gene_edges.nlargest(min(2, len(gene_edges)), 'score')
                edges_per_gene.append(top_edges.drop('score', axis=1))
        
        required_edges = pd.concat(edges_per_gene) if edges_per_gene else pd.DataFrame()
        
        print(f"   Selected {len(required_edges)} edges covering query genes")
        
        # STEP 2: Fill remaining budget with high-degree edges
        remaining_budget = max_edges - len(required_edges)
        
        if remaining_budget > 0:
            # Get other high-connectivity edges not already selected
            other_edges = edges_df[~edges_df.index.isin(required_edges.index)].copy()
            
            # Prioritize edges involving query genes or COVID-19
            priority_edges = other_edges[
                other_edges['SubjectCURIE'].isin(input_genes_in_graph + [covid_curie]) | 
                other_edges['ObjectCURIE'].isin(input_genes_in_graph + [covid_curie])
            ]
            
            if len(priority_edges) > 0:
                priority_edges['score'] = priority_edges.apply(
                    lambda row: degrees.get(row['SubjectCURIE'], 0) + degrees.get(row['ObjectCURIE'], 0),
                    axis=1
                )
                additional = priority_edges.nlargest(min(remaining_budget, len(priority_edges)), 'score')
                edges_to_plot = pd.concat([required_edges, additional.drop('score', axis=1)])
            else:
                edges_to_plot = required_edges
        else:
            edges_to_plot = required_edges[:max_edges]
        
        # Count coverage
        genes_in_viz = set()
        for _, row in edges_to_plot.iterrows():
            if row['SubjectCURIE'] in input_gene_curies:
                genes_in_viz.add(row['SubjectCURIE'])
            if row['ObjectCURIE'] in input_gene_curies:
                genes_in_viz.add(row['ObjectCURIE'])
        
        print(f"   Final: {len(edges_to_plot)} edges showing {len(genes_in_viz)}/{len(input_gene_curies)} query genes")
    else:
        edges_to_plot = edges_df
    
    print(f"\nBuilding graph with {len(edges_to_plot)} edges...")
    
    # Build NetworkX graph
    G = nx.DiGraph()
    
    for _, row in edges_to_plot.iterrows():
        G.add_edge(row['Subject'], row['Object'])
    
    print(f"  Nodes: {G.number_of_nodes()}")
    print(f"  Edges: {G.number_of_edges()}")
    
    # Add node attributes
    input_gene_names = dict(zip(normalized_genes_df['curie'], normalized_genes_df['original']))
    covid_curie = "MONDO:0100096"
    
    for node in G.nodes():
        # Find CURIE for this node
        node_curie = None
        matching_rows = edges_to_plot[
            (edges_to_plot['Subject'] == node) | (edges_to_plot['Object'] == node)
        ]
        
        if len(matching_rows) > 0:
            row = matching_rows.iloc[0]
            node_curie = row['SubjectCURIE'] if row['Subject'] == node else row['ObjectCURIE']
        
        # Classify node
        if node_curie in input_gene_curies:
            G.nodes[node]['node_type'] = 'de_gene'
        elif node_curie == covid_curie or 'COVID' in str(node).upper():
            G.nodes[node]['node_type'] = 'disease'
        else:
            G.nodes[node]['node_type'] = 'other'
    
    # Count node types
    de_gene_count = sum(1 for n in G.nodes() if G.nodes[n].get('node_type') == 'de_gene')
    disease_count = sum(1 for n in G.nodes() if G.nodes[n].get('node_type') == 'disease')
    other_count = G.number_of_nodes() - de_gene_count - disease_count
    
    print(f"  üî¥ {de_gene_count} DE genes | üíú {disease_count} diseases | üîµ {other_count} other")
    
    # Simplified styling
    graph_style = [
        {
            'selector': 'node[node_type = "de_gene"]',
            'style': {
                'background-color': '#E74C3C',
                'shape': 'rectangle',
                'label': 'data(id)',
                'font-size': '9px',
                'width': '40px',
                'height': '40px',
                'color': '#FFF'
            }
        },
        {
            'selector': 'node[node_type = "disease"]',
            'style': {
                'background-color': '#9B59B6',
                'shape': 'diamond',
                'label': 'data(id)',
                'font-size': '10px',
                'width': '50px',
                'height': '50px',
                'color': '#FFF'
            }
        },
        {
            'selector': 'node[node_type = "other"]',
            'style': {
                'background-color': '#3498DB',
                'shape': 'ellipse',
                'label': 'data(id)',
                'font-size': '7px',
                'width': '25px',
                'height': '25px'
            }
        },
        {
            'selector': 'edge',
            'style': {
                'curve-style': 'bezier',
                'target-arrow-shape': 'triangle',
                'line-color': '#DDD',
                'width': 1,
                'opacity': 0.4
            }
        }
    ]
    
    print(f"\nRendering visualization...")
    
    # Create widget
    cyto_graph = ipycytoscape.CytoscapeWidget()
    cyto_graph.graph.add_graph_from_networkx(G, directed=True)
    cyto_graph.set_style(graph_style)
    
    # Circle layout (fast, user can rearrange)
    cyto_graph.set_layout(name='circle')
    
    print(f"‚úì Graph ready!")
    print(f"  üí° Drag nodes to rearrange ‚Ä¢ Scroll to zoom")
    
    return cyto_graph

# Create visualization if edges available
if 'edges_df' in globals() and len(edges_df) > 0:
    print(f"\n{'='*60}")
    print("INTERACTIVE VISUALIZATION")
    print(f"{'='*60}\n")
    
    cyto_widget = visualize_knowledge_graph(
        edges_df, 
        title="COVID-19 Gene Network",
        max_edges=50
    )
    
    if cyto_widget:
        display(cyto_widget)
else:
    print("‚ö†Ô∏è  No edges available for visualization")


INTERACTIVE VISUALIZATION

‚ö†Ô∏è  Dataset has 812 edges - sampling 50 for visualization
   Strategy: Include at least one edge per query gene
   Found 5/10 query genes in results
   Selected 10 edges covering query genes
   Final: 50 edges showing 5/10 query genes

Building graph with 50 edges...
  Nodes: 30
  Edges: 36
  üî¥ 5 DE genes | üíú 1 diseases | üîµ 24 other

Rendering visualization...
‚úì Graph ready!
  üí° Drag nodes to rearrange ‚Ä¢ Scroll to zoom


CytoscapeWidget(cytoscape_layout={'name': 'circle'}, cytoscape_style=[{'selector': 'node[node_type = "de_gene"‚Ä¶

## 7. Graph Statistics & Analysis

In [86]:
if 'edges_df' in globals() and len(edges_df) > 0:
    print(f"\n{'='*60}")
    print("KNOWLEDGE GRAPH STATISTICS")
    print(f"{'='*60}\n")
    
    # Build graph for analysis
    G = nx.from_pandas_edgelist(
        edges_df,
        source='Subject',
        target='Object',
        create_using=nx.DiGraph
    )
    
    print(f"Nodes: {G.number_of_nodes()}")
    print(f"Edges: {G.number_of_edges()}")
    print(f"Density: {nx.density(G):.4f}")
    
    if nx.is_weakly_connected(G):
        print(f"Connected: Yes (single component)")
    else:
        num_components = nx.number_weakly_connected_components(G)
        print(f"Connected: No ({num_components} components)")
    
    # Top predicates
    print(f"\nTop 10 Predicates:")
    predicate_counts = edges_df['Predicate'].value_counts()
    for pred, count in predicate_counts.head(10).items():
        pred_short = pred.replace('biolink:', '')
        print(f"  {pred_short}: {count}")
    
    # Node degree analysis
    degree_dict = dict(G.degree())
    if degree_dict:
        top_nodes = sorted(degree_dict.items(), key=lambda x: x[1], reverse=True)[:10]
        print(f"\nTop 10 Highly Connected Nodes:")
        for node, degree in top_nodes:
            print(f"  {node}: {degree} connections")
    
    # Input genes in results
    input_genes_in_graph = set(normalized_genes_df['original'].tolist())
    genes_found = [g for g in input_genes_in_graph if g in G.nodes()]
    
    print(f"\nInput Genes Found in Graph: {len(genes_found)}/{len(input_genes_in_graph)}")
    if genes_found:
        print(f"  {', '.join(genes_found)}")
    
    print(f"\n{'='*60}\n")
else:
    print("\n‚ö†Ô∏è  No graph data available for statistics")


KNOWLEDGE GRAPH STATISTICS

Nodes: 476
Edges: 723
Density: 0.0032
Connected: Yes (single component)

Top 10 Predicates:
  occurs_together_in_literature_with: 425
  condition_associated_with_gene: 165
  genetically_associated_with: 137
  related_to: 38
  regulates: 13
  has_target: 13
  associated_with: 7
  has_biomarker: 7
  correlated_with: 2
  target_for: 1

Top 10 Highly Connected Nodes:
  NCBIGene:6776: 376 connections
  NCBIGene:1803: 158 connections
  NCBIGene:10410: 68 connections
  NCBIGene:9235: 62 connections
  NCBIGene:10581: 59 connections
  MONDO:0004992: 6 connections
  MONDO:0008903: 5 connections
  MONDO:0005010: 5 connections
  MONDO:0100096: 5 connections
  MONDO:0011057: 5 connections

Input Genes Found in Graph: 0/10


