# Long COVID Gene Discovery

Find all genes associated with **Long COVID** (MONDO:0100233) using the NCATS Translator knowledge graph system.

## Query Pattern
```
Long COVID Disease ‚Üí [predicates] ‚Üí Genes
```

## What This Notebook Does
1. Loads Translator MetaKG to discover available predicates
2. Builds a 1-hop TRAPI query (Disease ‚Üí Gene)
3. Queries multiple Translator APIs in parallel
4. Returns all genes associated with Long COVID

## Expected Results
- **API Success Rate**: 40-60% is normal (not all knowledge providers have Long COVID data)
- **Output**: DataFrame of genes with their predicates and sources

In [1]:
# Setup & Dependencies
import json
import pandas as pd
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any, Optional, Tuple

# TCT imports with graceful fallback
try:
    from TCT import TCT
    from TCT import name_resolver
    from TCT import translator_metakg
    from TCT import translator_kpinfo
    from TCT import translator_query
    TCT_AVAILABLE = True
    print("‚úì TCT library loaded successfully")
except ImportError as e:
    print(f"‚úó TCT not installed. Run: pip install TCT")
    print(f"  Error: {e}")
    TCT_AVAILABLE = False

# Data directory setup
DATA_DIR = Path("../data")
CACHE_DIR = DATA_DIR / "cache"
CACHE_DIR.mkdir(parents=True, exist_ok=True)

print(f"\n‚úì Setup complete")
print(f"  TCT available: {TCT_AVAILABLE}")
print(f"  Cache directory: {CACHE_DIR}")

‚úì TCT library loaded successfully

‚úì Setup complete
  TCT available: True
  Cache directory: ../data/cache


In [2]:
# Load Translator Resources (MetaKG)
if not TCT_AVAILABLE:
    raise RuntimeError("TCT library required. Install with: pip install TCT")

print("Loading Translator resources...")
print("(This may take 1-2 minutes - some APIs may timeout, which is normal)\n")

try:
    # Primary loading method
    APInames, metaKG, Translator_KP_info = translator_metakg.load_translator_resources()
    print(f"\n‚úì Successfully loaded resources:")
    print(f"  APIs available: {len(APInames)}")
    print(f"  MetaKG edges: {len(metaKG):,}")
    
except Exception as e:
    print(f"\n‚ö†Ô∏è  Primary loading failed: {e}")
    print("Attempting fallback (loading without Plover APIs)...")
    
    # Fallback: load components separately
    Translator_KP_info, APInames = translator_kpinfo.get_translator_kp_info()
    metaKG = translator_metakg.get_KP_metadata(APInames)
    
    print(f"\n‚úì Loaded resources (fallback mode):")
    print(f"  APIs available: {len(APInames)}")
    print(f"  MetaKG edges: {len(metaKG):,}")

Loading Translator resources...
(This may take 1-2 minutes - some APIs may timeout, which is normal)


‚úì Successfully loaded resources:
  APIs available: 58
  MetaKG edges: 24,036


In [3]:
# Discover Available Predicates for Disease ‚Üí Gene queries

# Get all predicates that connect Disease to Gene
all_predicates = list(set(
    TCT.select_concept(
        sub_list=["biolink:Disease"],
        obj_list=["biolink:Gene"],
        metaKG=metaKG
    )
))

print(f"Found {len(all_predicates)} predicates for Disease ‚Üí Gene queries\n")

# Display as DataFrame for easy viewing
predicates_df = pd.DataFrame({
    'Predicate': all_predicates,
    'Short Name': [p.replace('biolink:', '') for p in all_predicates]
}).sort_values('Short Name')

display(predicates_df)

# Helper function to filter predicates by pattern
def filter_predicates(pattern: str, predicates: List[str] = all_predicates) -> List[str]:
    """Filter predicates by substring match (case-insensitive).
    
    Examples:
        filter_predicates('associated')  # predicates containing 'associated'
        filter_predicates('gene')        # predicates containing 'gene'
    """
    pattern_lower = pattern.lower()
    return [p for p in predicates if pattern_lower in p.lower()]

print("\nüí° Use filter_predicates('pattern') to find specific predicates")
print(f"   Example: filter_predicates('associated') ‚Üí {filter_predicates('associated')[:3]}...")

Found 52 predicates for Disease ‚Üí Gene queries



Unnamed: 0,Predicate,Short Name
36,biolink:actively_involved_in,actively_involved_in
48,biolink:affected_by,affected_by
25,biolink:affects,affects
9,biolink:associated_with,associated_with
33,biolink:associated_with_decreased_likelihood_of,associated_with_decreased_likelihood_of
27,biolink:associated_with_increased_likelihood_of,associated_with_increased_likelihood_of
16,biolink:associated_with_resistance_to,associated_with_resistance_to
39,biolink:associated_with_sensitivity_to,associated_with_sensitivity_to
38,biolink:biomarker_for,biomarker_for
40,biolink:caused_by,caused_by



üí° Use filter_predicates('pattern') to find specific predicates
   Example: filter_predicates('associated') ‚Üí ['biolink:associated_with', 'biolink:associated_with_resistance_to', 'biolink:associated_with_increased_likelihood_of']...


In [5]:
# ‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
# ‚ïë                         USER CONFIGURATION                            ‚ïë
# ‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù

# Disease to query
DISEASE_CURIE = "MONDO:0100233"  # Long COVID
DISEASE_NAME = "Long COVID"

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# PREDICATE OPTIONS (uncomment ONE line to customize):
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# DEFAULT: Use ALL predicates (most comprehensive results)
selected_predicates = None

# OPTION 1: Use a specific predicate
# selected_predicates = ["biolink:gene_associated_with_condition"]

# OPTION 2: Use multiple specific predicates
# selected_predicates = ["biolink:gene_associated_with_condition", "biolink:related_to"]

# OPTION 3: Filter predicates by pattern (e.g., all containing 'associated')
# selected_predicates = filter_predicates("associated")

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print(f"Query Configuration:")
print(f"  Disease: {DISEASE_NAME} ({DISEASE_CURIE})")
if selected_predicates is None:
    print(f"  Predicates: ALL ({len(all_predicates)} available)")
else:
    print(f"  Predicates: {len(selected_predicates)} selected")
    for p in selected_predicates[:5]:
        print(f"    ‚Ä¢ {p.replace('biolink:', '')}")
    if len(selected_predicates) > 5:
        print(f"    ... and {len(selected_predicates) - 5} more")

Query Configuration:
  Disease: Long COVID (MONDO:0100233)
  Predicates: ALL (52 available)


In [6]:
# Build 1-Hop Query (abstracts away JSON complexity)

def build_1hop_query(
    source_curie: str,
    source_category: str,
    target_category: str,
    predicates: Optional[List[str]] = None,
    metaKG: pd.DataFrame = metaKG
) -> Tuple[Dict, List[str], Dict]:
    """Build a 1-hop TRAPI query.
    
    Args:
        source_curie: The source entity CURIE (e.g., 'MONDO:0100233')
        source_category: Biolink category (e.g., 'biolink:Disease')
        target_category: Biolink category to find (e.g., 'biolink:Gene')
        predicates: List of predicates to use, or None for auto-discovery
        metaKG: MetaKG DataFrame
        
    Returns:
        Tuple of (query_json, selected_APIs, metadata)
    """
    # Auto-discover predicates if not specified
    if predicates is None:
        predicates = list(set(
            TCT.select_concept(
                sub_list=[source_category],
                obj_list=[target_category],
                metaKG=metaKG
            )
        ))
    
    # Select APIs capable of answering this query
    selected_APIs = TCT.select_API(
        sub_list=[source_category],
        obj_list=[target_category],
        metaKG=metaKG
    )
    
    # Build TRAPI query_graph structure
    query_json = {
        'message': {
            'query_graph': {
                'nodes': {
                    'n00': {
                        'ids': [source_curie],
                        'categories': [source_category]
                    },
                    'n01': {
                        'categories': [target_category]
                    }
                },
                'edges': {
                    'e00': {
                        'subject': 'n00',
                        'object': 'n01',
                        'predicates': predicates
                    }
                }
            }
        }
    }
    
    metadata = {
        'source_curie': source_curie,
        'source_category': source_category,
        'target_category': target_category,
        'predicate_count': len(predicates),
        'api_count': len(selected_APIs)
    }
    
    return query_json, selected_APIs, metadata

# Build the query
query_json, selected_APIs, query_metadata = build_1hop_query(
    source_curie=DISEASE_CURIE,
    source_category="biolink:Disease",
    target_category="biolink:Gene",
    predicates=selected_predicates
)

print(f"‚úì Query built successfully")
print(f"")
print(f"  Query: {DISEASE_NAME} ‚Üí Genes")
print(f"  Predicates: {query_metadata['predicate_count']}")
print(f"  APIs available: {query_metadata['api_count']}")
print(f"")
print(f"  Sample APIs: {selected_APIs[:3]}")

‚úì Query built successfully

  Query: Long COVID ‚Üí Genes
  Predicates: 52
  APIs available: 19

  Sample APIs: ['CATRAX BigGIM DrugResponse Performance Phase KP - TRAPI 1.5.0', 'Automat-hetionet(Trapi v1.5.0)', 'Automat-icees-kg(Trapi v1.5.0)']


In [7]:
# Execute Query with Diagnostics

def classify_error(error_msg: str) -> str:
    """Convert technical errors to user-friendly messages."""
    error_lower = str(error_msg).lower()
    
    if 'nonetype' in error_lower or 'none' in error_lower:
        return "No data for this query (normal)"
    elif 'timeout' in error_lower:
        return "API timed out (may be slow)"
    elif 'connection' in error_lower or 'connect' in error_lower:
        return "Could not reach API"
    elif 'empty' in error_lower or 'no results' in error_lower:
        return "No results found"
    else:
        return str(error_msg)[:50]  # Truncate long errors

# Build API predicates dictionary (required by translator_query)
API_predicates = {}
API_withMetaKG = list(set(metaKG['API']))
for api in API_withMetaKG:
    API_predicates[api] = list(set(metaKG[metaKG['API'] == api]['Predicate']))

print(f"Querying {len(selected_APIs)} Translator APIs in parallel...")
print(f"‚è≥ This will take 1-2 minutes.")
print(f"")
print(f"‚ÑπÔ∏è  Note: 40-60% API success rate is NORMAL")
print(f"   Not all knowledge providers have Long COVID data.")
print(f"")

# Execute parallel queries
query_results = translator_query.parallel_api_query(
    query_json=query_json,
    select_APIs=selected_APIs,
    APInames=APInames,
    API_predicates=API_predicates,
    max_workers=min(5, len(selected_APIs))
)

print(f"\n‚úì Query complete!")

Querying 19 Translator APIs in parallel...
‚è≥ This will take 1-2 minutes.

‚ÑπÔ∏è  Note: 40-60% API success rate is NORMAL
   Not all knowledge providers have Long COVID data.

Automat-pharos(Trapi v1.5.0): Success!
RTX KG2 - TRAPI 1.5.0: Success!
Automat-robokop(Trapi v1.5.0): Success!
Service Provider TRAPI: Success!
BioThings Explorer (BTE) TRAPI: Success!

‚úì Query complete!


In [8]:
# Display Results Summary

# Count valid edges
edges = []
for k, v in query_results.items():
    if isinstance(v, dict) and 'subject' in v and 'object' in v:
        edges.append(v)

# Track which APIs returned data
successful_apis = set()
for edge in edges:
    for source in edge.get('sources', []):
        if isinstance(source, dict) and source.get('resource_role') == 'aggregator_knowledge_source':
            successful_apis.add(source.get('resource_id', 'unknown'))

print(f"{'='*60}")
print(f"QUERY RESULTS")
print(f"{'='*60}")
print(f"")
print(f"  Disease: {DISEASE_NAME} ({DISEASE_CURIE})")
print(f"  Target: Genes")
print(f"")
print(f"  APIs queried: {len(selected_APIs)}")
print(f"  APIs with data: {len(successful_apis)}")
print(f"  Success rate: {len(successful_apis)/len(selected_APIs)*100:.0f}%")
print(f"")
print(f"  Edges found: {len(edges):,}")
print(f"{'='*60}")

if len(edges) == 0:
    print(f"")
    print(f"‚ö†Ô∏è  No edges found. Possible reasons:")
    print(f"   ‚Ä¢ Long COVID is a relatively new disease")
    print(f"   ‚Ä¢ Try different predicates (e.g., 'related_to')")
    print(f"   ‚Ä¢ Some knowledge providers may not have this data yet")

QUERY RESULTS

  Disease: Long COVID (MONDO:0100233)
  Target: Genes

  APIs queried: 19
  APIs with data: 7
  Success rate: 37%

  Edges found: 9


In [9]:
# Parse Results to DataFrame

if len(edges) == 0:
    print("No edges to parse.")
    genes_df = pd.DataFrame()
else:
    print(f"Parsing {len(edges)} edges...")
    
    # Extract edge data
    rows = []
    for edge in edges:
        rows.append({
            'Subject': edge.get('subject', ''),
            'Predicate': edge.get('predicate', ''),
            'Object': edge.get('object', ''),
            'Sources': ', '.join([s.get('resource_id', '') for s in edge.get('sources', []) if isinstance(s, dict)])
        })
    
    genes_df = pd.DataFrame(rows)
    
    # The 'Object' column contains genes (since query is Disease ‚Üí Gene)
    genes_df = genes_df.rename(columns={'Object': 'Gene_CURIE', 'Subject': 'Disease_CURIE'})
    
    # Remove duplicates
    genes_df = genes_df.drop_duplicates()
    
    # Get unique gene CURIEs for name lookup
    unique_genes = genes_df['Gene_CURIE'].unique().tolist()
    print(f"Looking up names for {len(unique_genes)} unique genes...")
    
    # Lookup gene names
    gene_info = name_resolver.batch_lookup(unique_genes)
    
    # Map CURIE to name
    curie_to_name = {}
    for curie in unique_genes:
        if curie in gene_info:
            info = gene_info[curie]
            curie_to_name[curie] = info.name if hasattr(info, 'name') and info.name else curie
        else:
            curie_to_name[curie] = curie
    
    genes_df['Gene_Name'] = genes_df['Gene_CURIE'].map(curie_to_name)
    genes_df['Predicate_Short'] = genes_df['Predicate'].str.replace('biolink:', '')
    
    # Reorder columns
    genes_df = genes_df[['Gene_Name', 'Gene_CURIE', 'Predicate_Short', 'Disease_CURIE', 'Sources']]
    
    print(f"\n‚úì Found {len(genes_df)} gene associations ({genes_df['Gene_CURIE'].nunique()} unique genes)")
    print(f"")
    
    # Show top genes by frequency
    print(f"Top 20 Genes (by number of associations):")
    gene_counts = genes_df.groupby(['Gene_Name', 'Gene_CURIE']).size().reset_index(name='Count')
    gene_counts = gene_counts.sort_values('Count', ascending=False)
    display(gene_counts.head(20))
    
    print(f"\nPredicates used:")
    display(genes_df['Predicate_Short'].value_counts())

Parsing 9 edges...
Looking up names for 4 unique genes...

‚úì Found 9 gene associations (4 unique genes)

Top 20 Genes (by number of associations):


Unnamed: 0,Gene_Name,Gene_CURIE,Count
0,MONDO:0100233,MONDO:0100233,3
1,NCBIGene:114548,NCBIGene:114548,2
2,NCBIGene:55364,NCBIGene:55364,2
3,NCBIGene:969,NCBIGene:969,2



Predicates used:


Predicate_Short
related_to                     7
genetically_associated_with    1
in_clinical_trials_for         1
Name: count, dtype: int64

In [10]:
# Save Results

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# Save full results to cache (JSON)
cache_file = CACHE_DIR / f"long_covid_genes_{timestamp}.json"

results_to_save = {
    'metadata': {
        'disease_curie': DISEASE_CURIE,
        'disease_name': DISEASE_NAME,
        'query_timestamp': timestamp,
        'predicate_count': query_metadata['predicate_count'],
        'api_count': query_metadata['api_count'],
        'edge_count': len(edges)
    },
    'edges': edges
}

with open(cache_file, 'w') as f:
    json.dump(results_to_save, f, indent=2, default=str)

print(f"‚úì Results saved to cache:")
print(f"  {cache_file}")

# Save DataFrame to CSV (if we have results)
if len(genes_df) > 0:
    csv_file = CACHE_DIR / f"long_covid_genes_{timestamp}.csv"
    genes_df.to_csv(csv_file, index=False)
    print(f"")
    print(f"‚úì Gene list saved to CSV:")
    print(f"  {csv_file}")
    print(f"")
    print(f"Summary:")
    print(f"  Total associations: {len(genes_df)}")
    print(f"  Unique genes: {genes_df['Gene_CURIE'].nunique()}")

‚úì Results saved to cache:
  ../data/cache/long_covid_genes_20251203_094947.json

‚úì Gene list saved to CSV:
  ../data/cache/long_covid_genes_20251203_094947.csv

Summary:
  Total associations: 9
  Unique genes: 4
