# Intermediate → BiologicalProcess MetaKG Exploration

**Purpose**: Determine which intermediate entity types can connect to BiologicalProcess in the Translator MetaKG.

## Questions to Answer
1. Does ChemicalEntity → BiologicalProcess exist? What predicates?
2. Does Protein → BiologicalProcess exist? What predicates?
3. Does Gene → BiologicalProcess exist? What predicates?
4. Does Pathway → BiologicalProcess exist? What predicates?
5. What APIs support each Intermediate → BiologicalProcess pattern?

## Goal
Determine the default intermediate categories for the new query pattern:
```
Gene → [Intermediate] → Disease-associated BiologicalProcess
```

## 1. Setup & Dependencies

In [1]:
import json
import pandas as pd
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any

# TCT imports
try:
    from TCT import TCT
    from TCT import name_resolver
    from TCT import translator_metakg
    from TCT import translator_kpinfo
    TCT_AVAILABLE = True
    print("TCT library loaded successfully")
except ImportError as e:
    print(f"TCT not installed. Run: pip install TCT")
    print(f"  Error: {e}")
    TCT_AVAILABLE = False

print(f"\nSetup complete")
print(f"  TCT available: {TCT_AVAILABLE}")

TCT library loaded successfully

Setup complete
  TCT available: True


## 2. Load Translator Resources (MetaKG)

In [2]:
if not TCT_AVAILABLE:
    raise RuntimeError("TCT library required. Install with: pip install TCT")

print("Loading Translator resources...")
print("(This may take 1-2 minutes - some APIs may timeout, which is normal)\n")

try:
    # Primary loading method
    APInames, metaKG, Translator_KP_info = translator_metakg.load_translator_resources()
    print(f"\nSuccessfully loaded resources:")
    print(f"  APIs available: {len(APInames)}")
    print(f"  MetaKG edges: {len(metaKG):,}")
    
except Exception as e:
    print(f"\nPrimary loading failed: {e}")
    print("Attempting fallback (loading without Plover APIs)...")
    
    # Fallback: load components separately
    Translator_KP_info, APInames = translator_kpinfo.get_translator_kp_info()
    metaKG = translator_metakg.get_KP_metadata(APInames)
    
    print(f"\nLoaded resources (fallback mode):")
    print(f"  APIs available: {len(APInames)}")
    print(f"  MetaKG edges: {len(metaKG):,}")

Loading Translator resources...
(This may take 1-2 minutes - some APIs may timeout, which is normal)


Successfully loaded resources:
  APIs available: 58
  MetaKG edges: 22,252


## 3. Explore Intermediate → BiologicalProcess Connections

Check which intermediate entity types can connect to BiologicalProcess.

In [3]:
# Define intermediate categories to test
INTERMEDIATE_CATEGORIES = [
    "biolink:ChemicalEntity",
    "biolink:Protein",
    "biolink:Gene",
    "biolink:Pathway",
    "biolink:MolecularActivity",
    "biolink:CellularComponent",
    "biolink:AnatomicalEntity",
    "biolink:PhenotypicFeature",
]

TARGET = ["biolink:BiologicalProcess"]

print("="*70)
print("INTERMEDIATE → BiologicalProcess CONNECTIVITY")
print("="*70)

results = []

for category in INTERMEDIATE_CATEGORIES:
    # Get predicates
    predicates = list(set(
        TCT.select_concept(
            sub_list=[category],
            obj_list=TARGET,
            metaKG=metaKG
        )
    ))
    
    # Get APIs
    apis = TCT.select_API(
        sub_list=[category],
        obj_list=TARGET,
        metaKG=metaKG
    )
    
    short_name = category.replace("biolink:", "")
    
    results.append({
        'Category': short_name,
        'Full Category': category,
        'Predicates': len(predicates),
        'APIs': len(apis),
        'Predicate List': predicates,
        'API List': apis
    })
    
    status = "✓" if len(predicates) > 0 else "✗"
    print(f"\n{status} {short_name} → BiologicalProcess")
    print(f"    Predicates: {len(predicates)}")
    print(f"    APIs: {len(apis)}")

# Create summary DataFrame
summary_df = pd.DataFrame(results)[['Category', 'Predicates', 'APIs']]
print("\n" + "="*70)
print("SUMMARY")
print("="*70)
display(summary_df)

INTERMEDIATE → BiologicalProcess CONNECTIVITY

✓ ChemicalEntity → BiologicalProcess
    Predicates: 24
    APIs: 8

✓ Protein → BiologicalProcess
    Predicates: 32
    APIs: 11

✓ Gene → BiologicalProcess
    Predicates: 33
    APIs: 9

✓ Pathway → BiologicalProcess
    Predicates: 20
    APIs: 14

✓ MolecularActivity → BiologicalProcess
    Predicates: 23
    APIs: 8

✓ CellularComponent → BiologicalProcess
    Predicates: 25
    APIs: 8

✓ AnatomicalEntity → BiologicalProcess
    Predicates: 25
    APIs: 5

✓ PhenotypicFeature → BiologicalProcess
    Predicates: 29
    APIs: 9

SUMMARY


Unnamed: 0,Category,Predicates,APIs
0,ChemicalEntity,24,8
1,Protein,32,11
2,Gene,33,9
3,Pathway,20,14
4,MolecularActivity,23,8
5,CellularComponent,25,8
6,AnatomicalEntity,25,5
7,PhenotypicFeature,29,9


## 4. Detailed Analysis: ChemicalEntity → BiologicalProcess

In [4]:
category = "biolink:ChemicalEntity"
short_name = "ChemicalEntity"

predicates = list(set(
    TCT.select_concept(
        sub_list=[category],
        obj_list=TARGET,
        metaKG=metaKG
    )
))

apis = TCT.select_API(
    sub_list=[category],
    obj_list=TARGET,
    metaKG=metaKG
)

print(f"\n{'='*60}")
print(f"{short_name} → BiologicalProcess")
print(f"{'='*60}")

print(f"\nPredicates ({len(predicates)}):")
for p in sorted(predicates):
    print(f"  - {p.replace('biolink:', '')}")

print(f"\nAPIs ({len(apis)}):")
for api in sorted(apis):
    print(f"  - {api}")


ChemicalEntity → BiologicalProcess

Predicates (24):
  - actively_involved_in
  - acts_upstream_of_or_within
  - affects
  - ameliorates_condition
  - associated_with
  - capable_of
  - causes
  - close_match
  - coexists_with
  - correlated_with
  - disrupts
  - expressed_in
  - has_input
  - has_output
  - has_part
  - has_participant
  - negatively_correlated_with
  - occurs_in
  - occurs_together_in_literature_with
  - positively_correlated_with
  - produces
  - regulates
  - related_to
  - subclass_of

APIs (8):
  - Automat-cam-kp(Trapi v1.5.0)
  - Automat-robokop(Trapi v1.5.0)
  - Automat-ubergraph(Trapi v1.5.0)
  - BioThings Explorer (BTE) TRAPI
  - Microbiome KP - TRAPI 1.5.0
  - Multiomics KP - TRAPI 1.5.0
  - RTX KG2 - TRAPI 1.5.0
  - Text Mined Cooccurrence API


## 5. Detailed Analysis: Protein → BiologicalProcess

In [5]:
category = "biolink:Protein"
short_name = "Protein"

predicates = list(set(
    TCT.select_concept(
        sub_list=[category],
        obj_list=TARGET,
        metaKG=metaKG
    )
))

apis = TCT.select_API(
    sub_list=[category],
    obj_list=TARGET,
    metaKG=metaKG
)

print(f"\n{'='*60}")
print(f"{short_name} → BiologicalProcess")
print(f"{'='*60}")

print(f"\nPredicates ({len(predicates)}):")
for p in sorted(predicates):
    print(f"  - {p.replace('biolink:', '')}")

print(f"\nAPIs ({len(apis)}):")
for api in sorted(apis):
    print(f"  - {api}")


Protein → BiologicalProcess

Predicates (32):
  - actively_involved_in
  - acts_upstream_of
  - acts_upstream_of_negative_effect
  - acts_upstream_of_or_within
  - acts_upstream_of_or_within_negative_effect
  - acts_upstream_of_or_within_positive_effect
  - acts_upstream_of_positive_effect
  - affects
  - associated_with
  - capable_of
  - causes
  - contraindicated_in
  - contributes_to
  - correlated_with
  - disrupts
  - enabled_by
  - enables
  - expressed_in
  - gene_associated_with_condition
  - has_input
  - has_output
  - has_part
  - has_participant
  - negatively_correlated_with
  - occurs_in
  - occurs_together_in_literature_with
  - positively_correlated_with
  - produces
  - regulates
  - related_to
  - subclass_of
  - temporally_related_to

APIs (11):
  - Automat-cam-kp(Trapi v1.5.0)
  - Automat-human-goa(Trapi v1.5.0)
  - Automat-robokop(Trapi v1.5.0)
  - Automat-ubergraph(Trapi v1.5.0)
  - Automat-viral-proteome(Trapi v1.5.0)
  - BioThings Explorer (BTE) TRAPI
  - Micr

## 6. Detailed Analysis: Gene → BiologicalProcess

In [6]:
category = "biolink:Gene"
short_name = "Gene"

predicates = list(set(
    TCT.select_concept(
        sub_list=[category],
        obj_list=TARGET,
        metaKG=metaKG
    )
))

apis = TCT.select_API(
    sub_list=[category],
    obj_list=TARGET,
    metaKG=metaKG
)

print(f"\n{'='*60}")
print(f"{short_name} → BiologicalProcess")
print(f"{'='*60}")

print(f"\nPredicates ({len(predicates)}):")
for p in sorted(predicates):
    print(f"  - {p.replace('biolink:', '')}")

print(f"\nAPIs ({len(apis)}):")
for api in sorted(apis):
    print(f"  - {api}")


Gene → BiologicalProcess

Predicates (33):
  - active_in
  - actively_involved_in
  - acts_upstream_of
  - acts_upstream_of_negative_effect
  - acts_upstream_of_or_within
  - acts_upstream_of_or_within_negative_effect
  - acts_upstream_of_or_within_positive_effect
  - acts_upstream_of_positive_effect
  - affects
  - associated_with
  - capable_of
  - causes
  - contributes_to
  - correlated_with
  - disrupts
  - enabled_by
  - enables
  - expressed_in
  - gene_associated_with_condition
  - gene_product_of
  - has_input
  - has_output
  - has_part
  - has_participant
  - negatively_correlated_with
  - occurs_in
  - overlaps
  - participates_in
  - positively_correlated_with
  - produces
  - regulates
  - related_to
  - subclass_of

APIs (9):
  - Automat-cam-kp(Trapi v1.5.0)
  - Automat-hetionet(Trapi v1.5.0)
  - Automat-robokop(Trapi v1.5.0)
  - BioThings Explorer (BTE) TRAPI
  - Microbiome KP - TRAPI 1.5.0
  - Multiomics KP - TRAPI 1.5.0
  - RTX KG2 - TRAPI 1.5.0
  - Retriever
  - Ser

## 7. Detailed Analysis: Pathway → BiologicalProcess

In [7]:
category = "biolink:Pathway"
short_name = "Pathway"

predicates = list(set(
    TCT.select_concept(
        sub_list=[category],
        obj_list=TARGET,
        metaKG=metaKG
    )
))

apis = TCT.select_API(
    sub_list=[category],
    obj_list=TARGET,
    metaKG=metaKG
)

print(f"\n{'='*60}")
print(f"{short_name} → BiologicalProcess")
print(f"{'='*60}")

print(f"\nPredicates ({len(predicates)}):")
for p in sorted(predicates):
    print(f"  - {p.replace('biolink:', '')}")

print(f"\nAPIs ({len(apis)}):")
for api in sorted(apis):
    print(f"  - {api}")


Pathway → BiologicalProcess

Predicates (20):
  - affects
  - associated_with
  - causes
  - close_match
  - coexists_with
  - contributes_to
  - has_part
  - has_participant
  - negatively_correlated_with
  - occurs_in
  - occurs_together_in_literature_with
  - overlaps
  - positively_correlated_with
  - preceded_by
  - precedes
  - regulates
  - related_to
  - same_as
  - subclass_of
  - temporally_related_to

APIs (14):
  - Automat-cam-kp(Trapi v1.5.0)
  - Automat-gwas-catalog(Trapi v1.5.0)
  - Automat-hetionet(Trapi v1.5.0)
  - Automat-human-goa(Trapi v1.5.0)
  - Automat-panther(Trapi v1.5.0)
  - Automat-reactome(Trapi v1.5.0)
  - Automat-robokop(Trapi v1.5.0)
  - Automat-ubergraph(Trapi v1.5.0)
  - Automat-viral-proteome(Trapi v1.5.0)
  - Microbiome KP - TRAPI 1.5.0
  - Multiomics KP - TRAPI 1.5.0
  - RTX KG2 - TRAPI 1.5.0
  - Retriever
  - Text Mined Cooccurrence API


## 8. Also Check Gene → Intermediate Connections

For the full query pattern `Gene → [Intermediate] → BiologicalProcess`, we also need to verify Gene → Intermediate connections.

In [8]:
SOURCE = ["biolink:Gene"]

print("="*70)
print("Gene → INTERMEDIATE CONNECTIVITY")
print("="*70)

gene_to_intermediate_results = []

for category in INTERMEDIATE_CATEGORIES:
    # Get predicates
    predicates = list(set(
        TCT.select_concept(
            sub_list=SOURCE,
            obj_list=[category],
            metaKG=metaKG
        )
    ))
    
    # Get APIs
    apis = TCT.select_API(
        sub_list=SOURCE,
        obj_list=[category],
        metaKG=metaKG
    )
    
    short_name = category.replace("biolink:", "")
    
    gene_to_intermediate_results.append({
        'Category': short_name,
        'Full Category': category,
        'Predicates': len(predicates),
        'APIs': len(apis),
    })
    
    status = "✓" if len(predicates) > 0 else "✗"
    print(f"\n{status} Gene → {short_name}")
    print(f"    Predicates: {len(predicates)}")
    print(f"    APIs: {len(apis)}")

# Create summary DataFrame
gene_summary_df = pd.DataFrame(gene_to_intermediate_results)[['Category', 'Predicates', 'APIs']]
print("\n" + "="*70)
print("SUMMARY: Gene → Intermediate")
print("="*70)
display(gene_summary_df)

Gene → INTERMEDIATE CONNECTIVITY

✓ Gene → ChemicalEntity
    Predicates: 45
    APIs: 17

✓ Gene → Protein
    Predicates: 44
    APIs: 12

✓ Gene → Gene
    Predicates: 40
    APIs: 18

✓ Gene → Pathway
    Predicates: 24
    APIs: 13

✓ Gene → MolecularActivity
    Predicates: 34
    APIs: 11

✓ Gene → CellularComponent
    Predicates: 40
    APIs: 10

✓ Gene → AnatomicalEntity
    Predicates: 24
    APIs: 9

✓ Gene → PhenotypicFeature
    Predicates: 44
    APIs: 13

SUMMARY: Gene → Intermediate


Unnamed: 0,Category,Predicates,APIs
0,ChemicalEntity,45,17
1,Protein,44,12
2,Gene,40,18
3,Pathway,24,13
4,MolecularActivity,34,11
5,CellularComponent,40,10
6,AnatomicalEntity,24,9
7,PhenotypicFeature,44,13


## 9. Combined Analysis: Full Path Viability

For the full query pattern `Gene → [Intermediate] → BiologicalProcess`, we need BOTH connections to exist.

In [9]:
print("="*70)
print("FULL PATH VIABILITY: Gene → [Intermediate] → BiologicalProcess")
print("="*70)

viable_intermediates = []

for i, category in enumerate(INTERMEDIATE_CATEGORIES):
    short_name = category.replace("biolink:", "")
    
    # Get Gene → Intermediate info from our previous results
    gene_to_int = gene_to_intermediate_results[i]
    
    # Get Intermediate → BiologicalProcess info from our previous results  
    int_to_bp = results[i]
    
    # Both must have predicates for the path to be viable
    is_viable = gene_to_int['Predicates'] > 0 and int_to_bp['Predicates'] > 0
    
    if is_viable:
        viable_intermediates.append({
            'Category': short_name,
            'Full Category': category,
            'Gene→Int Predicates': gene_to_int['Predicates'],
            'Gene→Int APIs': gene_to_int['APIs'],
            'Int→BP Predicates': int_to_bp['Predicates'],
            'Int→BP APIs': int_to_bp['APIs'],
        })
        print(f"\n✓ {short_name}")
        print(f"    Gene → {short_name}: {gene_to_int['Predicates']} predicates, {gene_to_int['APIs']} APIs")
        print(f"    {short_name} → BiologicalProcess: {int_to_bp['Predicates']} predicates, {int_to_bp['APIs']} APIs")
    else:
        print(f"\n✗ {short_name} (missing connection)")
        if gene_to_int['Predicates'] == 0:
            print(f"    Gene → {short_name}: NO CONNECTION")
        else:
            print(f"    Gene → {short_name}: {gene_to_int['Predicates']} predicates")
        if int_to_bp['Predicates'] == 0:
            print(f"    {short_name} → BiologicalProcess: NO CONNECTION")
        else:
            print(f"    {short_name} → BiologicalProcess: {int_to_bp['Predicates']} predicates")

# Create summary DataFrame
if viable_intermediates:
    viable_df = pd.DataFrame(viable_intermediates)
    print("\n" + "="*70)
    print("VIABLE INTERMEDIATES FOR: Gene → [Intermediate] → BiologicalProcess")
    print("="*70)
    display(viable_df)
else:
    print("\n⚠️ NO VIABLE INTERMEDIATE CATEGORIES FOUND!")

FULL PATH VIABILITY: Gene → [Intermediate] → BiologicalProcess

✓ ChemicalEntity
    Gene → ChemicalEntity: 45 predicates, 17 APIs
    ChemicalEntity → BiologicalProcess: 24 predicates, 8 APIs

✓ Protein
    Gene → Protein: 44 predicates, 12 APIs
    Protein → BiologicalProcess: 32 predicates, 11 APIs

✓ Gene
    Gene → Gene: 40 predicates, 18 APIs
    Gene → BiologicalProcess: 33 predicates, 9 APIs

✓ Pathway
    Gene → Pathway: 24 predicates, 13 APIs
    Pathway → BiologicalProcess: 20 predicates, 14 APIs

✓ MolecularActivity
    Gene → MolecularActivity: 34 predicates, 11 APIs
    MolecularActivity → BiologicalProcess: 23 predicates, 8 APIs

✓ CellularComponent
    Gene → CellularComponent: 40 predicates, 10 APIs
    CellularComponent → BiologicalProcess: 25 predicates, 8 APIs

✓ AnatomicalEntity
    Gene → AnatomicalEntity: 24 predicates, 9 APIs
    AnatomicalEntity → BiologicalProcess: 25 predicates, 5 APIs

✓ PhenotypicFeature
    Gene → PhenotypicFeature: 44 predicates, 13 APIs


Unnamed: 0,Category,Full Category,Gene→Int Predicates,Gene→Int APIs,Int→BP Predicates,Int→BP APIs
0,ChemicalEntity,biolink:ChemicalEntity,45,17,24,8
1,Protein,biolink:Protein,44,12,32,11
2,Gene,biolink:Gene,40,18,33,9
3,Pathway,biolink:Pathway,24,13,20,14
4,MolecularActivity,biolink:MolecularActivity,34,11,23,8
5,CellularComponent,biolink:CellularComponent,40,10,25,8
6,AnatomicalEntity,biolink:AnatomicalEntity,24,9,25,5
7,PhenotypicFeature,biolink:PhenotypicFeature,44,13,29,9


## 10. Conclusions & Recommendations

In [10]:
print("="*70)
print("CONCLUSIONS")
print("="*70)

if viable_intermediates:
    print(f"\n✓ Found {len(viable_intermediates)} viable intermediate categories:")
    for v in viable_intermediates:
        print(f"  - {v['Category']}")
    
    print("\n" + "-"*70)
    print("RECOMMENDED DEFAULT INTERMEDIATES for Disease-BP Query Pattern:")
    print("-"*70)
    
    # Recommend based on API coverage and predicate count
    recommended = []
    for v in viable_intermediates:
        # Prioritize categories with good coverage on both hops
        if v['Gene→Int APIs'] >= 5 and v['Int→BP APIs'] >= 5:
            recommended.append(v['Full Category'])
            print(f"  ✓ {v['Category']} (high coverage)")
        elif v['Gene→Int APIs'] >= 2 and v['Int→BP APIs'] >= 2:
            recommended.append(v['Full Category'])
            print(f"  ○ {v['Category']} (moderate coverage)")
        else:
            print(f"  △ {v['Category']} (low coverage - consider optional)")
    
    print("\n" + "-"*70)
    print("Python list for implementation:")
    print("-"*70)
    print(f"DEFAULT_BP_INTERMEDIATE_CATEGORIES = {recommended}")
else:
    print("\n⚠️ NO VIABLE INTERMEDIATE CATEGORIES FOUND!")
    print("\nThis means the query pattern Gene → [Intermediate] → BiologicalProcess")
    print("is not well-supported in the current Translator MetaKG.")
    print("\nConsider alternative approaches:")
    print("  1. Use direct Gene → BiologicalProcess queries")
    print("  2. Query Gene → Disease → BiologicalProcess (different pattern)")

CONCLUSIONS

✓ Found 8 viable intermediate categories:
  - ChemicalEntity
  - Protein
  - Gene
  - Pathway
  - MolecularActivity
  - CellularComponent
  - AnatomicalEntity
  - PhenotypicFeature

----------------------------------------------------------------------
RECOMMENDED DEFAULT INTERMEDIATES for Disease-BP Query Pattern:
----------------------------------------------------------------------
  ✓ ChemicalEntity (high coverage)
  ✓ Protein (high coverage)
  ✓ Gene (high coverage)
  ✓ Pathway (high coverage)
  ✓ MolecularActivity (high coverage)
  ✓ CellularComponent (high coverage)
  ✓ AnatomicalEntity (high coverage)
  ✓ PhenotypicFeature (high coverage)

----------------------------------------------------------------------
Python list for implementation:
----------------------------------------------------------------------
DEFAULT_BP_INTERMEDIATE_CATEGORIES = ['biolink:ChemicalEntity', 'biolink:Protein', 'biolink:Gene', 'biolink:Pathway', 'biolink:MolecularActivity', 'biolink:C

---

## Next Steps

Based on the results above:
1. Update the implementation plan with the recommended intermediate categories
2. If ChemicalEntity (drugs) → BiologicalProcess is supported, we can find drug targets that affect disease-related processes
3. If Protein → BiologicalProcess is supported, we can find protein targets involved in disease processes