# Disease -> BiologicalProcess Query Exploration

**Purpose**: Explore what BiologicalProcesses are associated with diseases using TRAPI queries.

This notebook investigates Disease -> BiologicalProcess connections for:
- **Alzheimer's Disease** (MONDO:0004975)
- **Parkinson's Disease** (MONDO:0005180)

## Goal
Determine if we can use disease-associated BiologicalProcesses as endpoints for a new query pattern:
```
Gene -> [Intermediate] -> Disease-associated BiologicalProcess
```

## Questions to Answer
1. What predicates connect Disease -> BiologicalProcess?
2. How many BiologicalProcesses are returned per disease?
3. Are the results biologically meaningful (GO terms, UMLS processes)?
4. Which predicates should we filter to get useful results?

## 1. Setup & Dependencies

In [1]:
import json
import requests
import pandas as pd
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any, Optional, Tuple

# TCT imports
try:
    from TCT import TCT
    from TCT import name_resolver
    from TCT import translator_metakg
    from TCT import translator_kpinfo
    from TCT import translator_query
    TCT_AVAILABLE = True
    print("TCT library loaded successfully")
except ImportError as e:
    print(f"TCT not installed. Run: pip install TCT")
    print(f"  Error: {e}")
    TCT_AVAILABLE = False

# Data directory setup
DATA_DIR = Path("../data")
CACHE_DIR = DATA_DIR / "cache"
CACHE_DIR.mkdir(parents=True, exist_ok=True)

print(f"\nSetup complete")
print(f"  TCT available: {TCT_AVAILABLE}")
print(f"  Cache directory: {CACHE_DIR}")

TCT library loaded successfully

Setup complete
  TCT available: True
  Cache directory: ../data/cache


## 2. Load Translator Resources (MetaKG)

In [2]:
if not TCT_AVAILABLE:
    raise RuntimeError("TCT library required. Install with: pip install TCT")

print("Loading Translator resources...")
print("(This may take 1-2 minutes - some APIs may timeout, which is normal)\n")

try:
    # Primary loading method
    APInames, metaKG, Translator_KP_info = translator_metakg.load_translator_resources()
    print(f"\nSuccessfully loaded resources:")
    print(f"  APIs available: {len(APInames)}")
    print(f"  MetaKG edges: {len(metaKG):,}")
    
except Exception as e:
    print(f"\nPrimary loading failed: {e}")
    print("Attempting fallback (loading without Plover APIs)...")
    
    # Fallback: load components separately
    Translator_KP_info, APInames = translator_kpinfo.get_translator_kp_info()
    metaKG = translator_metakg.get_KP_metadata(APInames)
    
    print(f"\nLoaded resources (fallback mode):")
    print(f"  APIs available: {len(APInames)}")
    print(f"  MetaKG edges: {len(metaKG):,}")

Loading Translator resources...
(This may take 1-2 minutes - some APIs may timeout, which is normal)


Successfully loaded resources:
  APIs available: 58
  MetaKG edges: 22,252


## 3. Discover Disease -> BiologicalProcess Predicates

In [3]:
# Get all predicates that connect Disease to BiologicalProcess
disease_bioprocess_predicates = list(set(
    TCT.select_concept(
        sub_list=["biolink:Disease"],
        obj_list=["biolink:BiologicalProcess"],
        metaKG=metaKG
    )
))

print(f"Found {len(disease_bioprocess_predicates)} predicates for Disease -> BiologicalProcess queries\n")

# Display as DataFrame
predicates_df = pd.DataFrame({
    'Predicate': disease_bioprocess_predicates,
    'Short Name': [p.replace('biolink:', '') for p in disease_bioprocess_predicates]
}).sort_values('Short Name')

display(predicates_df)

Found 30 predicates for Disease -> BiologicalProcess queries



Unnamed: 0,Predicate,Short Name
29,biolink:affects,affects
7,biolink:associated_with,associated_with
15,biolink:biomarker_for,biomarker_for
6,biolink:causes,causes
11,biolink:close_match,close_match
25,biolink:coexists_with,coexists_with
20,biolink:contributes_to,contributes_to
19,biolink:correlated_with,correlated_with
12,biolink:decreases_response_to,decreases_response_to
10,biolink:diagnoses,diagnoses


In [4]:
# Select APIs capable of Disease -> BiologicalProcess queries
selected_APIs = TCT.select_API(
    sub_list=["biolink:Disease"],
    obj_list=["biolink:BiologicalProcess"],
    metaKG=metaKG
)

print(f"APIs capable of Disease -> BiologicalProcess queries: {len(selected_APIs)}")
for api in selected_APIs:
    print(f"  - {api}")

APIs capable of Disease -> BiologicalProcess queries: 10
  - Automat-monarchinitiative(Trapi v1.5.0)
  - Automat-ubergraph(Trapi v1.5.0)
  - Automat-robokop(Trapi v1.5.0)
  - BioThings Explorer (BTE) TRAPI
  - Microbiome KP - TRAPI 1.5.0
  - Multiomics KP - TRAPI 1.5.0
  - Text Mined Cooccurrence API
  - Automat-reactome(Trapi v1.5.0)
  - Automat-cam-kp(Trapi v1.5.0)
  - RTX KG2 - TRAPI 1.5.0


## 4. Query Helper Functions

In [5]:
def query_disease_to_bioprocess(
    disease_curie: str,
    disease_name: str,
    predicates: Optional[List[str]] = None,
    metaKG: pd.DataFrame = metaKG,
    APInames: Dict = APInames
) -> Tuple[pd.DataFrame, Dict]:
    """Query TRAPI for BiologicalProcesses associated with a disease.
    
    Args:
        disease_curie: MONDO CURIE (e.g., 'MONDO:0100096')
        disease_name: Human-readable name for display
        predicates: Optional list of predicates (None = use all)
        metaKG: MetaKG DataFrame
        APInames: API names dictionary
        
    Returns:
        Tuple of (results_df, metadata)
    """
    print(f"\n{'='*60}")
    print(f"Querying: {disease_name} ({disease_curie}) -> BiologicalProcess")
    print(f"{'='*60}\n")
    
    # Auto-discover predicates if not specified
    if predicates is None:
        predicates = list(set(
            TCT.select_concept(
                sub_list=["biolink:Disease"],
                obj_list=["biolink:BiologicalProcess"],
                metaKG=metaKG
            )
        ))
    
    print(f"Using {len(predicates)} predicates")
    
    # Select APIs
    selected_APIs = TCT.select_API(
        sub_list=["biolink:Disease"],
        obj_list=["biolink:BiologicalProcess"],
        metaKG=metaKG
    )
    
    print(f"Querying {len(selected_APIs)} APIs...\n")
    
    # Build TRAPI query
    query_json = {
        'message': {
            'query_graph': {
                'nodes': {
                    'n00': {
                        'ids': [disease_curie],
                        'categories': ['biolink:Disease']
                    },
                    'n01': {
                        'categories': ['biolink:BiologicalProcess']
                    }
                },
                'edges': {
                    'e00': {
                        'subject': 'n00',
                        'object': 'n01',
                        'predicates': predicates
                    }
                }
            }
        }
    }
    
    # Build API predicates dictionary
    API_predicates = {}
    API_withMetaKG = list(set(metaKG['API']))
    for api in API_withMetaKG:
        API_predicates[api] = list(set(metaKG[metaKG['API'] == api]['Predicate']))
    
    # Execute query
    query_results = translator_query.parallel_api_query(
        query_json=query_json,
        select_APIs=selected_APIs,
        APInames=APInames,
        API_predicates=API_predicates,
        max_workers=min(5, len(selected_APIs))
    )
    
    # Parse results
    edges = []
    for k, v in query_results.items():
        if isinstance(v, dict) and 'subject' in v and 'object' in v:
            edges.append(v)
    
    print(f"\nQuery complete!")
    print(f"  Edges found: {len(edges):,}")
    
    if len(edges) == 0:
        return pd.DataFrame(), {'disease_curie': disease_curie, 'edge_count': 0}
    
    # Extract edge data
    rows = []
    for edge in edges:
        rows.append({
            'Disease_CURIE': edge.get('subject', ''),
            'Predicate': edge.get('predicate', '').replace('biolink:', ''),
            'BP_CURIE': edge.get('object', ''),
            'Sources': ', '.join([s.get('resource_id', '') for s in edge.get('sources', []) if isinstance(s, dict)])
        })
    
    results_df = pd.DataFrame(rows).drop_duplicates()
    
    # Get unique BiologicalProcess CURIEs
    unique_bps = results_df['BP_CURIE'].unique().tolist()
    print(f"  Unique BiologicalProcesses: {len(unique_bps)}")
    
    metadata = {
        'disease_curie': disease_curie,
        'disease_name': disease_name,
        'edge_count': len(edges),
        'unique_bioprocesses': len(unique_bps),
        'predicates_used': results_df['Predicate'].unique().tolist()
    }
    
    return results_df, metadata

In [6]:
def batch_lookup_names(curies: List[str], batch_size: int = 1000) -> Dict[str, str]:
    """Batch lookup names for CURIEs using Node Normalizer API.
    
    This handles all ID types (GO, UMLS, REACT, etc.) in a single batch request.
    Much faster than individual lookups.
    
    Args:
        curies: List of CURIEs to look up
        batch_size: Number of CURIEs per batch (API limit)
        
    Returns:
        Dict mapping CURIE -> label
    """
    if not curies:
        return {}
    
    names = {}
    url = "https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes"
    
    # Process in batches
    total_batches = (len(curies) + batch_size - 1) // batch_size
    print(f"Looking up {len(curies)} CURIEs via Node Normalizer ({total_batches} batches)...")
    
    for i in range(0, len(curies), batch_size):
        batch = curies[i:i + batch_size]
        batch_num = i // batch_size + 1
        
        try:
            # POST request with batch of CURIEs
            resp = requests.post(url, json={"curies": batch}, timeout=60)
            
            if resp.status_code == 200:
                data = resp.json()
                for curie in batch:
                    if curie in data and data[curie]:
                        label = data[curie].get('id', {}).get('label')
                        if label:
                            names[curie] = label
                print(f"  Batch {batch_num}/{total_batches}: resolved {len([c for c in batch if c in names])} names")
            else:
                print(f"  Batch {batch_num}/{total_batches}: API error {resp.status_code}")
                
        except Exception as e:
            print(f"  Batch {batch_num}/{total_batches}: request failed - {e}")
    
    print(f"  Total: resolved {len(names)}/{len(curies)} names")
    return names

## 5. Query Alzheimer's Disease -> BiologicalProcess

In [7]:
AD_CURIE = "MONDO:0004975"
AD_NAME = "Alzheimer's Disease"

ad_df, ad_metadata = query_disease_to_bioprocess(
    disease_curie=AD_CURIE,
    disease_name=AD_NAME
)


Querying: Alzheimer's Disease (MONDO:0004975) -> BiologicalProcess

Using 30 predicates
Querying 10 APIs...

Automat-robokop(Trapi v1.5.0): Success!
Automat-reactome(Trapi v1.5.0): Success!
RTX KG2 - TRAPI 1.5.0: Success!
BioThings Explorer (BTE) TRAPI: Success!
Text Mined Cooccurrence API: Success!

Query complete!
  Edges found: 2,147
  Unique BiologicalProcesses: 1636


In [8]:
# Display Alzheimer's results
if len(ad_df) > 0:
    print(f"\n{'='*60}")
    print(f"Alzheimer's Disease -> BiologicalProcess Results")
    print(f"{'='*60}\n")
    
    print(f"Summary:")
    print(f"  Total edges: {ad_metadata['edge_count']}")
    print(f"  Unique BiologicalProcesses: {ad_metadata['unique_bioprocesses']}")
    
    print(f"\nPredicate distribution:")
    display(ad_df['Predicate'].value_counts())
else:
    print("No results found for Alzheimer's Disease -> BiologicalProcess")


Alzheimer's Disease -> BiologicalProcess Results

Summary:
  Total edges: 2147
  Unique BiologicalProcesses: 1636

Predicate distribution:


Predicate
occurs_together_in_literature_with       1296
actively_involves                         326
related_to                                115
affects                                   109
coexists_with                              91
affected_by                                73
causes                                     26
caused_by                                  25
condition_predisposed_by                   24
disrupts                                   20
manifestation_of                           17
subclass_of                                 8
predisposes_to_condition                    6
disease_has_basis_in                        6
has_manifestation                           2
exacerbates_condition                       1
preventative_for_condition                  1
treats_or_applied_or_studied_to_treat       1
Name: count, dtype: int64

## 6. Query Parkinson's Disease -> BiologicalProcess

In [9]:
PD_CURIE = "MONDO:0005180"
PD_NAME = "Parkinson's Disease"

pd_df, pd_metadata = query_disease_to_bioprocess(
    disease_curie=PD_CURIE,
    disease_name=PD_NAME
)


Querying: Parkinson's Disease (MONDO:0005180) -> BiologicalProcess

Using 30 predicates
Querying 10 APIs...

Automat-robokop(Trapi v1.5.0): Success!
RTX KG2 - TRAPI 1.5.0: Success!
Text Mined Cooccurrence API: Success!
BioThings Explorer (BTE) TRAPI: Success!

Query complete!
  Edges found: 1,526
  Unique BiologicalProcesses: 1160


In [10]:
# Display Parkinson's results
if len(pd_df) > 0:
    print(f"\n{'='*60}")
    print(f"Parkinson's Disease -> BiologicalProcess Results")
    print(f"{'='*60}\n")
    
    print(f"Summary:")
    print(f"  Total edges: {pd_metadata['edge_count']}")
    print(f"  Unique BiologicalProcesses: {pd_metadata['unique_bioprocesses']}")
    
    print(f"\nPredicate distribution:")
    display(pd_df['Predicate'].value_counts())
else:
    print("No results found for Parkinson's Disease -> BiologicalProcess")


Parkinson's Disease -> BiologicalProcess Results

Summary:
  Total edges: 1526
  Unique BiologicalProcesses: 1160

Predicate distribution:


Predicate
occurs_together_in_literature_with       885
actively_involves                        222
related_to                               148
affects                                   83
coexists_with                             72
affected_by                               37
caused_by                                 17
causes                                    12
manifestation_of                          11
subclass_of                               10
disrupts                                   7
predisposes_to_condition                   7
condition_predisposed_by                   5
negatively_correlated_with                 2
has_manifestation                          2
positively_correlated_with                 2
treats_or_applied_or_studied_to_treat      1
preceded_by                                1
precedes                                   1
correlated_with                            1
Name: count, dtype: int64

## 7. Resolve BiologicalProcess Names

In [11]:
# Collect all unique CURIEs
all_curies = list(set(ad_df['BP_CURIE'].tolist() + pd_df['BP_CURIE'].tolist()))

# Show distribution by prefix
prefix_counts = {}
for c in all_curies:
    prefix = c.split(':')[0] if ':' in c else 'unknown'
    prefix_counts[prefix] = prefix_counts.get(prefix, 0) + 1

print(f"BiologicalProcess types found ({len(all_curies)} total):")
for prefix, count in sorted(prefix_counts.items(), key=lambda x: -x[1]):
    print(f"  {prefix}: {count}")

BiologicalProcess types found (1968 total):
  GO: 1488
  ncats.bioplanet: 223
  UMLS: 218
  MONDO: 17
  HP: 13
  NCIT: 5
  REACT: 3
  EFO: 1


In [12]:
# Batch lookup all names using Node Normalizer API
# This is much faster than individual requests
all_names = batch_lookup_names(all_curies)

Looking up 1968 CURIEs via Node Normalizer (2 batches)...
  Batch 1/2: resolved 840 names
  Batch 2/2: resolved 834 names
  Total: resolved 1674/1968 names


In [13]:
# Add names to dataframes
ad_df['BP_Name'] = ad_df['BP_CURIE'].map(lambda x: all_names.get(x, x))
pd_df['BP_Name'] = pd_df['BP_CURIE'].map(lambda x: all_names.get(x, x))

## 8. Analyze Results by Predicate

In [15]:
# Define informative predicates (exclude text mining noise)
INFORMATIVE_PREDICATES = [
    'affects', 'causes', 'disrupts', 'disease_has_basis_in', 
    'related_to', 'correlated_with', 'contributes_to',
    'positively_correlated_with', 'negatively_correlated_with',
    'positively_associated_with', 'negatively_associated_with'
]

NOISE_PREDICATES = [
    'occurs_together_in_literature_with',  # Text mining - very noisy
    'manifestation_of',  # Too broad
    'coexists_with',  # Non-specific
    'actively_involves'  # Often duplicates
]

print("Informative predicates (KEEP):")
for p in INFORMATIVE_PREDICATES:
    print(f"  - {p}")

print("\nNoisy predicates (FILTER OUT):")
for p in NOISE_PREDICATES:
    print(f"  - {p}")

Informative predicates (KEEP):
  - affects
  - causes
  - disrupts
  - disease_has_basis_in
  - related_to
  - correlated_with
  - contributes_to
  - positively_correlated_with
  - negatively_correlated_with
  - positively_associated_with
  - negatively_associated_with

Noisy predicates (FILTER OUT):
  - occurs_together_in_literature_with
  - manifestation_of
  - coexists_with
  - actively_involves


In [16]:
# Filter to informative predicates
ad_informative = ad_df[ad_df['Predicate'].isin(INFORMATIVE_PREDICATES)]
pd_informative = pd_df[pd_df['Predicate'].isin(INFORMATIVE_PREDICATES)]

print(f"After filtering to informative predicates:")
print(f"  AD: {len(ad_informative)} edges (from {len(ad_df)} total)")
print(f"  PD: {len(pd_informative)} edges (from {len(pd_df)} total)")

After filtering to informative predicates:
  AD: 276 edges (from 2147 total)
  PD: 255 edges (from 1526 total)


In [17]:
# Show AD BiologicalProcesses by predicate
print("\n" + "="*60)
print("ALZHEIMER'S DISEASE - BiologicalProcesses by Predicate")
print("="*60)

for pred in sorted(ad_informative['Predicate'].unique()):
    subset = ad_informative[ad_informative['Predicate'] == pred]
    unique_bps = subset[['BP_CURIE', 'BP_Name']].drop_duplicates()
    print(f"\n[{pred}] ({len(unique_bps)} processes)")
    for _, row in unique_bps.head(10).iterrows():
        print(f"  - {row['BP_Name']}")
    if len(unique_bps) > 10:
        print(f"  ... and {len(unique_bps) - 10} more")


ALZHEIMER'S DISEASE - BiologicalProcesses by Predicate

[affects] (76 processes)
  - Aging
  - Autophagy
  - Axonal Transport
  - cell cycle
  - Cell Death Process
  - Cell physiology
  - Cell Survival
  - Cerebrovascular Circulation
  - Coitus
  - Cessation of life
  ... and 66 more

[causes] (22 processes)
  - Aging
  - Cessation of life
  - Sleep
  - immune response
  - Neurogenesis
  - Equilibrium
  - Pathogenesis
  - Insulin resistance
  - Pathologic Processes
  - Inflammation
  ... and 12 more

[disease_has_basis_in] (3 processes)
  - Deregulated CDK5 triggers multiple neurodegenerative pathways in Alzheimer's disease models (Homo sapiens)
  - Defective OGG1 Substrate Processing (Homo sapiens)
  - Defective Base Excision Repair Associated with OGG1 (Homo sapiens)

[disrupts] (11 processes)
  - Cytokinesis of the fertilized ovum
  - Axonal Transport
  - Cell Death Process
  - Cell physiology
  - Signal Transduction
  - Apoptosis
  - homologous chromosome pairing at meiosis
  - Au

In [18]:
# Show PD BiologicalProcesses by predicate
print("\n" + "="*60)
print("PARKINSON'S DISEASE - BiologicalProcesses by Predicate")
print("="*60)

for pred in sorted(pd_informative['Predicate'].unique()):
    subset = pd_informative[pd_informative['Predicate'] == pred]
    unique_bps = subset[['BP_CURIE', 'BP_Name']].drop_duplicates()
    print(f"\n[{pred}] ({len(unique_bps)} processes)")
    for _, row in unique_bps.head(10).iterrows():
        print(f"  - {row['BP_Name']}")
    if len(unique_bps) > 10:
        print(f"  ... and {len(unique_bps) - 10} more")


PARKINSON'S DISEASE - BiologicalProcesses by Predicate

[affects] (54 processes)
  - Aging
  - Autophagy
  - Blood Pressure
  - Cell Death Process
  - Cell physiology
  - Cell Survival
  - Cessation of life
  - Energy Metabolism
  - Equilibrium
  - Eye Movements
  ... and 44 more

[causes] (9 processes)
  - Cessation of life
  - Movement
  - Speech
  - Inflammation
  - Atrophic
  - Disability
  - Dysmotility
  - Inflammatory Response
  - Abnormal degeneration

[correlated_with] (1 processes)
  - Parkinson disease

[disrupts] (5 processes)
  - dopamine secretion, neurotransmission
  - Autophagy
  - Apoptosis
  - Synaptic Transmission
  - Ferroptosis

[negatively_correlated_with] (1 processes)
  - Parkinson disease

[positively_correlated_with] (1 processes)
  - Parkinson disease

[related_to] (86 processes)
  - Parkinson disease
  - Biological Evolution
  - gene expression
  - Mutation
  - oxidative phosphorylation
  - phosphorylation
  - Genetic Polymorphism
  - DNA Methylation
  - Ge

## 9. Compare AD vs Parkinson's

In [19]:
print("="*60)
print("COMPARISON: Disease -> BiologicalProcess")
print("="*60)

comparison_data = {
    'Disease': [AD_NAME, PD_NAME],
    'CURIE': [AD_CURIE, PD_CURIE],
    'Total Edges': [ad_metadata.get('edge_count', 0), pd_metadata.get('edge_count', 0)],
    'Unique BiologicalProcesses': [ad_metadata.get('unique_bioprocesses', 0), pd_metadata.get('unique_bioprocesses', 0)],
    'After Filtering': [len(ad_informative['BP_CURIE'].unique()), len(pd_informative['BP_CURIE'].unique())]
}

comparison_df = pd.DataFrame(comparison_data)
display(comparison_df)

COMPARISON: Disease -> BiologicalProcess


Unnamed: 0,Disease,CURIE,Total Edges,Unique BiologicalProcesses,After Filtering
0,Alzheimer's Disease,MONDO:0004975,2147,1636,135
1,Parkinson's Disease,MONDO:0005180,1526,1160,123


In [20]:
# Find shared BiologicalProcesses (using filtered data)
ad_bps = set(ad_informative['BP_CURIE'].unique())
pd_bps = set(pd_informative['BP_CURIE'].unique())
common_bps = ad_bps & pd_bps

print(f"\nShared BiologicalProcesses (informative predicates):")
print(f"  AD unique: {len(ad_bps)}")
print(f"  PD unique: {len(pd_bps)}")
print(f"  Shared: {len(common_bps)}")

if common_bps:
    print(f"\nShared processes (both neurodegenerative diseases):")
    for curie in sorted(common_bps):
        name = all_names.get(curie, curie)
        print(f"  - {name}")


Shared BiologicalProcesses (informative predicates):
  AD unique: 135
  PD unique: 123
  Shared: 75

Shared processes (both neurodegenerative diseases):
  - glucose metabolic process
  - lipid metabolic process
  - homologous chromosome pairing at meiosis
  - Vision
  - gene expression
  - phosphorylation
  - Inflammation
  - Acclimatization
  - Aging
  - Autophagy
  - Bioenergetics
  - Birth
  - Blood Pressure
  - Cell Death Process
  - Cell physiology
  - Cell Survival
  - Circadian Rhythms
  - Coitus
  - Cessation of life
  - Abnormal degeneration
  - Down-Regulation
  - Eating
  - Energy Metabolism
  - Equilibrium
  - Biological Evolution
  - Growth
  - Homeostasis
  - detoxification
  - Metabolism
  - Movement
  - Mutation
  - Synaptic Transmission
  - Neuronal Plasticity
  - Pathologic Processes
  - physiological aspects
  - Genetic Polymorphism
  - Pregnancy
  - Signal Transduction
  - Sleep
  - sensory perception of smell
  - Speech
  - Vision
  - wound healing
  - Apoptosis
 

In [21]:
# Disease-specific processes
ad_only = ad_bps - pd_bps
pd_only = pd_bps - ad_bps

print(f"\n{'='*60}")
print(f"DISEASE-SPECIFIC BiologicalProcesses")
print(f"{'='*60}")

print(f"\nAlzheimer's-specific ({len(ad_only)} processes):")
for curie in sorted(list(ad_only))[:15]:
    name = all_names.get(curie, curie)
    print(f"  - {name}")
if len(ad_only) > 15:
    print(f"  ... and {len(ad_only) - 15} more")

print(f"\nParkinson's-specific ({len(pd_only)} processes):")
for curie in sorted(list(pd_only))[:15]:
    name = all_names.get(curie, curie)
    print(f"  - {name}")
if len(pd_only) > 15:
    print(f"  ... and {len(pd_only) - 15} more")


DISEASE-SPECIFIC BiologicalProcesses

Alzheimer's-specific (60 processes):
  - translation
  - sphingolipid metabolic process
  - phagocytosis
  - immune response
  - cell cycle
  - brain development
  - cholesterol metabolic process
  - Ubiquitination
  - middle viral transcription
  - amyloid precursor protein metabolic process
  - long-term synaptic potentiation
  - Insulin resistance
  - Mental deterioration
  - Hypoxemia
  - Alzheimer disease
  ... and 45 more

Parkinson's-specific (48 processes):
  - oxidative phosphorylation
  - sensory perception of smell
  - locomotion
  - dopamine secretion, neurotransmission
  - Parkinson disease
  - defecation
  - Deglutition
  - digestion
  - Water consumption
  - Exertion
  - Eye Movements
  - nerve supply
  - Kinesthesis
  - locomotion
  - Motor Skills
  ... and 33 more


## 10. Reactome Pathway Analysis

In [22]:
# Look at disease_has_basis_in predicate (Reactome pathways)
ad_reactome = ad_df[ad_df['Predicate'] == 'disease_has_basis_in']
pd_reactome = pd_df[pd_df['Predicate'] == 'disease_has_basis_in']

print("="*60)
print("REACTOME PATHWAYS (disease_has_basis_in)")
print("="*60)

print(f"\nAlzheimer's Disease Reactome pathways ({len(ad_reactome)} edges):")
for _, row in ad_reactome.drop_duplicates(subset='BP_CURIE').iterrows():
    name = all_names.get(row['BP_CURIE'], row['BP_CURIE'])
    print(f"  - {name}")
    print(f"    ({row['BP_CURIE']})")

print(f"\nParkinson's Disease Reactome pathways ({len(pd_reactome)} edges):")
if len(pd_reactome) > 0:
    for _, row in pd_reactome.drop_duplicates(subset='BP_CURIE').iterrows():
        name = all_names.get(row['BP_CURIE'], row['BP_CURIE'])
        print(f"  - {name}")
        print(f"    ({row['BP_CURIE']})")
else:
    print("  (none found)")

REACTOME PATHWAYS (disease_has_basis_in)

Alzheimer's Disease Reactome pathways (6 edges):
  - Deregulated CDK5 triggers multiple neurodegenerative pathways in Alzheimer's disease models (Homo sapiens)
    (REACT:R-HSA-8862803)
  - Defective OGG1 Substrate Processing (Homo sapiens)
    (REACT:R-HSA-9656256)
  - Defective Base Excision Repair Associated with OGG1 (Homo sapiens)
    (REACT:R-HSA-9656249)

Parkinson's Disease Reactome pathways (0 edges):
  (none found)


## 11. Save Results

In [None]:
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# Save full results
output = {
    'timestamp': timestamp,
    'alzheimers': {
        'metadata': ad_metadata,
        'results': ad_df.to_dict('records')
    },
    'parkinsons': {
        'metadata': pd_metadata,
        'results': pd_df.to_dict('records')
    },
    'name_mappings': all_names
}

output_file = CACHE_DIR / f"disease_bioprocess_comparison_{timestamp}.json"
with open(output_file, 'w') as f:
    json.dump(output, f, indent=2, default=str)

print(f"Results saved to: {output_file}")

# Also save filtered CSVs
ad_informative.to_csv(CACHE_DIR / f"alzheimers_bioprocesses_filtered_{timestamp}.csv", index=False)
pd_informative.to_csv(CACHE_DIR / f"parkinsons_bioprocesses_filtered_{timestamp}.csv", index=False)
print(f"Filtered CSVs saved to: {CACHE_DIR}")

---

## 12. Conclusions & Recommendations

### Predicate Recommendations

**HIGH VALUE (keep for implementation):**
- `actively_involves` - Most results, biologically meaningful
- `causes` - Causal relationships
- `disrupts` - Disruption relationships
- `disease_has_basis_in` - Reactome pathways (rare but high quality)
- `related_to` - General associations


**FILTER OUT:**
- `occurs_together_in_literature_with` - Text mining noise (60%+ of results!)
- `coexists_with` - Non-specific
- `affects` - too broad

### BiologicalProcess Categories Found

1. **Cellular Processes**: Autophagy, Apoptosis, Cell cycle, Phagocytosis
2. **Neurological**: Synaptic Transmission, Neuronal Plasticity, Neurogenesis, Long-term potentiation
3. **Metabolic**: Glucose metabolism, Lipid metabolism, Cholesterol homeostasis
4. **Pathological**: Inflammatory Response, Disease Progression, Amyloid deposition
5. **Disease-specific**:
   - AD: Amyloid deposition, Cholesterol homeostasis, Microglial activation
   - PD: Dopamine secretion, Locomotion, Motor skills, Mitophagy
