# Ontology Term Validation and Querying

This notebook demonstrates how to:
1. Check if a list of ontology terms exists in the CDM ontology tables
2. Query OMP (Ontology of Microbial Phenotypes) terms
3. Query ECO (Evidence and Conclusion Ontology) terms

## Setup

In [1]:
# Import required libraries
from spark.utils import get_spark_session
import pandas as pd
from IPython.display import display
from typing import List, Dict, Tuple

# Initialize Spark session
spark = get_spark_session()
namespace = 'ontology_data'

25/07/17 21:46:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/17 21:46:38 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
25/07/17 21:46:39 WARN S3ABlockOutputStream: Application invoked the Syncable API against stream writing to spark-job-logs/jplfaria/app-20250717214638-0000.inprogress. This is unsupported
25/07/17 21:46:39 WARN Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.


## 1. Term Existence Validation

This function takes a list of ontology terms and checks which ones exist in the CDM ontology tables.

In [2]:
def validate_ontology_terms(term_list: str) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Validate a list of ontology terms against the CDM ontology database.
    
    Args:
        term_list: Comma-separated string of ontology terms (e.g., "OMP:0006023, ECO:0006056")
    
    Returns:
        Tuple of (found_terms_df, missing_terms_df)
    """
    # Parse and clean the input terms
    terms = [term.strip() for term in term_list.split(',')]
    terms = [term for term in terms if term]  # Remove empty strings
    
    print(f"Checking {len(terms)} terms: {terms}")
    
    # Convert to SQL-friendly format
    terms_sql = "','".join(terms)
    
    # Query to find which terms exist in statements table
    query = f"""
    WITH input_terms AS (
        SELECT explode(array('{terms_sql}')) as term
    ),
    found_terms AS (
        SELECT DISTINCT
            it.term as input_term,
            s.subject as found_term,
            s.predicate,
            s.value as label
        FROM input_terms it
        LEFT JOIN {namespace}.statements s 
            ON it.term = s.subject
        WHERE s.predicate = 'rdfs:label'
    ),
    all_found AS (
        SELECT DISTINCT input_term as term
        FROM found_terms
        WHERE found_term IS NOT NULL
    ),
    missing AS (
        SELECT term
        FROM input_terms
        WHERE term NOT IN (SELECT term FROM all_found)
    )
    SELECT 
        'found' as status,
        f.input_term as term,
        f.label as term_label
    FROM found_terms f
    WHERE f.found_term IS NOT NULL
    
    UNION ALL
    
    SELECT 
        'missing' as status,
        m.term,
        NULL as term_label
    FROM missing m
    """
    
    # Execute query
    results_df = spark.sql(query).toPandas()
    
    # Separate found and missing terms
    found_df = results_df[results_df['status'] == 'found'][['term', 'term_label']]
    missing_df = results_df[results_df['status'] == 'missing'][['term']]
    
    # Print summary
    print(f"\nValidation Results:")
    print(f"- Found: {len(found_df)} terms")
    print(f"- Missing: {len(missing_df)} terms")
    
    if len(missing_df) == 0:
        print("\n✓ All terms are part of the loaded ontology records!")
    else:
        print(f"\n✗ {len(missing_df)} terms are NOT part of the loaded ontology records")
    
    return found_df, missing_df

# Example usage
test_terms = "OMP:0006023, ECO:0006056, OMP:9999999, ECO:0000001, FAKE:12345"
found_terms, missing_terms = validate_ontology_terms(test_terms)

print("\nFound terms:")
display(found_terms)

if len(missing_terms) > 0:
    print("\nMissing terms:")
    display(missing_terms)

Checking 5 terms: ['OMP:0006023', 'ECO:0006056', 'OMP:9999999', 'ECO:0000001', 'FAKE:12345']


                                                                                

ModuleNotFoundError: No module named 'distutils'

## 2. Alternative Validation Method - Check Multiple Tables

This method checks for term existence across multiple CDM tables for more comprehensive validation.

In [3]:
def validate_terms_comprehensive(term_list: str) -> Dict[str, Dict]:
    """
    Comprehensive validation checking multiple tables.
    Returns detailed information about where each term was found.
    """
    terms = [term.strip() for term in term_list.split(',')]
    terms = [term for term in terms if term]
    
    results = {}
    
    for term in terms:
        # Check in statements table
        statements_query = f"""
        SELECT COUNT(*) as count
        FROM {namespace}.statements
        WHERE subject = '{term}' OR object = '{term}'
        """
        statements_count = spark.sql(statements_query).collect()[0]['count']
        
        # Check in entailed_edge table
        entailed_query = f"""
        SELECT COUNT(*) as count
        FROM {namespace}.entailed_edge
        WHERE subject = '{term}' OR object = '{term}'
        """
        entailed_count = spark.sql(entailed_query).collect()[0]['count']
        
        # Check in term_association table
        term_assoc_query = f"""
        SELECT COUNT(*) as count
        FROM {namespace}.term_association
        WHERE subject = '{term}' OR object = '{term}'
        """
        term_assoc_count = spark.sql(term_assoc_query).collect()[0]['count']
        
        # Get label if exists
        label_query = f"""
        SELECT value as label
        FROM {namespace}.statements
        WHERE subject = '{term}' AND predicate = 'rdfs:label'
        LIMIT 1
        """
        label_result = spark.sql(label_query).collect()
        label = label_result[0]['label'] if label_result else None
        
        results[term] = {
            'exists': (statements_count + entailed_count + term_assoc_count) > 0,
            'label': label,
            'in_statements': statements_count,
            'in_entailed_edge': entailed_count,
            'in_term_association': term_assoc_count,
            'total_occurrences': statements_count + entailed_count + term_assoc_count
        }
    
    # Create summary DataFrame
    summary_data = []
    for term, info in results.items():
        summary_data.append({
            'term': term,
            'exists': info['exists'],
            'label': info['label'],
            'statements': info['in_statements'],
            'entailed_edge': info['in_entailed_edge'],
            'term_association': info['in_term_association'],
            'total': info['total_occurrences']
        })
    
    summary_df = pd.DataFrame(summary_data)
    
    # Print summary
    existing = summary_df[summary_df['exists']]
    missing = summary_df[~summary_df['exists']]
    
    print(f"\nComprehensive Validation Results:")
    print(f"- Found: {len(existing)} terms")
    print(f"- Missing: {len(missing)} terms")
    
    return summary_df

# Test comprehensive validation
comprehensive_results = validate_terms_comprehensive(test_terms)
display(comprehensive_results)

                                                                                


Comprehensive Validation Results:
- Found: 3 terms
- Missing: 2 terms


                                                                                

Unnamed: 0,term,exists,label,statements,entailed_edge,term_association,total
0,OMP:0006023,True,carbon source utilization phenotype,25,15,0,40
1,ECO:0006056,True,high throughput evidence used in manual assertion,24,5,0,29
2,OMP:9999999,False,,0,0,0,0
3,ECO:0000001,True,inference from background scientific knowledge,11,5,0,16
4,FAKE:12345,False,,0,0,0,0


## 3. Querying OMP (Ontology of Microbial Phenotypes) Terms

Let's explore OMP terms and their relationships in the CDM ontology.

In [4]:
# Count OMP terms
def explore_omp_terms():
    # Count total OMP terms
    count_query = f"""
    SELECT COUNT(DISTINCT subject) as omp_term_count
    FROM {namespace}.statements
    WHERE subject LIKE 'OMP:%'
    """
    
    count = spark.sql(count_query).collect()[0]['omp_term_count']
    print(f"Total OMP terms in the ontology: {count:,}\n")
    
    # Get sample OMP terms with labels
    sample_query = f"""
    SELECT 
        subject as omp_term,
        value as label
    FROM {namespace}.statements
    WHERE subject LIKE 'OMP:%'
    AND predicate = 'rdfs:label'
    LIMIT 20
    """
    
    sample_df = spark.sql(sample_query).toPandas()
    print("Sample OMP terms:")
    display(sample_df)
    
    return sample_df

omp_samples = explore_omp_terms()

                                                                                

Total OMP terms in the ontology: 2,059





Sample OMP terms:


                                                                                

Unnamed: 0,omp_term,label
0,OMP:0000000,microbial phenotype
1,OMP:0000001,motility phenotype
2,OMP:0000002,decreased motility
3,OMP:0000003,cell arrangement phenotype
4,OMP:0000004,taxis phenotype
5,OMP:0000005,presence of motility
6,OMP:0000006,presence of akinete formation
7,OMP:0000007,absence of akinete formation
8,OMP:0000008,presence of budding
9,OMP:0000009,decreased cell surface area-to-volume ratio


In [5]:
# Query OMP term hierarchy
def get_omp_hierarchy(omp_term: str):
    """
    Get the hierarchy (parents and children) for a specific OMP term.
    """
    print(f"Querying hierarchy for: {omp_term}\n")
    
    # Get term label
    label_query = f"""
    SELECT value as label
    FROM {namespace}.statements
    WHERE subject = '{omp_term}' AND predicate = 'rdfs:label'
    """
    label_result = spark.sql(label_query).collect()
    if label_result:
        print(f"Term: {omp_term}")
        print(f"Label: {label_result[0]['label']}\n")
    
    # Get parents
    parents_query = f"""
    SELECT 
        s1.object as parent_term,
        s2.value as parent_label
    FROM {namespace}.statements s1
    LEFT JOIN {namespace}.statements s2
        ON s1.object = s2.subject AND s2.predicate = 'rdfs:label'
    WHERE s1.subject = '{omp_term}'
    AND s1.predicate = 'rdfs:subClassOf'
    AND s1.object LIKE 'OMP:%'
    """
    
    parents_df = spark.sql(parents_query).toPandas()
    print(f"Parent terms ({len(parents_df)}):")
    display(parents_df)
    
    # Get children
    children_query = f"""
    SELECT 
        s1.subject as child_term,
        s2.value as child_label
    FROM {namespace}.statements s1
    LEFT JOIN {namespace}.statements s2
        ON s1.subject = s2.subject AND s2.predicate = 'rdfs:label'
    WHERE s1.object = '{omp_term}'
    AND s1.predicate = 'rdfs:subClassOf'
    AND s1.subject LIKE 'OMP:%'
    LIMIT 10
    """
    
    children_df = spark.sql(children_query).toPandas()
    print(f"\nChild terms (showing up to 10 of possibly more):")
    display(children_df)
    
    return parents_df, children_df

# Example: Get hierarchy for a phenotype term
if len(omp_samples) > 0:
    example_term = omp_samples.iloc[0]['omp_term']
    parents, children = get_omp_hierarchy(example_term)

Querying hierarchy for: OMP:0000000



                                                                                

Term: OMP:0000000
Label: microbial phenotype





Parent terms (0):




Unnamed: 0,parent_term,parent_label





Child terms (showing up to 10 of possibly more):


                                                                                

Unnamed: 0,child_term,child_label
0,OMP:0000003,cell arrangement phenotype
1,OMP:0000026,host-virus interaction phenotype
2,OMP:0000180,metabolic phenotype
3,OMP:0000197,microbe-host interaction phenotype
4,OMP:0000207,serotype phenotype
5,OMP:0000214,cell staining phenotype
6,OMP:0000221,enzymatic activity phenotype
7,OMP:0000290,genetic material phenotype
8,OMP:0005116,multi-organism process phenotype
9,OMP:0006022,nutrient utilization phenotype


In [6]:
# Search OMP terms by keyword
def search_omp_by_keyword(keyword: str):
    """
    Search for OMP terms containing a specific keyword in their label.
    """
    query = f"""
    SELECT 
        subject as omp_term,
        value as label
    FROM {namespace}.statements
    WHERE subject LIKE 'OMP:%'
    AND predicate = 'rdfs:label'
    AND LOWER(value) LIKE LOWER('%{keyword}%')
    ORDER BY value
    LIMIT 20
    """
    
    results_df = spark.sql(query).toPandas()
    print(f"OMP terms containing '{keyword}' (showing up to 20):")
    display(results_df)
    
    # Count total matches
    count_query = f"""
    SELECT COUNT(*) as total_matches
    FROM {namespace}.statements
    WHERE subject LIKE 'OMP:%'
    AND predicate = 'rdfs:label'
    AND LOWER(value) LIKE LOWER('%{keyword}%')
    """
    
    total = spark.sql(count_query).collect()[0]['total_matches']
    print(f"\nTotal OMP terms matching '{keyword}': {total}")
    
    return results_df

# Search examples
growth_terms = search_omp_by_keyword('growth')
print("\n" + "="*60 + "\n")
resistance_terms = search_omp_by_keyword('resistance')



OMP terms containing 'growth' (showing up to 20):


                                                                                

Unnamed: 0,omp_term,label
0,OMP:0000039,O2 effects on population growth phenotype
1,OMP:0005195,abolished anaerobic population growth
2,OMP:0007690,abolished cell growth
3,OMP:0007789,abolished cell separation after cell division ...
4,OMP:0007270,abolished drug-dependence of population growth
5,OMP:0005247,abolished filamentous growth
6,OMP:0007599,abolished population growth
7,OMP:0006116,abolished population growth at acidic pH
8,OMP:0006117,abolished population growth at alkaline pH
9,OMP:0007955,abolished population growth at high temperature


                                                                                


Total OMP terms matching 'growth': 179






OMP terms containing 'resistance' (showing up to 20):


                                                                                

Unnamed: 0,omp_term,label
0,OMP:0007869,A22 antimicrobial agent resistance phenotype
1,OMP:0007362,Increased resistance to a nucleoside
2,OMP:0006101,abolished caffeine resistance
3,OMP:0007883,abolished chemical resistance
4,OMP:0007861,abolished fosfomycin resistance
5,OMP:0007874,abolished resistance to A22
6,OMP:0005135,abolished resistance to SDS-EDTA stress
7,OMP:0005134,abolished resistance to UV radiation
8,OMP:0007202,abolished resistance to UV-C radiation
9,OMP:0007914,abolished resistance to a chelator





Total OMP terms matching 'resistance': 286


                                                                                

## 4. Querying ECO (Evidence and Conclusion Ontology) Terms

Let's explore ECO terms which describe evidence types used in biological research.

In [7]:
# Explore ECO terms
def explore_eco_terms():
    # Count total ECO terms
    count_query = f"""
    SELECT COUNT(DISTINCT subject) as eco_term_count
    FROM {namespace}.statements
    WHERE subject LIKE 'ECO:%'
    """
    
    count = spark.sql(count_query).collect()[0]['eco_term_count']
    print(f"Total ECO terms in the ontology: {count:,}\n")
    
    # Get ECO term categories
    categories_query = f"""
    WITH eco_top_level AS (
        SELECT DISTINCT
            s1.subject as eco_term,
            s1.value as label,
            s2.object as parent
        FROM {namespace}.statements s1
        LEFT JOIN {namespace}.statements s2
            ON s1.subject = s2.subject AND s2.predicate = 'rdfs:subClassOf'
        WHERE s1.subject LIKE 'ECO:%'
        AND s1.predicate = 'rdfs:label'
    )
    SELECT 
        eco_term,
        label,
        CASE 
            WHEN label LIKE '%experimental%' THEN 'Experimental evidence'
            WHEN label LIKE '%computational%' THEN 'Computational evidence'
            WHEN label LIKE '%similarity%' THEN 'Similarity evidence'
            WHEN label LIKE '%manual%' THEN 'Manual assertion'
            WHEN label LIKE '%automatic%' THEN 'Automatic assertion'
            ELSE 'Other evidence type'
        END as evidence_category
    FROM eco_top_level
    LIMIT 30
    """
    
    categories_df = spark.sql(categories_query).toPandas()
    print("Sample ECO terms by category:")
    display(categories_df)
    
    return categories_df

eco_samples = explore_eco_terms()

                                                                                

Total ECO terms in the ontology: 2,239



                                                                                

Sample ECO terms by category:


Unnamed: 0,eco_term,label,evidence_category
0,ECO:0000030,BLAST evidence used in manual assertion,Manual assertion
1,ECO:0000030,BLAST evidence used in manual assertion,Manual assertion
2,ECO:0000069,differential methylation hybridization evidence,Other evidence type
3,ECO:0001247,point mutation phenotypic evidence used in man...,Manual assertion
4,ECO:0001247,point mutation phenotypic evidence used in man...,Manual assertion
5,ECO:0006258,GAL4-VP16 functional complementation evidence,Other evidence type
6,ECO:0006320,cell aggregation evidence used in automatic as...,Automatic assertion
7,ECO:0006320,cell aggregation evidence used in automatic as...,Automatic assertion
8,ECO:0007363,imaging assay evidence used in automatic asser...,Automatic assertion
9,ECO:0007363,imaging assay evidence used in automatic asser...,Automatic assertion


In [8]:
# Get ECO evidence types used in feature annotations
def get_eco_usage_in_annotations():
    """
    Find which ECO terms are actually used in feature annotations.
    """
    # Check if ECO terms are used in term_association
    usage_query = f"""
    WITH eco_in_associations AS (
        SELECT 
            ta.predicate,
            ta.object as eco_term,
            COUNT(*) as usage_count
        FROM {namespace}.term_association ta
        WHERE ta.object LIKE 'ECO:%'
        GROUP BY ta.predicate, ta.object
    ),
    eco_with_labels AS (
        SELECT 
            ea.predicate,
            ea.eco_term,
            ea.usage_count,
            s.value as eco_label
        FROM eco_in_associations ea
        LEFT JOIN {namespace}.statements s
            ON ea.eco_term = s.subject AND s.predicate = 'rdfs:label'
    )
    SELECT * FROM eco_with_labels
    ORDER BY usage_count DESC
    LIMIT 20
    """
    
    usage_df = spark.sql(usage_query).toPandas()
    
    if len(usage_df) > 0:
        print("ECO terms used in term associations:")
        display(usage_df)
    else:
        print("No ECO terms found in term_association table.")
        
        # Check other tables
        print("\nChecking for ECO terms in statements as evidence qualifiers...")
        
        evidence_query = f"""
        SELECT 
            predicate,
            object as eco_term,
            COUNT(*) as usage_count
        FROM {namespace}.statements
        WHERE object LIKE 'ECO:%'
        GROUP BY predicate, object
        ORDER BY usage_count DESC
        LIMIT 10
        """
        
        evidence_df = spark.sql(evidence_query).toPandas()
        if len(evidence_df) > 0:
            print("ECO terms used as objects in statements:")
            display(evidence_df)
    
    return usage_df

eco_usage = get_eco_usage_in_annotations()

                                                                                

No ECO terms found in term_association table.

Checking for ECO terms in statements as evidence qualifiers...




ECO terms used as objects in statements:


                                                                                

Unnamed: 0,predicate,eco_term,usage_count
0,owl:onProperty,ECO:9000000,2194
1,owl:someValuesFrom,ECO:0000218,1271
2,owl:someValuesFrom,ECO:0000203,923
3,rdfs:subClassOf,ECO:0000002,57
4,rdfs:subClassOf,ECO:0008039,41
5,rdfs:subClassOf,ECO:0000015,26
6,owl:annotatedProperty,ECO:9000002,26
7,rdfs:subClassOf,ECO:0000006,21
8,rdfs:subClassOf,ECO:0000059,19
9,rdfs:subClassOf,ECO:0000021,18


In [9]:
# Search ECO terms by evidence type
def search_eco_by_type(evidence_type: str):
    """
    Search for ECO terms by evidence type (e.g., 'experimental', 'computational', 'manual')
    """
    query = f"""
    SELECT 
        subject as eco_term,
        value as label
    FROM {namespace}.statements
    WHERE subject LIKE 'ECO:%'
    AND predicate = 'rdfs:label'
    AND LOWER(value) LIKE LOWER('%{evidence_type}%')
    ORDER BY value
    LIMIT 15
    """
    
    results_df = spark.sql(query).toPandas()
    print(f"ECO terms for '{evidence_type}' evidence:")
    display(results_df)
    
    return results_df

# Examples of different evidence types
experimental_eco = search_eco_by_type('experimental')
print("\n" + "="*60 + "\n")
computational_eco = search_eco_by_type('computational')
print("\n" + "="*60 + "\n")
similarity_eco = search_eco_by_type('similarity')



ECO terms for 'experimental' evidence:


                                                                                

Unnamed: 0,eco_term,label
0,ECO:0007665,automatically integrated combinatorial computa...
1,ECO:0007667,automatically integrated combinatorial computa...
2,ECO:0007666,automatically integrated combinatorial computa...
3,ECO:0007658,automatically integrated combinatorial experim...
4,ECO:0007660,automatically integrated combinatorial experim...
5,ECO:0007659,automatically integrated combinatorial experim...
6,ECO:0005551,biological system reconstruction evidence by e...
7,ECO:0005552,biological system reconstruction evidence by e...
8,ECO:0007480,biological system reconstruction evidence by e...
9,ECO:0005543,biological system reconstruction evidence by e...








ECO terms for 'computational' evidence:


                                                                                

Unnamed: 0,eco_term,label
0,ECO:0007665,automatically integrated combinatorial computa...
1,ECO:0007667,automatically integrated combinatorial computa...
2,ECO:0007666,automatically integrated combinatorial computa...
3,ECO:0007651,automatically integrated combinatorial computa...
4,ECO:0007653,automatically integrated combinatorial computa...
5,ECO:0007652,automatically integrated combinatorial computa...
6,ECO:0007661,combinatorial computational and experimental e...
7,ECO:0007829,combinatorial computational and experimental e...
8,ECO:0007744,combinatorial computational and experimental e...
9,ECO:0007677,combinatorial computational evidence








ECO terms for 'similarity' evidence:


                                                                                

Unnamed: 0,eco_term,label
0,ECO:0000063,compositional similarity evidence
1,ECO:0007200,compositional similarity evidence used in auto...
2,ECO:0007096,compositional similarity evidence used in manu...
3,ECO:0000067,developmental similarity evidence
4,ECO:0007201,developmental similarity evidence used in auto...
5,ECO:0007097,developmental similarity evidence used in manu...
6,ECO:0000075,gene expression similarity evidence
7,ECO:0007203,gene expression similarity evidence used in au...
8,ECO:0007099,gene expression similarity evidence used in ma...
9,ECO:0000051,genetic similarity evidence


## 5. Cross-Ontology Queries

Let's look at relationships between different ontologies (if any exist).

In [10]:
def find_cross_ontology_relationships():
    """
    Find relationships between terms from different ontologies.
    """
    # Look for statements that connect different ontology prefixes
    query = f"""
    WITH cross_refs AS (
        SELECT 
            s.subject,
            s.predicate,
            s.object,
            SUBSTRING_INDEX(s.subject, ':', 1) as subject_prefix,
            SUBSTRING_INDEX(s.object, ':', 1) as object_prefix
        FROM {namespace}.statements s
        WHERE s.subject LIKE '%:%' 
        AND s.object LIKE '%:%'
        AND SUBSTRING_INDEX(s.subject, ':', 1) != SUBSTRING_INDEX(s.object, ':', 1)
    )
    SELECT 
        subject_prefix,
        object_prefix,
        predicate,
        COUNT(*) as relationship_count
    FROM cross_refs
    WHERE subject_prefix IN ('OMP', 'ECO', 'GO', 'seed', 'EC')
    OR object_prefix IN ('OMP', 'ECO', 'GO', 'seed', 'EC')
    GROUP BY subject_prefix, object_prefix, predicate
    ORDER BY relationship_count DESC
    LIMIT 20
    """
    
    cross_ref_df = spark.sql(query).toPandas()
    print("Cross-ontology relationships:")
    display(cross_ref_df)
    
    # Show some examples
    if len(cross_ref_df) > 0:
        top_relationship = cross_ref_df.iloc[0]
        example_query = f"""
        SELECT 
            s.subject,
            s.predicate,
            s.object,
            s1.value as subject_label,
            s2.value as object_label
        FROM {namespace}.statements s
        LEFT JOIN {namespace}.statements s1
            ON s.subject = s1.subject AND s1.predicate = 'rdfs:label'
        LEFT JOIN {namespace}.statements s2
            ON s.object = s2.subject AND s2.predicate = 'rdfs:label'
        WHERE SUBSTRING_INDEX(s.subject, ':', 1) = '{top_relationship['subject_prefix']}'
        AND SUBSTRING_INDEX(s.object, ':', 1) = '{top_relationship['object_prefix']}'
        AND s.predicate = '{top_relationship['predicate']}'
        LIMIT 5
        """
        
        examples_df = spark.sql(example_query).toPandas()
        print(f"\nExample relationships between {top_relationship['subject_prefix']} and {top_relationship['object_prefix']}:")
        display(examples_df)
    
    return cross_ref_df

cross_ontology_rels = find_cross_ontology_relationships()



Cross-ontology relationships:


                                                                                

Unnamed: 0,subject_prefix,object_prefix,predicate,relationship_count
0,EC,_,rdfs:subClassOf,259726
1,_,GO,owl:annotatedSource,131427
2,_,GO,owl:someValuesFrom,67511
3,GO,obo,rdfs:isDefinedBy,51747
4,GO,owl,rdf:type,51747
5,GO,_,rdfs:subClassOf,17199
6,EC,owl,rdf:type,14856
7,_,GO,rdf:first,10566
8,GO,_,owl:equivalentClass,10116
9,_,EC,owl:someValuesFrom,9253


                                                                                


Example relationships between EC and _:


Unnamed: 0,subject,predicate,object,subject_label,object_label
0,EC:1.1.1.n1,rdfs:subClassOf,_:riog04912881,,
1,EC:1.1.1.n10,rdfs:subClassOf,_:riog04912883,,
2,EC:1.1.1.n11,rdfs:subClassOf,_:riog04912889,succinic semialdehyde reductase,
3,EC:1.1.1.n11,rdfs:subClassOf,_:riog04912888,succinic semialdehyde reductase,
4,EC:1.1.1.n11,rdfs:subClassOf,_:riog04912887,succinic semialdehyde reductase,


## 6. Batch Validation Function

A convenient function for validating large batches of terms with detailed reporting.

In [11]:
def batch_validate_terms(terms_list: List[str], output_file: str = None) -> pd.DataFrame:
    """
    Validate a large batch of ontology terms and optionally save results to a file.
    
    Args:
        terms_list: List of ontology term strings
        output_file: Optional CSV file path to save results
    
    Returns:
        DataFrame with validation results
    """
    results = []
    
    for term in terms_list:
        term = term.strip()
        
        # Check existence and get label
        query = f"""
        SELECT 
            '{term}' as term,
            CASE WHEN COUNT(*) > 0 THEN true ELSE false END as exists,
            MAX(CASE WHEN predicate = 'rdfs:label' THEN value END) as label,
            COUNT(*) as total_statements,
            COUNT(DISTINCT predicate) as predicate_count,
            SUBSTRING_INDEX('{term}', ':', 1) as ontology_prefix
        FROM {namespace}.statements
        WHERE subject = '{term}'
        """
        
        result = spark.sql(query).collect()[0]
        
        results.append({
            'term': term,
            'exists': result['exists'],
            'label': result['label'],
            'ontology': result['ontology_prefix'],
            'statement_count': result['total_statements'],
            'property_count': result['predicate_count']
        })
    
    # Create DataFrame
    results_df = pd.DataFrame(results)
    
    # Add summary statistics
    total_terms = len(results_df)
    found_terms = len(results_df[results_df['exists']])
    missing_terms = total_terms - found_terms
    
    print(f"\nBatch Validation Summary:")
    print(f"- Total terms checked: {total_terms}")
    print(f"- Found: {found_terms} ({found_terms/total_terms*100:.1f}%)")
    print(f"- Missing: {missing_terms} ({missing_terms/total_terms*100:.1f}%)")
    
    # Group by ontology
    ontology_summary = results_df.groupby('ontology').agg({
        'term': 'count',
        'exists': 'sum'
    }).rename(columns={'term': 'total', 'exists': 'found'})
    ontology_summary['missing'] = ontology_summary['total'] - ontology_summary['found']
    
    print("\nBy Ontology:")
    display(ontology_summary)
    
    # Save to file if requested
    if output_file:
        results_df.to_csv(output_file, index=False)
        print(f"\nResults saved to: {output_file}")
    
    return results_df

# Example usage with multiple ontology terms
test_batch = [
    "OMP:0006023", "OMP:0000144", "OMP:0007564",
    "ECO:0006056", "ECO:0000001", "ECO:0000269",
    "GO:0008150", "GO:0003674", "GO:0005575",
    "FAKE:00001", "TEST:12345"
]

batch_results = batch_validate_terms(test_batch)
print("\nDetailed Results:")
display(batch_results)




Batch Validation Summary:
- Total terms checked: 11
- Found: 9 (81.8%)
- Missing: 2 (18.2%)

By Ontology:


                                                                                

Unnamed: 0_level_0,total,found,missing
ontology,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ECO,3,3,0
FAKE,1,0,1
GO,3,3,0
OMP,3,3,0
TEST,1,0,1



Detailed Results:


Unnamed: 0,term,exists,label,ontology,statement_count,property_count
0,OMP:0006023,True,carbon source utilization phenotype,OMP,18,12
1,OMP:0000144,True,multiple nucleoids,OMP,9,9
2,OMP:0007564,True,altered meiotic nuclear division,OMP,10,9
3,ECO:0006056,True,high throughput evidence used in manual assertion,ECO,19,15
4,ECO:0000001,True,inference from background scientific knowledge,ECO,7,7
5,ECO:0000269,True,experimental evidence used in manual assertion,ECO,16,14
6,GO:0008150,True,biological_process,GO,25,15
7,GO:0003674,True,molecular_function,GO,15,10
8,GO:0005575,True,cellular_component,GO,19,13
9,FAKE:00001,False,,FAKE,0,0


## Summary

This notebook provides several methods for:

1. **Validating ontology terms**: Check if terms exist in the CDM ontology database
2. **Comprehensive validation**: Check terms across multiple tables
3. **OMP queries**: Explore microbial phenotype terms, hierarchies, and search by keywords
4. **ECO queries**: Explore evidence ontology terms and their usage
5. **Cross-ontology analysis**: Find relationships between different ontologies
6. **Batch validation**: Process large lists of terms with detailed reporting

The validation functions will return:
- Lists of found vs. missing terms
- Term labels and descriptions where available
- Usage statistics across different tables
- Summary messages indicating if all terms exist or which ones are missing