# Demo Notebook - July 10th

This notebook demonstrates the building blocks that lead to Query 0 (Genome Feature to Reaction Mapping) and explores comparative genomics across 50 E. coli strains.

## Part 1: Building Blocks for Query 0

Let's explore each component that makes the genome-to-reaction mapping possible.

In [1]:
# Setup - Import required libraries and initialize Spark
from spark.utils import get_spark_session
import time
import pandas as pd
from IPython.display import display

spark = get_spark_session()
namespace = 'ontology_data'

# Helper function to time queries
def time_query(query_name, query_func):
    """Execute a query and print execution time"""
    print(f"\n{'='*60}")
    print(f"Executing: {query_name}")
    print(f"{'='*60}")
    start_time = time.time()
    result = query_func()
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"\nQuery execution time: {execution_time:.2f} seconds")
    return result

25/07/10 02:52:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/10 02:52:03 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
25/07/10 02:52:05 WARN S3ABlockOutputStream: Application invoked the Syncable API against stream writing to spark-job-logs/jplfaria/app-20250710025202-0022.inprogress. This is unsupported
25/07/10 02:52:05 WARN Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.


### 1. Exploring SEED Reactions in the Ontology

SEED reactions are stored in the statements table with their labels. Let's see what they look like:

In [2]:
def explore_seed_reactions():
    query = f"""
    SELECT 
        subject as reaction_id,
        predicate,
        value as reaction_name
    FROM {namespace}.statements
    WHERE subject LIKE 'seed.reaction:%'
    AND predicate = 'rdfs:label'
    LIMIT 20
    """
    
    df = spark.sql(query).toPandas()
    print(f"Sample SEED reactions from the ontology:")
    display(df)
    
    # Count total reactions
    count_query = f"""
    SELECT COUNT(DISTINCT subject) as total_reactions
    FROM {namespace}.statements
    WHERE subject LIKE 'seed.reaction:%'
    AND predicate = 'rdfs:label'
    """
    count = spark.sql(count_query).collect()[0]['total_reactions']
    print(f"\nTotal SEED reactions in ontology: {count:,}")
    
    return df

time_query("Explore SEED Reactions", explore_seed_reactions)


Executing: Explore SEED Reactions


[Stage 4:>                                                          (0 + 1) / 1]

Sample SEED reactions from the ontology:


                                                                                

Unnamed: 0,reaction_id,predicate,reaction_name
0,seed.reaction:rxn00001,rdfs:label,diphosphate phosphohydrolase
1,seed.reaction:rxn00002,rdfs:label,urea-1-carboxylate amidohydrolase
2,seed.reaction:rxn00003,rdfs:label,pyruvate:pyruvate acetaldehydetransferase (dec...
3,seed.reaction:rxn00004,rdfs:label,4-hydroxy-4-methyl-2-oxoglutarate pyruvate-lya...
4,seed.reaction:rxn00006,rdfs:label,hydrogen-peroxide:hydrogen-peroxide oxidoreduc...
5,seed.reaction:rxn00007,rdfs:label,"alpha,alpha-trehalose glucohydrolase"
6,seed.reaction:rxn00008,rdfs:label,Mn(II):hydrogen-peroxide oxidoreductase
7,seed.reaction:rxn00009,rdfs:label,GTP:GTP guanylyltransferase
8,seed.reaction:rxn00010,rdfs:label,glyoxylate carboxy-lyase (dimerizing; tartrona...
9,seed.reaction:rxn00011,rdfs:label,pyruvate:thiamin diphosphate acetaldehydetrans...





Total SEED reactions in ontology: 44,901

Query execution time: 45.72 seconds


                                                                                

Unnamed: 0,reaction_id,predicate,reaction_name
0,seed.reaction:rxn00001,rdfs:label,diphosphate phosphohydrolase
1,seed.reaction:rxn00002,rdfs:label,urea-1-carboxylate amidohydrolase
2,seed.reaction:rxn00003,rdfs:label,pyruvate:pyruvate acetaldehydetransferase (dec...
3,seed.reaction:rxn00004,rdfs:label,4-hydroxy-4-methyl-2-oxoglutarate pyruvate-lya...
4,seed.reaction:rxn00006,rdfs:label,hydrogen-peroxide:hydrogen-peroxide oxidoreduc...
5,seed.reaction:rxn00007,rdfs:label,"alpha,alpha-trehalose glucohydrolase"
6,seed.reaction:rxn00008,rdfs:label,Mn(II):hydrogen-peroxide oxidoreductase
7,seed.reaction:rxn00009,rdfs:label,GTP:GTP guanylyltransferase
8,seed.reaction:rxn00010,rdfs:label,glyoxylate carboxy-lyase (dimerizing; tartrona...
9,seed.reaction:rxn00011,rdfs:label,pyruvate:thiamin diphosphate acetaldehydetrans...


### 2. Exploring SEED Roles in the Ontology

SEED roles represent enzyme functions. Let's examine them:

In [3]:
def explore_seed_roles():
    query = f"""
    SELECT 
        subject as role_id,
        predicate,
        value as role_name
    FROM {namespace}.statements
    WHERE subject LIKE 'seed.role:%'
    AND predicate = 'rdfs:label'
    LIMIT 20
    """
    
    df = spark.sql(query).toPandas()
    print(f"Sample SEED roles (enzyme functions) from the ontology:")
    display(df)
    
    # Count total roles
    count_query = f"""
    SELECT COUNT(DISTINCT subject) as total_roles
    FROM {namespace}.statements
    WHERE subject LIKE 'seed.role:%'
    AND predicate = 'rdfs:label'
    """
    count = spark.sql(count_query).collect()[0]['total_roles']
    print(f"\nTotal SEED roles in ontology: {count:,}")
    
    return df

time_query("Explore SEED Roles", explore_seed_roles)


Executing: Explore SEED Roles




Sample SEED roles (enzyme functions) from the ontology:


                                                                                

Unnamed: 0,role_id,predicate,role_name
0,seed.role:0000000000001,rdfs:label,(+)-caryolan-1-ol synthase (EC 4.2.1.138)
1,seed.role:0000000000002,rdfs:label,"(2E,6E)-farnesyl diphosphate synthase (EC 2.5...."
2,seed.role:0000000000003,rdfs:label,"(2E,6Z)-farnesyl diphosphate synthase (EC 2.5...."
3,seed.role:0000000000004,rdfs:label,(2R)-sulfolactate sulfo-lyase subunit alpha (E...
4,seed.role:0000000000005,rdfs:label,(2R)-sulfolactate sulfo-lyase subunit beta (EC...
5,seed.role:0000000000006,rdfs:label,(3R)-hydroxyacyl-ACP dehydratase subunit HadA
6,seed.role:0000000000007,rdfs:label,(3R)-hydroxyacyl-ACP dehydratase subunit HadB
7,seed.role:0000000000008,rdfs:label,(3R)-hydroxyacyl-ACP dehydratase subunit HadC
8,seed.role:0000000000011,rdfs:label,(Carboxyethyl)arginine beta-lactam-synthase (E...
9,seed.role:0000000000013,rdfs:label,"(R)-2-hydroxyacid dehydrogenase, similar to L-..."





Total SEED roles in ontology: 15,383

Query execution time: 13.69 seconds


                                                                                

Unnamed: 0,role_id,predicate,role_name
0,seed.role:0000000000001,rdfs:label,(+)-caryolan-1-ol synthase (EC 4.2.1.138)
1,seed.role:0000000000002,rdfs:label,"(2E,6E)-farnesyl diphosphate synthase (EC 2.5...."
2,seed.role:0000000000003,rdfs:label,"(2E,6Z)-farnesyl diphosphate synthase (EC 2.5...."
3,seed.role:0000000000004,rdfs:label,(2R)-sulfolactate sulfo-lyase subunit alpha (E...
4,seed.role:0000000000005,rdfs:label,(2R)-sulfolactate sulfo-lyase subunit beta (EC...
5,seed.role:0000000000006,rdfs:label,(3R)-hydroxyacyl-ACP dehydratase subunit HadA
6,seed.role:0000000000007,rdfs:label,(3R)-hydroxyacyl-ACP dehydratase subunit HadB
7,seed.role:0000000000008,rdfs:label,(3R)-hydroxyacyl-ACP dehydratase subunit HadC
8,seed.role:0000000000011,rdfs:label,(Carboxyethyl)arginine beta-lactam-synthase (E...
9,seed.role:0000000000013,rdfs:label,"(R)-2-hydroxyacid dehydrogenase, similar to L-..."


### 3. Understanding Term Associations: How Roles Map to Reactions

The term_association table connects SEED roles to reactions they catalyze:

In [4]:
def explore_term_associations():
    query = f"""
    WITH role_reaction_mappings AS (
        SELECT 
            ta.subject as role_id,
            ta.predicate,
            ta.object as reaction_id
        FROM {namespace}.term_association ta
        WHERE ta.subject LIKE 'seed.role:%'
        AND ta.object LIKE 'seed.reaction:%'
        LIMIT 20
    ),
    enriched_mappings AS (
        SELECT 
            m.role_id,
            r.value as role_name,
            m.reaction_id,
            rxn.value as reaction_name,
            m.predicate
        FROM role_reaction_mappings m
        LEFT JOIN {namespace}.statements r 
            ON m.role_id = r.subject AND r.predicate = 'rdfs:label'
        LEFT JOIN {namespace}.statements rxn 
            ON m.reaction_id = rxn.subject AND rxn.predicate = 'rdfs:label'
    )
    SELECT * FROM enriched_mappings
    """
    
    df = spark.sql(query).toPandas()
    print(f"Sample role-to-reaction mappings:")
    display(df)
    
    # Show predicate meaning
    print("\nNote: predicate 'RO:0002327' means 'enables' - the role enables/catalyzes the reaction")
    
    return df

time_query("Explore Term Associations", explore_term_associations)


Executing: Explore Term Associations


                                                                                

Sample role-to-reaction mappings:


Unnamed: 0,role_id,role_name,reaction_id,reaction_name,predicate



Note: predicate 'RO:0002327' means 'enables' - the role enables/catalyzes the reaction

Query execution time: 14.79 seconds


Unnamed: 0,role_id,role_name,reaction_id,reaction_name,predicate


### 4. Exploring Feature Annotations: RAST Roles in Genomes

The feature_annotation table contains RAST annotations that match SEED role names:

In [5]:
def explore_feature_annotations():
    query = f"""
    SELECT 
        genome_id,
        feature_id,
        rast,
        bakta_gene,
        bakta_product
    FROM {namespace}.feature_annotation
    WHERE genome_id = '562.61239'
    AND rast IS NOT NULL
    LIMIT 20
    """
    
    df = spark.sql(query).toPandas()
    print(f"Sample RAST annotations from E. coli genome 562.61239:")
    display(df)
    
    # Count features with RAST annotations
    count_query = f"""
    SELECT 
        COUNT(*) as features_with_rast,
        COUNT(DISTINCT rast) as unique_rast_roles
    FROM {namespace}.feature_annotation
    WHERE genome_id = '562.61239'
    AND rast IS NOT NULL
    """
    counts = spark.sql(count_query).collect()[0]
    print(f"\nGenome 562.61239 has:")
    print(f"  - {counts['features_with_rast']:,} features with RAST annotations")
    print(f"  - {counts['unique_rast_roles']:,} unique RAST roles")
    
    return df

time_query("Explore Feature Annotations", explore_feature_annotations)


Executing: Explore Feature Annotations


                                                                                

Sample RAST annotations from E. coli genome 562.61239:


Unnamed: 0,genome_id,feature_id,rast,bakta_gene,bakta_product
0,562.61239,562.61239_1,Alpha-ketoglutarate permease,,Alpha-ketoglutarate permease
1,562.61239,562.61239_2,Alpha-ketoglutarate permease,,hypothetical protein
2,562.61239,562.61239_3,Putative outer membrane lipoprotein,yfiM,YfiM family lipoprotein
3,562.61239,562.61239_4,CDP-diacylglycerol--serine O-phosphatidyltrans...,pssA,CDP-diacylglycerol--serine O-phosphatidyltrans...
4,562.61239,562.61239_5,Protein lysine acetyltransferase Pat (EC 2.3.1.-),pat,protein lysine acetyltransferase
5,562.61239,562.61239_6,"Uncharacterized conserved protein YfiP, contai...",tapT,tRNA-uridine aminocarboxypropyltransferase
6,562.61239,562.61239_7,Thioredoxin 2,trxC,thioredoxin TrxC
7,562.61239,562.61239_8,Uncharacterized tRNA/rRNA methyltransferase YfiF,yfiF,tRNA/rRNA methyltransferase
8,562.61239,562.61239_9,"Uracil-DNA glycosylase, family 1 (EC 3.2.2.27)",ung,uracil-DNA glycosylase
9,562.61239,562.61239_10,Autonomous glycyl radical cofactor,grcA,autonomous glycyl radical cofactor GrcA





Genome 562.61239 has:
  - 4,410 features with RAST annotations
  - 3,832 unique RAST roles

Query execution time: 6.95 seconds


                                                                                

Unnamed: 0,genome_id,feature_id,rast,bakta_gene,bakta_product
0,562.61239,562.61239_1,Alpha-ketoglutarate permease,,Alpha-ketoglutarate permease
1,562.61239,562.61239_2,Alpha-ketoglutarate permease,,hypothetical protein
2,562.61239,562.61239_3,Putative outer membrane lipoprotein,yfiM,YfiM family lipoprotein
3,562.61239,562.61239_4,CDP-diacylglycerol--serine O-phosphatidyltrans...,pssA,CDP-diacylglycerol--serine O-phosphatidyltrans...
4,562.61239,562.61239_5,Protein lysine acetyltransferase Pat (EC 2.3.1.-),pat,protein lysine acetyltransferase
5,562.61239,562.61239_6,"Uncharacterized conserved protein YfiP, contai...",tapT,tRNA-uridine aminocarboxypropyltransferase
6,562.61239,562.61239_7,Thioredoxin 2,trxC,thioredoxin TrxC
7,562.61239,562.61239_8,Uncharacterized tRNA/rRNA methyltransferase YfiF,yfiF,tRNA/rRNA methyltransferase
8,562.61239,562.61239_9,"Uracil-DNA glycosylase, family 1 (EC 3.2.2.27)",ung,uracil-DNA glycosylase
9,562.61239,562.61239_10,Autonomous glycyl radical cofactor,grcA,autonomous glycyl radical cofactor GrcA


### 5. Connecting Features to Roles: The Key Link

Let's verify that RAST annotations in feature_annotation match SEED role subjects in term_association:

In [6]:
def verify_rast_to_role_connection():
    query = f"""
    WITH genome_rast_roles AS (
        -- Get unique RAST roles from our genome
        SELECT DISTINCT rast
        FROM {namespace}.feature_annotation
        WHERE genome_id = '562.61239'
        AND rast IS NOT NULL
    ),
    matching_term_associations AS (
        -- Find which RAST roles exist in term_association
        SELECT 
            gr.rast,
            COUNT(DISTINCT ta.object) as reaction_count
        FROM genome_rast_roles gr
        INNER JOIN {namespace}.term_association ta
            ON gr.rast = ta.subject
        WHERE ta.object LIKE 'seed.reaction:%'
        GROUP BY gr.rast
    )
    SELECT 
        rast as role_string,
        reaction_count
    FROM matching_term_associations
    ORDER BY reaction_count DESC
    LIMIT 20
    """
    
    df = spark.sql(query).toPandas()
    print(f"RAST roles that successfully map to reactions:")
    display(df)
    
    # Summary statistics
    stats_query = f"""
    WITH genome_rast AS (
        SELECT DISTINCT rast
        FROM {namespace}.feature_annotation
        WHERE genome_id = '562.61239'
        AND rast IS NOT NULL
    ),
    mappable_rast AS (
        SELECT DISTINCT gr.rast
        FROM genome_rast gr
        INNER JOIN {namespace}.term_association ta
            ON gr.rast = ta.subject
        WHERE ta.object LIKE 'seed.reaction:%'
    )
    SELECT 
        (SELECT COUNT(*) FROM genome_rast) as total_rast_roles,
        (SELECT COUNT(*) FROM mappable_rast) as mappable_rast_roles
    """
    stats = spark.sql(stats_query).collect()[0]
    print(f"\nMapping success rate:")
    print(f"  - Total unique RAST roles: {stats['total_rast_roles']}")
    print(f"  - Roles that map to reactions: {stats['mappable_rast_roles']}")
    print(f"  - Success rate: {stats['mappable_rast_roles']/stats['total_rast_roles']*100:.1f}%")
    
    return df

time_query("Verify RAST to Role Connection", verify_rast_to_role_connection)


Executing: Verify RAST to Role Connection


                                                                                

RAST roles that successfully map to reactions:


                                                                                

Unnamed: 0,role_string,reaction_count
0,3-hydroxyacyl-[acyl-carrier-protein] dehydrata...,20
1,Peptidase B (EC 3.4.11.23),19
2,Cytosol aminopeptidase PepA (EC 3.4.11.1),19
3,"Aminopeptidase YpdF (MP-, MA-, MS-, AP-, NP- s...",19
4,Membrane alanine aminopeptidase N (EC 3.4.11.2),18
5,Uridine kinase (EC 2.7.1.48),18
6,Methionine aminopeptidase (EC 3.4.11.18),17
7,Glycerol-3-phosphate acyltransferase (EC 2.3.1...,16
8,CDP-diacylglycerol--glycerol-3-phosphate 3-pho...,13
9,Phosphatidate cytidylyltransferase (EC 2.7.7.41),13



Mapping success rate:
  - Total unique RAST roles: 3832
  - Roles that map to reactions: 808
  - Success rate: 21.1%

Query execution time: 6.73 seconds


Unnamed: 0,role_string,reaction_count
0,3-hydroxyacyl-[acyl-carrier-protein] dehydrata...,20
1,Peptidase B (EC 3.4.11.23),19
2,Cytosol aminopeptidase PepA (EC 3.4.11.1),19
3,"Aminopeptidase YpdF (MP-, MA-, MS-, AP-, NP- s...",19
4,Membrane alanine aminopeptidase N (EC 3.4.11.2),18
5,Uridine kinase (EC 2.7.1.48),18
6,Methionine aminopeptidase (EC 3.4.11.18),17
7,Glycerol-3-phosphate acyltransferase (EC 2.3.1...,16
8,CDP-diacylglycerol--glycerol-3-phosphate 3-pho...,13
9,Phosphatidate cytidylyltransferase (EC 2.7.7.41),13


### 6. Final Query 0: Genome Feature to Reaction Mapping

Now let's put it all together - this is the complete query that maps genome features to reactions:

In [7]:
def query_genome_reactions():
    query = f"""
    WITH genome_features AS (
        -- Step 1: Get features with RAST annotations from our genome
        SELECT 
            f.genome_id,
            f.feature_id,
            f.rast
        FROM {namespace}.feature_annotation f
        WHERE f.genome_id = '562.61239'
        AND f.rast IS NOT NULL
    ),
    feature_reactions AS (
        -- Step 2: Map RAST roles to SEED reactions via term_association
        SELECT DISTINCT
            gf.genome_id,
            gf.feature_id,
            gf.rast,
            ta.object as seed_reaction
        FROM genome_features gf
        INNER JOIN {namespace}.term_association ta
            ON gf.rast = ta.subject  -- This is the key join!
        WHERE ta.object LIKE 'seed.reaction:%'
    ),
    reaction_names AS (
        -- Step 3: Get human-readable reaction names from statements
        SELECT 
            subject as reaction_id,
            value as reaction_name
        FROM {namespace}.statements
        WHERE predicate = 'rdfs:label'
        AND subject LIKE 'seed.reaction:%'
    )
    -- Step 4: Combine everything
    SELECT 
        fr.genome_id,
        fr.feature_id,
        fr.rast,
        fr.seed_reaction,
        rn.reaction_name
    FROM feature_reactions fr
    LEFT JOIN reaction_names rn ON fr.seed_reaction = rn.reaction_id
    ORDER BY fr.genome_id, fr.feature_id
    LIMIT 100
    """
    
    df = spark.sql(query).toPandas()
    print(f"Genome features mapped to their catalyzed reactions:")
    display(df.head(20))
    print(f"\nTotal features with reaction mappings shown: {len(df)}")
    return df

time_query("Complete Genome Feature to Reaction Mapping", query_genome_reactions)


Executing: Complete Genome Feature to Reaction Mapping


                                                                                

Genome features mapped to their catalyzed reactions:


Unnamed: 0,genome_id,feature_id,rast,seed_reaction,reaction_name
0,562.61239,562.61239_1000,Malate synthase G (EC 2.3.3.9),seed.reaction:rxn00330,acetyl-CoA:glyoxylate C-acetyltransferase (thi...
1,562.61239,562.61239_1001,Glycolate permease,seed.reaction:rxn05470,"glycolate transport via proton symport, revers..."
2,562.61239,562.61239_1028,L-asparaginase (EC 3.5.1.1),seed.reaction:rxn00342,L-Asparagine amidohydrolase
3,562.61239,562.61239_1031,"Nucleoside 5-triphosphatase RdgB (dHAPTP, dITP...",seed.reaction:rxn00514,Inosine 5'-triphosphate pyrophosphohydrolase
4,562.61239,562.61239_1031,"Nucleoside 5-triphosphatase RdgB (dHAPTP, dITP...",seed.reaction:rxn02518,2'-Deoxyinosine-5'-triphosphate pyrophosphohyd...
5,562.61239,562.61239_1031,"Nucleoside 5-triphosphatase RdgB (dHAPTP, dITP...",seed.reaction:rxn01962,XTP pyrophosphohydrolase
6,562.61239,562.61239_1038,Glutathione synthetase (EC 6.3.2.3),seed.reaction:rxn00351,gamma-L-glutamyl-L-cysteine:glycine ligase (AD...
7,562.61239,562.61239_1043,S-adenosylmethionine synthetase (EC 2.5.1.6),seed.reaction:rxn00126,ATP:L-methionine S-adenosyltransferase
8,562.61239,562.61239_1043,S-adenosylmethionine synthetase (EC 2.5.1.6),seed.reaction:rxn03264,ATP:L-selenomethione S-adenosyltransferase
9,562.61239,562.61239_1047,Biosynthetic arginine decarboxylase (EC 4.1.1.19),seed.reaction:rxn00405,L-arginine carboxy-lyase (agmatine-forming)



Total features with reaction mappings shown: 100

Query execution time: 5.13 seconds


Unnamed: 0,genome_id,feature_id,rast,seed_reaction,reaction_name
0,562.61239,562.61239_1000,Malate synthase G (EC 2.3.3.9),seed.reaction:rxn00330,acetyl-CoA:glyoxylate C-acetyltransferase (thi...
1,562.61239,562.61239_1001,Glycolate permease,seed.reaction:rxn05470,"glycolate transport via proton symport, revers..."
2,562.61239,562.61239_1028,L-asparaginase (EC 3.5.1.1),seed.reaction:rxn00342,L-Asparagine amidohydrolase
3,562.61239,562.61239_1031,"Nucleoside 5-triphosphatase RdgB (dHAPTP, dITP...",seed.reaction:rxn00514,Inosine 5'-triphosphate pyrophosphohydrolase
4,562.61239,562.61239_1031,"Nucleoside 5-triphosphatase RdgB (dHAPTP, dITP...",seed.reaction:rxn02518,2'-Deoxyinosine-5'-triphosphate pyrophosphohyd...
...,...,...,...,...,...
95,562.61239,562.61239_1268,GTP cyclohydrolase I (EC 3.5.4.16) type 1,seed.reaction:rxn00299,"GTP 7,8-8,9-dihydrolase"
96,562.61239,562.61239_1268,GTP cyclohydrolase I (EC 3.5.4.16) type 1,seed.reaction:rxn03174,"2-Amino-4-hydroxy-6-(erythro-1,2,3-trihydroxyp..."
97,562.61239,562.61239_1268,GTP cyclohydrolase I (EC 3.5.4.16) type 1,seed.reaction:rxn00302,"GTP 8,9-hydrolase"
98,562.61239,562.61239_1269,S-formylglutathione hydrolase (EC 3.1.2.12),seed.reaction:rxn00377,S-Formylglutathione hydrolase


---

## Part 2: Extra Queries - Comparative Analysis of 50 E. coli Strains

Now let's explore the diversity across all 50 E. coli genomes using multi-table queries.

### 1. Core vs Accessory Genes: What's Universal vs Strain-Specific?

In [8]:
def analyze_core_accessory_genes():
    query = f"""
    WITH gene_distribution AS (
        -- Count how many genomes have each UniRef cluster
        SELECT 
            bakta_uniref,
            bakta_product,
            COUNT(DISTINCT genome_id) as genome_count,
            COLLECT_SET(genome_id) as genome_list
        FROM {namespace}.feature_annotation
        WHERE bakta_uniref IS NOT NULL
        GROUP BY bakta_uniref, bakta_product
    ),
    categorized_genes AS (
        SELECT 
            *,
            CASE 
                WHEN genome_count = 50 THEN 'Core'
                WHEN genome_count >= 45 THEN 'Soft-core'
                WHEN genome_count >= 25 THEN 'Shell'
                WHEN genome_count >= 2 THEN 'Cloud'
                ELSE 'Unique'
            END as gene_category
        FROM gene_distribution
    )
    SELECT 
        gene_category,
        COUNT(*) as gene_count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
    FROM categorized_genes
    GROUP BY gene_category
    ORDER BY 
        CASE gene_category
            WHEN 'Core' THEN 1
            WHEN 'Soft-core' THEN 2
            WHEN 'Shell' THEN 3
            WHEN 'Cloud' THEN 4
            WHEN 'Unique' THEN 5
        END
    """
    
    df = spark.sql(query).toPandas()
    print(f"E. coli pangenome structure across 50 strains:")
    display(df)
    
    # Show examples of core genes
    core_examples_query = f"""
    SELECT DISTINCT
        bakta_uniref,
        bakta_product,
        bakta_gene
    FROM {namespace}.feature_annotation
    WHERE bakta_uniref IN (
        SELECT bakta_uniref
        FROM {namespace}.feature_annotation
        WHERE bakta_uniref IS NOT NULL
        GROUP BY bakta_uniref
        HAVING COUNT(DISTINCT genome_id) = 50
    )
    LIMIT 10
    """
    
    core_df = spark.sql(core_examples_query).toPandas()
    print(f"\nExamples of core genes (present in all 50 strains):")
    display(core_df)
    
    return df

time_query("Core vs Accessory Gene Analysis", analyze_core_accessory_genes)


Executing: Core vs Accessory Gene Analysis


[Stage 102:>                                                        (0 + 1) / 1]

E. coli pangenome structure across 50 strains:


                                                                                

Unnamed: 0,gene_category,gene_count,percentage
0,Core,737,3.18
1,Soft-core,2240,9.67
2,Shell,1011,4.36
3,Cloud,8272,35.69
4,Unique,10916,47.1


                                                                                


Examples of core genes (present in all 50 strains):


Unnamed: 0,bakta_uniref,bakta_product,bakta_gene
0,"UniRef50_P0AEK4,UniRef90_P0AEK4",enoyl-ACP reductase FabI,fabI
1,"UniRef50_P23524,UniRef90_P23524",glycerate 2-kinase,garK
2,"UniRef50_P0A6T3,UniRef90_P0A6T3",galactokinase,galK
3,"UniRef50_P27249,UniRef90_Q3Z5J1",bifunctional uridylyltransferase/uridylyl-remo...,glnD
4,"UniRef50_A0A4V2JFY6,UniRef90_UPI00193345D9",flagellar type III secretion system pore prote...,fliP
5,"UniRef50_P21507,UniRef90_P21507",ATP-dependent RNA helicase SrmB,srmB
6,"UniRef50_P77301,UniRef90_P77301",Uncharacterized protein YbaP,ybaP
7,"UniRef50_P64506,UniRef90_P64506",Uncharacterized protein YebY,yebY
8,"UniRef50_P37902,UniRef90_P37902",glutamate/aspartate ABC transporter substrate-...,gltI
9,"UniRef50_P0A8F4,UniRef90_P0A8F4",uridine kinase,udk



Query execution time: 11.71 seconds


Unnamed: 0,gene_category,gene_count,percentage
0,Core,737,3.18
1,Soft-core,2240,9.67
2,Shell,1011,4.36
3,Cloud,8272,35.69
4,Unique,10916,47.1


### 2. Functional Diversity: EC Number Distribution Across Strains

In [9]:
def analyze_ec_diversity():
    query = f"""
    WITH strain_ec_profiles AS (
        -- Get EC number profiles for each strain
        SELECT 
            f.genome_id,
            f.bakta_ec,
            s.value as ec_name
        FROM {namespace}.feature_annotation f
        LEFT JOIN {namespace}.statements s
            ON CONCAT('EC:', f.bakta_ec) = s.subject
            AND s.predicate = 'rdfs:label'
        WHERE f.bakta_ec IS NOT NULL
    ),
    ec_distribution AS (
        -- Count how many strains have each EC number
        SELECT 
            bakta_ec,
            MAX(ec_name) as ec_name,
            COUNT(DISTINCT genome_id) as strain_count,
            ROUND(COUNT(DISTINCT genome_id) * 100.0 / 50, 1) as prevalence_pct
        FROM strain_ec_profiles
        GROUP BY bakta_ec
    ),
    categorized_ec AS (
        SELECT 
            *,
            CASE 
                WHEN strain_count = 50 THEN 'Universal'
                WHEN strain_count >= 40 THEN 'Highly conserved'
                WHEN strain_count >= 25 THEN 'Common'
                WHEN strain_count >= 10 THEN 'Variable'
                ELSE 'Rare'
            END as conservation_category
        FROM ec_distribution
    )
    SELECT 
        conservation_category,
        COUNT(*) as ec_count,
        MIN(strain_count) as min_strains,
        MAX(strain_count) as max_strains,
        ROUND(AVG(prevalence_pct), 1) as avg_prevalence_pct
    FROM categorized_ec
    GROUP BY conservation_category
    ORDER BY max_strains DESC
    """
    
    df = spark.sql(query).toPandas()
    print(f"EC number conservation across 50 E. coli strains:")
    display(df)
    
    # Show variable EC functions
    variable_ec_query = f"""
    WITH ec_counts AS (
        SELECT 
            f.bakta_ec,
            s.value as ec_name,
            COUNT(DISTINCT f.genome_id) as strain_count
        FROM {namespace}.feature_annotation f
        LEFT JOIN {namespace}.statements s
            ON CONCAT('EC:', f.bakta_ec) = s.subject
            AND s.predicate = 'rdfs:label'
        WHERE f.bakta_ec IS NOT NULL
        GROUP BY f.bakta_ec, s.value
    )
    SELECT * FROM ec_counts
    WHERE strain_count BETWEEN 10 AND 40
    ORDER BY strain_count DESC
    LIMIT 15
    """
    
    variable_df = spark.sql(variable_ec_query).toPandas()
    print(f"\nVariable enzymatic functions (present in 10-40 strains):")
    display(variable_df)
    
    return df

time_query("EC Number Diversity Analysis", analyze_ec_diversity)


Executing: EC Number Diversity Analysis


                                                                                

EC number conservation across 50 E. coli strains:


Unnamed: 0,conservation_category,ec_count,min_strains,max_strains,avg_prevalence_pct
0,Universal,478,50,50,100.0
1,Highly conserved,637,40,49,95.8
2,Common,39,26,39,68.4
3,Variable,30,10,23,31.0
4,Rare,138,1,9,6.6





Variable enzymatic functions (present in 10-40 strains):


                                                                                

Unnamed: 0,bakta_ec,ec_name,strain_count
0,"2.5.1.-,2.9.1.3",,40
1,1.6.99.-,,39
2,1.4.3.21,primary-amine oxidase,39
3,2.3.2.2,gamma-glutamyltransferase,39
4,2.4.1.352,glucosylglycerate phosphorylase,39
5,1.2.99.6,carboxylate reductase,39
6,3.5.1.42,nicotinamide-nucleotide amidase,39
7,2.7.1.195,phosphotransferase,39
8,6.2.1.30,phenylacetate--CoA ligase,39
9,5.3.3.10,5-carboxymethyl-2-hydroxymuconate Delta-isomerase,38



Query execution time: 14.17 seconds


Unnamed: 0,conservation_category,ec_count,min_strains,max_strains,avg_prevalence_pct
0,Universal,478,50,50,100.0
1,Highly conserved,637,40,49,95.8
2,Common,39,26,39,68.4
3,Variable,30,10,23,31.0
4,Rare,138,1,9,6.6


### 3. Metabolic Capability Differences: Reaction Sets Between Strains

In [10]:
def compare_metabolic_capabilities():
    query = f"""
    WITH strain_reactions AS (
        -- Map each strain to its set of reactions
        SELECT DISTINCT
            f.genome_id,
            ta.object as reaction_id
        FROM {namespace}.feature_annotation f
        INNER JOIN {namespace}.term_association ta
            ON f.rast = ta.subject
        WHERE f.rast IS NOT NULL
        AND ta.object LIKE 'seed.reaction:%'
    ),
    strain_reaction_counts AS (
        -- Count reactions per strain
        SELECT 
            genome_id,
            COUNT(DISTINCT reaction_id) as reaction_count
        FROM strain_reactions
        GROUP BY genome_id
    ),
    reaction_distribution AS (
        -- See how reactions are distributed
        SELECT 
            reaction_id,
            COUNT(DISTINCT genome_id) as strain_count
        FROM strain_reactions
        GROUP BY reaction_id
    ),
    stats AS (
        SELECT 
            MIN(reaction_count) as min_reactions,
            MAX(reaction_count) as max_reactions,
            AVG(reaction_count) as avg_reactions,
            STDDEV(reaction_count) as std_reactions
        FROM strain_reaction_counts
    )
    SELECT 
        'Metabolic Capacity Statistics' as metric,
        min_reactions,
        max_reactions,
        ROUND(avg_reactions, 1) as avg_reactions,
        ROUND(std_reactions, 1) as std_reactions,
        max_reactions - min_reactions as reaction_range
    FROM stats
    """
    
    df = spark.sql(query).toPandas()
    print(f"Metabolic reaction capacity across strains:")
    display(df)
    
    # Show strains with extreme metabolic capacities
    extremes_query = f"""
    WITH strain_reactions AS (
        SELECT DISTINCT
            f.genome_id,
            ta.object as reaction_id
        FROM {namespace}.feature_annotation f
        INNER JOIN {namespace}.term_association ta
            ON f.rast = ta.subject
        WHERE f.rast IS NOT NULL
        AND ta.object LIKE 'seed.reaction:%'
    ),
    strain_counts AS (
        SELECT 
            sr.genome_id,
            COUNT(DISTINCT sr.reaction_id) as reaction_count,
            MAX(s.value) as organism_name
        FROM strain_reactions sr
        LEFT JOIN {namespace}.feature_annotation fa ON sr.genome_id = fa.genome_id
        LEFT JOIN {namespace}.statements s 
            ON fa.genome_taxa = s.subject AND s.predicate = 'rdfs:label'
        GROUP BY sr.genome_id
    )
    SELECT * FROM (
        SELECT *, 'Highest capacity' as category
        FROM strain_counts
        ORDER BY reaction_count DESC
        LIMIT 5
    )
    UNION ALL
    SELECT * FROM (
        SELECT *, 'Lowest capacity' as category
        FROM strain_counts
        ORDER BY reaction_count ASC
        LIMIT 5
    )
    ORDER BY category, reaction_count DESC
    """
    
    extremes_df = spark.sql(extremes_query).toPandas()
    print(f"\nStrains with extreme metabolic capacities:")
    display(extremes_df)
    
    return df

time_query("Metabolic Capability Comparison", compare_metabolic_capabilities)


Executing: Metabolic Capability Comparison


                                                                                

Metabolic reaction capacity across strains:


Unnamed: 0,metric,min_reactions,max_reactions,avg_reactions,std_reactions,reaction_range
0,Metabolic Capacity Statistics,1027,1099,1082.5,14.4,72


                                                                                


Strains with extreme metabolic capacities:


Unnamed: 0,genome_id,reaction_count,organism_name,category
0,562.61119,1099,,Highest capacity
1,562.61097,1099,,Highest capacity
2,562.55859,1098,,Highest capacity
3,562.55864,1096,,Highest capacity
4,562.61192,1096,,Highest capacity
5,562.6121,1065,,Lowest capacity
6,562.61073,1062,,Lowest capacity
7,562.55868,1056,,Lowest capacity
8,562.61106,1042,,Lowest capacity
9,562.55845,1027,,Lowest capacity



Query execution time: 490.39 seconds


Unnamed: 0,metric,min_reactions,max_reactions,avg_reactions,std_reactions,reaction_range
0,Metabolic Capacity Statistics,1027,1099,1082.5,14.4,72


### 4. Taxonomic Clustering by Functional Profiles

In [11]:
def analyze_taxonomic_functional_clusters():
    query = f"""
    WITH strain_taxonomy AS (
        -- Get taxonomic info for each strain
        SELECT DISTINCT
            f.genome_id,
            f.genome_taxa,
            s.value as strain_name
        FROM {namespace}.feature_annotation f
        LEFT JOIN {namespace}.statements s
            ON f.genome_taxa = s.subject AND s.predicate = 'rdfs:label'
    ),
    strain_functions AS (
        -- Get functional profile (EC numbers) for each strain
        SELECT 
            genome_id,
            COUNT(DISTINCT bakta_ec) as ec_count,
            COUNT(DISTINCT bakta_go) as go_count,
            COUNT(DISTINCT CASE WHEN bakta_ec LIKE '1.%' THEN bakta_ec END) as oxidoreductases,
            COUNT(DISTINCT CASE WHEN bakta_ec LIKE '2.%' THEN bakta_ec END) as transferases,
            COUNT(DISTINCT CASE WHEN bakta_ec LIKE '3.%' THEN bakta_ec END) as hydrolases,
            COUNT(DISTINCT CASE WHEN bakta_ec LIKE '4.%' THEN bakta_ec END) as lyases,
            COUNT(DISTINCT CASE WHEN bakta_ec LIKE '5.%' THEN bakta_ec END) as isomerases,
            COUNT(DISTINCT CASE WHEN bakta_ec LIKE '6.%' THEN bakta_ec END) as ligases
        FROM {namespace}.feature_annotation
        WHERE bakta_ec IS NOT NULL
        GROUP BY genome_id
    )
    SELECT 
        st.genome_id,
        st.strain_name,
        sf.ec_count,
        sf.go_count,
        sf.oxidoreductases,
        sf.transferases,
        sf.hydrolases,
        sf.lyases,
        sf.isomerases,
        sf.ligases,
        ROUND(sf.transferases * 100.0 / sf.ec_count, 1) as transferase_pct,
        ROUND(sf.hydrolases * 100.0 / sf.ec_count, 1) as hydrolase_pct
    FROM strain_taxonomy st
    JOIN strain_functions sf ON st.genome_id = sf.genome_id
    ORDER BY sf.ec_count DESC
    LIMIT 20
    """
    
    df = spark.sql(query).toPandas()
    print(f"Functional enzyme profiles by strain:")
    display(df)
    
    return df

time_query("Taxonomic-Functional Clustering", analyze_taxonomic_functional_clusters)


Executing: Taxonomic-Functional Clustering




Functional enzyme profiles by strain:


                                                                                

Unnamed: 0,genome_id,strain_name,ec_count,go_count,oxidoreductases,transferases,hydrolases,lyases,isomerases,ligases,transferase_pct,hydrolase_pct
0,562.61071,,1151,1498,219,378,245,118,72,71,32.8,21.3
1,562.61115,,1151,1497,223,372,246,119,72,71,32.3,21.4
2,562.61207,,1151,1497,223,375,247,116,71,70,32.6,21.5
3,562.61119,,1150,1501,227,369,247,118,72,70,32.1,21.5
4,562.61195,,1149,1502,221,369,250,118,73,69,32.1,21.8
5,562.61198,,1148,1493,219,371,248,120,73,70,32.3,21.6
6,562.6124,,1146,1497,220,370,248,118,73,71,32.3,21.6
7,562.61097,,1143,1499,221,368,243,122,72,70,32.2,21.3
8,562.61163,,1143,1499,220,367,246,121,72,70,32.1,21.5
9,562.55864,,1142,1485,221,371,246,116,71,71,32.5,21.5



Query execution time: 9.02 seconds


Unnamed: 0,genome_id,strain_name,ec_count,go_count,oxidoreductases,transferases,hydrolases,lyases,isomerases,ligases,transferase_pct,hydrolase_pct
0,562.61071,,1151,1498,219,378,245,118,72,71,32.8,21.3
1,562.61115,,1151,1497,223,372,246,119,72,71,32.3,21.4
2,562.61207,,1151,1497,223,375,247,116,71,70,32.6,21.5
3,562.61119,,1150,1501,227,369,247,118,72,70,32.1,21.5
4,562.61195,,1149,1502,221,369,250,118,73,69,32.1,21.8
5,562.61198,,1148,1493,219,371,248,120,73,70,32.3,21.6
6,562.6124,,1146,1497,220,370,248,118,73,71,32.3,21.6
7,562.61097,,1143,1499,221,368,243,122,72,70,32.2,21.3
8,562.61163,,1143,1499,220,367,246,121,72,70,32.1,21.5
9,562.55864,,1142,1485,221,371,246,116,71,71,32.5,21.5


### 5. Pathway Completeness Analysis

In [12]:
def analyze_pathway_completeness():
    query = f"""
    WITH glycolysis_reactions AS (
        -- Define key glycolysis reactions (example pathway)
        SELECT reaction_id, reaction_name FROM (
            VALUES 
            ('seed.reaction:rxn00558', 'Glucose-6-phosphate isomerase'),
            ('seed.reaction:rxn00604', 'Phosphofructokinase'),
            ('seed.reaction:rxn00711', 'Fructose-bisphosphate aldolase'),
            ('seed.reaction:rxn00024', 'Glyceraldehyde-3-phosphate dehydrogenase'),
            ('seed.reaction:rxn00083', 'Phosphoglycerate kinase'),
            ('seed.reaction:rxn00119', 'Phosphoglycerate mutase'),
            ('seed.reaction:rxn00094', 'Enolase'),
            ('seed.reaction:rxn00200', 'Pyruvate kinase')
        ) AS t(reaction_id, reaction_name)
    ),
    strain_glycolysis_coverage AS (
        -- Check which strains have which glycolysis reactions
        SELECT 
            f.genome_id,
            COUNT(DISTINCT gr.reaction_id) as glycolysis_reactions_present,
            8 as total_glycolysis_reactions,
            ROUND(COUNT(DISTINCT gr.reaction_id) * 100.0 / 8, 1) as pathway_completeness_pct
        FROM {namespace}.feature_annotation f
        INNER JOIN {namespace}.term_association ta ON f.rast = ta.subject
        INNER JOIN glycolysis_reactions gr ON ta.object = gr.reaction_id
        WHERE f.rast IS NOT NULL
        GROUP BY f.genome_id
    ),
    completeness_summary AS (
        SELECT 
            CASE 
                WHEN pathway_completeness_pct = 100 THEN 'Complete'
                WHEN pathway_completeness_pct >= 75 THEN 'Nearly complete'
                WHEN pathway_completeness_pct >= 50 THEN 'Partial'
                ELSE 'Incomplete'
            END as completeness_category,
            COUNT(*) as strain_count
        FROM strain_glycolysis_coverage
        GROUP BY completeness_category
    )
    SELECT * FROM completeness_summary
    ORDER BY 
        CASE completeness_category
            WHEN 'Complete' THEN 1
            WHEN 'Nearly complete' THEN 2
            WHEN 'Partial' THEN 3
            ELSE 4
        END
    """
    
    df = spark.sql(query).toPandas()
    print(f"Glycolysis pathway completeness across strains:")
    display(df)
    
    # Show which reactions are most commonly missing
    missing_reactions_query = f"""
    WITH glycolysis_reactions AS (
        SELECT reaction_id, reaction_name FROM (
            VALUES 
            ('seed.reaction:rxn00558', 'Glucose-6-phosphate isomerase'),
            ('seed.reaction:rxn00604', 'Phosphofructokinase'),
            ('seed.reaction:rxn00711', 'Fructose-bisphosphate aldolase'),
            ('seed.reaction:rxn00024', 'Glyceraldehyde-3-phosphate dehydrogenase'),
            ('seed.reaction:rxn00083', 'Phosphoglycerate kinase'),
            ('seed.reaction:rxn00119', 'Phosphoglycerate mutase'),
            ('seed.reaction:rxn00094', 'Enolase'),
            ('seed.reaction:rxn00200', 'Pyruvate kinase')
        ) AS t(reaction_id, reaction_name)
    ),
    reaction_presence AS (
        SELECT 
            gr.reaction_id,
            gr.reaction_name,
            COUNT(DISTINCT f.genome_id) as strains_with_reaction
        FROM glycolysis_reactions gr
        LEFT JOIN {namespace}.term_association ta ON gr.reaction_id = ta.object
        LEFT JOIN {namespace}.feature_annotation f 
            ON ta.subject = f.rast AND f.rast IS NOT NULL
        GROUP BY gr.reaction_id, gr.reaction_name
    )
    SELECT 
        reaction_name,
        strains_with_reaction,
        50 - strains_with_reaction as strains_missing_reaction,
        ROUND(strains_with_reaction * 100.0 / 50, 1) as presence_pct
    FROM reaction_presence
    ORDER BY strains_with_reaction DESC
    """
    
    missing_df = spark.sql(missing_reactions_query).toPandas()
    print(f"\nGlycolysis reaction presence across 50 strains:")
    display(missing_df)
    
    return df

time_query("Pathway Completeness Analysis", analyze_pathway_completeness)


Executing: Pathway Completeness Analysis


                                                                                

Glycolysis pathway completeness across strains:


Unnamed: 0,completeness_category,strain_count
0,Partial,45
1,Incomplete,5



Glycolysis reaction presence across 50 strains:


Unnamed: 0,reaction_name,strains_with_reaction,strains_missing_reaction,presence_pct
0,Fructose-bisphosphate aldolase,50,0,100.0
1,Phosphoglycerate kinase,50,0,100.0
2,Glucose-6-phosphate isomerase,49,1,98.0
3,Phosphoglycerate mutase,46,4,92.0
4,Pyruvate kinase,0,50,0.0
5,Enolase,0,50,0.0
6,Glyceraldehyde-3-phosphate dehydrogenase,0,50,0.0
7,Phosphofructokinase,0,50,0.0



Query execution time: 6.32 seconds


Unnamed: 0,completeness_category,strain_count
0,Partial,45
1,Incomplete,5


### 6. Unique Functional Features by Strain

In [13]:
def find_unique_strain_features():
    query = f"""
    WITH feature_distribution AS (
        -- Find features unique to single strains
        SELECT 
            rast,
            COUNT(DISTINCT genome_id) as strain_count,
            COLLECT_SET(genome_id)[0] as unique_to_genome
        FROM {namespace}.feature_annotation
        WHERE rast IS NOT NULL
        GROUP BY rast
        HAVING COUNT(DISTINCT genome_id) = 1
    ),
    unique_features_with_reactions AS (
        -- See if these unique features have known reactions
        SELECT 
            fd.unique_to_genome,
            fd.rast,
            ta.object as reaction_id,
            s1.value as reaction_name,
            s2.value as strain_name
        FROM feature_distribution fd
        LEFT JOIN {namespace}.term_association ta ON fd.rast = ta.subject
        LEFT JOIN {namespace}.statements s1 
            ON ta.object = s1.subject AND s1.predicate = 'rdfs:label'
        LEFT JOIN {namespace}.feature_annotation fa ON fd.unique_to_genome = fa.genome_id
        LEFT JOIN {namespace}.statements s2 
            ON fa.genome_taxa = s2.subject AND s2.predicate = 'rdfs:label'
        WHERE ta.object LIKE 'seed.reaction:%'
    ),
    strain_unique_counts AS (
        SELECT 
            unique_to_genome,
            MAX(strain_name) as strain_name,
            COUNT(DISTINCT rast) as unique_functions,
            COUNT(DISTINCT reaction_id) as unique_reactions
        FROM unique_features_with_reactions
        GROUP BY unique_to_genome
    )
    SELECT * FROM strain_unique_counts
    WHERE unique_reactions > 0
    ORDER BY unique_reactions DESC
    LIMIT 15
    """
    
    df = spark.sql(query).toPandas()
    print(f"Strains with unique metabolic capabilities:")
    display(df)
    
    # Show examples of unique functions
    examples_query = f"""
    WITH unique_features AS (
        SELECT 
            rast,
            COLLECT_SET(genome_id)[0] as genome_id
        FROM {namespace}.feature_annotation
        WHERE rast IS NOT NULL
        GROUP BY rast
        HAVING COUNT(DISTINCT genome_id) = 1
    )
    SELECT 
        uf.genome_id,
        uf.rast as unique_function,
        ta.object as reaction_id,
        s.value as reaction_name
    FROM unique_features uf
    INNER JOIN {namespace}.term_association ta ON uf.rast = ta.subject
    LEFT JOIN {namespace}.statements s 
        ON ta.object = s.subject AND s.predicate = 'rdfs:label'
    WHERE ta.object LIKE 'seed.reaction:%'
    LIMIT 10
    """
    
    examples_df = spark.sql(examples_query).toPandas()
    print(f"\nExamples of strain-specific functions and reactions:")
    display(examples_df)
    
    return df

time_query("Unique Strain Features Analysis", find_unique_strain_features)


Executing: Unique Strain Features Analysis




Strains with unique metabolic capabilities:


                                                                                

Unnamed: 0,unique_to_genome,strain_name,unique_functions,unique_reactions
0,562.61106,,4,4
1,562.55868,,1,2
2,562.61167,,2,2
3,562.61179,,2,2
4,562.61197,,2,2
5,562.8507,,2,2
6,562.55577,,1,1
7,562.55845,,1,1
8,562.61175,,1,1


                                                                                


Examples of strain-specific functions and reactions:


Unnamed: 0,genome_id,unique_function,reaction_id,reaction_name
0,562.61179,5-methylthioribose kinase (EC 2.7.1.100),seed.reaction:rxn02894,ATP:S5-methyl-5-thio-D-ribose 1-phosphotransfe...
1,562.61175,Beta-ketoadipate enol-lactone hydrolase (EC 3....,seed.reaction:rxn02144,4-carboxymethylbut-3-en-4-olide enol-lactonohy...
2,562.61167,Maleate cis-trans isomerase (EC 5.2.1.1),seed.reaction:rxn00803,Maleate cis-trans-isomerase
3,562.8507,Shikimate 5-dehydrogenase I gamma (EC 1.1.1.25),seed.reaction:rxn01740,Shikimate:NADP+ 3-oxidoreductase
4,562.61106,3-dehydroquinate dehydratase II (EC 4.2.1.10),seed.reaction:rxn02213,3-Dehydroquinate hydro-lyase
5,562.55868,4-hydroxyphenylpyruvate dioxygenase (EC 1.13.1...,seed.reaction:rxn01827,4-Hydroxyphenylpyruvate:oxygen oxidoreductase ...
6,562.55868,4-hydroxyphenylpyruvate dioxygenase (EC 1.13.1...,seed.reaction:rxn00999,Phenylpyruvate:oxygen oxidoreductase (hydroxyl...
7,562.8507,Arginine decarboxylase (EC 4.1.1.19),seed.reaction:rxn00405,L-arginine carboxy-lyase (agmatine-forming)
8,562.61197,D-arabinitol 4-dehydrogenase (EC 1.1.1.11),seed.reaction:rxn03883,D-arabinitol:NAD 4-oxidoreductase
9,562.55577,D-arabino-3-hexulose 6-phosphate formaldehyde-...,seed.reaction:rxn03643,D-arabino-hex-3-ulose-6-phosphate formaldehyde...



Query execution time: 15.51 seconds


Unnamed: 0,unique_to_genome,strain_name,unique_functions,unique_reactions
0,562.61106,,4,4
1,562.55868,,1,2
2,562.61167,,2,2
3,562.61179,,2,2
4,562.61197,,2,2
5,562.8507,,2,2
6,562.55577,,1,1
7,562.55845,,1,1
8,562.61175,,1,1


### 7. Conservation Analysis: Most and Least Conserved Functions

In [14]:
def analyze_function_conservation():
    query = f"""
    WITH go_conservation AS (
        -- Analyze GO term conservation
        SELECT 
            f.bakta_go as go_term,
            s.value as go_name,
            COUNT(DISTINCT f.genome_id) as strain_count,
            COUNT(*) as total_annotations,
            ROUND(COUNT(DISTINCT f.genome_id) * 100.0 / 50, 1) as conservation_pct
        FROM {namespace}.feature_annotation f
        LEFT JOIN {namespace}.statements s
            ON f.bakta_go = s.subject AND s.predicate = 'rdfs:label'
        WHERE f.bakta_go IS NOT NULL
        GROUP BY f.bakta_go, s.value
    ),
    go_categories AS (
        -- Get GO categories (biological process, molecular function, cellular component)
        SELECT 
            gc.*,
            CASE 
                WHEN gc.go_term LIKE 'GO:00%' THEN 'Molecular Function'
                WHEN gc.go_term LIKE 'GO:000%' THEN 'Biological Process'
                WHEN gc.go_term LIKE 'GO:0005%' THEN 'Cellular Component'
                ELSE 'Other'
            END as go_category
        FROM go_conservation gc
    )
    SELECT 
        'Most Conserved' as conservation_type,
        go_term,
        go_name,
        go_category,
        strain_count,
        conservation_pct
    FROM go_categories
    WHERE strain_count >= 45
    ORDER BY strain_count DESC, total_annotations DESC
    LIMIT 10
    
    UNION ALL
    
    SELECT 
        'Least Conserved' as conservation_type,
        go_term,
        go_name,
        go_category,
        strain_count,
        conservation_pct
    FROM go_categories
    WHERE strain_count <= 5 AND strain_count > 1
    ORDER BY strain_count ASC, total_annotations DESC
    LIMIT 10
    """
    
    df = spark.sql(query).toPandas()
    print(f"Most and least conserved GO terms across E. coli strains:")
    display(df)
    
    # Summary by GO category
    category_summary_query = f"""
    WITH go_conservation AS (
        SELECT 
            f.bakta_go as go_term,
            COUNT(DISTINCT f.genome_id) as strain_count
        FROM {namespace}.feature_annotation f
        WHERE f.bakta_go IS NOT NULL
        GROUP BY f.bakta_go
    ),
    go_categories AS (
        SELECT 
            CASE 
                WHEN go_term LIKE 'GO:00%' THEN 'Molecular Function'
                WHEN go_term LIKE 'GO:000%' THEN 'Biological Process'
                WHEN go_term LIKE 'GO:0005%' THEN 'Cellular Component'
                ELSE 'Other'
            END as go_category,
            strain_count
        FROM go_conservation
    )
    SELECT 
        go_category,
        COUNT(*) as term_count,
        ROUND(AVG(strain_count), 1) as avg_strain_count,
        MIN(strain_count) as min_strains,
        MAX(strain_count) as max_strains
    FROM go_categories
    GROUP BY go_category
    ORDER BY avg_strain_count DESC
    """
    
    summary_df = spark.sql(category_summary_query).toPandas()
    print(f"\nConservation summary by GO category:")
    display(summary_df)
    
    return df

time_query("Function Conservation Analysis", analyze_function_conservation)


Executing: Function Conservation Analysis


ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'UNION'.(line 40, pos 4)

== SQL ==

    WITH go_conservation AS (
        -- Analyze GO term conservation
        SELECT 
            f.bakta_go as go_term,
            s.value as go_name,
            COUNT(DISTINCT f.genome_id) as strain_count,
            COUNT(*) as total_annotations,
            ROUND(COUNT(DISTINCT f.genome_id) * 100.0 / 50, 1) as conservation_pct
        FROM ontology_data.feature_annotation f
        LEFT JOIN ontology_data.statements s
            ON f.bakta_go = s.subject AND s.predicate = 'rdfs:label'
        WHERE f.bakta_go IS NOT NULL
        GROUP BY f.bakta_go, s.value
    ),
    go_categories AS (
        -- Get GO categories (biological process, molecular function, cellular component)
        SELECT 
            gc.*,
            CASE 
                WHEN gc.go_term LIKE 'GO:00%' THEN 'Molecular Function'
                WHEN gc.go_term LIKE 'GO:000%' THEN 'Biological Process'
                WHEN gc.go_term LIKE 'GO:0005%' THEN 'Cellular Component'
                ELSE 'Other'
            END as go_category
        FROM go_conservation gc
    )
    SELECT 
        'Most Conserved' as conservation_type,
        go_term,
        go_name,
        go_category,
        strain_count,
        conservation_pct
    FROM go_categories
    WHERE strain_count >= 45
    ORDER BY strain_count DESC, total_annotations DESC
    LIMIT 10

    UNION ALL
----^^^

    SELECT 
        'Least Conserved' as conservation_type,
        go_term,
        go_name,
        go_category,
        strain_count,
        conservation_pct
    FROM go_categories
    WHERE strain_count <= 5 AND strain_count > 1
    ORDER BY strain_count ASC, total_annotations DESC
    LIMIT 10
    


## Summary

This notebook demonstrated:

1. **Building blocks of Query 0**: How SEED reactions, roles, term associations, and feature annotations connect to map genome features to metabolic reactions

2. **Comparative genomics insights**: Analysis of 50 E. coli strains revealed:
   - Core vs accessory genome structure
   - Functional diversity in enzymatic capabilities
   - Metabolic capacity variations
   - Pathway completeness patterns
   - Strain-specific unique features
   - Conservation patterns of molecular functions

These queries showcase the power of integrating ontology data with genomic annotations to understand bacterial diversity and evolution.