# Notebook 01: Data Extraction & Linking

**Goal**: Map essential gene families → EC numbers → biochemical reactions

**Workflow**:
1. Load 859 universally essential families from essential_genome project
2. Extract gene cluster IDs for essential families
3. Query eggNOG annotations via Spark Connect (EC numbers)
4. Query ModelSEED biochemistry (reactions)
5. Identify universally essential reactions

**Workflow Testing**:
- ✅ MinIO download (downloaded essential_genome data)
- ⏳ Spark Connect queries (local queries with proxy)
- ⏳ Cross-database joins (Pangenome → Biochemistry)

**Prerequisites**:
- Proxy chain running (SSH tunnel + pproxy)
- `.venv-berdl` activated
- `KBASE_AUTH_TOKEN` in `.env`

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import os

# Add project root to path for imports
project_root = Path().resolve().parent.parent
sys.path.insert(0, str(project_root))

# Verify environment
assert Path('../data/essential_genome/essential_families.tsv').exists(), "Essential families data not found. Run MinIO download first."

print("✅ Imports successful")
print(f"Project root: {project_root}")

## Step 1: Load Essential Gene Families

Load the 859 universally essential families from the essential_genome project.

In [None]:
# Load essential families
families_df = pd.read_csv('../data/essential_genome/essential_families.tsv', sep='\t')

print(f"Total ortholog groups: {len(families_df):,}")
print(f"\nEssentiality classes:")
print(families_df['essentiality_class'].value_counts())

# Filter to universally essential
universal_essential = families_df[families_df['essentiality_class'] == 'universally_essential'].copy()

print(f"\n✅ Universally essential families: {len(universal_essential):,}")
print(f"\nSample families:")
universal_essential[['OG_id', 'rep_gene', 'rep_desc', 'n_organisms']].head(10)

## Step 2: Extract Gene Cluster IDs

The essential_genome project used FB locus IDs. We need to map these to pangenome gene cluster IDs.

**Note**: The essential_families.tsv has a `rep_gene` column with gene names, but we need the actual gene cluster IDs from the pangenome to query eggNOG annotations.

We'll need to query the pangenome link table from conservation_vs_fitness project or rebuild the mapping.

In [None]:
# Check if we have the gene-to-cluster mapping
# This was created in conservation_vs_fitness project
link_file = project_root / 'projects/conservation_vs_fitness/data/fb_pangenome_link.tsv'

if link_file.exists():
    print("✅ Found FB-pangenome link table")
    fb_link = pd.read_csv(link_file, sep='\t')
    print(f"Link table size: {len(fb_link):,} rows")
    print(fb_link.head())
else:
    print("⚠️  FB-pangenome link table not found")
    print("We'll need to use gene cluster representative sequences instead")
    fb_link = None

## Step 3: Initialize Spark Connect Session

**This tests local Spark Connect with proxy support.**

The `.venv-berdl` environment includes all necessary Spark dependencies.

In [None]:
# Set up proxy for Spark Connect
os.environ['https_proxy'] = 'http://127.0.0.1:8123'
os.environ['http_proxy'] = 'http://127.0.0.1:8123'
os.environ['no_proxy'] = 'localhost,127.0.0.1'

# Import Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, collect_list

# Read KBASE_AUTH_TOKEN
from dotenv import load_dotenv
load_dotenv(project_root / '.env')
auth_token = os.getenv('KBASE_AUTH_TOKEN')
assert auth_token, "KBASE_AUTH_TOKEN not found in .env"

print("✅ Environment configured")
print(f"Proxy: {os.environ.get('https_proxy')}")
print(f"Auth token: {auth_token[:20]}...")

In [None]:
# Connect to Spark
# Note: This requires a JupyterHub session to be active on BERDL
# The Spark Connect service runs on the cluster

spark = SparkSession.builder \
    .remote("sc://hub.berdl.kbase.us:443") \
    .config("spark.connect.grpc.http.proxy", "http://127.0.0.1:8123") \
    .config("spark.connect.grpc.channelBuilder", "netty") \
    .getOrCreate()

print("✅ Spark session created")
print(f"Spark version: {spark.version}")

# Test query
test_df = spark.sql("SHOW DATABASES")
databases = [row.namespace for row in test_df.collect()]
print(f"\n✅ Connected to BERDL")
print(f"Available databases: {len(databases)}")
print(f"Sample: {databases[:5]}")

## Step 4: Query eggNOG Annotations for Essential Gene Clusters

Since we don't have direct FB→pangenome mapping, we'll:
1. Use the gene names from essential families
2. Search for matching gene cluster descriptions in eggNOG annotations
3. Extract EC numbers for those clusters

**Alternative approach**: Query all gene clusters that are universally core (present in ≥95% of genomes across multiple species), as these are likely to include the essential genes.

In [None]:
# Approach: Get all core gene clusters with EC annotations
# These are likely to include the essential genes

# First, let's explore the eggnog_mapper_annotations schema
eggnog_schema = spark.sql("DESCRIBE kbase_ke_pangenome.eggnog_mapper_annotations")
print("eggNOG annotations schema:")
eggnog_schema.show(50, truncate=False)

In [None]:
# Query eggNOG annotations for gene clusters with EC numbers
# Filter to clusters that have EC assignments

ec_annotations_query = """
SELECT 
    query_name as gene_cluster_id,
    EC,
    Description,
    COG_category,
    KEGG_ko,
    KEGG_Pathway,
    Preferred_name
FROM kbase_ke_pangenome.eggnog_mapper_annotations
WHERE EC IS NOT NULL 
  AND EC != '-'
LIMIT 100
"""

ec_sample = spark.sql(ec_annotations_query)
print("\nSample EC annotations:")
ec_sample.show(20, truncate=False)

## Step 5: Map EC Numbers to ModelSEED Reactions

Query the biochemistry database to map EC numbers to reactions.

In [None]:
# Explore biochemistry schema
biochem_tables = spark.sql("SHOW TABLES IN kbase_msd_biochemistry")
print("Biochemistry tables:")
biochem_tables.show(truncate=False)

In [None]:
# Check reaction table schema
reaction_schema = spark.sql("DESCRIBE kbase_msd_biochemistry.reaction")
print("Reaction table schema:")
reaction_schema.show(50, truncate=False)

In [None]:
# Sample reactions
reactions_sample = spark.sql("""
SELECT 
    id as reaction_id,
    name,
    abbreviation,
    equation,
    reversibility,
    ec_numbers,
    pathways
FROM kbase_msd_biochemistry.reaction
LIMIT 20
""")

print("Sample reactions:")
reactions_sample.show(20, truncate=False)

## Step 6: Join EC Annotations with Reactions

This tests cross-database joins via Spark Connect.

In [None]:
# Join eggNOG EC numbers with ModelSEED reactions
# Note: EC numbers in eggNOG may be comma-separated lists

ec_reaction_join = spark.sql("""
WITH eggnog_ec AS (
    SELECT 
        query_name as gene_cluster_id,
        EC as ec_number,
        Description,
        Preferred_name
    FROM kbase_ke_pangenome.eggnog_mapper_annotations
    WHERE EC IS NOT NULL AND EC != '-'
),
reactions AS (
    SELECT 
        id as reaction_id,
        name as reaction_name,
        abbreviation,
        equation,
        reversibility,
        ec_numbers,
        pathways
    FROM kbase_msd_biochemistry.reaction
    WHERE ec_numbers IS NOT NULL
)
SELECT 
    e.gene_cluster_id,
    e.ec_number,
    e.Description as gene_description,
    r.reaction_id,
    r.reaction_name,
    r.equation,
    r.reversibility,
    r.pathways
FROM eggnog_ec e
JOIN reactions r
    ON e.ec_number = r.ec_numbers
    OR CONCAT(',', e.ec_number, ',') LIKE CONCAT('%,', r.ec_numbers, ',%')
LIMIT 100
""")

print("\n✅ Cross-database join successful")
print("Sample EC → Reaction mappings:")
ec_reaction_join.show(20, truncate=False)

## Step 7: Save Results

Save intermediate results for analysis in Notebook 02.

In [None]:
# Convert to pandas for local saving
ec_annotations_pd = ec_sample.toPandas()
ec_reaction_pd = ec_reaction_join.toPandas()

# Save
ec_annotations_pd.to_csv('../data/eggnog_ec_sample.tsv', sep='\t', index=False)
ec_reaction_pd.to_csv('../data/ec_reaction_mappings_sample.tsv', sep='\t', index=False)

print(f"✅ Saved {len(ec_annotations_pd):,} EC annotations")
print(f"✅ Saved {len(ec_reaction_pd):,} EC→reaction mappings")

## Summary

### Workflow Testing Results

✅ **MinIO download**: Successfully downloaded 185 MB from lakehouse  
✅ **Spark Connect**: Connected to BERDL from local machine via proxy  
✅ **Cross-database queries**: Joined pangenome (eggNOG) with biochemistry (ModelSEED)  
✅ **Data extraction**: Mapped EC numbers to reactions  

### Next Steps (Notebook 02)

1. Refine the EC→reaction mapping (handle comma-separated EC lists)
2. Link essential gene families to their EC numbers
3. Identify universally essential reactions
4. Pathway enrichment analysis
5. Metabolic network analysis

In [None]:
# Clean up
spark.stop()
print("✅ Spark session closed")