# Create GenomeDataLakeTables Object

This notebook creates a GenomeDataLakeTables object for the Acinetobacter ADP1 pangenome
and saves it to KBase workspace 76990 (appdev).

**Workflow:**
1. Load overlapping genome data from ADP1 pangenome analysis
2. Upload SQLite database file to KBase handle service
3. Build PangenomeData structure for Acinetobacter ADP1
4. Create and save GenomeDataLakeTables object to workspace

**Input:**
- datacache/overlapping_genomes.json - Genome overlap data from skani analysis
- data/lims_mirror.db - SQLite database with pangenome tables

**Output:**
- GenomeDataLakeTables object "ADP1Test" saved to workspace 76990

## Step 1: Build PangenomeData Structure

Create the PangenomeData structure for Acinetobacter ADP1.
This includes uploading the SQLite database to the handle service.

**Key objectives:**
- Verify SQLite database file exists
- Upload SQLite file to KBase handle service
- Build PangenomeData dict with all required fields

**Input:**
- data/lims_mirror.db: SQLite database file

**Output:**
- `pangenome_data`: PangenomeData structure ready for saving

In [1]:
%run util.py

import os

print("Building PangenomeData structure for Acinetobacter ADP1...")
print("="*80)

# Configuration
WORKSPACE_ID = 76990
SQLITE_FILE = os.path.join(util.notebook_folder, 'data', 'lims_mirror.db')
PANGENOME_ID = "GCF_000368685.1"
PANGENOME_TAXONOMY = "Acinetobacter baylyi ADP1"

print(f"Workspace ID: {WORKSPACE_ID}")
print(f"SQLite file: {SQLITE_FILE}")
print(f"File exists: {os.path.exists(SQLITE_FILE)}")

if os.path.exists(SQLITE_FILE):
    file_size = os.path.getsize(SQLITE_FILE)
    print(f"File size: {file_size:,} bytes")

print()

# Upload SQLite file to handle service
print("Uploading SQLite database to KBase handle service...")
(shockid,handle_ref) = util.upload_blob_file(SQLITE_FILE)
print(f"Handle reference: {handle_ref}")

# Build PangenomeData structure
pangenome_data = {
    "pangenome_id": PANGENOME_ID,
    "pangenome_taxonomy": PANGENOME_TAXONOMY,
    "user_genomes": ["76990/Acinetobacter_baylyi_ADP1"],  # No user genomes for this pangenome
    "datalake_genomes": ["GCF_000368685.1"],
    "sqllite_tables_handle_ref": handle_ref
}

print("\nPangenomeData structure:")
print("-" * 80)
print(f"  pangenome_id: {pangenome_data['pangenome_id']}")
print(f"  pangenome_taxonomy: {pangenome_data['pangenome_taxonomy']}")
print(f"  user_genomes: {len(pangenome_data['user_genomes'])} genomes")
print(f"  datalake_genomes: {len(pangenome_data['datalake_genomes'])} genomes")
print(f"  sqllite_tables_handle_ref: {pangenome_data['sqllite_tables_handle_ref']}")

# Save pangenome data
util.save('adp1_pangenome_data', pangenome_data)

print("\n" + "="*80)
print("PangenomeData saved to datacache.")

/Users/chenry/Dropbox/Projects/KBUtilLib/src


2025-12-14 08:59:19,823 - __main__.NotebookUtil - INFO - Loaded configuration from: /Users/chenry/.kbutillib/config.yaml
2025-12-14 08:59:19,823 - __main__.NotebookUtil - INFO - Loaded 0 tokens from /Users/chenry/.tokens
2025-12-14 08:59:19,824 - __main__.NotebookUtil - INFO - Loaded kbase tokens from /Users/chenry/.kbase/token
2025-12-14 08:59:19,824 - __main__.NotebookUtil - INFO - SKANI database cache: /Users/chenry/.kbutillib/skani_databases.json
  Via conda: conda install -c bioconda skani
  Via cargo: cargo install skani
  From source: https://github.com/bluenote-1577/skani
  Or set 'skani.executable' in config.yaml to the full path
2025-12-14 08:59:19,839 - __main__.NotebookUtil - INFO - Notebook environment detected
2025-12-14 08:59:19,840 - __main__.NotebookUtil - INFO - Uploading file to Shock: /Users/chenry/Dropbox/Projects/PangenomeAnalysis/notebooks/data/lims_mirror.db


Building PangenomeData structure for Acinetobacter ADP1...
Workspace ID: 76990
SQLite file: /Users/chenry/Dropbox/Projects/PangenomeAnalysis/notebooks/data/lims_mirror.db
File exists: True
File size: 7,364,608 bytes

Uploading SQLite database to KBase handle service...


2025-12-14 08:59:24,890 - __main__.NotebookUtil - INFO - File uploaded to Shock: c849023b-2ad9-450c-9804-f2d6229e2d1d, Handle: KBH_248028


Handle reference: KBH_248028

PangenomeData structure:
--------------------------------------------------------------------------------
  pangenome_id: GCF_000368685.1
  pangenome_taxonomy: Acinetobacter baylyi ADP1
  user_genomes: 1 genomes
  datalake_genomes: 1 genomes
  sqllite_tables_handle_ref: KBH_248028

PangenomeData saved to datacache.


## Step 2: Create and Save GenomeDataLakeTables Object

Create the GenomeDataLakeTables object and save it to KBase workspace 76990.

**Key objectives:**
- Build GenomeDataLakeTables with one pangenome (Acinetobacter ADP1)
- Save object as "ADP1Test" to workspace 76990
- Verify successful save

**Input:**
- `pangenome_data`: PangenomeData structure from Step 2

**Output:**
- `save_result`: Object info with workspace reference

In [5]:
%run util.py

print("Creating GenomeDataLakeTables object...")
print("="*80)

# Load pangenome data from Step 2
pangenome_data = util.load('adp1_pangenome_data')

# Configuration
OBJECT_NAME = "ADP1Test"
WORKSPACE_ID = 76990
DESCRIPTION = "GenomeDataLakeTables for Acinetobacter ADP1 pangenome analysis. Contains SQLite tables with contextual data for related genomes identified via skani search."

print(f"Object name: {OBJECT_NAME}")
print(f"Workspace ID: {WORKSPACE_ID}")
print(f"Description: {DESCRIPTION}")
print()


data_lake_tables = {
    "name": OBJECT_NAME,
    "description": DESCRIPTION,
    "genomeset_ref": "76990/ADP1Set",
    "pangenome_data": [pangenome_data]
}

# Save the object to workspace
save_result = util.save_ws_object(
    objid=OBJECT_NAME,
    workspace=WORKSPACE_ID,
    obj_json=data_lake_tables,
    obj_type="KBaseFBA.GenomeDataLakeTables"
)

print("\nSave successful!")
print("-" * 80)

# Save result info
util.save('adp1_save_result', save_result)

print("\n" + "="*80)
print("GenomeDataLakeTables object saved successfully!")

2025-12-14 09:06:07,461 - __main__.NotebookUtil - INFO - Loaded configuration from: /Users/chenry/.kbutillib/config.yaml
2025-12-14 09:06:07,461 - __main__.NotebookUtil - INFO - Loaded 0 tokens from /Users/chenry/.tokens
2025-12-14 09:06:07,462 - __main__.NotebookUtil - INFO - Loaded kbase tokens from /Users/chenry/.kbase/token
2025-12-14 09:06:07,463 - __main__.NotebookUtil - INFO - SKANI database cache: /Users/chenry/.kbutillib/skani_databases.json
  Via conda: conda install -c bioconda skani
  Via cargo: cargo install skani
  From source: https://github.com/bluenote-1577/skani
  Or set 'skani.executable' in config.yaml to the full path
2025-12-14 09:06:07,476 - __main__.NotebookUtil - INFO - Notebook environment detected


/Users/chenry/Dropbox/Projects/KBUtilLib/src
Creating GenomeDataLakeTables object...
Object name: ADP1Test
Workspace ID: 76990
Description: GenomeDataLakeTables for Acinetobacter ADP1 pangenome analysis. Contains SQLite tables with contextual data for related genomes identified via skani search.


Save successful!
--------------------------------------------------------------------------------

GenomeDataLakeTables object saved successfully!


## Summary

This cell provides a summary of the GenomeDataLakeTables creation process.

**Key objectives:**
- Load and display all results
- Verify object was created correctly
- Display link to view object in KBase

**Input:**
- All cached results from previous steps

**Output:**
- Summary display with KBase link

In [None]:
%run util.py

# Load all results
genome_info = util.load('adp1_genome_info')
pangenome_data = util.load('adp1_pangenome_data')
save_result = util.load('adp1_save_result')

print("GenomeDataLakeTables Creation Summary")
print("="*80)

print("\n1. Genome Data:")
print(f"   Datalake genomes: {len(genome_info['datalake_genomes'])}")
print(f"   Taxonomy: {genome_info['taxonomy']}")
print(f"   Overlap rate: {genome_info['overlap_stats']['overlap_rate']*100:.1f}%")

print("\n2. PangenomeData:")
print(f"   ID: {pangenome_data['pangenome_id']}")
print(f"   Taxonomy: {pangenome_data['pangenome_taxonomy']}")
print(f"   Handle ref: {pangenome_data['sqllite_tables_handle_ref']}")

print("\n3. Saved Object:")
print(f"   Name: {save_result['name']}")
print(f"   Reference: {save_result['workspace_ref']}")
print(f"   Type: {save_result['type']}")

# Generate KBase narrative link
kbase_url = f"https://narrative.kbase.us/#data/{save_result['workspace_ref']}"
print(f"\n4. View in KBase:")
print(f"   {kbase_url}")

print("\n" + "="*80)
print("GenomeDataLakeTables creation complete!")