## CDM ONTOLOGIES WORKFLOW

### Full workflow

In [None]:
%%time
import sys
import os
# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)
# Import all required functions
from analyze_core_ontologies import analyze_core_ontologies
import analyze_non_core_ontologies
from create_pseudo_base_ontology import create_pseudo_base_ontologies
from merge_ontologies import merge_ontologies
from create_semantic_sql_db import create_semantic_sql_db
from extract_sql_tables_to_tsv import extract_sql_tables_to_tsv

# Print workflow start
print("Starting CDM Ontologies Workflow...")

print("\n1. Analyzing Core Ontologies...")
analyze_core_ontologies(repo_path)

print("\n2. Analyzing Non-Core Ontologies...")
analyze_non_core_ontologies.analyze_non_core_ontologies(repo_path)

print("\n3. Creating Pseudo Base Ontologies...")
create_pseudo_base_ontologies(repo_path)

print("\n4. Merging Ontologies...")
if not merge_ontologies(repo_path):
    raise Exception("Ontology merge failed")

print("\n5. Creating Semantic SQL Database...")
if not create_semantic_sql_db(repo_path):
    raise Exception("Database creation failed")

print("\n6. Extracting SQL Tables to TSV...")
if not extract_sql_tables_to_tsv(repo_path):
    raise Exception("TSV extraction failed")

print("\nWorkflow completed successfully!")

Starting CDM Ontologies Workflow...

1. Analyzing Core Ontologies...

Analysis Results:
Skipping download, envo.owl already present

File: envo.owl
  Has imports: No
  Ontology IRI: http://purl.obolibrary.org/obo/envo.owl
  Own terms: 4385
  External terms: 3043
  Classification: Non-Base. Base version available in OBO Foundry: http://purl.obolibrary.org/obo/envo/envo-base.owl.
  External Terms Subject of Triples? Yes
  Number of external terms that are subjects of triples: 2984
  First 5 external terms that are subject of triples:
    http://purl.obolibrary.org/obo/BFO_0000001
    http://purl.obolibrary.org/obo/BFO_0000002
    http://purl.obolibrary.org/obo/BFO_0000003
    http://purl.obolibrary.org/obo/BFO_0000004
    http://purl.obolibrary.org/obo/BFO_0000006
  First 5 own terms:
    http://purl.obolibrary.org/obo/ENVO_00000000
    http://purl.obolibrary.org/obo/ENVO_00000001
    http://purl.obolibrary.org/obo/ENVO_00000002
    http://purl.obolibrary.org/obo/ENVO_00000003
    http:/

### Worflow in seperate steps

#### Analyze and Download Core Ontologies

In [9]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import and run the analysis
from analyze_core_ontologies import analyze_core_ontologies

# Run the analysis
analyze_core_ontologies(repo_path)


Analysis Results:
Downloading envo.owl...
Successfully downloaded: envo.owl

File: envo.owl
  Has imports: No
  Ontology IRI: http://purl.obolibrary.org/obo/envo.owl
  Own terms: 4514
  External terms: 2858
  Classification: Non-Base. Base version available in OBO Foundry: http://purl.obolibrary.org/obo/envo/envo-base.owl.
  External Terms Subject of Triples? Yes
  Number of external terms that are subjects of triples: 2807
  First 5 external terms that are subject of triples:
    http://purl.obolibrary.org/obo/BFO_0000001
    http://purl.obolibrary.org/obo/BFO_0000002
    http://purl.obolibrary.org/obo/BFO_0000003
    http://purl.obolibrary.org/obo/BFO_0000004
    http://purl.obolibrary.org/obo/BFO_0000006
  First 5 own terms:
    http://purl.obolibrary.org/obo/ENVO_00000000
    http://purl.obolibrary.org/obo/ENVO_00000001
    http://purl.obolibrary.org/obo/ENVO_00000002
    http://purl.obolibrary.org/obo/ENVO_00000003
    http://purl.obolibrary.org/obo/ENVO_00000004
  First 5 extern

#### Analyze and Download Non-Core Ontologies

In [10]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import and run the analysis
import analyze_non_core_ontologies

# Run the analysis
analyze_non_core_ontologies.analyze_non_core_ontologies(repo_path)


Processing external terms from core ontologies...
Downloading base version of ro...
Successfully downloaded: ro-base.owl
Downloading base version of pato...
Successfully downloaded: pato-base.owl
Downloading base version of obi...
Successfully downloaded: obi-base.owl
Downloading base version of fao...
Successfully downloaded: fao-base.owl
Downloading regular version of po...
Successfully downloaded: po.owl
Downloading base version of pco...
Successfully downloaded: pco-base.owl
Downloading regular version of iao...
Successfully downloaded: iao.owl
Downloading base version of uberon...
Successfully downloaded: uberon-base.owl
Downloading regular version of omo...
Successfully downloaded: omo.owl
Downloading regular version of bfo...
Successfully downloaded: bfo.owl
Downloading regular version of foodon...
Successfully downloaded: foodon.owl

Updating ontologies.txt...

Processing Additional OBO Foundry, PyOBO and In-house ontologies...
Downloading uo.owl...
Successfully downloaded: uo

#### Recreate pseudo base versions

In [11]:
%%time
import sys
import os
# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)
# Import and run the pseudo base creation
from create_pseudo_base_ontology import create_pseudo_base_ontologies
# Run the creation
create_pseudo_base_ontologies(repo_path)

Using ROBOT at: /scratch/jplfaria/install_stuff/robot/robot
Processing po.owl...
Using base IRI: http://purl.obolibrary.org/obo/PO_
Executing command:
robot remove --input /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/non-base-ontologies/po.owl --base-iri http://purl.obolibrary.org/obo/PO_ --axioms external --preserve-structure false --trim false remove --select imports --trim false --output /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/po-base.owl
Created base version for po.owl: po-base.owl
Processing iao.owl...
Using base IRI: http://purl.obolibrary.org/obo/IAO_
Executing command:
robot remove --input /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/non-base-ontologies/iao.owl --base-iri http://purl.obolibrary.org/obo/IAO_ --axioms external --preserve-structure false --trim false remove --select imports --trim false --output /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/iao-base.owl
Created base version for iao.owl: iao-base.owl
Processing om

True

#### Analyze the prefixes

In [6]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import the prefix analyzer
from analyze_prefixes import analyze_all_ontologies, generate_prefix_mapping

# Define input directory path
input_dir = os.path.join(repo_path, 'ontology_data_owl_core')

# Run the analysis
print(f"Analyzing ontologies in: {input_dir}")
results = analyze_all_ontologies(input_dir)

# Generate and save prefix mapping
mapping_content = generate_prefix_mapping(results)
mapping_file = os.path.join(repo_path, 'prefix_mapping.txt')

with open(mapping_file, 'w') as f:
    f.write(mapping_content)

print(f"\nPrefix mapping file generated at: {mapping_file}")

# Print summary of analysis
print("\nSummary of analysis:")
for filename, data in results.items():
    print(f"\n{filename}:")
    print(f"  Declared prefixes: {len(data['prefixes'])}")
    additional_prefixes = set(data['prefix_to_iris'].keys()) - data['prefixes']
    print(f"  Potential additional prefixes needed: {len(additional_prefixes)}")
    if additional_prefixes:
        print("  Additional prefixes:")
        for prefix in sorted(additional_prefixes):
            iris = data['prefix_to_iris'][prefix]
            if iris:
                print(f"    - {prefix}: {next(iter(iris))}")

Analyzing ontologies in: /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_core

Analyzing ncbitaxon.owl...



KeyboardInterrupt



#### Merge Ontologies

In [8]:
%%time
import sys
import os
import subprocess
from threading import Thread
import time
from pathlib import Path
from datetime import datetime
import logging

# Set up logging
logging.basicConfig(
    filename='merge_memory.log',
    level=logging.INFO,
    format='%(asctime)s - %(message)s'
)

# Initialize peak memory tracking
peak_memory_gb = 0
last_logged_memory = 0

def parse_memory_output(mem_line):
    parts = mem_line.split()
    used_mem = parts[2]
    value = float(used_mem[:-2])
    return value * 1024 if used_mem.endswith('Ti') else value

def monitor_memory():
    global peak_memory_gb, last_logged_memory
    while True:
        try:
            mem_info = subprocess.check_output(['free', '-h']).decode('utf-8').split('\n')
            current_memory_gb = parse_memory_output(mem_info[1])
            peak_memory_gb = max(peak_memory_gb, current_memory_gb)
            
            # Log if memory changed by more than 10GB
            if abs(current_memory_gb - last_logged_memory) > 10:
                logging.info(f"Memory: {current_memory_gb:.0f}GB (Peak: {peak_memory_gb:.0f}GB)")
                last_logged_memory = current_memory_gb
            
            time.sleep(60)
        except:
            break

# Start memory monitoring in background
monitor_thread = Thread(target=monitor_memory, daemon=True)
monitor_thread.start()

# Run merge
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)
from merge_ontologies import merge_ontologies

print("Starting ontology merge - Check merge_memory.log for memory usage details")
logging.info("Starting ontology merge")

merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl',
    output_filename='ontology_data_owl.owl'
)

# Log final statistics
logging.info(f"Merge completed - Final peak memory: {peak_memory_gb:.0f}GB")
print(f"\nMerge completed - Peak memory usage: {peak_memory_gb:.0f}GB")

Starting ontology merge - Check merge_memory.log for memory usage details
Using ROBOT at: /scratch/jplfaria/install_stuff/robot/robot
Looking for ontology files in: /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl
Found 30 ontology files:
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/kegg.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/metacyc.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/modelseed.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/envo.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/ncbitaxon.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/chebi.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/cl-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/pato-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/caro-base.owl
  - /scratch/jplfaria/KBase_

In [7]:
%%time
import sys
import os
import subprocess
from threading import Thread
import time
from pathlib import Path
from datetime import datetime
import logging

# Set up logging
logging.basicConfig(
    filename='merge_memory.log',
    level=logging.INFO,
    format='%(asctime)s - %(message)s'
)

# Initialize peak memory tracking
peak_memory_gb = 0
last_logged_memory = 0

def parse_memory_output(mem_line):
    parts = mem_line.split()
    used_mem = parts[2]
    value = float(used_mem[:-2])
    return value * 1024 if used_mem.endswith('Ti') else value

def monitor_memory():
    global peak_memory_gb, last_logged_memory
    while True:
        try:
            mem_info = subprocess.check_output(['free', '-h']).decode('utf-8').split('\n')
            current_memory_gb = parse_memory_output(mem_info[1])
            peak_memory_gb = max(peak_memory_gb, current_memory_gb)
            
            # Log if memory changed by more than 10GB
            if abs(current_memory_gb - last_logged_memory) > 10:
                logging.info(f"Memory: {current_memory_gb:.0f}GB (Peak: {peak_memory_gb:.0f}GB)")
                last_logged_memory = current_memory_gb
            
            time.sleep(60)
        except:
            break

# Start memory monitoring in background
monitor_thread = Thread(target=monitor_memory, daemon=True)
monitor_thread.start()

# Run merge
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)
from merge_ontologies_v2 import merge_ontologies

print("Starting ontology merge - Check merge_memory.log for memory usage details")
logging.info("Starting ontology merge")

merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl',
    output_filename='ontology_data_owl_prefix_EC_UP.owl'
)

# Log final statistics
logging.info(f"Merge completed - Final peak memory: {peak_memory_gb:.0f}GB")
print(f"\nMerge completed - Peak memory usage: {peak_memory_gb:.0f}GB")

Starting ontology merge - Check merge_memory.log for memory usage details
Found 30 ontology files:
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/kegg.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/metacyc.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/modelseed.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/envo.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/ncbitaxon.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/chebi.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/cl-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/pato-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/caro-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/uberon-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/fao-base.owl
  - /scratch/jplfaria/KBa

In [1]:
%%time
import sys
import os
import subprocess
from threading import Thread
import time
from pathlib import Path
from datetime import datetime
import logging

# Set up logging
logging.basicConfig(
    filename='merge_memory.log',
    level=logging.INFO,
    format='%(asctime)s - %(message)s'
)

# Initialize peak memory tracking
peak_memory_gb = 0
last_logged_memory = 0

def parse_memory_output(mem_line):
    parts = mem_line.split()
    used_mem = parts[2]
    value = float(used_mem[:-2])
    return value * 1024 if used_mem.endswith('Ti') else value

def monitor_memory():
    global peak_memory_gb, last_logged_memory
    while True:
        try:
            mem_info = subprocess.check_output(['free', '-h']).decode('utf-8').split('\n')
            current_memory_gb = parse_memory_output(mem_info[1])
            peak_memory_gb = max(peak_memory_gb, current_memory_gb)
            
            # Log if memory changed by more than 10GB
            if abs(current_memory_gb - last_logged_memory) > 10:
                logging.info(f"Memory: {current_memory_gb:.0f}GB (Peak: {peak_memory_gb:.0f}GB)")
                last_logged_memory = current_memory_gb
            
            time.sleep(60)
        except:
            break

# Start memory monitoring in background
monitor_thread = Thread(target=monitor_memory, daemon=True)
monitor_thread.start()

# Run merge
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)
from merge_ontologies_v3 import merge_ontologies

print("Starting ontology merge - Check merge_memory.log for memory usage details")
logging.info("Starting ontology merge")

merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl',
    output_filename='ontology_data_owl_set_prefixes.owl'
)

# Log final statistics
logging.info(f"Merge completed - Final peak memory: {peak_memory_gb:.0f}GB")
print(f"\nMerge completed - Peak memory usage: {peak_memory_gb:.0f}GB")

Starting ontology merge - Check merge_memory.log for memory usage details
Found 30 ontology files:
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/kegg.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/metacyc.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/modelseed.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/envo.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/ncbitaxon.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/chebi.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/cl-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/pato-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/caro-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/uberon-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/fao-base.owl
  - /scratch/jplfaria/KBa


KeyboardInterrupt



In [2]:
%%time
import sys
import os
import subprocess

repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
output_dir = os.path.join(repo_path, 'outputs')
intermediate_dir = os.path.join(output_dir, 'intermediary_ontology')

# Input file (your existing prefix_fixed.owl)
intermediate_file2 = os.path.join(intermediate_dir, 'prefix_fixed.owl')

# Output file
output_file = os.path.join(output_dir, 'ontology_data_owl_set_prefixes.owl')

# Step 3: Remove annotations using filter
filter_command = [
    'robot', 'remove',
    '--input', intermediate_file2,
    '--term', 'oio:id',
    '--select', 'annotation-properties',
    '--output', output_file
]

print("\nExecuting remove command:")
print(f"{' '.join(filter_command)}")

try:
    subprocess.run(filter_command, check=True, capture_output=True, text=True)
    print(f"\nSuccessfully created final ontology at {output_file}")
except subprocess.CalledProcessError as e:
    print(f"\nError executing ROBOT command:")
    print(f"Return code: {e.returncode}")
    if e.stdout:
        print("STDOUT:", e.stdout)
    if e.stderr:
        print("STDERR:", e.stderr)


Executing remove command:
robot remove --input /scratch/jplfaria/KBase_CDM_Ontologies/outputs/intermediary_ontology/prefix_fixed.owl --term oio:id --select annotation-properties --output /scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_set_prefixes.owl

Successfully created final ontology at /scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_set_prefixes.owl
CPU times: user 222 ms, sys: 191 ms, total: 414 ms
Wall time: 1h 7min 21s


In [None]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import and run the merge
from merge_ontologies_v4 import merge_ontologies

# Run the merge
merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl_test',  # Your input directory
    output_filename='test_merged_ontologies.owl'  # Your output filename
)

In [9]:
import os
from rdflib import Graph
import csv
from typing import Dict, Tuple

def extract_prefixes_from_ontology(file_path: str) -> dict:
    """Extract prefixes from a single ontology file."""
    g = Graph()
    g.parse(file_path, format="xml" if file_path.endswith(".owl") else "turtle")
    return {prefix: str(namespace) for prefix, namespace in g.namespaces()}

def extract_prefixes_from_directory(directory: str) -> Tuple[Dict[str, Dict[str, str]], Dict[str, str]]:
    """
    Extract prefixes from all ontology files in a directory and track per-file prefixes.
    
    Returns:
        Tuple containing:
        - Dictionary mapping filenames to their prefix dictionaries
        - Combined dictionary of all prefixes
    """
    per_file_prefixes = {}
    all_prefixes = {}
    
    for filename in os.listdir(directory):
        if filename.endswith((".owl", ".ttl", ".ofn", ".rdf")):
            file_path = os.path.join(directory, filename)
            print(f"Processing {file_path}...")
            try:
                file_prefixes = extract_prefixes_from_ontology(file_path)
                per_file_prefixes[filename] = file_prefixes
                all_prefixes.update(file_prefixes)
            except Exception as e:
                print(f"Error processing {file_path}: {e}")
    
    return per_file_prefixes, all_prefixes

def save_prefixes_to_csv(prefixes: dict, output_file: str, include_source: bool = False):
    """Save prefixes to a CSV file."""
    with open(output_file, "w", newline="") as csvfile:
        writer = csv.writer(csvfile)
        if include_source:
            writer.writerow(["Source", "Prefix", "Namespace"])
            for source, prefix_dict in prefixes.items():
                for prefix, namespace in sorted(prefix_dict.items()):
                    writer.writerow([source, prefix, namespace])
        else:
            writer.writerow(["Prefix", "Namespace"])
            for prefix, namespace in sorted(prefixes.items()):
                writer.writerow([prefix, namespace])

def analyze_prefixes(input_dir: str, merged_file: str, output_dir: str):
    """
    Analyze prefixes in both individual ontologies and merged ontology.
    
    Args:
        input_dir: Directory containing individual ontologies
        merged_file: Path to the merged ontology file
        output_dir: Directory for output files
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Analyze individual ontologies
    print("\nAnalyzing individual ontologies...")
    per_file_prefixes, all_prefixes = extract_prefixes_from_directory(input_dir)
    
    # Save individual ontology prefixes
    individual_output = os.path.join(output_dir, "individual_prefixes.csv")
    save_prefixes_to_csv(per_file_prefixes, individual_output, include_source=True)
    print(f"Individual ontology prefixes saved to {individual_output}")
    
    # Save combined prefixes from all individual ontologies
    combined_output = os.path.join(output_dir, "combined_prefixes.csv")
    save_prefixes_to_csv(all_prefixes, combined_output)
    print(f"Combined prefixes saved to {combined_output}")
    
    # Analyze merged ontology
    if os.path.exists(merged_file):
        print("\nAnalyzing merged ontology...")
        try:
            merged_prefixes = extract_prefixes_from_ontology(merged_file)
            merged_output = os.path.join(output_dir, "merged_prefixes.csv")
            save_prefixes_to_csv(merged_prefixes, merged_output)
            print(f"Merged ontology prefixes saved to {merged_output}")
            
            # Compare prefixes
            print("\nPrefix Analysis:")
            original_prefixes = set(all_prefixes.keys())
            merged_prefix_set = set(merged_prefixes.keys())
            
            print(f"\nTotal prefixes in individual ontologies: {len(original_prefixes)}")
            print(f"Total prefixes in merged ontology: {len(merged_prefix_set)}")
            
            lost_prefixes = original_prefixes - merged_prefix_set
            new_prefixes = merged_prefix_set - original_prefixes
            
            if lost_prefixes:
                print(f"\nPrefixes lost after merging ({len(lost_prefixes)}):")
                for prefix in sorted(lost_prefixes):
                    print(f"  - {prefix}: {all_prefixes[prefix]}")
            
            if new_prefixes:
                print(f"\nNew prefixes in merged ontology ({len(new_prefixes)}):")
                for prefix in sorted(new_prefixes):
                    print(f"  - {prefix}: {merged_prefixes[prefix]}")
                    
        except Exception as e:
            print(f"Error analyzing merged ontology: {e}")
    else:
        print(f"\nMerged ontology file not found: {merged_file}")

if __name__ == "__main__":
    # Configuration
    input_dir = "/scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl"
    merged_file = "/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl.owl"
    output_dir = "/scratch/jplfaria/KBase_CDM_Ontologies/outputs/prefix_analysis"
    
    analyze_prefixes(input_dir, merged_file, output_dir)


Analyzing individual ontologies...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/kegg.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/metacyc.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/modelseed.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/envo.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/ncbitaxon.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/chebi.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/cl-base.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/pato-base.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/caro-base.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/uberon-base.owl...
Processing /scratch/jplfaria/KBase_CDM_Ontologies/on

In [2]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import and run the merge
from merge_ontologies import merge_ontologies

# Run the merge with custom input directory and output filename
merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl_core',  # Custom input directory
    output_filename='ontology_data_owl_core.owl'    # Custom output filename
)

Using ROBOT at: /cdm_shared_workspace/user_shared_workspace/Jose/install_stuff/robot/robot
Looking for ontology files in: /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core
Found 4 ontology files:
  - /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core/ncbitaxon.owl
  - /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core/envo.owl
  - /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core/chebi.owl
  - /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core/go.owl
Created list of merged ontologies at: /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontologies_merged.txt
Saving output to: /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/ontology_data_owl_core.owl

Executing command:
robot merge --annotate-defined-by true --input /cdm_shared_w

True

In [13]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import the merge function
from merge_ontologies_custom_order import merge_ontologies

# Define the desired order for merging (filenames must be present in the input directory)
desired_order = [
    'ncbitaxon.owl',
    'go.owl',
    'envo.owl',
    'chebi.owl',
    'bfo-base.owl',
    'caro-base.owl',
    'cl-base.owl',
    'fao-base.owl',
    'foodon-base.owl',
    'iao-base.owl',
    'obi-base.owl',
    'omo-base.owl',
    'pato-base.owl',
    'pco-base.owl',
    'po-base.owl',
    'ro-base.owl',
    'so-base.owl',
    'taxrank-base.owl',
    'uberon-base.owl',
    'uo-base.owl',
    'gtdb.owl',
    'eccode.owl',
    'pfam.owl',
    'rhea.owl',
    'credit.owl',
    'ror.owl',
    'interpro.owl',
    'seed.owl',
    'modelseed.owl',
    'kegg.owl',
    'metacyc.owl'
]

# Run the merge with custom input directory, output filename, and the specified file order
merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl',  # Custom input directory
    output_filename='ontology_data_owl_31.owl',  # Custom output filename
    ontology_order=desired_order
)

Using ROBOT at: /scratch/jplfaria/install_stuff/robot/robot
Looking for ontology files in: /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl
Found 31 ontology files:
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/ncbitaxon.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/envo.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/chebi.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/bfo-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/caro-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/cl-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/fao-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/foodon-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/iao-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/obi-base.owl
  - /scratch/jplfaria/

True

In [15]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import the merge function
from merge_ontologies_custom_order import merge_ontologies

# Define the desired order for merging (filenames must be present in the input directory)
desired_order = [
    'go.owl',
    'chebi.owl',
    'eccode.owl',
    'rhea.owl',
    'modelseed.owl',
    'kegg.owl',
    'metacyc.owl'
]

# Run the merge with custom input directory, output filename, and the specified file order
merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl',  # Custom input directory
    output_filename='7_biochemistry.owl',  # Custom output filename
    ontology_order=desired_order
)

Using ROBOT at: /scratch/jplfaria/install_stuff/robot/robot
Looking for ontology files in: /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl
Found 7 ontology files:
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/chebi.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/eccode.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/rhea.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/modelseed.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/kegg.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/metacyc.owl
Created list of merged ontologies at: /scratch/jplfaria/KBase_CDM_Ontologies/ontologies_merged.txt
Saving output to: /scratch/jplfaria/KBase_CDM_Ontologies/outputs/7_biochemistry.owl

Executing command:
robot merge --annotate-defined-by true --input /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl --input /scratc

True

In [18]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import the merge function
from merge_ontologies_custom_order import merge_ontologies

# Define the desired order for merging (filenames must be present in the input directory)
desired_order = [
    'ncbitaxon.owl',
    'go.owl',
    'envo.owl',
    'chebi.owl',
    'gtdb.owl',
    'rhea.owl',
    'eccode.owl',
    'modelseed.owl'
]

# Run the merge with custom input directory, output filename, and the specified file order
merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl',  # Custom input directory
    output_filename='core_plus_ec_rhea_gtdb_modelseed.owl',  # Custom output filename
    ontology_order=desired_order
)

Using ROBOT at: /scratch/jplfaria/install_stuff/robot/robot
Looking for ontology files in: /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl
Found 8 ontology files:
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/ncbitaxon.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/envo.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/chebi.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/gtdb.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/rhea.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/eccode.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/modelseed.owl
Created list of merged ontologies at: /scratch/jplfaria/KBase_CDM_Ontologies/ontologies_merged.txt
Saving output to: /scratch/jplfaria/KBase_CDM_Ontologies/outputs/core_plus_ec_rhea_gtdb_modelseed.owl

Executing command:
robot merge --annotate-defined-by tr

True

In [8]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import the merge function
from merge_ontologies_custom_order import merge_ontologies

# Define the desired order for merging (filenames must be present in the input directory)
desired_order = [
    'ncbitaxon.owl',
    'go.owl'
]

# Run the merge with custom input directory, output filename, and the specified file order
merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl',  # Custom input directory
    output_filename='ncbi_go.owl',  # Custom output filename
    ontology_order=desired_order
)

Using ROBOT at: /scratch/jplfaria/install_stuff/robot/robot
Looking for ontology files in: /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl
Found 2 ontology files:
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/ncbitaxon.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl
Created list of merged ontologies at: /scratch/jplfaria/KBase_CDM_Ontologies/ontologies_merged.txt
Saving output to: /scratch/jplfaria/KBase_CDM_Ontologies/outputs/ncbi_go.owl

Executing command:
robot merge --annotate-defined-by true --input /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/ncbitaxon.owl --input /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl/go.owl remove --axioms disjoint --trim true --preserve-structure false remove --term owl:Nothing --trim true --preserve-structure false --output /scratch/jplfaria/KBase_CDM_Ontologies/outputs/ncbi_go.owl

ROBOT Output:
Successfully merged ontologies into /scratch/jplfaria/KBase_CDM_Ontologies/outputs/ncbi

True

In [None]:
%%time
import sys
import os
import subprocess
from threading import Thread
import time
from pathlib import Path
from datetime import datetime
import logging

# Set up logging
logging.basicConfig(
    filename='merge_memory.log',
    level=logging.INFO,
    format='%(asctime)s - %(message)s'
)

# Initialize peak memory tracking
peak_memory_gb = 0
last_logged_memory = 0

def parse_memory_output(mem_line):
    parts = mem_line.split()
    used_mem = parts[2]
    value = float(used_mem[:-2])
    return value * 1024 if used_mem.endswith('Ti') else value

def monitor_memory():
    global peak_memory_gb, last_logged_memory
    while True:
        try:
            mem_info = subprocess.check_output(['free', '-h']).decode('utf-8').split('\n')
            current_memory_gb = parse_memory_output(mem_info[1])
            peak_memory_gb = max(peak_memory_gb, current_memory_gb)
            
            # Log if memory changed by more than 10GB
            if abs(current_memory_gb - last_logged_memory) > 10:
                logging.info(f"Memory: {current_memory_gb:.0f}GB (Peak: {peak_memory_gb:.0f}GB)")
                last_logged_memory = current_memory_gb
            
            time.sleep(60)
        except:
            break

# Start memory monitoring in background
monitor_thread = Thread(target=monitor_memory, daemon=True)
monitor_thread.start()

# Run merge
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)
from merge_ontologies_v2 import merge_ontologies

print("Starting ontology merge - Check merge_memory.log for memory usage details")
logging.info("Starting ontology merge")

merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl',
    output_filename='ontology_data_owl_no_eccode.owl'
)

# Log final statistics
logging.info(f"Merge completed - Final peak memory: {peak_memory_gb:.0f}GB")
print(f"\nMerge completed - Peak memory usage: {peak_memory_gb:.0f}GB")

In [1]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import and run the merge
from merge_ontologies import merge_ontologies

# Run the merge with custom input directory and output filename
merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl_core_ec',  # Custom input directory
    output_filename='ontology_data_owl_core_ec.owl'    # Custom output filename
)

Using ROBOT at: /cdm_shared_workspace/user_shared_workspace/Jose/install_stuff/robot/robot
Looking for ontology files in: /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core_ec
Found 5 ontology files:
  - /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core_ec/ncbitaxon.owl
  - /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core_ec/envo.owl
  - /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core_ec/chebi.owl
  - /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core_ec/eccode.owl
  - /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontology_data_owl_core_ec/go.owl
Created list of merged ontologies at: /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/ontologies_merged.txt
Saving output to: /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_

True

In [1]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import and run the merge
from merge_ontologies import merge_ontologies

# Run the merge with custom input directory and output filename
merge_ontologies(
    repo_path,
    input_dir_name='ontology_data_owl_base',  # Custom input directory
    output_filename='ontology_data_owl_base.owl'    # Custom output filename
)

Using ROBOT at: /scratch/jplfaria/install_stuff/robot/robot
Looking for ontology files in: /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base
Found 14 ontology files:
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/foodon-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/cl-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/pco-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/iao-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/caro-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/uberon-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/uo-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/omo-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/bfo-base.owl
  - /scratch/jplfaria/KBase_CDM_Ontologies/ontology_data_owl_base/po-base.owl
  - /scratch/jplfaria/KBase_

True

#### Create Semantic SQL DB

In [5]:
import sys
print(f"Current Python executable: {sys.executable}")

Current Python executable: /home/jplfaria/miniconda3/bin/python


In [6]:
!which python

/usr/local/bin/python


In [1]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import and run the database creation
from create_semantic_sql_db import create_semantic_sql_db

# Run the database creation with custom input filename
create_semantic_sql_db(
    repo_path,
    input_owl_filename='eccode.owl'
)

Error occurred: Could not find semsql installation location
CPU times: user 24.9 ms, sys: 128 μs, total: 25.1 ms
Wall time: 398 ms


Traceback (most recent call last):
  File "/scratch/jplfaria/KBase_CDM_Ontologies/scripts/create_semantic_sql_db.py", line 24, in create_semantic_sql_db
    raise Exception("Could not find semsql installation location")
Exception: Could not find semsql installation location


False

In [11]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import and run the database creation
from create_semantic_sql_db import create_semantic_sql_db

# Run the database creation with custom input filename
create_semantic_sql_db(
    repo_path,
    input_owl_filename='ontology_data_owl.owl'
)

Input OWL file: ontology_data_owl.owl
Output DB file: ontology_data_owl.db
Working directory: /scratch/jplfaria/KBase_CDM_Ontologies/outputs

Executing command:

        export PATH="/scratch/jplfaria/bin:$PATH"
        source /home/jplfaria/scratch/install_stuff/setup_oak_env.sh
        /home/jplfaria/scratch/install_stuff/semsql/bin/semsql make ontology_data_owl.db
        

Output:
robot \
remove -i ontology_data_owl.owl --axioms "equivalent disjoint annotation abox type" \
filter --exclude-terms /home/jplfaria/scratch/install_stuff/semsql/semsql/builder//exclude-terms.txt \
-o ontology_data_owl-min.owl
touch ontology_data_owl-properties.txt
grep -v ^prefix, /home/jplfaria/scratch/install_stuff/semsql/semsql/builder/prefixes/prefixes.csv | grep -v ^obo, | perl -npe 's@,(.*)@: "$1"@' > ontology_data_owl-prefixes.yaml.tmp && mv ontology_data_owl-prefixes.yaml.tmp ontology_data_owl-prefixes.yaml
relation-graph --disable-owl-nothing true \
--ontology-file ontology_data_owl-min.owl \
 \


False

#### Extract SQLite tables to .tsv

In [4]:
%%time
import sys
import os

# Add the scripts directory to the Python path
repo_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
scripts_path = os.path.join(repo_path, 'scripts')
sys.path.append(scripts_path)

# Import and run the extraction
from extract_sql_tables_to_tsv import extract_sql_tables_to_tsv

# Run the extraction
extract_sql_tables_to_tsv(repo_path)

Reading database from: /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/CDM_merged_ontologies.db
Saving TSV files to: /cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables
Processing table: term_association
Exported 'term_association' to '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables/term_association.tsv'
Processing table: has_oio_synonym_statement
Exported 'has_oio_synonym_statement' to '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables/has_oio_synonym_statement.tsv'
Processing table: anonymous_expression
Exported 'anonymous_expression' to '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables/anonymous_expression.tsv'
Processing table: anonymous_class_expression
Exported 'anonymous_class_expression' to '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables/anonymous_class

True

In [14]:
%%time
import os
import sqlite3
import pandas as pd
#Specify the output directory
output_dir = '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_31/'

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# Connect to the SQLite database
conn = sqlite3.connect('/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_31.db')

# Get a list of all tables in the database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

# Extract each table and save as TSV
for table in tables:
    table_name = table[0]
    df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
    output_path = os.path.join(output_dir, f"{table_name}.tsv")
    df.to_csv(output_path, sep='\t', index=False)
    print(f"Table '{table_name}' has been exported to '{output_path}'")

# Close the connection
conn.close()
print("All tables have been exported to TSV files in the specified directory.")

Table 'prefix' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_31/prefix.tsv'
Table 'rdf_list_statement' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_31/rdf_list_statement.tsv'
Table 'rdf_level_summary_statistic' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_31/rdf_level_summary_statistic.tsv'
Table 'anonymous_expression' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_31/anonymous_expression.tsv'
Table 'anonymous_class_expression' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_31/anonymous_class_expression.tsv'
Table 'anonymous_property_expression' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl_31/anonymous_property_expression.tsv'
Table 'anonymous_individual_expression' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_

In [11]:
%%time
import os
import sqlite3
import pandas as pd

# Specify the output directory
output_dir = '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/parquet_tables_ontology_data_owl'

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# Connect to the SQLite database
conn = sqlite3.connect('/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl.db')

# Get a list of all tables in the database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

# Extract each table and save as Parquet
for table in tables:
    table_name = table[0]
    df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
    output_path = os.path.join(output_dir, f"{table_name}.parquet")
    df.to_parquet(output_path, index=False)
    print(f"Table '{table_name}' has been exported to '{output_path}'")

# Close the connection
conn.close()

print("All tables have been exported to Parquet files in the specified directory.")

Table 'term_association' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/parquet_tables_ontology_data_owl/term_association.parquet'
Table 'has_oio_synonym_statement' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/parquet_tables_ontology_data_owl/has_oio_synonym_statement.parquet'
Table 'anonymous_expression' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/parquet_tables_ontology_data_owl/anonymous_expression.parquet'
Table 'anonymous_class_expression' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/parquet_tables_ontology_data_owl/anonymous_class_expression.parquet'
Table 'anonymous_property_expression' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/parquet_tables_ontology_data_owl/anonymous_property_expression.parquet'
Table 'anonymous_individual_expression' has been exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/parquet_tables_ontology_data_owl/anonymous_individual_

In [13]:
%%time
import pandas as pd

# Read the parquet file and display first 200 rows
df = pd.read_parquet('/scratch/jplfaria/KBase_CDM_Ontologies/outputs/parquet_tables_ontology_data_owl/statements.parquet')

# Display first 200 rows
print("First 200 rows:")
print(df.head(200))

# Also display basic information about the dataframe
print("\nDataframe Info:")
print(df.info())

First 200 rows:
           stanza       subject         predicate        object  \
0    obo:envo.owl  obo:envo.owl     foaf:homepage          None   
1    obo:envo.owl  obo:envo.owl   owl:versionInfo          None   
2    obo:envo.owl  obo:envo.owl      rdfs:comment          None   
3    obo:envo.owl  obo:envo.owl      rdfs:comment          None   
4    obo:envo.owl  obo:envo.owl      rdfs:comment          None   
..            ...           ...               ...           ...   
195   IAO:0000425   IAO:0000425        rdfs:label          None   
196   IAO:0000425   IAO:0000425        rdfs:label          None   
197   IAO:0000425   IAO:0000425  rdfs:isDefinedBy  obo:envo.owl   
198   IAO:0000425   IAO:0000425       IAO:0000117          None   
199   IAO:0000425   IAO:0000425       IAO:0000115          None   

                                                 value    datatype language  \
0                      http://environmentontology.org/  xsd:anyURI     None   
1                    

In [None]:
%%time
import pandas as pd
import re

# Read the parquet file
df = pd.read_parquet('/scratch/jplfaria/KBase_CDM_Ontologies/outputs/parquet_tables_ontology_data_owl/statements.parquet')

def extract_prefix(value):
    if pd.isna(value):
        return None
    # Match patterns like 'obo:', 'rdfs:', 'owl:', etc.
    match = re.match(r'^([a-zA-Z]+:)', str(value))
    if match:
        return match.group(1)
    return None

# Extract unique prefixes from subject, predicate, and object columns
subject_prefixes = set(df['subject'].apply(extract_prefix).dropna())
predicate_prefixes = set(df['predicate'].apply(extract_prefix).dropna())
object_prefixes = set(df['object'].apply(extract_prefix).dropna())

# Combine all unique prefixes
all_prefixes = sorted(subject_prefixes | predicate_prefixes | object_prefixes)

print("Unique prefixes found in the data:")
for prefix in all_prefixes:
    print(f"- {prefix}")

print("\nTotal number of unique prefixes:", len(all_prefixes))

# Show count of each prefix usage
print("\nPrefix usage counts:")
for column in ['subject', 'predicate', 'object']:
    prefix_counts = df[column].apply(extract_prefix).value_counts()
    print(f"\nTop 10 prefixes in {column} column:")
    print(prefix_counts.head(10))

In [12]:
%%time
import os
import pandas as pd

# Input and output directories
input_dir = '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl'
output_dir = '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/parquet_tables_ontology_data_owl'

# Ensure output directory exists
os.makedirs(output_dir, exist_ok=True)

# Get all TSV files in the input directory
tsv_files = [f for f in os.listdir(input_dir) if f.endswith('.tsv')]

# Convert each TSV file to Parquet
for tsv_file in tsv_files:
    # Construct full file paths
    tsv_path = os.path.join(input_dir, tsv_file)
    parquet_file = tsv_file.replace('.tsv', '.parquet')
    parquet_path = os.path.join(output_dir, parquet_file)
    
    # Read TSV and write to Parquet
    try:
        df = pd.read_csv(tsv_path, sep='\t')
        df.to_parquet(parquet_path, index=False)
        print(f"Converted '{tsv_file}' to '{parquet_file}'")
    except Exception as e:
        print(f"Error converting {tsv_file}: {str(e)}")

print("\nConversion complete! All TSV files have been converted to Parquet format.")

Converted 'term_association.tsv' to 'term_association.parquet'
Converted 'has_oio_synonym_statement.tsv' to 'has_oio_synonym_statement.parquet'
Converted 'anonymous_expression.tsv' to 'anonymous_expression.parquet'
Converted 'anonymous_class_expression.tsv' to 'anonymous_class_expression.parquet'
Converted 'anonymous_property_expression.tsv' to 'anonymous_property_expression.parquet'
Converted 'anonymous_individual_expression.tsv' to 'anonymous_individual_expression.parquet'
Converted 'owl_restriction.tsv' to 'owl_restriction.parquet'
Converted 'owl_complex_axiom.tsv' to 'owl_complex_axiom.parquet'
Converted 'prefix.tsv' to 'prefix.parquet'
Converted 'rdf_list_statement.tsv' to 'rdf_list_statement.parquet'
Converted 'rdf_level_summary_statistic.tsv' to 'rdf_level_summary_statistic.parquet'
Converted 'relation_graph_construct.tsv' to 'relation_graph_construct.parquet'
Converted 'subgraph_query.tsv' to 'subgraph_query.parquet'
Converted 'entailed_edge.tsv' to 'entailed_edge.parquet'
Conv



Converted 'statements.tsv' to 'statements.parquet'

Conversion complete! All TSV files have been converted to Parquet format.
CPU times: user 4min 31s, sys: 50.7 s, total: 5min 21s
Wall time: 5min 59s


In [7]:
%%time
import os
import sqlite3
import pandas as pd
#Specify the output directory
output_dir = '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl_ec'

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# Connect to the SQLite database
conn = sqlite3.connect('/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/eccode.db')

# Get a list of all tables in the database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

# Extract each table and save as TSV
for table in tables:
    table_name = table[0]
    df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
    output_path = os.path.join(output_dir, f"{table_name}.tsv")
    df.to_csv(output_path, sep='\t', index=False)
    print(f"Table '{table_name}' has been exported to '{output_path}'")

# Close the connection
conn.close()
print("All tables have been exported to TSV files in the specified directory.")

Table 'term_association' has been exported to '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl_ec/term_association.tsv'
Table 'has_oio_synonym_statement' has been exported to '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl_ec/has_oio_synonym_statement.tsv'
Table 'anonymous_expression' has been exported to '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl_ec/anonymous_expression.tsv'
Table 'anonymous_class_expression' has been exported to '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl_ec/anonymous_class_expression.tsv'
Table 'anonymous_property_expression' has been exported to '/cdm_shared_workspace/user_shared_workspace/Jose/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl_ec/anonymous_property_expression.tsv'
Table 'anonymous_individual_expre

In [None]:
%%time
import os
import sqlite3
import pandas as pd

def export_database_objects(db_path, output_dir):
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Connect to the SQLite database
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    
    # Get both tables and views
    cursor.execute("""
        SELECT name, type 
        FROM sqlite_master 
        WHERE type IN ('table', 'view')
        AND name NOT LIKE 'sqlite_%';
    """)
    
    db_objects = cursor.fetchall()
    
    # Extract each object and save as TSV
    for obj_name, obj_type in db_objects:
        try:
            df = pd.read_sql_query(f"SELECT * FROM {obj_name}", conn)
            output_path = os.path.join(output_dir, f"{obj_name}_{obj_type}.tsv")
            df.to_csv(output_path, sep='\t', index=False)
            print(f"{obj_type.capitalize()} '{obj_name}' exported to '{output_path}'")
            print(f"Number of rows: {len(df)}")
        except Exception as e:
            print(f"Error processing {obj_type} {obj_name}: {str(e)}")
    
    conn.close()
    print("\nExport complete!")

# Usage
db_path = '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl.db'
output_dir = '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl'

export_database_objects(db_path, output_dir)

Table 'term_association' exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/term_association_table.tsv'
Number of rows: 0
Table 'has_oio_synonym_statement' exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/has_oio_synonym_statement_table.tsv'
Number of rows: 0
Table 'anonymous_expression' exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/anonymous_expression_table.tsv'
Number of rows: 0
Table 'anonymous_class_expression' exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/anonymous_class_expression_table.tsv'
Number of rows: 0
Table 'anonymous_property_expression' exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/anonymous_property_expression_table.tsv'
Number of rows: 0
Table 'anonymous_individual_expression' exported to '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl

In [None]:
import os
import sqlite3
import pandas as pd
import textwrap

def export_database_objects(db_path, output_dir):
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Connect to the SQLite database
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    
    # Get both tables and views
    cursor.execute("""
        SELECT name, type 
        FROM sqlite_master 
        WHERE type IN ('table', 'view')
        AND name NOT LIKE 'sqlite_%';
    """)
    
    db_objects = cursor.fetchall()
    
    # Extract each object and save as TSV
    for obj_name, obj_type in db_objects:
        try:
            # Read the data into a DataFrame
            df = pd.read_sql_query(f"SELECT * FROM {obj_name}", conn)
            
            # Save to TSV
            output_path = os.path.join(output_dir, f"{obj_name}_{obj_type}.tsv")
            df.to_csv(output_path, sep='\t', index=False)
            
            # Print detailed information
            print("\n" + "="*80)
            print(f"{obj_type.capitalize()}: '{obj_name}'")
            print(f"Number of rows: {len(df)}")
            print("\nColumns:")
            for col in df.columns:
                print(f"- {col}")
                
            if len(df) > 0:
                print("\nFirst row of data:")
                # Format the first row nicely
                first_row = df.iloc[0]
                for col, val in first_row.items():
                    # Wrap long values for better readability
                    if isinstance(val, str) and len(str(val)) > 50:
                        val = textwrap.shorten(str(val), width=50, placeholder="...")
                    print(f"  {col}: {val}")
            else:
                print("\nNo data in this table/view")
            
            print(f"\nExported to: {output_path}")
            
        except Exception as e:
            print(f"Error processing {obj_type} {obj_name}: {str(e)}")
    
    conn.close()
    print("\nExport complete!")

# Usage
db_path = '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/ontology_data_owl.db'
output_dir = '/scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl'

export_database_objects(db_path, output_dir)


Table: 'term_association'
Number of rows: 0

Columns:
- id
- subject
- predicate
- object
- evidence_type
- publication
- source

No data in this table/view

Exported to: /scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/term_association_table.tsv

Table: 'has_oio_synonym_statement'
Number of rows: 0

Columns:
- subject
- predicate
- object
- value
- datatype
- language

No data in this table/view

Exported to: /scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/has_oio_synonym_statement_table.tsv

Table: 'anonymous_expression'
Number of rows: 0

Columns:
- id

No data in this table/view

Exported to: /scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/anonymous_expression_table.tsv

Table: 'anonymous_class_expression'
Number of rows: 0

Columns:
- id

No data in this table/view

Exported to: /scratch/jplfaria/KBase_CDM_Ontologies/outputs/tsv_tables_ontology_data_owl/anonymous_class_expression_table.tsv

Table: 