# XMLC-LOTUS: Bookmark Data Enrichment with Ontology Labels

This notebook implements a pipeline for enriching bookmark data (JSONL) using a custom ontology (YAML) and the LOTUS framework. The goal is to prepare data for eXtreme Multi-Label Classification (XMLC) and potential Knowledge Graph construction.

## Overview

The notebook is structured in the following sections:

1. **Initial Setup**: Configure credentials and load libraries
2. **Configure LOTUS Environment**: Initialize and register LOTUS models
3. **Load Bookmark Data**: Read and validate the JSONL dataset
4. **Explore and Clean Data**: Analyze and preprocess the data
5. **Prepare Data for LOTUS**: Structure data for semantic processing
6. **Ontology Loading and Preparation**: Load and structure the ontology labels
7. **XMLC - Assign Ontology Labels**: Enrich bookmarks with relevant labels
8. **Ontology Expansion Analysis**: Optional analysis for ontology refinement
9. **Store Enriched Data**: Save the final enriched dataset
10. **Finish Task & Summarize**: Provide execution summary

In [9]:
!pip install pandas numpy pyyaml beautifulsoup4 html5lib lotus-ai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## SECTION 0: Initial Setup

In this section, we configure secure credentials and import necessary libraries.

In [2]:
# Configure Secure Credentials
import os
import warnings

# Prioritize environment variables for API keys
# Check if OpenAI API key is available
api_key = os.environ.get('OPENAI_API_KEY')

if not api_key:
    # For demonstration purposes, you can set the key here
    # WARNING: This is not recommended for production code
    # api_key = "your-api-key-here"  # Uncomment and replace with your key if needed
    print("⚠️ Warning: OpenAI API key not found in environment variables.")
    print("Please set your OPENAI_API_KEY environment variable before proceeding.")
    print("Example: export OPENAI_API_KEY='your-api-key-here'")
else:
    print("✅ OpenAI API key found in environment variables.")

✅ OpenAI API key found in environment variables.


In [13]:
!python temp_fix.py

No changes needed for import or configure calls.


In [14]:
# Load Libraries
try:
    # Core libraries
    import pandas as pd
    import numpy as np
    import yaml
    from bs4 import BeautifulSoup
    import html
    import json
    from datetime import datetime
    import re
    from tqdm.auto import tqdm
    
    # LOTUS framework
    import lotus
    # Do not import configure directly
    
    # Suppress common warnings for cleaner execution logs
    warnings.filterwarnings('ignore', category=UserWarning)
    warnings.filterwarnings('ignore', category=FutureWarning)
    
    # Suppress BeautifulSoup warnings about URLs and filenames
    from bs4 import MarkupResemblesLocatorWarning
    warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)
    
    print("✅ Core libraries successfully imported.")
except ImportError as e:
    print(f"❌ Error importing libraries: {e}")
    print("Please install missing packages using pip:")
    print("pip install pandas numpy pyyaml beautifulsoup4 html5lib lotus-ai")

✅ Core libraries successfully imported.


## SECTION 1: Configure LOTUS Environment

In this section, we initialize and register the LOTUS models for semantic operations.

In [39]:
# --- Add these lines for debugging model location ---
# (You can remove the debug prints now if you want, or leave them)
print("--- Debugging LOTUS Model Location ---")
print("dir(lotus):", dir(lotus))
if hasattr(lotus, 'models'):
    print("dir(lotus.models):", dir(lotus.models))
    # Check if Models are directly in lotus.models
    print(f"LanguageModel in lotus.models: {hasattr(lotus.models, 'LanguageModel')}")
    print(f"RetrievalModel in lotus.models: {hasattr(lotus.models, 'RetrievalModel')}")
    print(f"RerankerModel in lotus.models: {hasattr(lotus.models, 'RerankerModel')}")
else:
    print("lotus does NOT have 'models' attribute")

# Check if Models are directly in lotus
print(f"LanguageModel in lotus: {hasattr(lotus, 'LanguageModel')}")
print(f"RetrievalModel in lotus: {hasattr(lotus, 'RetrievalModel')}")
print(f"RerankerModel in lotus: {hasattr(lotus, 'RerankerModel')}")

print("--- End Model Location Debug ---")
# --- End of added lines ---


# Configure LOTUS Framework
try:
    # 1. Initialize Language Model (LM) - Use lotus.models.LM
    lm = lotus.models.LM("gpt-4o-mini")  # Balance capability/cost

    # 2. Initialize Retrieval Model (RM) - Use lotus.models.SentenceTransformersRM
    rm = lotus.models.SentenceTransformersRM("intfloat/e5-base-v2")  # Strong sentence embeddings

    # 3. Initialize Reranker Model (optional) - Use lotus.models.CrossEncoderReranker
    reranker = lotus.models.CrossEncoderReranker("mixedbread-ai/mxbai-rerank-large-v1")

    # Register models with LOTUS - Use top-level configure
    lotus.configure(lm=lm, rm=rm, reranker=reranker)

    # Verify configuration
    print(f"✅ LOTUS configured successfully with:")
    # Access model_name if it exists, otherwise handle potential AttributeError
    print(f"   - Language Model: {getattr(lm, 'model_name', 'N/A')}")
    print(f"   - Retrieval Model: {getattr(rm, 'model_name', 'N/A')}")
    print(f"   - Reranker Model: {getattr(reranker, 'model_name', 'N/A')}")

except Exception as e:
    print(f"❌ Error configuring LOTUS: {e}")
    print("Please check your API key and ensure LOTUS is properly installed.")
    print("Installation: pip install lotus-ai")


--- Debugging LOTUS Model Location ---
dir(lotus): ['WebSearchCorpus', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'cache', 'dtype_extensions', 'load_sem_index', 'logger', 'logging', 'lotus', 'models', 'nl_expression', 'sem_agg', 'sem_cluster_by', 'sem_dedup', 'sem_extract', 'sem_filter', 'sem_index', 'sem_join', 'sem_map', 'sem_ops', 'sem_partition_by', 'sem_search', 'sem_sim_join', 'sem_topk', 'settings', 'templates', 'types', 'utils', 'vector_store', 'web_search']
dir(lotus.models): ['ColBERTv2RM', 'CrossEncoderReranker', 'LM', 'LiteLLMRM', 'RM', 'Reranker', 'SentenceTransformersRM', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'colbertv2_rm', 'cross_encoder_reranker', 'litellm_rm', 'lm', 'reranker', 'rm', 'sentence_transformers_rm']
LanguageModel in lotus.models: False
RetrievalModel in lotus.models: False
RerankerM

## SECTION 2: Load Bookmark Data

In this section, we load and validate the bookmark data from the JSONL file.

In [19]:
# Load Bookmark Data
try:
    # Define the path to the JSONL file
    data_path = "Dataset.jsonl"  # Path to the dataset
    
    # Check if file exists
    if os.path.exists(data_path):
        # Read JSONL file into a Pandas DataFrame
        df = pd.read_json(data_path, lines=True)
        
        # Verify successful loading
        print(f"✅ Data loaded successfully from {data_path}")
        print(f"   - Shape: {df.shape[0]} rows, {df.shape[1]} columns")
        
        # Display basic info
        print("\nDataFrame Info:")
        df.info()
        
        print("\nSample Data (first 2 rows):")
        display(df.head(2))
        
        # Check for expected columns
        expected_columns = ['id', 'url', 'source', 'title', 'content', 'created_at', 'domain', 'metadata']
        missing_columns = [col for col in expected_columns if col not in df.columns]
        
        if missing_columns:
            print(f"\n⚠️ Warning: Missing expected columns: {missing_columns}")
        else:
            print("\n✅ All expected columns are present.")
            
    else:
        print(f"❌ Error: File not found at {data_path}")
        print("Please check the file path and ensure the file exists.")
        
except Exception as e:
    print(f"❌ Error loading data: {e}")

✅ Data loaded successfully from Dataset.jsonl
   - Shape: 8154 rows, 8 columns

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8154 entries, 0 to 8153
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   id          8154 non-null   int64              
 1   url         8154 non-null   object             
 2   source      8154 non-null   object             
 3   title       150 non-null    object             
 4   content     8154 non-null   object             
 5   created_at  8154 non-null   datetime64[ns, UTC]
 6   domain      150 non-null    object             
 7   metadata    8154 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(1), object(6)
memory usage: 509.8+ KB

Sample Data (first 2 rows):


Unnamed: 0,id,url,source,title,content,created_at,domain,metadata
0,999815155,https://x.com/lauriewired/status/1904582573046...,twitter,LaurieWired (@lauriewired) on X,Just built an MCP for Ghidra.\n\nNow basically...,2025-03-25 22:04:07.055000+00:00,x.com,"{'raindrop_id': 999815155, 'raindrop_created':..."
1,999814823,https://github.com/LaurieWired/GhidraMCP,github,GitHub - LaurieWired/GhidraMCP: MCP Server for...,MCP Server for Ghidra. Contribute to LaurieWir...,2025-03-25 22:03:50.122000+00:00,github.com,"{'raindrop_id': 999814823, 'raindrop_created':..."



✅ All expected columns are present.


## SECTION 3: Explore and Clean Data

In this section, we analyze the data characteristics and clean it for semantic quality.

In [20]:
# Subsection 3.1: Initial Exploration
print("=== Data Exploration ===\n")

# Schema: Column names, data types, non-null counts
print("Schema:")
df.info()

# Distributions: Value counts for categorical fields
print("\nSource Distribution:")
print(df['source'].value_counts())

print("\nDomain Distribution (top 10):")
print(df['domain'].value_counts().head(10))

# Missing Values: Quantify nulls per column
print("\nMissing Values:")
print(df.isnull().sum())

# Text Lengths: Statistics for title and content lengths
df['title_length'] = df['title'].apply(lambda x: len(str(x)) if pd.notna(x) else 0)
df['content_length'] = df['content'].apply(lambda x: len(str(x)) if pd.notna(x) else 0)

print("\nText Length Statistics:")
print(df[['title_length', 'content_length']].describe())

# Sample Review: Look at actual content examples
print("\nSample Content Review:")
sample_idx = 0  # First row
print(f"Title: {df.iloc[sample_idx]['title']}")
content = df.iloc[sample_idx]['content']
if pd.notna(content) and len(str(content)) > 200:
    print(f"Content: {str(content)[:200]}...")
else:
    print(f"Content: {content}")

=== Data Exploration ===

Schema:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8154 entries, 0 to 8153
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   id          8154 non-null   int64              
 1   url         8154 non-null   object             
 2   source      8154 non-null   object             
 3   title       150 non-null    object             
 4   content     8154 non-null   object             
 5   created_at  8154 non-null   datetime64[ns, UTC]
 6   domain      150 non-null    object             
 7   metadata    8154 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(1), object(6)
memory usage: 509.8+ KB

Source Distribution:
source
raindrop    8000
web           60
twitter       53
github        41
Name: count, dtype: int64

Domain Distribution (top 10):
domain
x.com                                   53
github.com                              29
arxiv.o

In [21]:
# Subsection 3.2: Cleaning Steps
print("=== Data Cleaning ===\n")

# 1. Clean Text Fields
def clean_text(text):
    """Clean text by removing HTML tags, decoding HTML entities, and normalizing whitespace."""
    if pd.isna(text) or not isinstance(text, str):
        return ""
    
    # Remove HTML tags
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    # Decode HTML entities
    text = html.unescape(text)
    
    # Normalize whitespace (multiple spaces/newlines -> single space)
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply cleaning to title and content
df['cleaned_title'] = df['title'].apply(clean_text)
df['cleaned_content'] = df['content'].apply(clean_text)

print("Text cleaning applied to 'title' and 'content' columns.")
print("\nSample of cleaned text:")
for i in range(2):
    print(f"Original title: {df.iloc[i]['title']}")
    print(f"Cleaned title: {df.iloc[i]['cleaned_title']}")
    print(f"Original content: {df.iloc[i]['content']}")
    print(f"Cleaned content: {df.iloc[i]['cleaned_content']}")
    print("---")

# 2. Handle Missing Essential Text
# Count rows before dropping
rows_before = len(df)

# Drop rows if both cleaned_title AND cleaned_content are empty
df = df[(df['cleaned_title'] != "") | (df['cleaned_content'] != "")]

# Count rows dropped
rows_dropped = rows_before - len(df)
print(f"Dropped {rows_dropped} rows with empty title AND content.")

# Fill remaining NaNs in cleaned text columns with empty string
df['cleaned_title'] = df['cleaned_title'].fillna("")
df['cleaned_content'] = df['cleaned_content'].fillna("")

# 3. Parse Dates
# Convert string dates to datetime objects
df['created_at_dt'] = pd.to_datetime(df['created_at'], errors='coerce')

# Count invalid dates
invalid_dates = df['created_at_dt'].isna().sum()
print(f"Found {invalid_dates} unparseable dates (converted to NaT).")

# 4. Handle Duplicates
# Count duplicates before removal
duplicates_before = df.duplicated(subset=['url']).sum()
print(f"Found {duplicates_before} duplicate URLs.")

# Sort by url then created_at_dt (desc), then drop duplicates keeping the first (most recent)
if duplicates_before > 0:
    df = df.sort_values(['url', 'created_at_dt'], ascending=[True, False])
    df = df.drop_duplicates(subset=['url'], keep='first')
    print(f"Kept the most recent entry for each duplicate URL.")

# Verification
print("\nCleaning completed. Final DataFrame shape:", df.shape)
print("\nSample of cleaned data:")
for i in range(2):
    print(f"URL: {df.iloc[i]['url']}")
    print(f"Cleaned Title: {df.iloc[i]['cleaned_title']}")
    print(f"Cleaned Content: {df.iloc[i]['cleaned_content']}")
    print(f"Created At (DT): {df.iloc[i]['created_at_dt']}")
    print("---")

print("✅ Data exploration and cleaning completed successfully")

=== Data Cleaning ===

Text cleaning applied to 'title' and 'content' columns.

Sample of cleaned text:
Original title: LaurieWired (@lauriewired) on X
Cleaned title: LaurieWired (@lauriewired) on X
Original content: Just built an MCP for Ghidra.

Now basically any LLM (Claude, Gemini, local...) can Reverse Engineer malware for you.  With the right prompting, it automates a *ton* of tedious tasks.

One-shot markups of entire binaries with just a click.

Open source, on Github now.
Cleaned content: Just built an MCP for Ghidra. Now basically any LLM (Claude, Gemini, local...) can Reverse Engineer malware for you. With the right prompting, it automates a *ton* of tedious tasks. One-shot markups of entire binaries with just a click. Open source, on Github now.
---
Original title: GitHub - LaurieWired/GhidraMCP: MCP Server for Ghidra
Cleaned title: GitHub - LaurieWired/GhidraMCP: MCP Server for Ghidra
Original content: MCP Server for Ghidra. Contribute to LaurieWired/GhidraMCP development 

## SECTION 4: Prepare Data for LOTUS Operators

In this section, we structure the data for semantic processing with LOTUS.

In [22]:
# Step 4.1: Create Unified Text Input
print("=== Preparing Data for LOTUS ===\n")

# Concatenate cleaned_title and cleaned_content into a single field
df['combined_text'] = df.apply(
    lambda row: (row['cleaned_title'] + " " + row['cleaned_content']).strip(),
    axis=1
)

print("Created 'combined_text' column by concatenating title and content.")

# Step 4.2: Extract Existing Metadata Labels
def extract_tags(metadata):
    """Extract tags from metadata dictionary, safely handling different structures."""
    if not isinstance(metadata, dict):
        return []
    
    # Try to extract tags from 'raindrop_tags' key (adjust based on actual structure)
    tags = metadata.get('raindrop_tags', [])
    
    # Ensure result is a list
    if not isinstance(tags, list):
        return []
    
    return tags

# Apply extraction to metadata column
df['existing_tags'] = df['metadata'].apply(extract_tags)

# Count rows with tags
rows_with_tags = (df['existing_tags'].str.len() > 0).sum()
print(f"Extracted existing tags from {rows_with_tags} rows.")

# Verification
print("\nSample rows showing combined text and existing tags:")
for i in range(3):
    print(f"URL: {df.iloc[i]['url']}")
    print(f"Combined Text: {df.iloc[i]['combined_text'][:100]}..." if len(df.iloc[i]['combined_text']) > 100 else f"Combined Text: {df.iloc[i]['combined_text']}")
    print(f"Existing Tags: {df.iloc[i]['existing_tags']}")
    print("---")

print(f"Final DataFrame shape: {df.shape}")
print("✅ Data preparation for LOTUS completed successfully")

=== Preparing Data for LOTUS ===

Created 'combined_text' column by concatenating title and content.
Extracted existing tags from 0 rows.

Sample rows showing combined text and existing tags:
URL: chrome-extension://ldgfbffkinooeloadekpmfoklnobpien/index.html#/add?link=https%3A%2F%2Fgithub.com%2Fbytarnish%2FAGILE%3Ftab%3Dreadme-ov-file
Combined Text: Bookmark saved
Existing Tags: []
---
URL: chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/index.html
Combined Text: Index
Existing Tags: []
---
URL: chrome://newtab/
Combined Text: New Tab
Existing Tags: []
---
Final DataFrame shape: (7588, 15)
✅ Data preparation for LOTUS completed successfully


## SECTION 5: Ontology Loading and Preparation

In this section, we load and structure the ontology labels from the YAML file.

In [24]:
# Step 5.1: Load Ontology Definition
print("=== Loading and Preparing Ontology ===\n")

# Define the path to the YAML file
ontology_path = "Ontology.yaml"

# Parse the YAML file into a Python dictionary
with open(ontology_path, 'r') as file:
    ontology_structure = yaml.safe_load(file)

print(f"✅ Ontology loaded successfully from {ontology_path}")

# Print a snippet of the loaded structure
print("\nOntology Structure (first domain):\n")
first_domain = list(ontology_structure.keys())[0]
print(f"Domain: {first_domain}")
print(f"Description: {ontology_structure[first_domain].get('description', 'No description')}")
print(f"Number of subcategories: {len(ontology_structure[first_domain].get('subcategories', []))}")

=== Loading and Preparing Ontology ===

✅ Ontology loaded successfully from Ontology.yaml

Ontology Structure (first domain):

Domain: AI_Machine_Learning
Description: Concepts, tools, and applications related to Artificial Intelligence and Machine Learning.
Number of subcategories: 5


In [28]:
# Step 5.2: Extract Labels and Descriptions
def traverse_ontology(structure, current_path=""):
    """Recursively traverse the ontology structure to extract labels, descriptions, and paths."""
    labels_collection = []
    
    # Handle top-level domains
    if not isinstance(structure, dict):
        print(f"⚠️ Warning: Expected dictionary structure at top level, but got {type(structure)}. Skipping.")
        return labels_collection

    for domain_name, domain_data in structure.items():
        # Skip comments and metadata (keys starting with #)
        if isinstance(domain_name, str) and domain_name.startswith('#'):
            continue
        
        # Check if domain_data is a dictionary before accessing description/subcategories
        if isinstance(domain_data, dict):
            domain_path = domain_name
            domain_desc = domain_data.get('description', f"Domain: {domain_name}")
            
            # Add the domain itself as a label
            labels_collection.append((domain_name, domain_desc, domain_path))
            
            # Process subcategories if they exist and are a list
            subcategories = domain_data.get('subcategories', [])
            if isinstance(subcategories, list):
                for subcat in subcategories:
                    # Check if subcat is a dictionary
                    if isinstance(subcat, dict):
                        for subcat_name, subcat_data in subcat.items():
                            # Check if subcat_data is a dictionary
                            if isinstance(subcat_data, dict):
                                subcat_path = f"{domain_path}/{subcat_name}"
                                subcat_desc = subcat_data.get('description', f"Subcategory: {subcat_name}")
                                
                                # Add the subcategory as a label
                                labels_collection.append((subcat_name, subcat_desc, subcat_path))
                                
                                # Process instances if they exist and are a list
                                instances = subcat_data.get('instances', [])
                                if isinstance(instances, list):
                                    for instance in instances:
                                        # Check if instance is a dictionary
                                        if isinstance(instance, dict):
                                            for instance_name, instance_data in instance.items():
                                                instance_path = f"{subcat_path}/{instance_name}"
                                                
                                                # Get description or provide a default, checking instance_data type
                                                if isinstance(instance_data, dict):
                                                    instance_desc = instance_data.get('description', f"Instance: {instance_name}")
                                                else:
                                                    # Handle cases where instance_data is not a dict (e.g., just a string or null)
                                                    instance_desc = f"Instance: {instance_name}" 
                                                    # Optionally print a warning if needed:
                                                    # print(f"⚠️ Warning: Expected dict for instance '{instance_name}', got {type(instance_data)}")
                                                
                                                # Add the instance as a label
                                                labels_collection.append((instance_name, instance_desc, instance_path))
                                        else:
                                            print(f"⚠️ Warning: Expected dict for instance item in '{subcat_name}', got {type(instance)}. Item: {instance}")
                                else:
                                    if instances: # Only warn if it's not an empty list or None
                                        print(f"⚠️ Warning: Expected list for instances in '{subcat_name}', got {type(instances)}. Data: {instances}")
                            else:
                                print(f"⚠️ Warning: Expected dict for subcategory data '{subcat_name}', got {type(subcat_data)}. Data: {subcat_data}")
                    else:
                         print(f"⚠️ Warning: Expected dict for subcategory item in '{domain_name}', got {type(subcat)}. Item: {subcat}")
            else:
                if subcategories: # Only warn if it's not an empty list or None
                    print(f"⚠️ Warning: Expected list for subcategories in '{domain_name}', got {type(subcategories)}. Data: {subcategories}")
        else:
            # Handle cases where domain_data is not a dictionary (e.g., just a string, list, or null)
            print(f"⚠️ Warning: Expected dictionary for domain '{domain_name}', but got {type(domain_data)}. Skipping description and subcategories for this domain. Data: {domain_data}")
            # Optionally add the domain with a default description if needed
            # domain_path = domain_name
            # domain_desc = f"Domain: {domain_name} (Data format warning)"
            # labels_collection.append((domain_name, domain_desc, domain_path))

    return labels_collection

# Extract labels from the ontology structure
labels_collection = traverse_ontology(ontology_structure)

# Warn if no labels were found
if not labels_collection:
    print("⚠️ Warning: No labels were extracted from the ontology.")
else:
    print(f"✅ Extracted {len(labels_collection)} labels from the ontology.")
    print("\nSample of extracted labels:")
    for label, desc, path in labels_collection[:5]:  # Show first 5 labels
        print(f"Label: {label}\nDescription: {desc}\nPath: {path}\n")

✅ Extracted 315 labels from the ontology.

Sample of extracted labels:
Label: AI_Machine_Learning
Description: Concepts, tools, and applications related to Artificial Intelligence and Machine Learning.
Path: AI_Machine_Learning

Label: Core_Concepts
Description: Fundamental ideas and architectures in AI/ML.
Path: AI_Machine_Learning/Core_Concepts

Label: LLM
Description: Large Language Models
Path: AI_Machine_Learning/Core_Concepts/LLM

Label: RAG
Description: Retrieval-Augmented Generation
Path: AI_Machine_Learning/Core_Concepts/RAG

Label: Vector_Database
Description: Databases optimized for vector similarity search
Path: AI_Machine_Learning/Core_Concepts/Vector_Database



In [29]:
# Step 5.3: Create Label DataFrame for LOTUS
# Convert the extracted labels_collection into a DataFrame
labels_df = pd.DataFrame(labels_collection, columns=['label', 'description', 'path'])

# Create a combined field for improved embedding quality
labels_df['label_plus_desc'] = labels_df.apply(
    lambda row: f"{row['label']}: {row['description']}",
    axis=1
)

print("Created labels DataFrame with combined 'label_plus_desc' field.")
print(f"Labels DataFrame shape: {labels_df.shape}")
print("\nSample of labels DataFrame:")
display(labels_df.head())

# Step 5.4: Prepare Label Lists
# Extract unique label names
all_label_names = labels_df['label'].unique().tolist()

print(f"\nPrepared list of {len(all_label_names)} unique label names.")
print("\nSample of label names:")
print(all_label_names[:10])  # Show first 10 labels

# Optional: Create more specific lists based on ontology structure
# For example, extract top-level domains
top_level_domains = [label for label, _, path in labels_collection if '/' not in path]
print(f"\nIdentified {len(top_level_domains)} top-level domains:")
print(top_level_domains)

print("\n✅ Ontology loading and preparation completed successfully")

Created labels DataFrame with combined 'label_plus_desc' field.
Labels DataFrame shape: (315, 4)

Sample of labels DataFrame:


Unnamed: 0,label,description,path,label_plus_desc
0,AI_Machine_Learning,"Concepts, tools, and applications related to A...",AI_Machine_Learning,"AI_Machine_Learning: Concepts, tools, and appl..."
1,Core_Concepts,Fundamental ideas and architectures in AI/ML.,AI_Machine_Learning/Core_Concepts,Core_Concepts: Fundamental ideas and architect...
2,LLM,Large Language Models,AI_Machine_Learning/Core_Concepts/LLM,LLM: Large Language Models
3,RAG,Retrieval-Augmented Generation,AI_Machine_Learning/Core_Concepts/RAG,RAG: Retrieval-Augmented Generation
4,Vector_Database,Databases optimized for vector similarity search,AI_Machine_Learning/Core_Concepts/Vector_Database,Vector_Database: Databases optimized for vecto...



Prepared list of 296 unique label names.

Sample of label names:
['AI_Machine_Learning', 'Core_Concepts', 'LLM', 'RAG', 'Vector_Database', 'AI_Agents', 'Prompt_Engineering', 'LMops', 'XMLC', 'Multi_Label_Classification']

Identified 7 top-level domains:
['AI_Machine_Learning', 'Software_Development', 'Project_Business_Development', 'Personal_Knowledge_Management_Productivity', 'Data_Management_Processing', 'Architecture_Construction', 'XMLC']

✅ Ontology loading and preparation completed successfully


## SECTION 6: XMLC - Assign Ontology Labels using LOTUS

In this section, we use LOTUS to assign ontology labels to the bookmarks.

In [34]:
# --- Add these lines for debugging ---
print("--- Debugging LOTUS structure ---")
try:
    print("dir(lotus):", dir(lotus))
    if hasattr(lotus, 'sem_ops'):
        print("dir(lotus.sem_ops):", dir(lotus.sem_ops))
        if hasattr(lotus.sem_ops, 'sem_index'):
            print("dir(lotus.sem_ops.sem_index):", dir(lotus.sem_ops.sem_index))
            # Let's also check if the function is directly in lotus
            if hasattr(lotus, 'sem_index') and callable(lotus.sem_index):
                 print("lotus.sem_index IS callable")
            else:
                 print("lotus.sem_index is NOT callable")
            # And check if the function is directly in lotus.sem_ops
            if hasattr(lotus.sem_ops, 'sem_index') and callable(lotus.sem_ops.sem_index):
                 print("lotus.sem_ops.sem_index IS callable")
            else:
                 print("lotus.sem_ops.sem_index is NOT callable")

    else:
        print("lotus does not have attribute 'sem_ops'")
except Exception as debug_e:
    print(f"Error during debug printing: {debug_e}")
print("--- End Debugging ---")
# --- End of added lines ---

# The line causing the error (keep your current version for now):
indexed_labels_df = lotus.sem_ops.sem_index(
    labels_df,
    column="label_plus_desc",
    index_dir=index_dir
)


--- Debugging LOTUS structure ---
dir(lotus): ['WebSearchCorpus', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'cache', 'dtype_extensions', 'load_sem_index', 'logger', 'logging', 'lotus', 'models', 'nl_expression', 'sem_agg', 'sem_cluster_by', 'sem_dedup', 'sem_extract', 'sem_filter', 'sem_index', 'sem_join', 'sem_map', 'sem_ops', 'sem_partition_by', 'sem_search', 'sem_sim_join', 'sem_topk', 'settings', 'templates', 'types', 'utils', 'vector_store', 'web_search']
dir(lotus.sem_ops): ['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'cascade_utils', 'load_sem_index', 'postprocessors', 'sem_agg', 'sem_cluster_by', 'sem_dedup', 'sem_extract', 'sem_filter', 'sem_index', 'sem_join', 'sem_map', 'sem_partition_by', 'sem_search', 'sem_sim_join', 'sem_topk']
dir(lotus.sem_ops.sem_index): ['Any', 'SemIndexDataframe', '__builtins__', '

TypeError: 'module' object is not callable

In [None]:
# Step 6.A.2: Perform Semantic Similarity Join
print("\nPerforming semantic similarity join...")

# Define the number of top labels to retrieve per bookmark
K = 10

# Perform the join using LOTUS sem_sim_join operator
enriched_df = lotus.sem_sim_join(
    left_df=df,
    left_column="combined_text",
    right_df=indexed_labels_df,
    right_column="label_plus_desc",
    k=K,
    use_reranker=True  # Use the reranker for improved relevance
)

print(f"✅ Semantic similarity join completed with K={K}")
print(f"Enriched DataFrame shape: {enriched_df.shape}")
print("\nSample of matched rows:")
display(enriched_df[['url', 'combined_text', 'label', '_score']].head(3))

In [None]:
# Step 6.A.3: Aggregate Top K Labels
print("\nAggregating top K labels per bookmark...")

# Sort by bookmark ID and score (descending)
enriched_df = enriched_df.sort_values(['id', '_score'], ascending=[True, False])

# Group by bookmark ID and aggregate labels, scores, and paths
agg_labels = enriched_df.groupby('id').agg({
    'label': list,
    '_score': list,
    'path': list
}).reset_index()

# Rename columns for clarity
agg_labels = agg_labels.rename(columns={
    'label': 'ontology_labels_k',
    '_score': 'ontology_scores_k',
    'path': 'ontology_paths_k'
})

print(f"✅ Aggregated labels for {len(agg_labels)} bookmarks")

# Step 6.A.4: Merge Aggregated Labels into Final DataFrame
print("\nMerging aggregated labels back to main DataFrame...")

# Merge the aggregated results with the cleaned DataFrame
df_final = pd.merge(df, agg_labels, on='id', how='left')

# Handle bookmarks with no matches
for col in ['ontology_labels_k', 'ontology_scores_k', 'ontology_paths_k']:
    df_final[col] = df_final[col].apply(lambda x: [] if pd.isna(x) else x)

print(f"✅ Final DataFrame with aggregated labels created")
print(f"Final DataFrame shape: {df_final.shape}")
print("\nSample of final DataFrame with aggregated labels:")
for i in range(2):
    print(f"URL: {df_final.iloc[i]['url']}")
    print(f"Combined Text: {df_final.iloc[i]['combined_text'][:100]}..." if len(df_final.iloc[i]['combined_text']) > 100 else f"Combined Text: {df_final.iloc[i]['combined_text']}")
    print(f"Ontology Labels: {df_final.iloc[i]['ontology_labels_k']}")
    print("---")

In [None]:
# Supplementary Strategy (C): Keyword Extraction and Matching
print("\n=== Strategy C: Keyword Extraction and Matching ===\n")

# Step 6.C.1: Extract Mentioned Entities
print("Extracting mentioned entities from bookmark text...")

# Define the extraction prompt
extraction_prompt = """
Extract specific tools, libraries, concepts, and technical terms mentioned in the text.
Focus on named entities related to technology, software, AI/ML, and development.

Format your response as a JSON object with this structure:
{"mentioned_entities": ["entity1", "entity2", ...]}
"""

# Use LOTUS sem_extract operator on a sample of the data
# In production, you would process the entire dataset
sample_size = min(20, len(df_final))  # Process up to 20 rows for demonstration
sample_df = df_final.head(sample_size).copy()

extracted_df = lotus.sem_extract(
    sample_df,
    column="combined_text",
    output_format={"mentioned_entities": "List specific tools, libraries, concepts..."},
    prompt=extraction_prompt
)

print(f"✅ Extracted entities from {len(extracted_df)} bookmarks")
print("\nSample of extracted entities:")
display(extracted_df[['url', 'mentioned_entities']].head(2))

In [None]:
# Step 6.C.2: Match Extracted Entities to Ontology
print("\nMatching extracted entities to ontology labels...")

# Convert all label names to lowercase for case-insensitive matching
all_label_names_lower = [label.lower() for label in all_label_names]
label_map = {label.lower(): label for label in all_label_names}  # Map to preserve original case

def match_entities_to_ontology(entities):
    """Match extracted entities to ontology labels using case-insensitive matching."""
    if not isinstance(entities, list):
        return []
    
    matched_labels = []
    for entity in entities:
        entity_lower = entity.lower()
        # Check for exact match
        if entity_lower in all_label_names_lower:
            matched_labels.append(label_map[entity_lower])
    
    # Return unique matches
    return list(set(matched_labels))

# Apply matching to the extracted entities
extracted_df['keyword_matched_labels'] = extracted_df['mentioned_entities'].apply(match_entities_to_ontology)

# Count matches
match_counts = extracted_df['keyword_matched_labels'].apply(len)
total_matches = match_counts.sum()
rows_with_matches = (match_counts > 0).sum()

print(f"✅ Found {total_matches} ontology label matches across {rows_with_matches} bookmarks")
print("\nSample of matched labels:")
display(extracted_df[['url', 'mentioned_entities', 'keyword_matched_labels']].head(2))

# Merge the keyword matched labels back to the main DataFrame
# For demonstration, we'll only merge the processed sample
# In production, you would process and merge the entire dataset
df_final = pd.merge(df_final, extracted_df[['id', 'mentioned_entities', 'keyword_matched_labels']], 
                    on='id', how='left')

# Fill NaN values with empty lists
df_final['mentioned_entities'] = df_final['mentioned_entities'].apply(lambda x: [] if pd.isna(x) else x)
df_final['keyword_matched_labels'] = df_final['keyword_matched_labels'].apply(lambda x: [] if pd.isna(x) else x)

In [None]:
# Combine Labels
print("\n=== Combining Labels from Different Strategies ===\n")

# Ensure both label columns exist
if 'ontology_labels_k' in df_final.columns and 'keyword_matched_labels' in df_final.columns:
    # Create a unified list of labels from both strategies
    def combine_labels(row):
        # Get labels from both strategies
        semantic_labels = row['ontology_labels_k'] if isinstance(row['ontology_labels_k'], list) else []
        keyword_labels = row['keyword_matched_labels'] if isinstance(row['keyword_matched_labels'], list) else []
        
        # Combine and deduplicate
        combined = list(set(semantic_labels + keyword_labels))
        return combined
    
    # Apply the combination function
    df_final['combined_ontology_labels'] = df_final.apply(combine_labels, axis=1)
    
    # Count labels
    total_combined = df_final['combined_ontology_labels'].apply(len).sum()
    total_semantic = df_final['ontology_labels_k'].apply(len).sum()
    total_keyword = df_final['keyword_matched_labels'].apply(len).sum()
    
    print(f"✅ Combined labels created")
    print(f"Total semantic labels: {total_semantic}")
    print(f"Total keyword labels: {total_keyword}")
    print(f"Total combined unique labels: {total_combined}")
    
    print("\nSample comparison of label sources:")
    for i in range(2):
        print(f"URL: {df_final.iloc[i]['url']}")
        print(f"Semantic Labels: {df_final.iloc[i]['ontology_labels_k']}")
        print(f"Keyword Labels: {df_final.iloc[i]['keyword_matched_labels']}")
        print(f"Combined Labels: {df_final.iloc[i]['combined_ontology_labels']}")
        print("---")
else:
    print("⚠️ Warning: Required label columns not found. Skipping label combination.")

print("\n✅ XMLC label assignment completed successfully")

## SECTION 7: Ontology Expansion Analysis

In this section, we analyze the data to identify potential new ontology labels.

In [None]:
# Step 7.1: Cluster Bookmarks by Content Similarity
print("=== Clustering Bookmarks by Content Similarity ===\n")

# Define the number of clusters
n_clusters = 10  # Adjust based on dataset size and diversity

# Use LOTUS sem_cluster operator to cluster the bookmarks
clustered_df = lotus.sem_cluster(
    df_final,
    column="combined_text",
    n_clusters=n_clusters,
    cluster_column_name="content_cluster"
)

# Count bookmarks per cluster
cluster_counts = clustered_df['content_cluster'].value_counts().sort_index()

print(f"✅ Clustered bookmarks into {n_clusters} groups")
print("\nBookmarks per cluster:")
print(cluster_counts)

In [None]:
# Step 7.2: Extract Cluster Themes
print("=== Extracting Cluster Themes ===\n")

# Define the extraction prompt
theme_prompt = """
Analyze the text and identify the main themes or topics.
Focus on technical domains, concepts, and subject areas.

Format your response as a JSON object with this structure:
{"cluster_theme": "Main theme", "subtopics": ["subtopic1", "subtopic2", ...]}
"""

# Sample a few bookmarks from each cluster for theme extraction
cluster_samples = []
for cluster_id in range(n_clusters):
    # Get bookmarks from this cluster
    cluster_bookmarks = clustered_df[clustered_df['content_cluster'] == cluster_id]
    
    # Skip empty clusters
    if len(cluster_bookmarks) == 0:
        continue
        
    # Sample up to 5 bookmarks from this cluster
    sample_size = min(5, len(cluster_bookmarks))
    samples = cluster_bookmarks.sample(sample_size)
    
    # Concatenate the text from all samples
    combined_sample_text = " ".join(samples['combined_text'].tolist())
    
    # Add to the list of samples
    cluster_samples.append({
        'cluster_id': cluster_id,
        'sample_text': combined_sample_text,
        'bookmark_count': len(cluster_bookmarks)
    })

# Convert to DataFrame
cluster_samples_df = pd.DataFrame(cluster_samples)

# Extract themes using LOTUS sem_extract
if len(cluster_samples_df) > 0:
    themes_df = lotus.sem_extract(
        cluster_samples_df,
        column="sample_text",
        output_format={"cluster_theme": "Main theme", "subtopics": "List of subtopics"},
        prompt=theme_prompt
    )
    
    print(f"✅ Extracted themes for {len(themes_df)} clusters")
    print("\nSample of cluster themes:")
    for i in range(min(3, len(themes_df))):
        print(f"Cluster {themes_df.iloc[i]['cluster_id']} ({themes_df.iloc[i]['bookmark_count']} bookmarks):")
        print(f"  Theme: {themes_df.iloc[i]['cluster_theme']}")
        print(f"  Subtopics: {themes_df.iloc[i]['subtopics']}")
        print()
else:
    print("⚠️ No clusters available for theme extraction")

In [None]:
# Step 7.3: Identify Potential New Ontology Labels
print("=== Identifying Potential New Ontology Labels ===\n")

# Collect all existing ontology labels
existing_labels = set()
for labels in df_final['combined_ontology_labels']:
    if isinstance(labels, list):
        existing_labels.update(labels)

# Extract potential new labels from cluster themes
if 'themes_df' in locals() and len(themes_df) > 0:
    # Collect all themes and subtopics
    potential_labels = set()
    for _, row in themes_df.iterrows():
        # Add the main theme
        if isinstance(row['cluster_theme'], str):
            potential_labels.add(row['cluster_theme'])
        
        # Add all subtopics
        if isinstance(row['subtopics'], list):
            potential_labels.update(row['subtopics'])
    
    # Find labels that don't exist in the current ontology
    new_labels = potential_labels - existing_labels
    
    print(f"✅ Identified {len(new_labels)} potential new ontology labels")
    print("\nSample of potential new labels:")
    print(list(new_labels)[:10])  # Show up to 10 new labels
else:
    print("⚠️ No cluster themes available for new label identification")

In [None]:
# Step 7.4: Entity Extraction for Ontology Expansion
print("=== Entity Extraction for Ontology Expansion ===\n")

# Define the entity extraction prompt
entity_prompt = """
Extract specific named entities from the text that could be valuable additions to a technical ontology.
Focus on:
1. Technical tools and frameworks
2. Programming languages and libraries
3. AI/ML models and techniques
4. Technical concepts and methodologies

Format your response as a JSON object with this structure:
{"extracted_entities": ["entity1", "entity2", ...], "entity_categories": {"category1": ["entity1", "entity2"], "category2": ["entity3"]}}
"""

# Sample bookmarks for entity extraction
sample_size = min(20, len(df_final))  # Process up to 20 rows for demonstration
entity_sample_df = df_final.sample(sample_size).copy()

# Extract entities using LOTUS sem_extract
entities_df = lotus.sem_extract(
    entity_sample_df,
    column="combined_text",
    output_format={
        "extracted_entities": "List of entities",
        "entity_categories": "Categorized entities"
    },
    prompt=entity_prompt
)

print(f"✅ Extracted entities from {len(entities_df)} bookmarks")
print("\nSample of extracted entities:")
for i in range(min(2, len(entities_df))):
    print(f"Bookmark: {entities_df.iloc[i]['url']}")
    print(f"  Entities: {entities_df.iloc[i]['extracted_entities'][:5]}..." if len(entities_df.iloc[i]['extracted_entities']) > 5 else f"  Entities: {entities_df.iloc[i]['extracted_entities']}")
    print(f"  Categories: {entities_df.iloc[i]['entity_categories']}")
    print()

# Aggregate all extracted entities
all_entities = []
for entities in entities_df['extracted_entities']:
    if isinstance(entities, list):
        all_entities.extend(entities)

# Count entity frequencies
from collections import Counter
entity_counts = Counter(all_entities)
top_entities = entity_counts.most_common(20)  # Get top 20 entities

print("\nTop extracted entities:")
for entity, count in top_entities:
    print(f"  {entity}: {count} occurrences")

print("\n✅ Ontology expansion analysis completed successfully")

## SECTION 8: Store Enriched Data

In this section, we save the enriched data to files for further use.

In [None]:
# Step 8.1: Prepare Output Directory
print("=== Preparing Output Directory ===\n")

# Create output directory with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = f"enriched_data_{timestamp}"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"✅ Created output directory: {output_dir}")
else:
    print(f"⚠️ Output directory already exists: {output_dir}")

In [None]:
# Step 8.2: Save Full Enriched Dataset
print("=== Saving Full Enriched Dataset ===\n")

# Save as CSV (for easy viewing and basic analysis)
csv_path = os.path.join(output_dir, "enriched_bookmarks.csv")

# Select columns to save
# Exclude large or complex columns that don't work well in CSV
csv_columns = [col for col in df_final.columns if not any(x in col for x in ['metadata', '_scores', '_paths'])]

# For list columns, convert to string representation
df_csv = df_final[csv_columns].copy()
for col in df_csv.columns:
    if df_csv[col].apply(lambda x: isinstance(x, list)).any():
        df_csv[col] = df_csv[col].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)

# Save to CSV
df_csv.to_csv(csv_path, index=False)
print(f"✅ Saved CSV file: {csv_path}")

# Save as JSONL (preserving all data structures)
jsonl_path = os.path.join(output_dir, "enriched_bookmarks.jsonl")

# Convert DataFrame to JSONL
with open(jsonl_path, 'w') as f:
    for _, row in df_final.iterrows():
        # Convert row to dictionary and handle non-serializable objects
        row_dict = row.to_dict()
        for key, value in row_dict.items():
            if pd.isna(value):
                row_dict[key] = None
            elif isinstance(value, pd.Timestamp):
                row_dict[key] = value.isoformat()
        
        # Write as JSON line
        f.write(json.dumps(row_dict) + '\n')

print(f"✅ Saved JSONL file: {jsonl_path}")

In [None]:
# Step 8.3: Save Label Statistics
print("=== Saving Label Statistics ===\n")

# Create a DataFrame with label statistics
label_stats = []

# Check if combined_ontology_labels column exists
if 'combined_ontology_labels' in df_final.columns:
    # Collect all labels
    all_labels = []
    for labels in df_final['combined_ontology_labels']:
        if isinstance(labels, list):
            all_labels.extend(labels)
    
    # Count label frequencies
    from collections import Counter
    label_counts = Counter(all_labels)
    
    # Convert to DataFrame
    for label, count in label_counts.most_common():
        label_stats.append({
            'label': label,
            'count': count,
            'percentage': round(count / len(df_final) * 100, 2)
        })
    
    # Create DataFrame
    label_stats_df = pd.DataFrame(label_stats)
    
    # Save to CSV
    stats_path = os.path.join(output_dir, "label_statistics.csv")
    label_stats_df.to_csv(stats_path, index=False)
    
    print(f"✅ Saved label statistics: {stats_path}")
    print(f"Total unique labels: {len(label_stats_df)}")
    print("\nTop 10 labels:")
    for i in range(min(10, len(label_stats_df))):
        print(f"  {label_stats_df.iloc[i]['label']}: {label_stats_df.iloc[i]['count']} bookmarks ({label_stats_df.iloc[i]['percentage']}%)")
else:
    print("⚠️ No 'combined_ontology_labels' column found. Skipping label statistics.")

In [None]:
# Step 8.4: Save Summary Report
print("=== Saving Summary Report ===\n")

# Create a summary report
summary = {
    'timestamp': datetime.now().isoformat(),
    'total_bookmarks': len(df_final),
    'output_directory': output_dir,
    'files_created': [
        {'name': 'enriched_bookmarks.csv', 'path': csv_path},
        {'name': 'enriched_bookmarks.jsonl', 'path': jsonl_path}
    ]
}

# Add label statistics if available
if 'label_stats_df' in locals():
    summary['total_unique_labels'] = len(label_stats_df)
    summary['files_created'].append({'name': 'label_statistics.csv', 'path': stats_path})

# Save summary as JSON
summary_path = os.path.join(output_dir, "summary.json")
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"✅ Saved summary report: {summary_path}")

print("\n✅ Data storage completed successfully")
print(f"All enriched data saved to directory: {output_dir}")

## SECTION 9: Finish Task & Summarize

In this section, we summarize the execution and provide final insights.

In [None]:
# Execution Summary
print("=== XMLC-LOTUS Execution Summary ===\n")

# Collect key statistics
stats = {
    'total_bookmarks_processed': len(df_final),
    'total_ontology_labels': len(all_label_names) if 'all_label_names' in locals() else 0,
    'total_labels_assigned': df_final['combined_ontology_labels'].apply(len).sum() if 'combined_ontology_labels' in df_final.columns else 0,
    'avg_labels_per_bookmark': round(df_final['combined_ontology_labels'].apply(len).mean(), 2) if 'combined_ontology_labels' in df_final.columns else 0,
    'output_directory': output_dir if 'output_dir' in locals() else 'Not saved',
    'execution_timestamp': datetime.now().isoformat()
}

# Print summary
print(f"Bookmarks Processed: {stats['total_bookmarks_processed']}")
print(f"Ontology Labels Available: {stats['total_ontology_labels']}")
print(f"Total Labels Assigned: {stats['total_labels_assigned']}")
print(f"Average Labels per Bookmark: {stats['avg_labels_per_bookmark']}")
print(f"Output Directory: {stats['output_directory']}")
print(f"Execution Completed: {stats['execution_timestamp']}")

print("\n✅ XMLC-LOTUS pipeline execution completed successfully")
print("The bookmark data has been enriched with ontology labels and is ready for further use.")