# Agent Module: Rule Extraction and Consolidation

This notebook demonstrates the **Agent layer** of the rule extraction system. The agent is responsible for:

1. **Document Processing**: Loading and chunking technical documents
2. **Rule Extraction**: Using LLMs to extract operational rules from text
3. **Sensor Resolution**: Mapping natural language sensor references to actual sensor IDs
4. **Rule Consolidation**: Identifying and removing redundant rules

---
## 1. Setup

Initialize the environment, configure containers, and set up database connections.


In [1]:
import sys
from pathlib import Path

# Add workspace root to path for imports
workspace_root = Path.cwd().parent
if str(workspace_root) not in sys.path:
    sys.path.insert(0, str(workspace_root))

from src.config import AppConfig
from src.agent.infrastructure.container import get_agent_container
from src.api.infrastructure.container import init_container, get_container

# Initialize configuration
config = AppConfig()

# Initialize Agent container (handles LLM, vector store, document loader)
agent_container = get_agent_container()
use_case = agent_container.rule_extraction_use_case()
workflow = use_case.get_workflow()

# Initialize API container (handles database, file storage)
api_config = {
    "database": {
        "url": config.database.url,
        "async_url": config.database.async_url,
    },
    "storage": {"path": config.storage.path},
}
init_container(api_config)
api_container = get_container()

# Create database tables
db = api_container.database()
db.create_all()

print("Agent container initialized")
print("API container initialized")
print("Database tables created")


  from .autonotebook import tqdm as notebook_tqdm
2025-12-14 17:12:49,810 - INFO - HTTP Request: GET http://localhost:6333 "HTTP/1.1 200 OK"
[32m2025-12-14 17:12:50.002[0m | [1mINFO    [0m | [36msrc.agent.application.extraction_workflow[0m:[36m__init__[0m:[36m109[0m - [1m‚úì Structured output supported by LLM provider[0m
[32m2025-12-14 17:12:50.167[0m | [1mINFO    [0m | [36msrc.api.infrastructure.database[0m:[36m__init__[0m:[36m61[0m - [1mDatabase connection pool configured: pool_size=20, max_overflow=10, pool_recycle=3600s[0m
[32m2025-12-14 17:12:50.168[0m | [1mINFO    [0m | [36msrc.api.infrastructure.database[0m:[36mcreate_all[0m:[36m68[0m - [1mCreating database tables...[0m
[32m2025-12-14 17:12:50.362[0m | [1mINFO    [0m | [36msrc.api.infrastructure.database[0m:[36mcreate_all[0m:[36m70[0m - [1m‚úì Database tables created[0m


Agent container initialized
API container initialized
Database tables created


In [2]:
# Initialize all services
from src.api.application.collection_service import CollectionService
from src.api.application.document_service import DocumentService
from src.api.application.qdrant_sync_service import QdrantSyncService
from src.api.application.processing_service import ProcessingService
from src.api.application.sensor_service import SensorService
from src.api.application.consolidation_service import ConsolidationService
from src.api.domain.schemas import CollectionCreate, ConsolidationJobCreate
from src.api.infrastructure.repositories import (
    CollectionRepository,
    DocumentRepository,
    ChunkRepository,
    RuleRepository,
)

# Initialize services
file_storage = api_container.file_storage()
collection_service = CollectionService(file_storage=file_storage)
document_service = DocumentService(file_storage=file_storage)
sync_service = QdrantSyncService(
    file_storage=file_storage,
    document_loader=agent_container.document_loader(),
    vector_store_provider=agent_container.vector_store(),
)
processing_service = ProcessingService(file_storage=file_storage)
sensor_service = SensorService()

print("All services initialized")


[32m2025-12-14 17:12:50.396[0m | [1mINFO    [0m | [36msrc.api.infrastructure.storage[0m:[36m__init__[0m:[36m22[0m - [1mFile storage initialized at: storage[0m
2025-12-14 17:12:50,444 - INFO - HTTP Request: GET http://localhost:6333 "HTTP/1.1 200 OK"


All services initialized


---
## 2. Cleanup (Optional)

Reset the database and Qdrant vector store. Run this cell to start fresh.


In [None]:
async def cleanup_everything():
    """
    Complete cleanup: Delete all collections, documents, chunks, jobs, and Qdrant data.
    WARNING: This will delete ALL data!
    """
    print("Starting cleanup process...\n")
    
    # Step 1: Delete all collections from database (cascades to documents, chunks, rules)
    print("Step 1: Deleting all collections from database...")
    async with db.get_async_session() as session:
        collections = await collection_service.list_collections(session)
        
        if not collections:
            print("  No collections to delete")
        else:
            for col in collections:
                print(f"  Deleting collection: {col.name} (ID: {col.id})")
                await collection_service.delete_collection(session, col.id)
            await session.commit()
            print(f"  Deleted {len(collections)} collections")
    
    # Step 2: Clear all Qdrant collections
    print("\nStep 2: Clearing all Qdrant collections...")
    qdrant_collections = use_case.list_collections()
    
    if not qdrant_collections:
        print("  No Qdrant collections to clear")
    else:
        for qcol in qdrant_collections:
            print(f"  Clearing Qdrant collection: {qcol}")
            use_case.clear_vector_store(collection_name=qcol)
        print(f"  Cleared {len(qdrant_collections)} Qdrant collections")
    
    # Step 3: Verify cleanup
    print("\nStep 3: Verifying cleanup...")
    async with db.get_async_session() as session:
        remaining_collections = await collection_service.list_collections(session)
        print(f"  Database collections remaining: {len(remaining_collections)}")
    
    remaining_qdrant = use_case.list_collections()
    print(f"  Qdrant collections remaining: {len(remaining_qdrant)}")
    
    print("\nCleanup complete!")

# Uncomment to run cleanup
# await cleanup_everything()


---
## 3. Collections

Collections group related documents together. Each collection has:
- A corresponding Qdrant vector collection for semantic search
- Associated sensors for entity resolution
- Extracted rules from its documents


In [3]:
# Create a new collection
async def create_collection(name: str, description: str):
    """Create a new document collection."""
    async with db.get_async_session() as session:
        collection = await collection_service.create_collection(
            session,
            CollectionCreate(name=name, description=description)
        )
        await session.commit()
        
        print(f"Created Collection: {collection.name}")
        print(f"  ID: {collection.id}")
        print(f"  Qdrant collection: {collection.qdrant_collection_name}")
        print(f"  Created at: {collection.created_at}")
        return collection

# Create the collection
collection = await create_collection(
    name="algae_mock",
    description="Algae Mock - Mock process for testing"
)


[32m2025-12-14 17:13:13.723[0m | [1mINFO    [0m | [36msrc.api.application.collection_service[0m:[36mcreate_collection[0m:[36m43[0m - [1m‚úì Created collection: algae_mock (ID: 13)[0m


Created Collection: algae_mock
  ID: 13
  Qdrant collection: collection_algae_mock
  Created at: 2025-12-14 16:13:13.706657+00:00


In [4]:
# List all collections
async def list_collections():
    """List all collections with their statistics."""
    async with db.get_async_session() as session:
        collections = await collection_service.list_collections(session)
        
        print(f"Found {len(collections)} collections:\n")
        for col in collections:
            print(f"[{col.id}] {col.name}")
            print(f"    Description: {col.description}")
            print(f"    Qdrant: {col.qdrant_collection_name}")
            print(f"    Documents: {col.document_count}")
            print(f"    Chunks: {col.total_chunks}")
            print()
        return collections

collections = await list_collections()


Found 11 collections:

[13] algae_mock
    Description: Algae Mock - Mock process for testing
    Qdrant: collection_algae_mock
    Documents: 0
    Chunks: 0

[12] c3c4_splitter
    Description: C3/C4 Splitter - Refinery specifications and operational procedures
    Qdrant: collection_c3c4_splitter
    Documents: 1
    Chunks: 29

[9] Eval_184424
    Description: Evaluation test
    Qdrant: collection_eval_184424
    Documents: 4
    Chunks: 82

[8] Eval_160020
    Description: Evaluation test
    Qdrant: collection_eval_160020
    Documents: 4
    Chunks: 80

[7] Eval_151223
    Description: Evaluation test
    Qdrant: collection_eval_151223
    Documents: 4
    Chunks: 76

[6] Eval_142835
    Description: Evaluation test
    Qdrant: collection_eval_142835
    Documents: 4
    Chunks: 79

[5] Eval_020656
    Description: Evaluation test
    Qdrant: collection_eval_020656
    Documents: 4
    Chunks: 75

[4] Eval_010913
    Description: Evaluation test
    Qdrant: collection_eval_0109

---
## 4. Documents

Upload technical documents to a collection


In [5]:
# Upload documents from the resources folder
async def upload_documents(collection_id: int, folder_path: str):
    """Upload all markdown documents from a folder."""
    folder = Path(folder_path)
    files = [f for f in folder.iterdir() if f.is_file() and f.suffix in [".md", ".pdf", ".docx"]]
    
    print(f"Uploading {len(files)} documents to collection {collection_id}...\n")
    
    uploaded_docs = []
    for doc_path in files:
        with open(doc_path, 'rb') as f:
            async with db.get_async_session() as session:
                mime_type = "text/markdown" if doc_path.suffix == ".md" else "application/octet-stream"
                doc = await document_service.upload_document(
                    session=session,
                    collection_id=collection_id,
                    filename=doc_path.name,
                    file=f,
                    mime_type=mime_type
                )
                await session.commit()
                uploaded_docs.append(doc)
                print(f"Uploaded: {doc.filename}")
                print(f"  ID: {doc.id}")
                print(f"  Size: {doc.file_size} bytes")
                print(f"  Status: {doc.qdrant_status}")
    
    return uploaded_docs

# Upload documents
docs = await upload_documents(collection.id, "../resources/algae_mock")


[32m2025-12-14 17:13:38.720[0m | [1mINFO    [0m | [36msrc.api.infrastructure.storage[0m:[36msave_file[0m:[36m64[0m - [1m‚úì Saved file: collection_13/process.md (14354 bytes)[0m


[32m2025-12-14 17:13:38.766[0m | [1mINFO    [0m | [36msrc.api.application.document_service[0m:[36mupload_document[0m:[36m57[0m - [1m‚úì Uploaded document: process.md (ID: 33)[0m


Uploading 1 documents to collection 13...

Uploaded: process.md
  ID: 33
  Size: 14354 bytes
  Status: QdrantStatus.NOT_UPLOADED


In [6]:
# Sync documents to Qdrant (chunk and embed)
async def sync_to_qdrant(collection_id: int):
    """Sync documents to Qdrant vector store."""
    print(f"Syncing collection {collection_id} to Qdrant...")
    
    async with db.get_async_session() as session:
        result = await sync_service.sync_collection_to_qdrant(session, collection_id)
        await session.commit()
        
        print(f"\nSync complete:")
        print(f"  Documents synced: {result['synced_documents']}")
        print(f"  Failed documents: {result['failed_documents']}")
        print(f"  Total chunks: {result['total_chunks']}")
        return result

sync_result = await sync_to_qdrant(collection.id)


[32m2025-12-14 17:13:46.683[0m | [1mINFO    [0m | [36msrc.api.application.qdrant_sync_service[0m:[36msync_collection_to_qdrant[0m:[36m74[0m - [1mSyncing 1 documents to Qdrant...[0m
[32m2025-12-14 17:13:46.698[0m | [1mINFO    [0m | [36msrc.agent.infrastructure.document_loaders[0m:[36mload_documents[0m:[36m27[0m - [1mLoading document: storage/collection_13/process.md[0m
2025-12-14 17:13:46,708 - INFO - detected formats: [<InputFormat.MD: 'md'>]
2025-12-14 17:13:46,714 - INFO - Going to convert document batch...
2025-12-14 17:13:46,715 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-12-14 17:13:46,742 - INFO - Loading plugin 'docling_defaults'
2025-12-14 17:13:46,747 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-12-14 17:13:46,748 - INFO - Processing document process.md


Syncing collection 13 to Qdrant...


2025-12-14 17:13:47,203 - INFO - Finished converting document process.md in 0.50 sec.
[32m2025-12-14 17:13:47.248[0m | [1mINFO    [0m | [36msrc.agent.infrastructure.document_loaders[0m:[36mload_documents[0m:[36m33[0m - [1m‚úì Loaded 23 chunks from storage/collection_13/process.md[0m
[32m2025-12-14 17:13:47.248[0m | [1mINFO    [0m | [36msrc.agent.infrastructure.document_loaders[0m:[36mload_documents[0m:[36m37[0m - [1mTotal chunks loaded: 23[0m
2025-12-14 17:13:47,371 - INFO - HTTP Request: GET http://localhost:6333 "HTTP/1.1 200 OK"
2025-12-14 17:13:47,383 - INFO - HTTP Request: GET http://localhost:6333/collections/collection_algae_mock/exists "HTTP/1.1 200 OK"
2025-12-14 17:13:49,558 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-12-14 17:13:49,687 - INFO - HTTP Request: PUT http://localhost:6333/collections/collection_algae_mock "HTTP/1.1 200 OK"
2025-12-14 17:13:51,017 - INFO - HTTP Request: POST http://localhost:11434/api


Sync complete:
  Documents synced: 1
  Failed documents: 0
  Total chunks: 23


---
## 5. Chunks

Documents are split into chunks for semantic search. Each chunk is embedded and stored in Qdrant.


In [7]:
# View chunks in a collection
async def list_chunks(collection_id: int, limit: int = 5):
    """List chunks in a collection."""
    async with db.get_async_session() as session:
        chunk_repo = ChunkRepository(session)
        chunks = await chunk_repo.list_by_collection(collection_id)
        
        # Extract data within session context
        chunk_data = []
        for chunk in chunks:
            chunk_data.append({
                'id': chunk.id,
                'chunk_index': chunk.chunk_index,
                'document_filename': chunk.document.filename,
                'content_preview': chunk.content_preview,
                'qdrant_point_id': chunk.qdrant_point_id,
            })
    
    print(f"Total chunks: {len(chunk_data)}\n")
    for data in chunk_data[:limit]:
        print(f"Chunk #{data['chunk_index']} (ID: {data['id']})")
        print(f"  Document: {data['document_filename']}")
        print(f"  Preview: {data['content_preview'][:100]}...")
        print(f"  Qdrant ID: {data['qdrant_point_id']}")
        print()
    
    if len(chunk_data) > limit:
        print(f"... and {len(chunk_data) - limit} more chunks")
    
    return chunk_data

chunks = await list_chunks(collection.id)


Total chunks: 23

Chunk #0 (ID: 669)
  Document: process.md
  Preview: # Astro-Algae Bioreactor: Process Description and Control Strategy

## 1. Process Overview

The **As...
  Qdrant ID: def59788-ffb0-4e7a-b427-95b8d73e324f

Chunk #1 (ID: 670)
  Document: process.md
  Preview: ## 2. Critical Alarm Limits and Thresholds

### 2.1 Temperature Alarms

The following temperature li...
  Qdrant ID: 450c704b-d44d-4bb5-bfc6-b7213e9cc514

Chunk #2 (ID: 671)
  Document: process.md
  Preview: **CRITICAL** : If product storage temperature exceeds **10¬∞C** , protein degradation begins within 3...
  Qdrant ID: 628deb20-979e-4975-99e8-0e18cc9aadb9

Chunk #3 (ID: 672)
  Document: process.md
  Preview: **CRITICAL** : If filter differential pressure exceeds **1.0 bar** , membrane fouling is severe. Sto...
  Qdrant ID: 50f3e4d1-60b1-4771-be0a-12683e51c96a

Chunk #4 (ID: 673)
  Document: process.md
  Preview: **CRITICAL** : If storage tank level exceeds **98%** , overflow risk. Stop product transfer.

##

---
## 6. Sensors

Sensors provide context for rule extraction. The LLM uses sensor metadata to map natural language references (e.g., "column temperature") to actual sensor IDs (e.g., "TI-101").


In [8]:
import pandas as pd

# Import sensors from CSV
async def import_sensors(collection_id: int, csv_path: str):
    """Import sensors from a CSV file."""
    async with db.get_async_session() as session:
        with open(csv_path, 'rb') as f:
            result = await sensor_service.import_from_csv(
                session,
                collection_id=collection_id,
                file=f
            )
        print(f"Successfully imported {result.total} sensors")
        return result

# Import sensors for collection
sensors_result = await import_sensors(
    collection.id, 
    "../resources/algae_mock/sensors.csv"
)


[32m2025-12-14 17:14:08.472[0m | [1mINFO    [0m | [36msrc.api.application.sensor_service[0m:[36mimport_from_csv[0m:[36m108[0m - [1müìÑ Parsed 27 sensors from CSV[0m
[32m2025-12-14 17:14:08.627[0m | [1mINFO    [0m | [36msrc.api.application.sensor_service[0m:[36mimport_from_csv[0m:[36m120[0m - [1m‚úì Imported 27 sensors for collection 13[0m


Successfully imported 27 sensors


In [9]:
# Display sensors as a table
async def display_sensors(collection_id: int):
    """Display sensors in a formatted table."""
    async with db.get_async_session() as session:
        result = await sensor_service.list_sensors(session, collection_id=collection_id)
        sensors = result.sensors
        
        df = pd.DataFrame([
            {
                'Sensor ID': s.sensor_id,
                'Name': s.name[:40] + '...' if len(s.name) > 40 else s.name,
                'Unit': s.unit or 'N/A',
                'Example': s.example or 'N/A',
            }
            for s in sensors
        ])
        
        print(f"Total sensors: {result.total}\n")
        return df

sensors_df = await display_sensors(collection.id)
display(sensors_df)


Total sensors: 27



Unnamed: 0,Sensor ID,Name,Unit,Example
0,AA_ACTIVITY,Culture Activity,%,95.4
1,AA_AERATION_RATE,Aeration Rate,L/min,25.0
2,AA_AGITATOR_POWER,Agitator Power,%,85.0
3,AA_CONC_PRODUCT,Product Concentration,g/L,10.2
4,AA_DENSITY_BIO,Culture Density,g/L,85.3
5,AA_FEED_FLOW,Nutrient Feed Flow,L/hr,250.0
6,AA_GLUCOSE_MOL,Glucose Concentration,%,5.5
7,AA_LEVEL_BIO,Bioreactor Level,%,75.0
8,AA_LEVEL_CULTURE,Culture Tank Level,%,65.8
9,AA_LEVEL_STORAGE,Storage Tank Level,%,80.2


---
## 7. Rule Extraction

The extraction workflow:
1. Creates a processing job with one task per chunk
2. For each chunk, the LLM extracts operational rules
3. Rules are validated (Python syntax check)
4. Sensor references are resolved to actual IDs
5. Time expressions are parsed and validated


In [10]:
# Create a processing job
async def create_extraction_job(collection_id: int, use_grounding: bool = True):
    """Create a rule extraction job."""
    print(f"Creating extraction job...")
    
    async with db.get_async_session() as session:
        job = await processing_service.create_job(
            session, 
            collection_id, 
            use_grounding=use_grounding
        )
        await session.commit()
        
        print(f"\nCreated job {job.id}:")
        print(f"  Status: {job.status}")
        print(f"  Total chunks: {job.total_chunks}")
        print(f"  Use grounding: {use_grounding}")
        return job

extraction_job = await create_extraction_job(collection.id, use_grounding=True)


Creating extraction job...


[32m2025-12-14 17:14:21.234[0m | [1mINFO    [0m | [36msrc.api.application.processing_service[0m:[36mcreate_job[0m:[36m60[0m - [1m‚úì Created processing job 13 with 23 tasks for collection 13[0m



Created job 13:
  Status: ProcessingStatus.PENDING
  Total chunks: 23
  Use grounding: True


In [12]:
# View job tasks
async def list_tasks(job_id: int, limit: int = 5):
    """List tasks in a job."""
    async with db.get_async_session() as session:
        tasks = await processing_service.get_job_tasks(session, job_id)
        
        print(f"Job {job_id}: {len(tasks)} tasks\n")
        
        for task in tasks[:limit]:
            print(f"Task {task.id}:")
            print(f"  Chunk: #{task.chunk.chunk_index}")
            print(f"  Status: {task.status}")
            print(f"  Preview: {task.chunk.content_preview[:60]}...")
            print()
        
        if len(tasks) > limit:
            print(f"... and {len(tasks) - limit} more tasks")
        
        return tasks

tasks = await list_tasks(extraction_job.id)


Job 12: 29 tasks

Task 827:
  Chunk: #0
  Status: ProcessingStatus.PENDING
  Preview: # C3/C4 Splitter: Process Description and Control Strategy

...

Task 828:
  Chunk: #1
  Status: ProcessingStatus.PENDING
  Preview: ### Fractionation Column (T-1405)

- The column separates th...

Task 829:
  Chunk: #2
  Status: ProcessingStatus.PENDING
  Preview: ### Reboiler System (E-1407)

- The heat necessary for fract...

Task 830:
  Chunk: #3
  Status: ProcessingStatus.PENDING
  Preview: ### Overhead System

- The overhead vapors (primarily C3 com...

Task 831:
  Chunk: #4
  Status: ProcessingStatus.PENDING
  Preview: ### Water Handling

- Sour water is collected in the boots o...

... and 24 more tasks


In [11]:
from src.api.application.job_executor import execute_job
import time

# Execute the extraction job
async def run_extraction(job_id: int):
    """Execute the extraction job and monitor progress."""
    print(f"Starting extraction job {job_id}...")
    print("This may take a few minutes depending on the number of chunks.\n")
    
    start_time = time.time()
    await execute_job(job_id)
    elapsed = time.time() - start_time
    
    # Check final status
    async with db.get_async_session() as session:
        final_job = await processing_service.get_job(session, job_id)
        
        print(f"\nJob {job_id} completed in {elapsed:.1f}s")
        print(f"  Status: {final_job.status}")
        print(f"  Progress: {final_job.progress_percentage:.1f}%")
        print(f"  Completed: {final_job.completed_chunks}/{final_job.total_chunks}")
        print(f"  Failed: {final_job.failed_chunks}")
        return final_job

extraction_result = await run_extraction(extraction_job.id)


[32m2025-12-14 17:14:26.694[0m | [1mINFO    [0m | [36msrc.api.application.job_executor[0m:[36mexecute_job[0m:[36m37[0m - [1müöÄ Starting execution of job 13[0m
[32m2025-12-14 17:14:26.731[0m | [1mINFO    [0m | [36msrc.api.application.job_executor[0m:[36mexecute_job[0m:[36m77[0m - [1m‚ö° Processing 23 tasks with max 5 concurrent workers[0m
[32m2025-12-14 17:14:26.733[0m | [1mINFO    [0m | [36msrc.api.application.job_executor[0m:[36m_process_task[0m:[36m149[0m - [1müìã Processing task 856 (chunk 669)[0m
[32m2025-12-14 17:14:26.734[0m | [1mINFO    [0m | [36msrc.api.application.job_executor[0m:[36m_process_task[0m:[36m149[0m - [1müìã Processing task 857 (chunk 670)[0m
[32m2025-12-14 17:14:26.735[0m | [1mINFO    [0m | [36msrc.api.application.job_executor[0m:[36m_process_task[0m:[36m149[0m - [1müìã Processing task 858 (chunk 671)[0m
[32m2025-12-14 17:14:26.736[0m | [1mINFO    [0m | [36msrc.api.application.job_executor[0m:

Starting extraction job 13...
This may take a few minutes depending on the number of chunks.



[32m2025-12-14 17:14:26.896[0m | [1mINFO    [0m | [36msrc.api.application.job_executor[0m:[36m_execute_workflow_with_callbacks[0m:[36m463[0m - [1mRunning task 856 without Langfuse tracing[0m
[32m2025-12-14 17:14:26.897[0m | [1mINFO    [0m | [36msrc.agent.application.extraction_workflow[0m:[36m_gather_context[0m:[36m287[0m - [1müìö Gathering context for chunk from: process.md (collection: collection_algae_mock)[0m
[32m2025-12-14 17:14:26.946[0m | [1mINFO    [0m | [36msrc.api.application.job_executor[0m:[36m_process_task[0m:[36m205[0m - [1müì° Loaded 27 sensors for collection algae_mock[0m
[32m2025-12-14 17:14:27.001[0m | [1mINFO    [0m | [36msrc.agent.application.extraction_workflow[0m:[36m__init__[0m:[36m109[0m - [1m‚úì Structured output supported by LLM provider[0m
[32m2025-12-14 17:14:27.025[0m | [1mINFO    [0m | [36msrc.api.application.job_executor[0m:[36m_execute_workflow_with_callbacks[0m:[36m463[0m - [1mRunning task 859


Job 13 completed in 881.0s
  Status: ProcessingStatus.RUNNING
  Progress: 0.0%
  Completed: 0/23
  Failed: 0


In [12]:
# View extracted rules
async def get_rules(collection_id: int, limit: int = 100):
    """Get all extracted rules for a collection."""
    async with db.get_async_session() as session:
        rule_repo = RuleRepository(session)
        
        rules = await rule_repo.list_by_collection(collection_id, limit=limit)
        stats = await rule_repo.get_stats_by_collection(collection_id)
        
        print(f"Extracted Rules Summary:")
        print(f"  Total rules: {stats['total_rules']}")
        print(f"  By type: {stats['rules_by_type']}")
        print(f"  Latest extraction: {stats['latest_extraction']}")
        
        print(f"\nSample rules (first 5):\n")
        for i, rule in enumerate(rules[:5], 1):
            print(f"[{i}] {rule.rule_name}")
            print(f"    Type: {rule.rule_type or 'general'}")
            print(f"    Description: {rule.rule_description[:80]}..." if len(rule.rule_description) > 80 else f"    Description: {rule.rule_description}")
            print(f"    Status: {rule.verification_status}")
            print()
        
        if len(rules) > 5:
            print(f"... and {len(rules) - 5} more rules")
        
        return rules, stats

rules, rule_stats = await get_rules(collection.id)


Extracted Rules Summary:
  Total rules: 281
  By type: {'operational': 108, 'safety': 114, 'maintenance': 18, 'quality': 24, 'optimization': 17}
  Latest extraction: 2025-12-14 16:29:07.562662+00:00

Sample rules (first 5):

[1] bioreactor_low_low_level_alert
    Type: safety
    Description: Alert when bioreactor level falls below critical low-low limit
    Status: VerificationStatus.OK

[2] low_oxygen_sustained_alert
    Type: operational
    Description: Alert when oxygen level remains below threshold for extended period
    Status: VerificationStatus.OK

[3] storage_tank_high_high_level_alert
    Type: safety
    Description: Alert when storage tank level exceeds critical high-high limit
    Status: VerificationStatus.OK

[4] bioreactor_critical_zone_alert
    Type: safety
    Description: Alert when bioreactor enters critical operating zone requiring emergency shutdow...
    Status: VerificationStatus.OK

[5] filter_pressure_trend_alert
    Type: maintenance
    Description: Alert

In [13]:
# Display a sample rule body
if rules:
    sample_rule = rules[0]
    print(f"Sample Rule: {sample_rule.rule_name}")
    print(f"Description: {sample_rule.rule_description}")
    print(f"\nRule Body:")
    print("-" * 60)
    print(sample_rule.rule_body)
    print("-" * 60)


Sample Rule: bioreactor_low_low_level_alert
Description: Alert when bioreactor level falls below critical low-low limit

Rule Body:
------------------------------------------------------------
def bioreactor_low_low_level_alert(status) -> str:
    current_bio_level = status.get("AA_LEVEL_BIO", "0")
    if current_bio_level and current_bio_level < 20:
        return "bioreactor_low_low_level_alert"
    return None
------------------------------------------------------------


---
## 8. Rule Consolidation

The consolidation workflow identifies and handles:
- **Redundant rules**: Exact or semantic duplicates
- **Mergeable rules**: Multiple conditions that can be combined
- **Simplifiable rules**: Complex logic that can be optimized

Key metric: **Consolidation Ratio** = Input Rules / Output Rules


In [15]:
# Display rules before consolidation
async def display_rules_summary(collection_id: int):
    """Display summary of rules before consolidation."""
    async with db.get_async_session() as session:
        rule_repo = RuleRepository(session)
        
        all_rules = await rule_repo.list_by_collection(collection_id, limit=500)
        active_rules = await rule_repo.list_active_by_collection(collection_id)
        
        # Calculate statistics by status
        verified_rules = [r for r in active_rules if r.verification_status == "OK"]
        sensor_issues = [r for r in active_rules if r.sensor_parsing_status in ["SENSORS_NOT_FOUND"]]
        syntax_errors = [r for r in active_rules if r.verification_status == "SYNTAX_ERROR"]
        
        print(f"Rules Summary:")
        print(f"  Total rules: {len(all_rules)}")
        print(f"  Active rules: {len(active_rules)}")
        print(f"    - Fully verified: {len(verified_rules)}")
        print(f"    - Sensor issues: {len(sensor_issues)}")
        print(f"    - Syntax errors: {len(syntax_errors)}")
        
        return active_rules

active_rules_before = await display_rules_summary(collection.id)


Rules Summary:
  Total rules: 281
  Active rules: 281
    - Fully verified: 0
    - Sensor issues: 0
    - Syntax errors: 0


In [16]:
# Create consolidation job
async def create_consolidation_job(collection_id: int, confidence_threshold: float = 0.7):
    """Create a consolidation job."""
    async with db.get_async_session() as session:
        consolidation_service = ConsolidationService(session)
        
        job_data = ConsolidationJobCreate(
            collection_id=collection_id,
            confidence_threshold=confidence_threshold
        )
        
        job = await consolidation_service.create_consolidation_job(job_data)
        
        print(f"Created consolidation job:")
        print(f"  Job ID: {job.id}")
        print(f"  Collection ID: {job.collection_id}")
        print(f"  Confidence threshold: {job.confidence_threshold}")
        print(f"  Input rules: {job.input_rules_count}")
        print(f"  Status: {job.status}")
        
        return job

consolidation_job = await create_consolidation_job(
    collection.id,
    confidence_threshold=0.7
)


[32m2025-12-14 18:37:12.068[0m | [1mINFO    [0m | [36msrc.api.application.consolidation_service[0m:[36mcreate_consolidation_job[0m:[36m69[0m - [1mCreated consolidation job 3 for collection=13, job=None[0m


Created consolidation job:
  Job ID: 3
  Collection ID: 13
  Confidence threshold: 0.7
  Input rules: 0
  Status: ProcessingStatus.PENDING


In [17]:
from src.api.application.consolidation_executor import execute_consolidation_job

# Execute consolidation
async def run_consolidation(job_id: int):
    """Execute the consolidation workflow."""
    print(f"Running consolidation job {job_id}...")
    print("This may take a few minutes...\n")
    
    start_time = time.time()
    await execute_consolidation_job(job_id)
    elapsed = time.time() - start_time
    
    print(f"\nConsolidation completed in {elapsed:.1f}s")

await run_consolidation(consolidation_job.id)


[32m2025-12-14 18:37:13.514[0m | [1mINFO    [0m | [36msrc.api.application.consolidation_executor[0m:[36mexecute_consolidation_job[0m:[36m42[0m - [1müöÄ Starting consolidation job[0m


Running consolidation job 3...
This may take a few minutes...



[32m2025-12-14 18:37:13.770[0m | [1mINFO    [0m | [36msrc.api.application.consolidation_executor[0m:[36mexecute_consolidation_job[0m:[36m89[0m - [1mLoaded 281 active rules and 27 sensors from collection 13[0m
[32m2025-12-14 18:37:13.771[0m | [1mINFO    [0m | [36msrc.api.application.consolidation_executor[0m:[36mexecute_consolidation_job[0m:[36m135[0m - [1mRunning consolidation workflow...[0m
[32m2025-12-14 18:37:13.799[0m | [1mINFO    [0m | [36msrc.agent.application.consolidation_workflow[0m:[36m__init__[0m:[36m77[0m - [1m‚úì Structured output supported by LLM provider for consolidation[0m
[32m2025-12-14 18:37:13.813[0m | [1mINFO    [0m | [36msrc.agent.application.consolidation_workflow[0m:[36m_analyze_rules[0m:[36m188[0m - [1müìä Analyzing rules for consolidation opportunities...[0m
[32m2025-12-14 18:37:13.813[0m | [1mINFO    [0m | [36msrc.agent.application.consolidation_workflow[0m:[36m_analyze_rules[0m:[36m198[0m - [1mAnaly


Consolidation completed in 905.5s


In [18]:
from src.api.infrastructure.repositories import ConsolidationJobRepository

# Display consolidation results
async def display_consolidation_results(job_id: int):
    """Display the results of a consolidation job."""
    async with db.get_async_session() as session:
        consolidation_repo = ConsolidationJobRepository(session)
        job = await consolidation_repo.get_by_id(job_id)
        
        if not job:
            print("Consolidation job not found")
            return
        
        print(f"Consolidation Job #{job.id}")
        print(f"  Status: {job.status}")
        print(f"  Started: {job.started_at}")
        print(f"  Completed: {job.completed_at}")
        
        if job.completed_at and job.started_at:
            duration = (job.completed_at - job.started_at).total_seconds()
            print(f"  Duration: {duration:.2f}s")
        
        print(f"\nStatistics:")
        print(f"  Input rules: {job.input_rules_count}")
        print(f"  Output rules: {job.output_rules_count}")
        print(f"  Rules removed: {job.rules_removed}")
        print(f"  Rules merged: {job.rules_merged}")
        print(f"  Rules simplified: {job.rules_simplified}")
        
        if job.input_rules_count and job.output_rules_count:
            reduction = (job.input_rules_count - job.output_rules_count) / job.input_rules_count * 100
            ratio = job.input_rules_count / job.output_rules_count
            print(f"  Reduction: {reduction:.1f}%")
            print(f"  Consolidation ratio: {ratio:.2f}x")
        
        if job.error:
            print(f"\nError: {job.error}")
        
        return job

consolidation_result = await display_consolidation_results(consolidation_job.id)


Consolidation Job #3
  Status: ProcessingStatus.COMPLETED
  Started: 2025-12-14 16:37:13.699352+00:00
  Completed: 2025-12-14 16:52:19.028282+00:00
  Duration: 905.33s

Statistics:
  Input rules: 281
  Output rules: 68
  Rules removed: 20
  Rules merged: 60
  Rules simplified: 17
  Reduction: 75.8%
  Consolidation ratio: 4.13x


---
## 9. Summary

This notebook demonstrated the complete rule extraction and consolidation pipeline:

1. **Setup**: Initialized agent and API containers
2. **Collections**: Created document collections with Qdrant integration
3. **Documents**: Uploaded and chunked technical documents
4. **Sensors**: Imported sensor metadata for entity resolution
5. **Extraction**: Executed the LLM-based rule extraction workflow
6. **Consolidation**: Optimized the rule set by removing redundancies

The extracted rules can be used with the **Streaming** module for real-time anomaly detection. See `streaming.ipynb` for details.
