Skip to content

feat: GEO Index#1

Merged
rorybyrne merged 19 commits into
mainfrom
rory/geo-index
Dec 26, 2025
Merged

feat: GEO Index#1
rorybyrne merged 19 commits into
mainfrom
rory/geo-index

Conversation

@rorybyrne

Copy link
Copy Markdown
Contributor

No description provided.

…sync validation

- Add validation domain with trait registration and validation execution
- Implement event-driven architecture with DepositionSubmitted->ValidationCompleted flow
- Add shadow domain event listener for ValidationCompleted events
- Create OCI validator runner using Docker for containerized validation
- Add index and ingest SDK protocols with ChromaDB vector backend and GEO ingestor
- Implement comprehensive dependency injection wiring across all domains
- Add REST API endpoints for trait management and async validation
- Update SRN model to support new resource types (trait, convention, vocab)
- Add comprehensive database tables for depositions, traits, and validation runs
- Refactor command handlers to be async and use dataclass pattern
- Add in-memory event bus for prototype event handling
- Update Python version requirement from 3.14 to 3.13
Add comprehensive CLI functionality including:
- Server daemon management (start/stop/status) with background process handling
- Vector search command that communicates with REST API
- Health and search REST endpoints with proper error handling
- Local infrastructure for managing ~/.osa directory and server state
- Integration tests for GEO ingestor against live NCBI API

refactor: simplify dependency injection and remove unused domain components

Remove shadow domain integration, unused command handlers, and simplify
event listener architecture to prepare for outbox pattern implementation
- Remove shadow domain and replace with simplified validation flow
- Implement transactional outbox pattern for reliable event delivery
- Add BackgroundWorker for unified event processing and scheduling
- Replace typer CLI with cyclopts for better type safety
- Add comprehensive configuration system with YAML support
- Implement domain event listeners for deposition lifecycle
- Add initial ingestion and scheduled tasks support
- Restructure DI with proper scoping (APP/UOW) using custom Scope enum
- Add logging configuration and admin commands for local development
- Simplify validation model by removing trait concept
- Add record publishing and indexing pipeline
- Implement auto-approval curation for zero-config operation

refactor: update dependency injection to use custom UOW scope

Replace dishka's REQUEST scope with custom UOW (Unit of Work) scope
throughout the codebase to better align with domain boundaries and
support both HTTP requests and background event processing
Add Alembic configuration for database schema management with initial
migration creating tables for depositions, validation runs, records,
and events. Include extensive unit tests covering domain models,
aggregates, commands, services, and infrastructure mappers to ensure
proper functionality across deposition, shadow, validation, and shared
domains.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive GEO (Gene Expression Omnibus) data ingestion and vector-based indexing system. The changes include:

  • Complete GEO ingestor using NCBI E-utilities API for fetching genomics metadata
  • Vector storage backend using ChromaDB and sentence-transformers for semantic search
  • Event-driven architecture with background workers, outbox pattern, and scheduled tasks
  • Full domain model reorganization with validation, records, ingestion, and indexing domains
  • CLI interface for server management, search, and configuration
  • Comprehensive test coverage across unit and integration levels

Key Changes

  • GEO Ingestion: Implements fetching of GSE (GEO Series) records from NCBI with configurable scheduling and initial runs
  • Vector Search: ChromaDB-backed semantic search over metadata using sentence-transformers embeddings
  • Event System: Background worker with APScheduler for event processing and scheduled ingestion tasks
  • Infrastructure: SQLAlchemy persistence with Alembic migrations, dependency injection with Dishka using custom scopes

Reviewed changes

Copilot reviewed 160 out of 166 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pyproject.toml Downgraded Python from 3.14 to 3.13, added comprehensive dependencies including aiodocker, chromadb, sentence-transformers, APScheduler
osa/infrastructure/ingest/geo/* Complete GEO ingestor implementation using NCBI E-utilities API
osa/infrastructure/index/vector/* Vector storage backend with ChromaDB and sentence-transformers
osa/infrastructure/event/worker.py Background worker for event processing and scheduled tasks
osa/infrastructure/persistence/* SQLAlchemy repositories, mappers, and database configuration
osa/domain/validation/* Validation domain with models, services, and OCI container runner
osa/domain/ingest/* Ingestion domain with listeners and scheduled tasks
osa/domain/index/* Index domain with projector listener
osa/domain/record/* Record domain with publication events and listeners
osa/config.py Expanded configuration with YAML support, database, logging, indexes, and ingestors
osa/cli/* Complete CLI with server management, search, config, and admin commands
tests/* Comprehensive unit and integration tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread osa/infrastructure/index/di.py Outdated
Comment on lines +28 to +29
# elif idx_config.backend == "keyword":
# backends[name] = KeywordStorageBackend(name, idx_config.config)

Copilot AI Dec 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Copilot uses AI. Check for mistakes.
class _CommandHandlerMeta(ABCMeta):
"""Metaclass that combines ABC with auto-dataclass for subclasses."""

def __new__(mcs, name: str, bases: tuple, namespace: dict):

Copilot AI Dec 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class methods or methods of a type deriving from type should have 'cls', rather than 'mcs', as their first parameter.

Copilot uses AI. Check for mistakes.
class _EventListenerMeta(ABCMeta):
"""Metaclass that applies @dataclass and extracts __event_type__ from EventListener[E]."""

def __new__(mcs, name: str, bases: tuple[type, ...], namespace: dict[str, Any]) -> type:

Copilot AI Dec 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class methods or methods of a type deriving from type should have 'cls', rather than 'mcs', as their first parameter.

Copilot uses AI. Check for mistakes.
from abc import ABC, abstractmethod
from typing import NewType
from uuid import UUID
from datetime import UTC, datetime

Copilot AI Dec 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'UTC' is not used.
Import of 'datetime' is not used.

Copilot uses AI. Check for mistakes.
Comment thread osa/cli/util/daemon.py
try:
os.kill(pid, signal.SIGKILL)
time.sleep(0.5)
except ProcessLookupError:

Copilot AI Dec 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
await container.delete(force=True)
except Exception:
logfire.warning("Failed to delete container", container_id=container.id)
pass

Copilot AI Dec 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary 'pass' statement.

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +35
curation_required = False # False for v1
if curation_required:
logger.info(
f"Curation required for {event.deposition_srn}, not auto-approving"
)
return

Copilot AI Dec 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Suggested change
curation_required = False # False for v1
if curation_required:
logger.info(
f"Curation required for {event.deposition_srn}, not auto-approving"
)
return

Copilot uses AI. Check for mistakes.

# UOW-scoped providers for listeners
for _listener_type in LISTENER_TYPES:
locals()[_listener_type.__name__] = provide(_listener_type, scope=Scope.UOW)

Copilot AI Dec 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modification of the locals() dictionary will have no effect on the local variables.

Copilot uses AI. Check for mistakes.
Move IndexConfig and IngestConfig from domain value objects to main
config module to centralize configuration management. Rename event
listeners to use descriptive action-based names that better express
their purpose. Remove unused UoW abstraction and validation listener.
Move API routes from domain-specific packages to centralized v1 structure
with /api/v1 prefix. Add global OSA error handling and comprehensive CLI
improvements including search result caching, server log viewing, and
record detail display. Convert vector backend operations to async using
thread pools to prevent event loop blocking.
Introduce structured console output using rich library for better UX:
- Add Console class wrapping rich for consistent formatting
- Create Pydantic models for type-safe data handling
- Replace print statements with styled success/error/warning messages
- Add structured display for search results and record details
- Improve error messages with hints and better formatting
- Add relative time formatting for server status
…tics

Add /api/v1/stats endpoint that returns record counts and index health status.
Add 'osa stats' CLI command to display this information in a user-friendly format.

feat: add init command to set up OSA configuration and directories

Add 'osa init' command that creates the XDG-compliant directory structure
and generates a default config file with GEO ingestor setup.

refactor: migrate from ~/.osa to XDG Base Directory specification

Move from single ~/.osa directory to XDG-compliant structure:
- ~/.config/osa/ for configuration
- ~/.local/share/osa/ for data (database, vectors)
- ~/.local/state/osa/ for runtime state and logs
- ~/.cache/osa/ for temporary cache files

This improves organization and follows Linux filesystem standards.

fix: prevent duplicate initial ingestion runs

Check if initial ingestion already completed for each ingestor before
triggering new runs to avoid redundant data processing on server restart.

fix: prevent duplicate log entries in daemon mode

Only add console handler when not using log file to avoid double logging
when stderr is redirected to log file in daemon mode.
Remove standalone config command module and integrate config
management into init command to simplify CLI structure and
reduce code duplication
Add REST API endpoint at /api/v1/events for listing domain events
with cursor-based pagination, filtering, and ordering support.

Add CLI command 'osa events' to view recent events from command line.

Add server restart command for easier development workflow.

Update event model to include created_at timestamp and extend
event repository with list_events and count methods.

Change default database path to XDG-compliant location.
…fig resolution

Remove optional config parameter from start function and update
_resolve_config to use non-optional Path parameter. Use paths.config_dir
as default config location and remove hint about creating ./osa.yaml
since config resolution is now simplified.
Replaces hardcoded GEO ingestor with entry point-based discovery system.
Adds GEOEntrezIngestor using NCBI E-utilities API for incremental updates.
Validates ingestor configs at startup for fail-fast error handling.
…ngestors

Change indexes and ingestors from dict[str, Config] to list[Config]
to simplify configuration and avoid naming conflicts. Add name field
to IndexConfig and use ingestor type as name for IngestConfig.
Update validation to check for duplicate ingestor types and improve
error handling with ConfigError class for better user experience.
Display hit score as percentage in green color at the beginning of
metadata line to help users quickly identify result relevance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants