feat: GEO Index#1
Conversation
…sync validation - Add validation domain with trait registration and validation execution - Implement event-driven architecture with DepositionSubmitted->ValidationCompleted flow - Add shadow domain event listener for ValidationCompleted events - Create OCI validator runner using Docker for containerized validation - Add index and ingest SDK protocols with ChromaDB vector backend and GEO ingestor - Implement comprehensive dependency injection wiring across all domains - Add REST API endpoints for trait management and async validation - Update SRN model to support new resource types (trait, convention, vocab) - Add comprehensive database tables for depositions, traits, and validation runs - Refactor command handlers to be async and use dataclass pattern - Add in-memory event bus for prototype event handling - Update Python version requirement from 3.14 to 3.13
Add comprehensive CLI functionality including: - Server daemon management (start/stop/status) with background process handling - Vector search command that communicates with REST API - Health and search REST endpoints with proper error handling - Local infrastructure for managing ~/.osa directory and server state - Integration tests for GEO ingestor against live NCBI API refactor: simplify dependency injection and remove unused domain components Remove shadow domain integration, unused command handlers, and simplify event listener architecture to prepare for outbox pattern implementation
- Remove shadow domain and replace with simplified validation flow - Implement transactional outbox pattern for reliable event delivery - Add BackgroundWorker for unified event processing and scheduling - Replace typer CLI with cyclopts for better type safety - Add comprehensive configuration system with YAML support - Implement domain event listeners for deposition lifecycle - Add initial ingestion and scheduled tasks support - Restructure DI with proper scoping (APP/UOW) using custom Scope enum - Add logging configuration and admin commands for local development - Simplify validation model by removing trait concept - Add record publishing and indexing pipeline - Implement auto-approval curation for zero-config operation refactor: update dependency injection to use custom UOW scope Replace dishka's REQUEST scope with custom UOW (Unit of Work) scope throughout the codebase to better align with domain boundaries and support both HTTP requests and background event processing
Add Alembic configuration for database schema management with initial migration creating tables for depositions, validation runs, records, and events. Include extensive unit tests covering domain models, aggregates, commands, services, and infrastructure mappers to ensure proper functionality across deposition, shadow, validation, and shared domains.
There was a problem hiding this comment.
Pull request overview
This PR implements a comprehensive GEO (Gene Expression Omnibus) data ingestion and vector-based indexing system. The changes include:
- Complete GEO ingestor using NCBI E-utilities API for fetching genomics metadata
- Vector storage backend using ChromaDB and sentence-transformers for semantic search
- Event-driven architecture with background workers, outbox pattern, and scheduled tasks
- Full domain model reorganization with validation, records, ingestion, and indexing domains
- CLI interface for server management, search, and configuration
- Comprehensive test coverage across unit and integration levels
Key Changes
- GEO Ingestion: Implements fetching of GSE (GEO Series) records from NCBI with configurable scheduling and initial runs
- Vector Search: ChromaDB-backed semantic search over metadata using sentence-transformers embeddings
- Event System: Background worker with APScheduler for event processing and scheduled ingestion tasks
- Infrastructure: SQLAlchemy persistence with Alembic migrations, dependency injection with Dishka using custom scopes
Reviewed changes
Copilot reviewed 160 out of 166 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Downgraded Python from 3.14 to 3.13, added comprehensive dependencies including aiodocker, chromadb, sentence-transformers, APScheduler |
| osa/infrastructure/ingest/geo/* | Complete GEO ingestor implementation using NCBI E-utilities API |
| osa/infrastructure/index/vector/* | Vector storage backend with ChromaDB and sentence-transformers |
| osa/infrastructure/event/worker.py | Background worker for event processing and scheduled tasks |
| osa/infrastructure/persistence/* | SQLAlchemy repositories, mappers, and database configuration |
| osa/domain/validation/* | Validation domain with models, services, and OCI container runner |
| osa/domain/ingest/* | Ingestion domain with listeners and scheduled tasks |
| osa/domain/index/* | Index domain with projector listener |
| osa/domain/record/* | Record domain with publication events and listeners |
| osa/config.py | Expanded configuration with YAML support, database, logging, indexes, and ingestors |
| osa/cli/* | Complete CLI with server management, search, config, and admin commands |
| tests/* | Comprehensive unit and integration tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # elif idx_config.backend == "keyword": | ||
| # backends[name] = KeywordStorageBackend(name, idx_config.config) |
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| class _CommandHandlerMeta(ABCMeta): | ||
| """Metaclass that combines ABC with auto-dataclass for subclasses.""" | ||
|
|
||
| def __new__(mcs, name: str, bases: tuple, namespace: dict): |
There was a problem hiding this comment.
Class methods or methods of a type deriving from type should have 'cls', rather than 'mcs', as their first parameter.
| class _EventListenerMeta(ABCMeta): | ||
| """Metaclass that applies @dataclass and extracts __event_type__ from EventListener[E].""" | ||
|
|
||
| def __new__(mcs, name: str, bases: tuple[type, ...], namespace: dict[str, Any]) -> type: |
There was a problem hiding this comment.
Class methods or methods of a type deriving from type should have 'cls', rather than 'mcs', as their first parameter.
| from abc import ABC, abstractmethod | ||
| from typing import NewType | ||
| from uuid import UUID | ||
| from datetime import UTC, datetime |
There was a problem hiding this comment.
Import of 'UTC' is not used.
Import of 'datetime' is not used.
| try: | ||
| os.kill(pid, signal.SIGKILL) | ||
| time.sleep(0.5) | ||
| except ProcessLookupError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| await container.delete(force=True) | ||
| except Exception: | ||
| logfire.warning("Failed to delete container", container_id=container.id) | ||
| pass |
There was a problem hiding this comment.
Unnecessary 'pass' statement.
| curation_required = False # False for v1 | ||
| if curation_required: | ||
| logger.info( | ||
| f"Curation required for {event.deposition_srn}, not auto-approving" | ||
| ) | ||
| return |
There was a problem hiding this comment.
This statement is unreachable.
| curation_required = False # False for v1 | |
| if curation_required: | |
| logger.info( | |
| f"Curation required for {event.deposition_srn}, not auto-approving" | |
| ) | |
| return |
|
|
||
| # UOW-scoped providers for listeners | ||
| for _listener_type in LISTENER_TYPES: | ||
| locals()[_listener_type.__name__] = provide(_listener_type, scope=Scope.UOW) |
There was a problem hiding this comment.
Modification of the locals() dictionary will have no effect on the local variables.
Move IndexConfig and IngestConfig from domain value objects to main config module to centralize configuration management. Rename event listeners to use descriptive action-based names that better express their purpose. Remove unused UoW abstraction and validation listener.
Move API routes from domain-specific packages to centralized v1 structure with /api/v1 prefix. Add global OSA error handling and comprehensive CLI improvements including search result caching, server log viewing, and record detail display. Convert vector backend operations to async using thread pools to prevent event loop blocking.
Introduce structured console output using rich library for better UX: - Add Console class wrapping rich for consistent formatting - Create Pydantic models for type-safe data handling - Replace print statements with styled success/error/warning messages - Add structured display for search results and record details - Improve error messages with hints and better formatting - Add relative time formatting for server status
…tics Add /api/v1/stats endpoint that returns record counts and index health status. Add 'osa stats' CLI command to display this information in a user-friendly format. feat: add init command to set up OSA configuration and directories Add 'osa init' command that creates the XDG-compliant directory structure and generates a default config file with GEO ingestor setup. refactor: migrate from ~/.osa to XDG Base Directory specification Move from single ~/.osa directory to XDG-compliant structure: - ~/.config/osa/ for configuration - ~/.local/share/osa/ for data (database, vectors) - ~/.local/state/osa/ for runtime state and logs - ~/.cache/osa/ for temporary cache files This improves organization and follows Linux filesystem standards. fix: prevent duplicate initial ingestion runs Check if initial ingestion already completed for each ingestor before triggering new runs to avoid redundant data processing on server restart. fix: prevent duplicate log entries in daemon mode Only add console handler when not using log file to avoid double logging when stderr is redirected to log file in daemon mode.
a91a236 to
d0c5d06
Compare
Remove standalone config command module and integrate config management into init command to simplify CLI structure and reduce code duplication
Add REST API endpoint at /api/v1/events for listing domain events with cursor-based pagination, filtering, and ordering support. Add CLI command 'osa events' to view recent events from command line. Add server restart command for easier development workflow. Update event model to include created_at timestamp and extend event repository with list_events and count methods. Change default database path to XDG-compliant location.
…fig resolution Remove optional config parameter from start function and update _resolve_config to use non-optional Path parameter. Use paths.config_dir as default config location and remove hint about creating ./osa.yaml since config resolution is now simplified.
Replaces hardcoded GEO ingestor with entry point-based discovery system. Adds GEOEntrezIngestor using NCBI E-utilities API for incremental updates. Validates ingestor configs at startup for fail-fast error handling.
…ngestors Change indexes and ingestors from dict[str, Config] to list[Config] to simplify configuration and avoid naming conflicts. Add name field to IndexConfig and use ingestor type as name for IngestConfig. Update validation to check for duplicate ingestor types and improve error handling with ConfigError class for better user experience.
…tion limit configuration
Display hit score as percentage in green color at the beginning of metadata line to help users quickly identify result relevance
No description provided.