A local NotebookLM-like research application that helps you organize, analyze, and generate insights from your research materials. Built with Clean Architecture principles for maintainability and testability.
Discovery empowers researchers, students, and knowledge workers to build comprehensive research notebooks by collecting sources from various formats (PDFs, documents, web articles) and generating intelligent summaries and insights. Think of it as your personal research assistant that:
- Organizes your research materials into focused notebooks
- Ingests content from files (PDF, DOCX, TXT, MD) and web URLs
- Analyzes your sources using vector-based semantic search
- Generates summaries, blog posts, and research outputs using AI
- Maintains full data privacy with local-first storage
- Pluggable Infrastructure: Support for offline LLMs and embedding models for complete data sovereignty
- Output Modules: Generate specialized research artifacts like comparative analyses, executive briefings, and research reports
- Enhanced Collaboration: Export and share research notebooks while maintaining privacy controls
This FastAPI-based application follows Clean Architecture principles, ensuring clear separation between business logic, infrastructure concerns, and API layers. All your data stays local while leveraging the power of modern AI for content analysis and generation.
🎯 Intelligent Research Management
- Multi-Source Ingestion: Import content from PDFs, DOCX, TXT, Markdown files, and web URLs
- Vector-Powered Search: Semantic similarity search across all your research materials using Weaviate
- AI-Driven Insights: Generate summaries, blog posts, and research outputs using Google Gemini
- Question Answering: Ask natural language questions and get AI-powered answers from your sources using RAG (Retrieval-Augmented Generation)
🛠️ Developer-First Design
- Clean Architecture: Framework-independent core business logic, fully testable
- RESTful API: Comprehensive FastAPI backend with interactive documentation
- CLI Tool: Full-featured command-line interface for all operations
- Local-First: All data stored locally with PostgreSQL and Weaviate
🔒 Privacy & Control
- Data Sovereignty: Everything runs locally on your infrastructure
- No Cloud Lock-in: Works entirely offline (except for optional Gemini API calls)
- Configurable AI: Support for custom LLM and embedding model configurations
- Notebooks: A collection of related sources for a specific project or topic.
- Sources: Research materials imported into a notebook, such as files (PDF, DOCX, TXT, MD) and URLs.
- Outputs: Generated content, such as summaries or blog posts, created from the sources in a notebook.
- Vector Search: Semantic similarity search powered by Weaviate vector database for finding relevant content chunks within notebooks.
- Python 3.12+ - Modern Python runtime
- Docker - For running PostgreSQL and Weaviate services
- uv - Fast Python package manager (recommended)
-
Clone and navigate to the repository:
git clone <repository-url> cd discovery
-
Install uv (if not already installed):
# Unix/macOS/Linux curl -LsSf https://astral.sh/uv/install.sh | sh # Or using pip pip install uv
-
Set up the Python environment:
# Creates virtual environment and installs all dependencies uv sync # Activate the environment source .venv/bin/activate # Unix/macOS # or .venv\Scripts\activate # Windows
Create a .env file in the project root with these required variables:
# Database Configuration
DATABASE_URL="postgresql://postgres:Foobar321@localhost:5432/postgres"
# AI Services
GEMINI_API_KEY="your_gemini_api_key_here" # For Google Gemini LLM
GEMINI_MODEL="gemini-3-pro-preview" # Gemini model to use (optional, defaults to gemini-2.0-flash-001)
# Google Search Services
GOOGLE_CUSTOM_SEARCH_API_KEY="your_google_search_api_key" # For web search features
GOOGLE_CUSTOM_SEARCH_ENGINE_ID="your_search_engine_id" # Custom search engine ID
# Vector Database (optional - defaults to localhost)
WEAVIATE_URL="http://localhost:8080" # Local Weaviate instance
WEAVIATE_API_KEY="your_weaviate_cloud_key" # Only for cloud instancesEnvironment Variable Details:
| Variable | Purpose | Required | Default |
|---|---|---|---|
DATABASE_URL |
PostgreSQL connection string | Yes | None |
GEMINI_API_KEY |
Google Gemini API access for AI features | Yes | None |
GEMINI_MODEL |
Gemini model name (e.g., gemini-3-pro-preview) | No | gemini-2.0-flash-001 |
GOOGLE_CUSTOM_SEARCH_API_KEY |
Google Custom Search API for web search features | Yes | None |
GOOGLE_CUSTOM_SEARCH_ENGINE_ID |
Custom search engine identifier for Google search | Yes | None |
WEAVIATE_URL |
Weaviate vector database URL | No | http://localhost:8080 |
WEAVIATE_API_KEY |
Weaviate cloud authentication | No | None |
Start the PostgreSQL database using Docker:
# Start PostgreSQL container
docker-compose -f pgDockerCompose/docker-compose.yaml up -d
# Apply database migrations
alembic upgrade headFor semantic search capabilities, start Weaviate:
# Start Weaviate vector database
docker-compose -f weaviateDockerCompose/docker-compose.yaml up -dThis provides:
- Weaviate vector database on port 8080
- Text-to-vector transformer for generating embeddings
# Start the FastAPI server
./scripts/dev.sh
# Or manually:
uv run uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
- Create your first notebook: Use the interactive docs or API endpoints
Verify everything works correctly:
# Run all tests (should pass ~42 tests)
./scripts/test.sh
# Or using uv directly
uv run pytest tests/ -vA demonstration script is provided to showcase the vector search capabilities:
python scripts/ingest_wikipedia_notebook.pyThis script will:
- Create a new notebook with a random name
- Import two Wikipedia articles (Monsters, Inc. and Monsters University)
- Ingest the content into the vector database
- Perform sample similarity search queries
- Display the results
Make sure both the API server and Weaviate are running before executing the demo.
Discovery follows Clean Architecture principles as advocated by Robert C. Martin and Steve Smith (Ardalis), ensuring maintainable, testable, and framework-independent code.
- Dependency Inversion: Dependencies point inward toward the Core business logic
- Framework Independence: Core business logic has zero dependencies on external frameworks
- Interface-Driven Design: Inner layers define interfaces; outer layers implement them
- Separation of Concerns: Clear boundaries between business logic, infrastructure, and presentation
┌─────────────────┐
│ API Layer │ ← FastAPI, Routes, DTOs
│ (src/api/) │
└────────┬────────┘
│ depends on
┌────────▼────────┐
│ Core Layer │ ← Entities, Services, Interfaces
│ (src/core/) │ (Framework Independent)
└────────┬────────┘
│ implements
┌────────▼────────┐
│Infrastructure │ ← Repositories, Providers, Database
│(src/infrastructure/)│
└─────────────────┘
| Rule | Implementation |
|---|---|
| Core Independence | src/core/ has minimal dependencies - only domain logic |
| Interface Definition | Core defines INotebookRepository, Infrastructure implements SqlNotebookRepository |
| Dependency Direction | API → Core ← Infrastructure (never Core → Infrastructure) |
| Command/Query Pattern | Services use structured command/query objects as inputs |
| Result Pattern | All services return Result<T> objects for consistent error handling |
| Unit Testing | Core services are easily testable without external dependencies |
The project is organized into three main layers:
Core Layer (src/core/):
entities/: Domain entities (Notebook, Source, etc.)services/: Business logic servicesinterfaces/: Abstract interfaces for repositories and providerscommands/&queries/: Structured input objectsresults/: Standardized result types
Infrastructure Layer (src/infrastructure/):
repositories/: Database access implementationsproviders/: External service implementations (LLM, Vector DB)database/: Database models and migrations
API Layer (src/api/):
main.py: FastAPI application setup*_router.py: Route definitionsdtos.py: Data transfer objects for API serialization
Notebook Management:
POST /api/notebooks- Create new research notebookGET /api/notebooks- List all notebooks with metadataGET /api/notebooks/{id}- Get specific notebook detailsPUT /api/notebooks/{id}- Update notebook propertiesDELETE /api/notebooks/{id}- Delete notebook and all sources
Source Management:
POST /api/notebooks/{id}/sources/file- Upload file source (PDF, DOCX, TXT, MD)POST /api/notebooks/{id}/sources/url- Add web URL as sourceGET /api/notebooks/{id}/sources- List all sources in notebookDELETE /api/sources/{id}- Remove source from notebook
Content Generation:
POST /api/notebooks/{id}/generate-summary- Generate AI summary from selected sourcesPOST /api/notebooks/{id}/generate-output- Create structured outputs (blog posts, briefs)
Enable semantic search across your research materials:
POST /api/notebooks/{id}/ingest- Ingest content into vector databaseGET /api/notebooks/{id}/similar- Semantic similarity searchGET /api/notebooks/{id}/vectors/count- Get vector count for notebookDELETE /api/notebooks/{id}/vectors- Clear all vectors for notebook
Access the full API documentation at: http://localhost:8000/docs
Try the vector search capabilities:
# Run the Wikipedia demo (creates notebook with sample content)
python src/apps/ingest_notebook_into_vectordb.py
# This demonstrates:
# 1. Creating a notebook
# 2. Adding Wikipedia articles as sources
# 3. Ingesting content for semantic search
# 4. Performing similarity queriesDiscovery includes a powerful CLI built with Typer for managing your research notebooks from the terminal. Perfect for automation, scripting, and quick workflows.
The CLI is automatically installed when you set up the project:
# After running 'uv sync', the CLI is available as:
python -m src.cli
# Or install globally with pipx for convenience:
pipx install .
# Then use directly:
discovery --help1. Start the API server first:
./scripts/dev.sh
# Server runs at http://localhost:80002. Configure the CLI to connect to your API:
# Initialize a configuration profile
python -m src.cli config init --url http://localhost:8000
# Test the connection
python -m src.cli config test
# View current configuration
python -m src.cli config showConfiguration is stored in ~/.discovery/config.toml. You can manage multiple profiles for different environments.
Notebook Management:
# List all notebooks
python -m src.cli notebooks list
# Create a new notebook
python -m src.cli notebooks create --name "AI Research" --tags "machine-learning,llm"
# Show notebook details
python -m src.cli notebooks show <notebook-id>
# Update notebook
python -m src.cli notebooks update <notebook-id> --name "Updated Name"
# Delete notebook
python -m src.cli notebooks delete <notebook-id>Source Management:
# Add a web URL source
python -m src.cli sources add url \
--notebook <notebook-id> \
--url "https://en.wikipedia.org/wiki/Artificial_intelligence"
# Add a file source
python -m src.cli sources add file \
--notebook <notebook-id> \
--path /path/to/research-paper.pdf
# Add text content directly
python -m src.cli sources add text \
--notebook <notebook-id> \
--content "Your research notes here"
# List sources in a notebook
python -m src.cli sources list --notebook <notebook-id>
# Remove a source
python -m src.cli sources remove <source-id>Vector Database Operations:
# Ingest notebook sources into vector database for semantic search
python -m src.cli vectors ingest --notebook <notebook-id>
# Perform similarity search
python -m src.cli vectors search \
--notebook <notebook-id> \
--query "machine learning applications" \
--limit 5
# Check vector count
python -m src.cli vectors count --notebook <notebook-id>
# Clear vectors for a notebook
python -m src.cli vectors delete --notebook <notebook-id>Question Answering (RAG):
# Ask a question and get AI-powered answers from your sources
python -m src.cli qa ask \
--notebook <notebook-id> \
--question "What are the main applications of deep learning?"
# Output in JSON format for scripting
python -m src.cli qa ask \
--notebook <notebook-id> \
--question "Summarize the key findings" \
--format jsonOutput Generation:
# Generate a blog post from sources
python -m src.cli outputs create \
--notebook <notebook-id> \
--type "blog_post" \
--prompt "Write a blog post about AI trends"
# List generated outputs
python -m src.cli outputs list --notebook <notebook-id>
# View output content
python -m src.cli outputs show <output-id>All list and show commands support multiple output formats:
# Human-readable table (default)
python -m src.cli notebooks list --format table
# JSON for scripting
python -m src.cli notebooks list --format json
# YAML for configuration
python -m src.cli notebooks list --format yaml
# Plain text
python -m src.cli notebooks list --format textComplete research workflow:
#!/bin/bash
# Create notebook and add sources
NOTEBOOK_ID=$(python -m src.cli notebooks create \
--name "AI Ethics Research" \
--format json | jq -r '.id')
# Add multiple sources
for url in \
"https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence" \
"https://en.wikipedia.org/wiki/AI_safety"; do
python -m src.cli sources add url --notebook $NOTEBOOK_ID --url "$url"
done
# Ingest into vector database
python -m src.cli vectors ingest --notebook $NOTEBOOK_ID
# Ask questions
python -m src.cli qa ask \
--notebook $NOTEBOOK_ID \
--question "What are the main ethical concerns with AI?" \
--format jsonExtract notebook information:
# Get all notebook IDs
python -m src.cli notebooks list --format json | jq -r '.notebooks[].id'
# Count sources per notebook
python -m src.cli notebooks list --format json | \
jq '.notebooks[] | "\(.name): \(.source_count) sources"'The CLI provides convenient short aliases:
notebooks→nbsources→srcvectors→vecoutputs→out
python -m src.cli nb list # Same as 'notebooks list'
python -m src.cli src add url # Same as 'sources add url'
python -m src.cli vec ingest # Same as 'vectors ingest'Profile Management:
# Create profiles for different environments
python -m src.cli config init --profile production --url https://api.production.com
python -m src.cli config init --profile staging --url https://api.staging.com
# Use specific profile
python -m src.cli notebooks list --profile production
# Switch default profile
python -m src.cli config use --profile stagingEnvironment Variables:
# Export configuration as environment variables
python -m src.cli config env
# Use in scripts:
eval $(python -m src.cli config env)
echo $DISCOVERY_API_URLNotebook State Tracking:
# CLI remembers your most recent notebook
python -m src.cli notebooks recent --set <notebook-id>
# Then commands work without specifying notebook ID
python -m src.cli sources list # Uses recent notebook
python -m src.cli vectors ingest # Uses recent notebookFor complete CLI documentation, see src/cli/README.md.
The project includes comprehensive test coverage:
# Run all tests (~42 tests should pass)
./scripts/test.sh
# Or using uv directly
uv run pytest tests/ -v
# Run specific test suites
uv run pytest tests/unit/ -v # Unit tests (38 tests)
uv run pytest tests/integration/ -v # Integration tests (4 tests)# Start development environment
./scripts/dev.sh
# Or activate environment manually
source .venv/bin/activate
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000- Follow Clean Architecture principles
- Write unit tests for all core business logic
- Use command/query objects for service inputs
- Return Result objects from services
- Keep core layer framework-independent
- User Stories:
specs/core_stories.md- Detailed feature requirements - Domain Model:
specs/domain_model.md- Entity relationships and design - Clean Architecture:
specs/clean_architecture.md- Architecture guidelines - Quick Start:
QUICK_START.md- Minimal setup guide
This project is open source. See the repository for license details.

