Discovery

A local NotebookLM-like research application that helps you organize, analyze, and generate insights from your research materials. Built with Clean Architecture principles for maintainability and testability.

Purpose & Vision

Discovery empowers researchers, students, and knowledge workers to build comprehensive research notebooks by collecting sources from various formats (PDFs, documents, web articles) and generating intelligent summaries and insights. Think of it as your personal research assistant that:

Organizes your research materials into focused notebooks
Ingests content from files (PDF, DOCX, TXT, MD) and web URLs
Analyzes your sources using vector-based semantic search
Generates summaries, blog posts, and research outputs using AI
Maintains full data privacy with local-first storage

Future Roadmap

Pluggable Infrastructure: Support for offline LLMs and embedding models for complete data sovereignty
Output Modules: Generate specialized research artifacts like comparative analyses, executive briefings, and research reports
Enhanced Collaboration: Export and share research notebooks while maintaining privacy controls

Overview

This FastAPI-based application follows Clean Architecture principles, ensuring clear separation between business logic, infrastructure concerns, and API layers. All your data stays local while leveraging the power of modern AI for content analysis and generation.

Notebook and sources

Ask questions of your sources

Major Features

🎯 Intelligent Research Management

Multi-Source Ingestion: Import content from PDFs, DOCX, TXT, Markdown files, and web URLs
Vector-Powered Search: Semantic similarity search across all your research materials using Weaviate
AI-Driven Insights: Generate summaries, blog posts, and research outputs using Google Gemini
Question Answering: Ask natural language questions and get AI-powered answers from your sources using RAG (Retrieval-Augmented Generation)

🛠️ Developer-First Design

Clean Architecture: Framework-independent core business logic, fully testable
RESTful API: Comprehensive FastAPI backend with interactive documentation
CLI Tool: Full-featured command-line interface for all operations
Local-First: All data stored locally with PostgreSQL and Weaviate

🔒 Privacy & Control

Data Sovereignty: Everything runs locally on your infrastructure
No Cloud Lock-in: Works entirely offline (except for optional Gemini API calls)
Configurable AI: Support for custom LLM and embedding model configurations

Core Concepts

Notebooks: A collection of related sources for a specific project or topic.
Sources: Research materials imported into a notebook, such as files (PDF, DOCX, TXT, MD) and URLs.
Outputs: Generated content, such as summaries or blog posts, created from the sources in a notebook.
Vector Search: Semantic similarity search powered by Weaviate vector database for finding relevant content chunks within notebooks.

Getting Started for Developers

Prerequisites

Python 3.12+ - Modern Python runtime
Docker - For running PostgreSQL and Weaviate services
uv - Fast Python package manager (recommended)

Quick Setup

Clone and navigate to the repository:
```
git clone <repository-url>
cd discovery
```

Install uv (if not already installed):

# Unix/macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or using pip
pip install uv

Set up the Python environment:

# Creates virtual environment and installs all dependencies
uv sync

# Activate the environment  
source .venv/bin/activate  # Unix/macOS
# or
.venv\Scripts\activate     # Windows

Environment Variables

Create a .env file in the project root with these required variables:

# Database Configuration
DATABASE_URL="postgresql://postgres:Foobar321@localhost:5432/postgres"

# AI Services
GEMINI_API_KEY="your_gemini_api_key_here"        # For Google Gemini LLM
GEMINI_MODEL="gemini-3-pro-preview"              # Gemini model to use (optional, defaults to gemini-2.0-flash-001)

# Google Search Services
GOOGLE_CUSTOM_SEARCH_API_KEY="your_google_search_api_key"     # For web search features
GOOGLE_CUSTOM_SEARCH_ENGINE_ID="your_search_engine_id"        # Custom search engine ID

# Vector Database (optional - defaults to localhost)
WEAVIATE_URL="http://localhost:8080"             # Local Weaviate instance
WEAVIATE_API_KEY="your_weaviate_cloud_key"       # Only for cloud instances

Environment Variable Details:

Variable	Purpose	Required	Default
`DATABASE_URL`	PostgreSQL connection string	Yes	None
`GEMINI_API_KEY`	Google Gemini API access for AI features	Yes	None
`GEMINI_MODEL`	Gemini model name (e.g., gemini-3-pro-preview)	No	`gemini-2.0-flash-001`
`GOOGLE_CUSTOM_SEARCH_API_KEY`	Google Custom Search API for web search features	Yes	None
`GOOGLE_CUSTOM_SEARCH_ENGINE_ID`	Custom search engine identifier for Google search	Yes	None
`WEAVIATE_URL`	Weaviate vector database URL	No	`http://localhost:8080`
`WEAVIATE_API_KEY`	Weaviate cloud authentication	No	None

Starting Your Discovery Instance

1. Database Setup

Start the PostgreSQL database using Docker:

# Start PostgreSQL container
docker-compose -f pgDockerCompose/docker-compose.yaml up -d

# Apply database migrations
alembic upgrade head

2. Vector Database Setup (Optional but Recommended)

For semantic search capabilities, start Weaviate:

# Start Weaviate vector database
docker-compose -f weaviateDockerCompose/docker-compose.yaml up -d

This provides:

Weaviate vector database on port 8080
Text-to-vector transformer for generating embeddings

3. Launch the Application

# Start the FastAPI server
./scripts/dev.sh

# Or manually:
uv run uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000

4. Verify Installation

API Documentation: http://localhost:8000/docs
Health Check: http://localhost:8000/health
Create your first notebook: Use the interactive docs or API endpoints

5. Run the Test Suite

Verify everything works correctly:

# Run all tests (should pass ~42 tests)
./scripts/test.sh

# Or using uv directly
uv run pytest tests/ -v

Vector Search Demo

A demonstration script is provided to showcase the vector search capabilities:

python scripts/ingest_wikipedia_notebook.py

This script will:

Create a new notebook with a random name
Import two Wikipedia articles (Monsters, Inc. and Monsters University)
Ingest the content into the vector database
Perform sample similarity search queries
Display the results

Make sure both the API server and Weaviate are running before executing the demo.

Architecture & Clean Code Principles

Discovery follows Clean Architecture principles as advocated by Robert C. Martin and Steve Smith (Ardalis), ensuring maintainable, testable, and framework-independent code.

Core Principles

Dependency Inversion: Dependencies point inward toward the Core business logic
Framework Independence: Core business logic has zero dependencies on external frameworks
Interface-Driven Design: Inner layers define interfaces; outer layers implement them
Separation of Concerns: Clear boundaries between business logic, infrastructure, and presentation

Architecture Layers

┌─────────────────┐
│   API Layer     │  ← FastAPI, Routes, DTOs
│   (src/api/)    │
└────────┬────────┘
         │ depends on
┌────────▼────────┐
│   Core Layer    │  ← Entities, Services, Interfaces  
│   (src/core/)   │     (Framework Independent)
└────────┬────────┘
         │ implements
┌────────▼────────┐
│Infrastructure   │  ← Repositories, Providers, Database
│(src/infrastructure/)│
└─────────────────┘

Clean Architecture Rules Applied

Rule	Implementation
Core Independence	`src/core/` has minimal dependencies - only domain logic
Interface Definition	Core defines `INotebookRepository`, Infrastructure implements `SqlNotebookRepository`
Dependency Direction	API → Core ← Infrastructure (never Core → Infrastructure)
Command/Query Pattern	Services use structured command/query objects as inputs
Result Pattern	All services return `Result<T>` objects for consistent error handling
Unit Testing	Core services are easily testable without external dependencies

Project Structure

The project is organized into three main layers:

Core Layer (src/core/):

entities/: Domain entities (Notebook, Source, etc.)
services/: Business logic services
interfaces/: Abstract interfaces for repositories and providers
commands/ & queries/: Structured input objects
results/: Standardized result types

Infrastructure Layer (src/infrastructure/):

repositories/: Database access implementations
providers/: External service implementations (LLM, Vector DB)
database/: Database models and migrations

API Layer (src/api/):

main.py: FastAPI application setup
*_router.py: Route definitions
dtos.py: Data transfer objects for API serialization

API Features & Endpoints

Core Functionality

Notebook Management:

POST /api/notebooks - Create new research notebook
GET /api/notebooks - List all notebooks with metadata
GET /api/notebooks/{id} - Get specific notebook details
PUT /api/notebooks/{id} - Update notebook properties
DELETE /api/notebooks/{id} - Delete notebook and all sources

Source Management:

POST /api/notebooks/{id}/sources/file - Upload file source (PDF, DOCX, TXT, MD)
POST /api/notebooks/{id}/sources/url - Add web URL as source
GET /api/notebooks/{id}/sources - List all sources in notebook
DELETE /api/sources/{id} - Remove source from notebook

Content Generation:

POST /api/notebooks/{id}/generate-summary - Generate AI summary from selected sources
POST /api/notebooks/{id}/generate-output - Create structured outputs (blog posts, briefs)

Vector Search API

Enable semantic search across your research materials:

POST /api/notebooks/{id}/ingest - Ingest content into vector database
GET /api/notebooks/{id}/similar - Semantic similarity search
GET /api/notebooks/{id}/vectors/count - Get vector count for notebook
DELETE /api/notebooks/{id}/vectors - Clear all vectors for notebook

Interactive Documentation

Access the full API documentation at: http://localhost:8000/docs

Demo & Testing

Try the vector search capabilities:

# Run the Wikipedia demo (creates notebook with sample content)
python src/apps/ingest_notebook_into_vectordb.py

# This demonstrates:
# 1. Creating a notebook
# 2. Adding Wikipedia articles as sources  
# 3. Ingesting content for semantic search
# 4. Performing similarity queries

Command-Line Interface (CLI)

Discovery includes a powerful CLI built with Typer for managing your research notebooks from the terminal. Perfect for automation, scripting, and quick workflows.

CLI Installation & Setup

The CLI is automatically installed when you set up the project:

# After running 'uv sync', the CLI is available as:
python -m src.cli

# Or install globally with pipx for convenience:
pipx install .

# Then use directly:
discovery --help

Initial Configuration

1. Start the API server first:

./scripts/dev.sh
# Server runs at http://localhost:8000

2. Configure the CLI to connect to your API:

# Initialize a configuration profile
python -m src.cli config init --url http://localhost:8000

# Test the connection
python -m src.cli config test

# View current configuration
python -m src.cli config show

Configuration is stored in ~/.discovery/config.toml. You can manage multiple profiles for different environments.

Essential CLI Commands

Notebook Management:

# List all notebooks
python -m src.cli notebooks list

# Create a new notebook
python -m src.cli notebooks create --name "AI Research" --tags "machine-learning,llm"

# Show notebook details
python -m src.cli notebooks show <notebook-id>

# Update notebook
python -m src.cli notebooks update <notebook-id> --name "Updated Name"

# Delete notebook
python -m src.cli notebooks delete <notebook-id>

Source Management:

# Add a web URL source
python -m src.cli sources add url \
  --notebook <notebook-id> \
  --url "https://en.wikipedia.org/wiki/Artificial_intelligence"

# Add a file source
python -m src.cli sources add file \
  --notebook <notebook-id> \
  --path /path/to/research-paper.pdf

# Add text content directly
python -m src.cli sources add text \
  --notebook <notebook-id> \
  --content "Your research notes here"

# List sources in a notebook
python -m src.cli sources list --notebook <notebook-id>

# Remove a source
python -m src.cli sources remove <source-id>

Vector Database Operations:

# Ingest notebook sources into vector database for semantic search
python -m src.cli vectors ingest --notebook <notebook-id>

# Perform similarity search
python -m src.cli vectors search \
  --notebook <notebook-id> \
  --query "machine learning applications" \
  --limit 5

# Check vector count
python -m src.cli vectors count --notebook <notebook-id>

# Clear vectors for a notebook
python -m src.cli vectors delete --notebook <notebook-id>

Question Answering (RAG):

# Ask a question and get AI-powered answers from your sources
python -m src.cli qa ask \
  --notebook <notebook-id> \
  --question "What are the main applications of deep learning?"

# Output in JSON format for scripting
python -m src.cli qa ask \
  --notebook <notebook-id> \
  --question "Summarize the key findings" \
  --format json

Output Generation:

# Generate a blog post from sources
python -m src.cli outputs create \
  --notebook <notebook-id> \
  --type "blog_post" \
  --prompt "Write a blog post about AI trends"

# List generated outputs
python -m src.cli outputs list --notebook <notebook-id>

# View output content
python -m src.cli outputs show <output-id>

CLI Output Formats

All list and show commands support multiple output formats:

# Human-readable table (default)
python -m src.cli notebooks list --format table

# JSON for scripting
python -m src.cli notebooks list --format json

# YAML for configuration
python -m src.cli notebooks list --format yaml

# Plain text
python -m src.cli notebooks list --format text

Scripting Examples

Complete research workflow:

#!/bin/bash
# Create notebook and add sources

NOTEBOOK_ID=$(python -m src.cli notebooks create \
  --name "AI Ethics Research" \
  --format json | jq -r '.id')

# Add multiple sources
for url in \
  "https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence" \
  "https://en.wikipedia.org/wiki/AI_safety"; do
  python -m src.cli sources add url --notebook $NOTEBOOK_ID --url "$url"
done

# Ingest into vector database
python -m src.cli vectors ingest --notebook $NOTEBOOK_ID

# Ask questions
python -m src.cli qa ask \
  --notebook $NOTEBOOK_ID \
  --question "What are the main ethical concerns with AI?" \
  --format json

Extract notebook information:

# Get all notebook IDs
python -m src.cli notebooks list --format json | jq -r '.notebooks[].id'

# Count sources per notebook
python -m src.cli notebooks list --format json | \
  jq '.notebooks[] | "\(.name): \(.source_count) sources"'

CLI Aliases

The CLI provides convenient short aliases:

notebooks → nb
sources → src
vectors → vec
outputs → out

python -m src.cli nb list        # Same as 'notebooks list'
python -m src.cli src add url    # Same as 'sources add url'
python -m src.cli vec ingest     # Same as 'vectors ingest'

Advanced CLI Features

Profile Management:

# Create profiles for different environments
python -m src.cli config init --profile production --url https://api.production.com
python -m src.cli config init --profile staging --url https://api.staging.com

# Use specific profile
python -m src.cli notebooks list --profile production

# Switch default profile
python -m src.cli config use --profile staging

Environment Variables:

# Export configuration as environment variables
python -m src.cli config env

# Use in scripts:
eval $(python -m src.cli config env)
echo $DISCOVERY_API_URL

Notebook State Tracking:

# CLI remembers your most recent notebook
python -m src.cli notebooks recent --set <notebook-id>

# Then commands work without specifying notebook ID
python -m src.cli sources list  # Uses recent notebook
python -m src.cli vectors ingest  # Uses recent notebook

For complete CLI documentation, see src/cli/README.md.

Development & Testing

Running Tests

The project includes comprehensive test coverage:

# Run all tests (~42 tests should pass)
./scripts/test.sh

# Or using uv directly  
uv run pytest tests/ -v

# Run specific test suites
uv run pytest tests/unit/ -v      # Unit tests (38 tests)
uv run pytest tests/integration/ -v  # Integration tests (4 tests)

Development Workflow

# Start development environment
./scripts/dev.sh

# Or activate environment manually
source .venv/bin/activate
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000

Contributing

Follow Clean Architecture principles
Write unit tests for all core business logic
Use command/query objects for service inputs
Return Result objects from services
Keep core layer framework-independent

Additional Resources

User Stories: specs/core_stories.md - Detailed feature requirements
Domain Model: specs/domain_model.md - Entity relationships and design
Clean Architecture: specs/clean_architecture.md - Architecture guidelines
Quick Start: QUICK_START.md - Minimal setup guide

License

This project is open source. See the repository for license details.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.vscode		.vscode
discoveryPortal		discoveryPortal
docs		docs
pgDockerCompose		pgDockerCompose
prompts		prompts
scripts		scripts
specs		specs
src		src
storage/notebooks		storage/notebooks
tests		tests
weaviateDockerCompose		weaviateDockerCompose
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
DOCKER.md		DOCKER.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
alembic.ini		alembic.ini
discover-question-answer.png		discover-question-answer.png
discovery-notebooks.png		discovery-notebooks.png
docker-compose.yml		docker-compose.yml
markmap.html		markmap.html
openapi.json		openapi.json
pyproject.toml		pyproject.toml
todo.md		todo.md
uv.lock		uv.lock

License

michaelprosario/discovery

Folders and files

Latest commit

History

Repository files navigation

Discovery

Purpose & Vision

Future Roadmap

Overview

Notebook and sources

Ask questions of your sources

Major Features

Core Concepts

Getting Started for Developers

Prerequisites

Quick Setup

Environment Variables

Starting Your Discovery Instance

1. Database Setup

2. Vector Database Setup (Optional but Recommended)

3. Launch the Application

4. Verify Installation

5. Run the Test Suite

Vector Search Demo

Architecture & Clean Code Principles

Core Principles

Architecture Layers

Clean Architecture Rules Applied

Project Structure

API Features & Endpoints

Core Functionality

Vector Search API

Interactive Documentation

Demo & Testing

Command-Line Interface (CLI)

CLI Installation & Setup

Initial Configuration

Essential CLI Commands

CLI Output Formats

Scripting Examples

CLI Aliases

Advanced CLI Features

Development & Testing

Running Tests

Development Workflow

Contributing

Additional Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages