corprag

Multimodal RAG service built upon RAGAnything + LightRAG with PostgreSQL defaults and additional enhancements as the unified service.

Features

🌐 Flexible data sourcing -- Ingest from local filesystem, Azure Blob Storage, or Snowflake tables
🗂️ Multimodal ingestion with granular enhancements -- PDF, Word, Excel, PowerPoint, images, and more via parsing engine
🤖 Multi-provider LLM -- OpenAI, Azure OpenAI, Anthropic, Google Gemini, Qwen, MiniMax, Ollama, OpenRouter, xInference
🔭 Knowledge graph + Vector semantic -- Dual retrieval with Apache AGE (graph) and pgvector (vector) in a single PostgreSQL instance
↕️ Reranking -- LLM-based listwise OR Reranker from Cohere, Jina, Aliyun, Azure Cohere; Point any backend at a custom endpoint (Xinference, Ollama etc.)
✨ Retrieval enrichment -- Enhanced answer and retrieval formation for better citation and reference.
🔌 Three interfaces -- Python SDK, REST API, and MCP server

Quick Start

Option A: Python SDK

pip install corprag   # requires Python 3.12

import asyncio
from corprag import RAGService, CorpragConfig

async def main():
    # Minimal config -- just needs an LLM API key
    config = CorpragConfig(openai_api_key="sk-...")

    # Initialize (connects to PostgreSQL, sets up RAG engine)
    service = await RAGService.create(config=config)

    # Ingest documents
    result = await service.aingest(source_type="local", path="./docs")
    print(f"Ingested {result['ingested']} documents")

    # Retrieve (structured contexts + sources, no LLM answer)
    result = await service.aretrieve(query="What are the key findings?")
    print(result.contexts)

    # Answer (LLM-generated answer + structured contexts + sources)
    result = await service.aanswer(query="What are the key findings?")
    print(result.answer)

    await service.close()

asyncio.run(main())

Note: The SDK requires a running PostgreSQL instance with pgvector + AGE extensions, or use JSON fallback for development (see Configuration).

Option B: Self-Hosted Server (Docker)

git clone https://github.com/hanlianlu/corprag.git
cd corprag
cp .env.example .env
# Edit .env -- at minimum set CORPRAG_OPENAI_API_KEY
docker compose up

Everything is included: PostgreSQL (pgvector + AGE), REST API (:8100), and MCP server (:8101).

# Health check
curl http://localhost:8100/health

# Ingest documents
curl -X POST http://localhost:8100/ingest \
  -H "Content-Type: application/json" \
  -d '{"source_type": "local", "path": "/app/corprag_storage/sources"}'

# Retrieve (contexts + sources, no LLM answer)
curl -X POST http://localhost:8100/retrieve \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the key findings?"}'

# Answer (LLM-generated answer + contexts + sources)
curl -X POST http://localhost:8100/answer \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the key findings?"}'

Option C: MCP Server (for AI Agents)

pip install corprag
corprag-mcp  # stdio mode

Add to Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "corprag": {
      "command": "corprag-mcp"
    }
  }
}

Available MCP tools: retrieve, answer, ingest, list_files, delete_files.

Note: Like the SDK, the MCP server requires PostgreSQL with pgvector + AGE, or JSON fallback storage (see Configuration).

Local Development

git clone https://github.com/hanlianlu/corprag.git
cd corprag

# Configure environment
cp .env.example .env

# Edit .env -- at minimum set CORPRAG_OPENAI_API_KEY

# Install dependencies
uv sync

# Start PostgreSQL (pick one)
docker compose up postgres -d        # via Docker

# starts all services including PostgreSQL, API, and MCP
docker compose up  -d

Testing

uv run pytest tests/unit                    # unit tests (no external services)
uv run pytest tests/integration             # integration tests (requires PostgreSQL)
uv run pytest                               # all tests
uv run pytest --cov-report=html             # + HTML report → htmlcov/index.html

Linting

uv run ruff check src/ tests/ scripts/              # lint check
uv run ruff format --check src/ tests/ scripts/     # format check

uv run ruff check --fix src/ tests/ scripts/        
uv run ruff format src/ tests/ scripts/

Tip: To skip PostgreSQL entirely during development, set these in your .env:
CORPRAG_VECTOR_STORAGE=NanoVectorDBStorage
CORPRAG_GRAPH_STORAGE=NetworkXStorage
CORPRAG_KV_STORAGE=JsonKVStorage
CORPRAG_DOC_STATUS_STORAGE=JsonDocStatusStorage

Note: Excel-to-PDF conversion requires LibreOffice (libreoffice on PATH). If not installed, Excel files are ingested as-is without conversion. The Docker image includes LibreOffice.

Configuration

All configuration is via CORPRAG_ environment variables, a .env file, or constructor arguments.

Priority order (highest to lowest):

Constructor args -- CorpragConfig(openai_api_key="sk-...")
Environment variables -- CORPRAG_OPENAI_API_KEY=sk-...
.env file
Defaults

LLM Provider

Variable	Default	Description
`CORPRAG_LLM_PROVIDER`	`openai`	`openai`, `azure_openai`, `anthropic`, `google_gemini`, `qwen`, `minimax`, `ollama`, `openrouter`
`CORPRAG_EMBEDDING_PROVIDER`	(follows `llm_provider`)	Override embedding provider (e.g., `openai` when using Anthropic)
`CORPRAG_VISION_PROVIDER`	(follows `llm_provider`)	Override vision provider
`CORPRAG_EMBEDDING_MODEL`	`text-embedding-3-large`	Embedding model

Each provider has its own API key. Model names are unified across providers.

See .env.example for all provider-specific variables.

Storage Backends

Variable	Default	Options
`CORPRAG_VECTOR_STORAGE`	`PGVectorStorage`	PGVectorStorage, MilvusVectorDBStorage, NanoVectorDBStorage, ...
`CORPRAG_GRAPH_STORAGE`	`PGGraphStorage`	PGGraphStorage, Neo4JStorage, NetworkXStorage, ...
`CORPRAG_KV_STORAGE`	`PGKVStorage`	PGKVStorage, JsonKVStorage, RedisKVStorage, ...
`CORPRAG_DOC_STATUS_STORAGE`	`PGDocStatusStorage`	PGDocStatusStorage, JsonDocStatusStorage, ...

See .env.example for all available configuration options.

Reranking

Five backends are available. The cohere, jina, and aliyun backends use LightRAG's built-in rerank functions and can target any API-compatible service via RERANK_BASE_URL.

Variable	Default	Description
`CORPRAG_RERANK_BACKEND`	`llm`	`llm`, `cohere`, `jina`, `aliyun`, `azure_cohere`
`CORPRAG_RERANK_MODEL`	(backend default)	Model name sent to the endpoint
`CORPRAG_RERANK_BASE_URL`	(provider default)	Custom endpoint URL for any compatible service
`CORPRAG_RERANK_API_KEY`	—	Generic API key (falls back to provider-specific keys)

Backend defaults (used when RERANK_MODEL / RERANK_API_KEY are not set):

Backend	Default model	Provider-specific key
`llm`	(follows `INGESTION_MODEL`)	(follows `LLM_PROVIDER` credentials)
`cohere`	`rerank-v4.0-pro`	`CORPRAG_COHERE_API_KEY`
`jina`	`jina-reranker-v2-base-multilingual`	`CORPRAG_JINA_API_KEY`
`aliyun`	`gte-rerank-v2`	`CORPRAG_ALIYUN_RERANK_API_KEY`
`azure_cohere`	`Cohere-rerank-v4.0-pro`	`CORPRAG_AZURE_COHERE_API_KEY` + `CORPRAG_AZURE_COHERE_ENDPOINT`

Examples:

# Cohere (direct)
CORPRAG_RERANK_BACKEND=cohere
CORPRAG_COHERE_API_KEY=your-key

# Local reranker via Xinference / LiteLLM / any Cohere-compatible endpoint
CORPRAG_RERANK_BACKEND=cohere
CORPRAG_RERANK_MODEL=bge-reranker-v2-m3
CORPRAG_RERANK_BASE_URL=http://localhost:9997/v1/rerank

# LLM-based listwise reranker (default -- no extra config needed)
CORPRAG_RERANK_BACKEND=llm

See .env.example for all reranking options.

REST API

Method	Endpoint	Description
`POST`	`/ingest`	Ingest documents from local, Azure Blob, or Snowflake
`POST`	`/retrieve`	Retrieve contexts and sources (no LLM answer)
`POST`	`/answer`	LLM-generated answer with contexts and sources
`GET`	`/files`	List ingested documents
`DELETE`	`/files`	Delete documents
`GET`	`/health`	Health check with storage status

Set CORPRAG_API_AUTH_TOKEN to enable bearer token authentication.

Architecture

┌──────────────────────────────────────────────────────┐
│   Python SDK  ·  REST API (:8100)  ·  MCP (:8101)    │
└─────────────────────────┬────────────────────────────┘
                          │  CorpragConfig
                 ┌────────▼────────┐
                 │   RAGService    │
                 └────┬───────┬────┘
                      │       │
          ┌───────────▼─┐  ┌──▼─────────────┐
          │  Ingestion  │  │   Retrieval    │
          │  Pipeline   │  │ (RAGAnything   │
          │             │  │  + LightRAG)   │
          │  local      │  │                │
          │  azure blob │  │  retrieve()    │
          │  snowflake  │  │  answer()      │
          └──────┬──────┘  └───────┬────────┘
                 └─────────┬───────┘
                           │
             ┌─────────────┴──────────────────┐
             │                                │
  ┌──────────▼────────┐      ┌────────────────▼─────┐
  │   LLM Providers   │      │      Storage         │
  │  Chat · Embed     │      │  PostgreSQL          │
  │  Vision · Rerank  │      │  (pgvector + AGE)    │
  │                   │      │                      │
  │  OpenAI · Azure   │      │  Neo4j · Milvus      │
  │  Anthropic ·      │      │  Redis · JSON · ...  │
  │  Gemini · Qwen    │      └──────────────────────┘
  │  Ollama · ...     │
  └───────────────────┘

License

Apache License 2.0. See LICENSE for details.

Built by HanlianLyu. Contributions welcome! Please open issues or pull requests on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
postgres		postgres
scripts		scripts
src/corprag		src/corprag
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

corprag

Features

Quick Start

Option A: Python SDK

Option B: Self-Hosted Server (Docker)

Option C: MCP Server (for AI Agents)

Local Development

Testing

Linting

Configuration

LLM Provider

Storage Backends

Reranking

REST API

Architecture

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

corprag

Features

Quick Start

Option A: Python SDK

Option B: Self-Hosted Server (Docker)

Option C: MCP Server (for AI Agents)

Local Development

Testing

Linting

Configuration

LLM Provider

Storage Backends

Reranking

REST API

Architecture

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages