Skip to content

hanlianlu/corprag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

corprag

PyPI CI

Multimodal RAG service built upon RAGAnything + LightRAG with PostgreSQL defaults and additional enhancements as the unified service.

Features

  • 🌐 Flexible data sourcing -- Ingest from local filesystem, Azure Blob Storage, or Snowflake tables
  • πŸ—‚οΈ Multimodal ingestion with granular enhancements -- PDF, Word, Excel, PowerPoint, images, and more via parsing engine
  • πŸ€– Multi-provider LLM -- OpenAI, Azure OpenAI, Anthropic, Google Gemini, Qwen, MiniMax, Ollama, OpenRouter, xInference
  • πŸ”­ Knowledge graph + Vector semantic -- Dual retrieval with Apache AGE (graph) and pgvector (vector) in a single PostgreSQL instance
  • ↕️ Reranking -- LLM-based listwise OR Reranker from Cohere, Jina, Aliyun, Azure Cohere; Point any backend at a custom endpoint (Xinference, Ollama etc.)
  • ✨ Retrieval enrichment -- Enhanced answer and retrieval formation for better citation and reference.
  • πŸ”Œ Three interfaces -- Python SDK, REST API, and MCP server

Quick Start

Option A: Python SDK

pip install corprag   # requires Python 3.12
import asyncio
from corprag import RAGService, CorpragConfig

async def main():
    # Minimal config -- just needs an LLM API key
    config = CorpragConfig(openai_api_key="sk-...")

    # Initialize (connects to PostgreSQL, sets up RAG engine)
    service = await RAGService.create(config=config)

    # Ingest documents
    result = await service.aingest(source_type="local", path="./docs")
    print(f"Ingested {result['ingested']} documents")

    # Retrieve (structured contexts + sources, no LLM answer)
    result = await service.aretrieve(query="What are the key findings?")
    print(result.contexts)

    # Answer (LLM-generated answer + structured contexts + sources)
    result = await service.aanswer(query="What are the key findings?")
    print(result.answer)

    await service.close()

asyncio.run(main())

Note: The SDK requires a running PostgreSQL instance with pgvector + AGE extensions, or use JSON fallback for development (see Configuration).

Option B: Self-Hosted Server (Docker)

git clone https://github.com/hanlianlu/corprag.git
cd corprag
cp .env.example .env
# Edit .env -- at minimum set CORPRAG_OPENAI_API_KEY
docker compose up

Everything is included: PostgreSQL (pgvector + AGE), REST API (:8100), and MCP server (:8101).

# Health check
curl http://localhost:8100/health

# Ingest documents
curl -X POST http://localhost:8100/ingest \
  -H "Content-Type: application/json" \
  -d '{"source_type": "local", "path": "/app/corprag_storage/sources"}'

# Retrieve (contexts + sources, no LLM answer)
curl -X POST http://localhost:8100/retrieve \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the key findings?"}'

# Answer (LLM-generated answer + contexts + sources)
curl -X POST http://localhost:8100/answer \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the key findings?"}'

Option C: MCP Server (for AI Agents)

pip install corprag
corprag-mcp  # stdio mode

Add to Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "corprag": {
      "command": "corprag-mcp"
    }
  }
}

Available MCP tools: retrieve, answer, ingest, list_files, delete_files.

Note: Like the SDK, the MCP server requires PostgreSQL with pgvector + AGE, or JSON fallback storage (see Configuration).

Local Development

git clone https://github.com/hanlianlu/corprag.git
cd corprag

# Configure environment
cp .env.example .env

# Edit .env -- at minimum set CORPRAG_OPENAI_API_KEY

# Install dependencies
uv sync

# Start PostgreSQL (pick one)
docker compose up postgres -d        # via Docker

# starts all services including PostgreSQL, API, and MCP
docker compose up  -d  

Testing

uv run pytest tests/unit                    # unit tests (no external services)
uv run pytest tests/integration             # integration tests (requires PostgreSQL)
uv run pytest                               # all tests
uv run pytest --cov-report=html             # + HTML report β†’ htmlcov/index.html

Linting

uv run ruff check src/ tests/ scripts/              # lint check
uv run ruff format --check src/ tests/ scripts/     # format check

uv run ruff check --fix src/ tests/ scripts/        
uv run ruff format src/ tests/ scripts/             

Tip: To skip PostgreSQL entirely during development, set these in your .env:

CORPRAG_VECTOR_STORAGE=NanoVectorDBStorage
CORPRAG_GRAPH_STORAGE=NetworkXStorage
CORPRAG_KV_STORAGE=JsonKVStorage
CORPRAG_DOC_STATUS_STORAGE=JsonDocStatusStorage

Note: Excel-to-PDF conversion requires LibreOffice (libreoffice on PATH). If not installed, Excel files are ingested as-is without conversion. The Docker image includes LibreOffice.

Configuration

All configuration is via CORPRAG_ environment variables, a .env file, or constructor arguments.

Priority order (highest to lowest):

  1. Constructor args -- CorpragConfig(openai_api_key="sk-...")
  2. Environment variables -- CORPRAG_OPENAI_API_KEY=sk-...
  3. .env file
  4. Defaults

LLM Provider

Variable Default Description
CORPRAG_LLM_PROVIDER openai openai, azure_openai, anthropic, google_gemini, qwen, minimax, ollama, openrouter
CORPRAG_EMBEDDING_PROVIDER (follows llm_provider) Override embedding provider (e.g., openai when using Anthropic)
CORPRAG_VISION_PROVIDER (follows llm_provider) Override vision provider
CORPRAG_EMBEDDING_MODEL text-embedding-3-large Embedding model

Each provider has its own API key. Model names are unified across providers.

See .env.example for all provider-specific variables.

Storage Backends

Variable Default Options
CORPRAG_VECTOR_STORAGE PGVectorStorage PGVectorStorage, MilvusVectorDBStorage, NanoVectorDBStorage, ...
CORPRAG_GRAPH_STORAGE PGGraphStorage PGGraphStorage, Neo4JStorage, NetworkXStorage, ...
CORPRAG_KV_STORAGE PGKVStorage PGKVStorage, JsonKVStorage, RedisKVStorage, ...
CORPRAG_DOC_STATUS_STORAGE PGDocStatusStorage PGDocStatusStorage, JsonDocStatusStorage, ...

See .env.example for all available configuration options.

Reranking

Five backends are available. The cohere, jina, and aliyun backends use LightRAG's built-in rerank functions and can target any API-compatible service via RERANK_BASE_URL.

Variable Default Description
CORPRAG_RERANK_BACKEND llm llm, cohere, jina, aliyun, azure_cohere
CORPRAG_RERANK_MODEL (backend default) Model name sent to the endpoint
CORPRAG_RERANK_BASE_URL (provider default) Custom endpoint URL for any compatible service
CORPRAG_RERANK_API_KEY β€” Generic API key (falls back to provider-specific keys)

Backend defaults (used when RERANK_MODEL / RERANK_API_KEY are not set):

Backend Default model Provider-specific key
llm (follows INGESTION_MODEL) (follows LLM_PROVIDER credentials)
cohere rerank-v4.0-pro CORPRAG_COHERE_API_KEY
jina jina-reranker-v2-base-multilingual CORPRAG_JINA_API_KEY
aliyun gte-rerank-v2 CORPRAG_ALIYUN_RERANK_API_KEY
azure_cohere Cohere-rerank-v4.0-pro CORPRAG_AZURE_COHERE_API_KEY + CORPRAG_AZURE_COHERE_ENDPOINT

Examples:

# Cohere (direct)
CORPRAG_RERANK_BACKEND=cohere
CORPRAG_COHERE_API_KEY=your-key

# Local reranker via Xinference / LiteLLM / any Cohere-compatible endpoint
CORPRAG_RERANK_BACKEND=cohere
CORPRAG_RERANK_MODEL=bge-reranker-v2-m3
CORPRAG_RERANK_BASE_URL=http://localhost:9997/v1/rerank

# LLM-based listwise reranker (default -- no extra config needed)
CORPRAG_RERANK_BACKEND=llm

See .env.example for all reranking options.

REST API

Method Endpoint Description
POST /ingest Ingest documents from local, Azure Blob, or Snowflake
POST /retrieve Retrieve contexts and sources (no LLM answer)
POST /answer LLM-generated answer with contexts and sources
GET /files List ingested documents
DELETE /files Delete documents
GET /health Health check with storage status

Set CORPRAG_API_AUTH_TOKEN to enable bearer token authentication.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Python SDK  Β·  REST API (:8100)  Β·  MCP (:8101)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚  CorpragConfig
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚   RAGService    β”‚
                 β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                      β”‚       β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚  Ingestion  β”‚  β”‚   Retrieval    β”‚
          β”‚  Pipeline   β”‚  β”‚ (RAGAnything   β”‚
          β”‚             β”‚  β”‚  + LightRAG)   β”‚
          β”‚  local      β”‚  β”‚                β”‚
          β”‚  azure blob β”‚  β”‚  retrieve()    β”‚
          β”‚  snowflake  β”‚  β”‚  answer()      β”‚
          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚                                β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
  β”‚   LLM Providers   β”‚      β”‚      Storage         β”‚
  β”‚  Chat Β· Embed     β”‚      β”‚  PostgreSQL          β”‚
  β”‚  Vision Β· Rerank  β”‚      β”‚  (pgvector + AGE)    β”‚
  β”‚                   β”‚      β”‚                      β”‚
  β”‚  OpenAI Β· Azure   β”‚      β”‚  Neo4j Β· Milvus      β”‚
  β”‚  Anthropic Β·      β”‚      β”‚  Redis Β· JSON Β· ...  β”‚
  β”‚  Gemini Β· Qwen    β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β”‚  Ollama Β· ...     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

License

Apache License 2.0. See LICENSE for details.


Built by HanlianLyu. Contributions welcome! Please open issues or pull requests on GitHub.

About

Corporate RAG - multimodal document ingestion & retrieval service built on RAGAnything + LightRAG with PostgreSQL

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors