Skip to content

jcvikl/code-graph-rag

 
 

Repository files navigation

Graph-Code: A Graph-Based RAG System for Python Codebases

A sophisticated Retrieval-Augmented Generation (RAG) system that analyzes Python repositories, builds knowledge graphs, and enables natural language querying of codebase structure and relationships.

ag-ui Logo

🚀 Features

  • AST-based Code Analysis: Deep parsing of Python files to extract classes, functions, methods, and their relationships
  • Knowledge Graph Storage: Uses Memgraph to store codebase structure as an interconnected graph
  • Natural Language Querying: Ask questions about your codebase in plain English
  • AI-Powered Cypher Generation: Leverages Google Gemini to translate natural language to Cypher queries
  • Code Snippet Retrieval: Retrieves actual source code snippets for found functions/methods
  • Dependency Analysis: Parses pyproject.toml to understand external dependencies

🏗️ Architecture

The system consists of two main components:

  1. Repository Parser (repo_parser.py): Analyzes Python codebases and ingests data into Memgraph
  2. RAG System (codebase_rag/): Interactive CLI for querying the stored knowledge graph

Core Components

  • Graph Database: Memgraph for storing code structure as nodes and relationships
  • LLM Integration: Google Gemini for natural language processing
  • Code Analysis: AST traversal for extracting code elements
  • Query Tools: Specialized tools for graph querying and code retrieval

📋 Prerequisites

  • Python 3.12+
  • Docker & Docker Compose (for Memgraph)
  • Google Gemini API key
  • uv package manager

🛠️ Installation

  1. Clone the repository:
git clone <repository-url>
cd graph-code
  1. Install dependencies:
uv sync
  1. Set up environment variables:
cp .env.example .env
# Edit .env with your configuration

Required environment variables:

GEMINI_API_KEY=your-api-key
GEMINI_MODEL_ID=gemini-2.5-pro
MODEL_CYPHER_ID=gemini-2.5-flash-lite-preview-06-17
MEMGRAPH_HOST=localhost
MEMGRAPH_PORT=7687
  1. Start Memgraph database:
docker-compose up -d

🎯 Usage

Step 1: Parse a Repository

Parse and ingest a Python repository into the knowledge graph:

python repo_parser.py /path/to/your/python/repo --clean

Options:

  • --clean: Clear existing data before parsing
  • --host: Memgraph host (default: localhost)
  • --port: Memgraph port (default: 7687)

Step 2: Query the Codebase

Start the interactive RAG CLI:

python -m codebase_rag.main --repo-path /path/to/your/repo

Example queries:

  • "Show me all classes that contain 'user' in their name"
  • "Find functions related to database operations"
  • "What methods does the User class have?"
  • "Show me functions that handle authentication"

📊 Graph Schema

The knowledge graph uses the following node types and relationships:

Node Types

  • Project: Root node representing the entire repository
  • Package: Python packages (directories with __init__.py)
  • Module: Individual Python files
  • Class: Class definitions
  • Function: Module-level functions
  • Method: Class methods
  • Folder: Regular directories
  • File: Non-Python files
  • ExternalPackage: External dependencies

Relationships

  • CONTAINS_PACKAGE/MODULE/FILE/FOLDER: Hierarchical containment
  • DEFINES: Module defines classes/functions
  • DEFINES_METHOD: Class defines methods
  • DEPENDS_ON_EXTERNAL: Project depends on external packages

🔧 Configuration

Configuration is managed through environment variables and the config.py file:

MEMGRAPH_HOST = "localhost"
MEMGRAPH_PORT = 7687
GEMINI_MODEL_ID = "gemini-2.5-pro"  # Main RAG orchestrator model
MODEL_CYPHER_ID = "gemini-2.5-flash-lite-preview-06-17"  # Cypher generation model
TARGET_REPO_PATH = "."
GEMINI_API_KEY = "required"

🏃‍♂️ Development

Project Structure

graph-code/
├── repo_parser.py              # Repository analysis and ingestion
├── codebase_rag/              # RAG system package
│   ├── main.py                # CLI entry point
│   ├── config.py              # Configuration management
│   ├── prompts.py             # LLM prompts and schemas
│   ├── schemas.py             # Pydantic models
│   ├── services/              # Core services
│   │   ├── graph_db.py        # Memgraph integration
│   │   └── llm.py             # Gemini LLM integration
│   └── tools/                 # RAG tools
│       ├── codebase_query.py  # Graph querying tool
│       └── code_retrieval.py  # Code snippet retrieval
├── docker-compose.yaml        # Memgraph setup
└── pyproject.toml            # Project dependencies

Key Dependencies

  • pydantic-ai: AI agent framework
  • pymgclient: Memgraph Python client
  • loguru: Advanced logging
  • python-dotenv: Environment variable management

🐛 Debugging

  1. Check Memgraph connection:

    • Ensure Docker containers are running: docker-compose ps
    • Verify Memgraph is accessible on port 7687
  2. View database in Memgraph Lab:

  3. Enable debug logging:

    • The RAG orchestrator runs in debug mode by default
    • Check logs for detailed execution traces

🤝 Contributing

  1. Follow the established code structure
  2. Keep files under 100 lines (as per user rules)
  3. Use type annotations
  4. Follow conventional commit messages
  5. Use DRY principles

🙋‍♂️ Support

For issues or questions:

  1. Check the logs for error details
  2. Verify Memgraph connection
  3. Ensure all environment variables are set
  4. Review the graph schema matches your expectations

Star History

Star History Chart

About

Search Monorepos and get relevant answers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%