An intelligent AI-powered tool that enables interactive conversations with any code repository using advanced iterative RAG (Retrieval-Augmented Generation) with self-reflection capabilities.
This project implements the requested chatbot application that works with GitHub repositories as context. It can answer questions like:
- "Is there any authentication-related logic in the code? If so, where is it?"
- "What does the following application do?"
- "What is the flow (classes and function calls) of an executable signing in this application?"
The solution uses Azure OpenAI (GPT-4o + Ada-2 embeddings) with an iterative RAG architecture for accurate code understanding.
- Repository Analysis: Loads and processes code repositories into a searchable vector store
- Iterative RAG: Agent iteratively searches and refines queries until sufficient information is found
- Self-Reflection: Assesses information completeness and continues searching if needed
- Memory System: Maintains conversation context across interactions
- MLanguage Support: Works with Python
- Interactive CLI: Easy-to-use command-line interface
Ensure you have a .env file with Azure OpenAI credentials:
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=text-embedding-ada-002You can install CodeRAG directly from the repository:
git clone https://github.com/m3et/CodeRAG.git
pip install -e .This will install all dependencies and create a coderag command for easy access.
CodeRAG features a streamlined single-command interface that automatically loads a repository and starts an interactive chat session:
# Using the installed command (recommended)
coderag . # Analyze current project
coderag /path/to/repository # Analyze any repository
coderag ./repo --force-reload # Force reload repository data
coderag ./repo --verbose # Enable verbose logging
# Or using Python directly
python src/cli/cli.py .
python src/cli/cli.py /path/to/repository --verboseThat's it! CodeRAG will automatically:
- 🔍 Detect file types in your repository (Python, Java, JS, TS, Go, etc.)
- 🧩 Process and chunk your code intelligently using AST parsing
- 💾 Store code chunks in a searchable vector database
- 🤖 Start an interactive chat session where you can ask questions
# Quick start with current directory
coderag .
# When prompted, ask questions like:
💬 You: What does this application do?
💬 You: Is there any authentication logic? Where?
💬 You: How is error handling implemented?
💬 You: What are the main components and their relationships?With the streamlined interface, you simply load a repository and start asking questions interactively:
# Start analyzing the current project
coderag signify
# The CLI will guide you through the process:
🚀 CodeRAG - Interactive AI Code Repository Analysis
============================================================
💡 Analyzing your repository and starting an interactive AI chat session
🔧 Initializing components...
✅ Components initialized successfully!
📁 Repository Analysis Phase
==================================================
📍 Analyzing repository: /path/to/your/repo
🔍 Detected languages: Python
📝 Processing file patterns: **/*.py
🚀 Starting repository processing...
📊 Processing Complete!
📁 Total files scanned: 45
✅ Successfully processed: 42
🧩 Total code chunks created: 156
🤖 CodeRAG Interactive Chat
==================================================
🎯 You can now ask questions about the repository!
💡 Type 'help' to see example questions
🚪 Type 'quit', 'exit', or 'q' to stop
💬 You: What does this application do?
💬 You: Is there any authentication logic? Where?
💬 You: How is error handling implemented?The CLI works seamlessly with the suggested repositories from the task:
# Python signing module
git clone https://github.com/ralphje/signify
coderag ./signify
# Then ask: "What is the flow of signature verification?"# Enable detailed logging for debugging
coderag ./my-repo --verbose
# Get help on available commands
coderag --help-
Iterative RAG Agent (
src/agents/iterative_rag_agent.py)- Main agent with thinking loop
- Self-reflection and query refinement
- Memory management
-
Code Chunker (
src/code_chunker.py)- AST-based intelligent code parsing
- Extracts functions, classes, and modules
- Preserves code context and relationships
-
Vector Store (
src/storage/vector_store.py)- ChromaDB integration for similarity search
- Efficient storage and retrieval of code chunks
-
Repository Processor (
src/repository_processor.py)- Orchestrates the entire processing pipeline
- Handles multiple file types and patterns
- Batch processing with progress tracking
- Query Refinement: Adapts search queries based on context and missing information
- Vector Search: Searches the code repository using semantic similarity
- Fact Extraction: Extracts key insights from retrieved code chunks
- Self-Assessment: Evaluates if sufficient information exists to answer the question
- Iteration: Continues with refined queries if more information is needed
- Final Answer: Generates comprehensive response with confidence scoring
- Iterative vs Single-Shot RAG: Chose iterative approach for better accuracy on complex questions
- AST-based Chunking: Preserves code structure better than simple text splitting
- Self-Reflection: Allows the agent to determine when it has sufficient information
- Memory System: Maintains context across conversations for better user experience
- Graph RAG: Could represent code relationships as graphs for better understanding
- Few-Shot Learning: Could include examples of good Q&A pairs for better responses
- Multi-Agent Systems: Could use specialized agents for different types of questions
- Question Rewriting: Could rephrase questions for better retrieval
- Chunking Strategy: Balances context preservation with search efficiency
- Batch Processing: Processes repositories efficiently with configurable batch sizes
- Memory Management: Limits conversation history to prevent memory bloat
- Caching: ChromaDB provides efficient similarity search caching
.
├── README.md
├── notes.md
├── pyproject.toml
├── requirements.txt
├── src
│ ├── __init__.py
│ ├── agents
│ │ ├── __init__.py
│ │ ├── base.py
│ │ └── iterative_rag_agent.py
│ ├── chunkers
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── code.py
│ │ ├── document.py
│ │ ├── markdown.py
│ │ └── text.py
│ ├── cli
│ │ ├── __init__.py
│ │ └── cli.py
│ ├── config
│ │ ├── __init__.py
│ │ └── openai.py
│ ├── llm_provider.py
│ ├── loaders
│ │ ├── __init__.py
│ │ ├── base.py
│ │ └── repository.py
│ ├── repository_processor.py
│ └── storage
│ ├── __init__.py
│ ├── chroma_vector_store.py
│ ├── vector_store.py
│ └── vector_store_factory.py
└── uv.lock
8 directories, 28 files
- Environment Variables: Ensure all Azure OpenAI credentials are set correctly in your
.envfile - Memory Usage: Large repositories may require increased system memory
- API Limits: Azure OpenAI rate limits may slow down processing
- File Permissions: Ensure the CLI has read access to repository directories
- Use
coderag --helpto see all available options - Use
coderag <repo> --verbosefor detailed logging during troubleshooting - Type
helpduring interactive chat to see example questions - If repository processing fails, check file permissions and ensure the directory contains supported file types