Skip to content

m3et/CodeRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeRAG - Interactive AI Code Repository Analysis

An intelligent AI-powered tool that enables interactive conversations with any code repository using advanced iterative RAG (Retrieval-Augmented Generation) with self-reflection capabilities.

Task Completion ✅

This project implements the requested chatbot application that works with GitHub repositories as context. It can answer questions like:

  1. "Is there any authentication-related logic in the code? If so, where is it?"
  2. "What does the following application do?"
  3. "What is the flow (classes and function calls) of an executable signing in this application?"

The solution uses Azure OpenAI (GPT-4o + Ada-2 embeddings) with an iterative RAG architecture for accurate code understanding.

Features

  • Repository Analysis: Loads and processes code repositories into a searchable vector store
  • Iterative RAG: Agent iteratively searches and refines queries until sufficient information is found
  • Self-Reflection: Assesses information completeness and continues searching if needed
  • Memory System: Maintains conversation context across interactions
  • MLanguage Support: Works with Python
  • Interactive CLI: Easy-to-use command-line interface

Quick Start

1. Setup Environment

Ensure you have a .env file with Azure OpenAI credentials:

AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=text-embedding-ada-002

2. Install CodeRAG

You can install CodeRAG directly from the repository:

git clone https://github.com/m3et/CodeRAG.git
pip install -e .

This will install all dependencies and create a coderag command for easy access.

3. Analyze Any Repository

CodeRAG features a streamlined single-command interface that automatically loads a repository and starts an interactive chat session:

# Using the installed command (recommended)
coderag .                          # Analyze current project
coderag /path/to/repository        # Analyze any repository
coderag ./repo --force-reload      # Force reload repository data  
coderag ./repo --verbose           # Enable verbose logging

# Or using Python directly
python src/cli/cli.py .
python src/cli/cli.py /path/to/repository --verbose

That's it! CodeRAG will automatically:

  1. 🔍 Detect file types in your repository (Python, Java, JS, TS, Go, etc.)
  2. 🧩 Process and chunk your code intelligently using AST parsing
  3. 💾 Store code chunks in a searchable vector database
  4. 🤖 Start an interactive chat session where you can ask questions

4. Example Usage

# Quick start with current directory
coderag .

# When prompted, ask questions like:
💬 You: What does this application do?
💬 You: Is there any authentication logic? Where?
💬 You: How is error handling implemented?
💬 You: What are the main components and their relationships?

Usage Examples

Interactive Chat Sessions

With the streamlined interface, you simply load a repository and start asking questions interactively:

# Start analyzing the current project
coderag signify

# The CLI will guide you through the process:
🚀 CodeRAG - Interactive AI Code Repository Analysis
============================================================
💡 Analyzing your repository and starting an interactive AI chat session

🔧 Initializing components...
✅ Components initialized successfully!

📁 Repository Analysis Phase
==================================================
📍 Analyzing repository: /path/to/your/repo
🔍 Detected languages: Python
📝 Processing file patterns: **/*.py

🚀 Starting repository processing...
📊 Processing Complete!
  📁 Total files scanned: 45
  ✅ Successfully processed: 42
  🧩 Total code chunks created: 156

🤖 CodeRAG Interactive Chat
==================================================
🎯 You can now ask questions about the repository!
💡 Type 'help' to see example questions
🚪 Type 'quit', 'exit', or 'q' to stop

💬 You: What does this application do?
💬 You: Is there any authentication logic? Where?
💬 You: How is error handling implemented?

Testing with Suggested Repositories

The CLI works seamlessly with the suggested repositories from the task:

# Python signing module  
git clone https://github.com/ralphje/signify
coderag ./signify
# Then ask: "What is the flow of signature verification?"

Advanced Usage

# Enable detailed logging for debugging
coderag ./my-repo --verbose

# Get help on available commands
coderag --help

Architecture

Core Components

  1. Iterative RAG Agent (src/agents/iterative_rag_agent.py)

    • Main agent with thinking loop
    • Self-reflection and query refinement
    • Memory management
  2. Code Chunker (src/code_chunker.py)

    • AST-based intelligent code parsing
    • Extracts functions, classes, and modules
    • Preserves code context and relationships
  3. Vector Store (src/storage/vector_store.py)

    • ChromaDB integration for similarity search
    • Efficient storage and retrieval of code chunks
  4. Repository Processor (src/repository_processor.py)

    • Orchestrates the entire processing pipeline
    • Handles multiple file types and patterns
    • Batch processing with progress tracking

Thinking Loop Process

  1. Query Refinement: Adapts search queries based on context and missing information
  2. Vector Search: Searches the code repository using semantic similarity
  3. Fact Extraction: Extracts key insights from retrieved code chunks
  4. Self-Assessment: Evaluates if sufficient information exists to answer the question
  5. Iteration: Continues with refined queries if more information is needed
  6. Final Answer: Generates comprehensive response with confidence scoring

Implementation Notes

Design Decisions

  1. Iterative vs Single-Shot RAG: Chose iterative approach for better accuracy on complex questions
  2. AST-based Chunking: Preserves code structure better than simple text splitting
  3. Self-Reflection: Allows the agent to determine when it has sufficient information
  4. Memory System: Maintains context across conversations for better user experience

Alternative Approaches Considered

  1. Graph RAG: Could represent code relationships as graphs for better understanding
  2. Few-Shot Learning: Could include examples of good Q&A pairs for better responses
  3. Multi-Agent Systems: Could use specialized agents for different types of questions
  4. Question Rewriting: Could rephrase questions for better retrieval

Performance Considerations

  • Chunking Strategy: Balances context preservation with search efficiency
  • Batch Processing: Processes repositories efficiently with configurable batch sizes
  • Memory Management: Limits conversation history to prevent memory bloat
  • Caching: ChromaDB provides efficient similarity search caching

Project Structure

.
├── README.md
├── notes.md
├── pyproject.toml
├── requirements.txt
├── src
│   ├── __init__.py
│   ├── agents
│   │   ├── __init__.py
│   │   ├── base.py
│   │   └── iterative_rag_agent.py
│   ├── chunkers
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── code.py
│   │   ├── document.py
│   │   ├── markdown.py
│   │   └── text.py
│   ├── cli
│   │   ├── __init__.py
│   │   └── cli.py
│   ├── config
│   │   ├── __init__.py
│   │   └── openai.py
│   ├── llm_provider.py
│   ├── loaders
│   │   ├── __init__.py
│   │   ├── base.py
│   │   └── repository.py
│   ├── repository_processor.py
│   └── storage
│       ├── __init__.py
│       ├── chroma_vector_store.py
│       ├── vector_store.py
│       └── vector_store_factory.py
└── uv.lock

8 directories, 28 files

Troubleshooting

Common Issues

  1. Environment Variables: Ensure all Azure OpenAI credentials are set correctly in your .env file
  2. Memory Usage: Large repositories may require increased system memory
  3. API Limits: Azure OpenAI rate limits may slow down processing
  4. File Permissions: Ensure the CLI has read access to repository directories

Getting Help

  • Use coderag --help to see all available options
  • Use coderag <repo> --verbose for detailed logging during troubleshooting
  • Type help during interactive chat to see example questions
  • If repository processing fails, check file permissions and ensure the directory contains supported file types

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages