CodeRAG - Interactive AI Code Repository Analysis

An intelligent AI-powered tool that enables interactive conversations with any code repository using advanced iterative RAG (Retrieval-Augmented Generation) with self-reflection capabilities.

Task Completion ✅

This project implements the requested chatbot application that works with GitHub repositories as context. It can answer questions like:

"Is there any authentication-related logic in the code? If so, where is it?"
"What does the following application do?"
"What is the flow (classes and function calls) of an executable signing in this application?"

The solution uses Azure OpenAI (GPT-4o + Ada-2 embeddings) with an iterative RAG architecture for accurate code understanding.

Features

Repository Analysis: Loads and processes code repositories into a searchable vector store
Iterative RAG: Agent iteratively searches and refines queries until sufficient information is found
Self-Reflection: Assesses information completeness and continues searching if needed
Memory System: Maintains conversation context across interactions
MLanguage Support: Works with Python
Interactive CLI: Easy-to-use command-line interface

Quick Start

1. Setup Environment

Ensure you have a .env file with Azure OpenAI credentials:

AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=text-embedding-ada-002

2. Install CodeRAG

You can install CodeRAG directly from the repository:

git clone https://github.com/m3et/CodeRAG.git
pip install -e .

This will install all dependencies and create a coderag command for easy access.

3. Analyze Any Repository

CodeRAG features a streamlined single-command interface that automatically loads a repository and starts an interactive chat session:

# Using the installed command (recommended)
coderag .                          # Analyze current project
coderag /path/to/repository        # Analyze any repository
coderag ./repo --force-reload      # Force reload repository data  
coderag ./repo --verbose           # Enable verbose logging

# Or using Python directly
python src/cli/cli.py .
python src/cli/cli.py /path/to/repository --verbose

That's it! CodeRAG will automatically:

🔍 Detect file types in your repository (Python, Java, JS, TS, Go, etc.)
🧩 Process and chunk your code intelligently using AST parsing
💾 Store code chunks in a searchable vector database
🤖 Start an interactive chat session where you can ask questions

4. Example Usage

# Quick start with current directory
coderag .

# When prompted, ask questions like:
💬 You: What does this application do?
💬 You: Is there any authentication logic? Where?
💬 You: How is error handling implemented?
💬 You: What are the main components and their relationships?

Usage Examples

Interactive Chat Sessions

With the streamlined interface, you simply load a repository and start asking questions interactively:

# Start analyzing the current project
coderag signify

# The CLI will guide you through the process:
🚀 CodeRAG - Interactive AI Code Repository Analysis
============================================================
💡 Analyzing your repository and starting an interactive AI chat session

🔧 Initializing components...
✅ Components initialized successfully!

📁 Repository Analysis Phase
==================================================
📍 Analyzing repository: /path/to/your/repo
🔍 Detected languages: Python
📝 Processing file patterns: **/*.py

🚀 Starting repository processing...
📊 Processing Complete!
  📁 Total files scanned: 45
  ✅ Successfully processed: 42
  🧩 Total code chunks created: 156

🤖 CodeRAG Interactive Chat
==================================================
🎯 You can now ask questions about the repository!
💡 Type 'help' to see example questions
🚪 Type 'quit', 'exit', or 'q' to stop

💬 You: What does this application do?
💬 You: Is there any authentication logic? Where?
💬 You: How is error handling implemented?

Testing with Suggested Repositories

The CLI works seamlessly with the suggested repositories from the task:

# Python signing module  
git clone https://github.com/ralphje/signify
coderag ./signify
# Then ask: "What is the flow of signature verification?"

Advanced Usage

# Enable detailed logging for debugging
coderag ./my-repo --verbose

# Get help on available commands
coderag --help

Architecture

Core Components

Iterative RAG Agent (src/agents/iterative_rag_agent.py)
- Main agent with thinking loop
- Self-reflection and query refinement
- Memory management
Code Chunker (src/code_chunker.py)
- AST-based intelligent code parsing
- Extracts functions, classes, and modules
- Preserves code context and relationships
Vector Store (src/storage/vector_store.py)
- ChromaDB integration for similarity search
- Efficient storage and retrieval of code chunks
Repository Processor (src/repository_processor.py)
- Orchestrates the entire processing pipeline
- Handles multiple file types and patterns
- Batch processing with progress tracking

Thinking Loop Process

Query Refinement: Adapts search queries based on context and missing information
Vector Search: Searches the code repository using semantic similarity
Fact Extraction: Extracts key insights from retrieved code chunks
Self-Assessment: Evaluates if sufficient information exists to answer the question
Iteration: Continues with refined queries if more information is needed
Final Answer: Generates comprehensive response with confidence scoring

Implementation Notes

Design Decisions

Iterative vs Single-Shot RAG: Chose iterative approach for better accuracy on complex questions
AST-based Chunking: Preserves code structure better than simple text splitting
Self-Reflection: Allows the agent to determine when it has sufficient information
Memory System: Maintains context across conversations for better user experience

Alternative Approaches Considered

Graph RAG: Could represent code relationships as graphs for better understanding
Few-Shot Learning: Could include examples of good Q&A pairs for better responses
Multi-Agent Systems: Could use specialized agents for different types of questions
Question Rewriting: Could rephrase questions for better retrieval

Performance Considerations

Chunking Strategy: Balances context preservation with search efficiency
Batch Processing: Processes repositories efficiently with configurable batch sizes
Memory Management: Limits conversation history to prevent memory bloat
Caching: ChromaDB provides efficient similarity search caching

Project Structure

.
├── README.md
├── notes.md
├── pyproject.toml
├── requirements.txt
├── src
│   ├── __init__.py
│   ├── agents
│   │   ├── __init__.py
│   │   ├── base.py
│   │   └── iterative_rag_agent.py
│   ├── chunkers
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── code.py
│   │   ├── document.py
│   │   ├── markdown.py
│   │   └── text.py
│   ├── cli
│   │   ├── __init__.py
│   │   └── cli.py
│   ├── config
│   │   ├── __init__.py
│   │   └── openai.py
│   ├── llm_provider.py
│   ├── loaders
│   │   ├── __init__.py
│   │   ├── base.py
│   │   └── repository.py
│   ├── repository_processor.py
│   └── storage
│       ├── __init__.py
│       ├── chroma_vector_store.py
│       ├── vector_store.py
│       └── vector_store_factory.py
└── uv.lock

8 directories, 28 files

Troubleshooting

Common Issues

Environment Variables: Ensure all Azure OpenAI credentials are set correctly in your .env file
Memory Usage: Large repositories may require increased system memory
API Limits: Azure OpenAI rate limits may slow down processing
File Permissions: Ensure the CLI has read access to repository directories

Getting Help

Use coderag --help to see all available options
Use coderag <repo> --verbose for detailed logging during troubleshooting
Type help during interactive chat to see example questions
If repository processing fails, check file permissions and ensure the directory contains supported file types

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeRAG - Interactive AI Code Repository Analysis

Task Completion ✅

Features

Quick Start

1. Setup Environment

2. Install CodeRAG

3. Analyze Any Repository

4. Example Usage

Usage Examples

Interactive Chat Sessions

Testing with Suggested Repositories

Advanced Usage

Architecture

Core Components

Thinking Loop Process

Implementation Notes

Design Decisions

Alternative Approaches Considered

Performance Considerations

Project Structure

Troubleshooting

Common Issues

Getting Help

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
README.md		README.md
notes.md		notes.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

CodeRAG - Interactive AI Code Repository Analysis

Task Completion ✅

Features

Quick Start

1. Setup Environment

2. Install CodeRAG

3. Analyze Any Repository

4. Example Usage

Usage Examples

Interactive Chat Sessions

Testing with Suggested Repositories

Advanced Usage

Architecture

Core Components

Thinking Loop Process

Implementation Notes

Design Decisions

Alternative Approaches Considered

Performance Considerations

Project Structure

Troubleshooting

Common Issues

Getting Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages