ccpPy - RAG-Based C/C++ to Python Translator

A local-first code translation system that converts C/C++ codebases to Python using dependency-aware RAG (Retrieval-Augmented Generation) and local LLMs via Ollama.

Features

AST-Based Parsing: Uses tree-sitter for accurate C and C++ code structure extraction
Dependency-Aware Translation: Builds call graphs to translate in optimal order (leaf nodes first)
Semantic Code Search: FAISS-powered vector search finds similar code patterns for context
Local LLM Integration: Uses Ollama for privacy-preserving, cost-free translation
Validation Pipeline: Automatic syntax, type hint, and PEP 8 validation
Translation Memory: Maintains consistency across related code translations
Multi-Language Support: Handles both C and C++ source files
Iterative Refinement: Retries failed translations with validation feedback

Requirements

Python 3.9+
Ollama with a supported model (default: qwen3:4b)
8GB RAM recommended

Installation

# Clone the repository
git clone https://github.com/lexg-colorado/ccpPy.git
cd ccpPy

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Pull the LLM model
ollama pull qwen3:4b

Quick Start

1. Configure Your Project

Edit config.yaml to point to your C/C++ source:

source:
  source_path: "/path/to/your/c/project"
  language: "c"  # or "cpp" or "auto"

2. Run the Pipeline

# Phase 1: Parse source code into AST
python scripts/01_parse_c_code.py

# Phase 2: Build dependency graph
python scripts/02_build_graph.py

# Phase 3: Create semantic index
python scripts/03_index_code.py

# Phase 4: Translate to Python
python scripts/04_translate.py --limit 10

CLI Reference

Parse C/C++ Code

python scripts/01_parse_c_code.py [OPTIONS]

Options:
  --lang LANG      Language to parse (c, cpp, auto)
  --force          Force re-parsing of all files
  --verbose        Enable debug logging
  --profile NAME   Use a project profile from profiles/

Build Dependency Graph

python scripts/02_build_graph.py [OPTIONS]

Options:
  --lang LANG        Language to analyze
  --analyze-only     Only show analysis, don't save
  --output-format    Output format (pickle, json, both)
  --verbose          Enable debug logging

Create Semantic Index

python scripts/03_index_code.py [OPTIONS]

Options:
  --lang LANG      Language to index
  --rebuild        Force rebuild of index
  --query TEXT     Test query against index
  --top-k N        Number of results for test query
  --verbose        Enable debug logging

Translate Code

python scripts/04_translate.py [OPTIONS]

Options:
  --limit N          Translate only N functions
  --function NAME    Translate specific function
  --dry-run          Preview without translating
  --verbose          Enable debug logging
  --debug            Save failed translations for analysis
  --no-leaves        Skip leaf node priority ordering
  --continue         Resume from previous translation
  --output-dir DIR   Custom output directory

Architecture

C/C++ Source -> AST Parser -> Dependency Graph -> Vector Store -> RAG Translator -> Python Output
                    |              |                   |                |
              tree-sitter      networkx           FAISS +          Ollama LLM
                                              sentence-transformers

Pipeline Phases

Phase	Component	Description
1	AST Parser	Extracts functions, structs, includes, call relationships
2	Dependency Graph	Builds call graphs, determines translation order
3	Semantic Index	Creates embeddings for code similarity search
4	Translation	RAG-powered translation with iterative refinement
5	Validation	Ensures output quality and correctness

How It Works

This project demonstrates how commercial "unlimited context" translation tools work:

Smart Chunking: Code is parsed into function-level units using tree-sitter AST parsing
Dependency Ordering: Translation follows the call graph (leaves first) to ensure dependencies are translated before dependents
Semantic Retrieval: Similar code patterns provide few-shot examples via FAISS similarity search
Iterative Refinement: Failed translations receive validation feedback and retry (up to 3 attempts)
Translation Memory: Maintains C-to-Python entity mappings for consistency

Configuration

All settings are centralized in config.yaml:

Section	Key Settings
`source`	Source path, language (c/cpp/auto), exclude patterns
`llm`	Backend (ollama), model (qwen3:4b), temperature, max tokens
`embeddings`	Model (all-MiniLM-L6-v2), chunk size, batch size
`translation`	Max iterations, example count, Python style preferences
`validation`	Syntax check, type check, linting options

Environment Variables

Variable	Description	Default
`OLLAMA_HOST`	Ollama server URL	`http://localhost:11434`
`C2PY_CONFIG`	Path to config file	`config.yaml`

Library Mappings

Built-in mappings for common C libraries:

C Library	Python Equivalent
`stdlib.h`	Python builtins, `os`, `sys`
`ncurses.h`	`curses`, `rich`, `textual`
`pthread.h`	`threading`, `concurrent.futures`
`string.h`	Python string methods
`math.h`	`math` module

Custom mappings can be added in mappings/ directory.

Project Structure

c2py/
├── src/
│   ├── parser/        # tree-sitter AST parsing (C and C++)
│   ├── analysis/      # Dependency graph building
│   ├── indexing/      # Semantic embeddings & FAISS
│   ├── translation/   # RAG translation engine
│   ├── validation/    # Python code validation
│   └── utils/         # Config, logging, CLI utilities
├── scripts/           # CLI pipeline tools
├── profiles/          # Project-specific configurations
├── mappings/          # Custom library mappings
├── tests/             # Unit tests
├── data/              # Generated cache (gitignored)
└── output/            # Translated Python (gitignored)

Development

# Run tests
pytest

# Run with coverage
pytest --cov=src tests/

# Format code
black src/ tests/ scripts/
isort src/ tests/ scripts/

# Type checking
mypy src/

Limitations

Complex macro expansion may require manual adjustment
Platform-specific code needs review
Pointer arithmetic is translated to Python idioms (verify correctness)
Performance-critical sections may need optimization
Deeply interconnected code may exceed context window

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

License

Apache License 2.0 - See LICENSE for details.

Author

Lex Gaines

Acknowledgments

tree-sitter for robust C/C++ parsing
sentence-transformers for semantic embeddings
FAISS for efficient similarity search
Ollama for local LLM inference
NetworkX for dependency graph analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ccpPy - RAG-Based C/C++ to Python Translator

Features

Requirements

Installation

Quick Start

1. Configure Your Project

2. Run the Pipeline

CLI Reference

Parse C/C++ Code

Build Dependency Graph

Create Semantic Index

Translate Code

Architecture

Pipeline Phases

How It Works

Configuration

Environment Variables

Library Mappings

Project Structure

Development

Limitations

Contributing

License

Author

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
mappings		mappings
profiles		profiles
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt
setup_project.py		setup_project.py

Folders and files

Latest commit

History

Repository files navigation

ccpPy - RAG-Based C/C++ to Python Translator

Features

Requirements

Installation

Quick Start

1. Configure Your Project

2. Run the Pipeline

CLI Reference

Parse C/C++ Code

Build Dependency Graph

Create Semantic Index

Translate Code

Architecture

Pipeline Phases

How It Works

Configuration

Environment Variables

Library Mappings

Project Structure

Development

Limitations

Contributing

License

Author

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages