An embedding-based code duplication detector for C# codebases. Uses Tree-sitter for code parsing, vector embeddings for semantic similarity, and SQLite with sqlite-vec for efficient storage and retrieval.
- Tree-sitter C# parsing: Extracts methods, control structures, lambdas, and other code blocks
- Semantic similarity: Uses embedding models to find semantically similar code, not just text matches
- Sliding window: Handles large methods by splitting them into overlapping windows
- Clustering: Groups similar code chunks into clusters
- Multiple output formats: Console and JSON reports
- MCP server: Exposes functionality via Model Context Protocol for AI agent integration
- Python 3.10+
- C compiler (MSVC Build Tools on Windows, gcc/clang on Linux/macOS)
- Clone the repository:
git clone <repository-url>
cd code_duplication_detector- Install dependencies:
pip install -r requirements.txt- Add the Tree-sitter C# grammar as a submodule:
git submodule add https://github.com/tree-sitter/tree-sitter-c-sharp.git vendor/tree-sitter-c-sharp
git submodule update --init --recursive- Build the grammar:
python scripts/build_grammar.py- Set up your API key:
# For OpenAI (default)
export OPENAI_API_KEY=your-api-key
# Or for Voyage AI
export VOYAGE_API_KEY=your-api-keyScan a directory:
python cli.py scan /path/to/csharp/projectGenerate a report:
python cli.py reportGenerate JSON report:
python cli.py report --json --output report.jsonAdjust similarity threshold:
python cli.py report --threshold 0.90List all chunks:
python cli.py listClear the database:
python cli.py clear --forceStart the MCP server for AI agent integration:
python cli.py mcpAvailable MCP tools:
scan_directory: Scan a C# codebase for code chunksgenerate_report: Generate a duplication reportlist_chunks: List all stored chunksclear_embeddings: Clear the database
Edit config.json to customize behavior:
{
"embedding_provider": "openai",
"similarity_threshold": 0.85,
"database_path": "data/embeddings.db",
"max_tokens": 200,
"window_size": 100,
"stride": 50,
"min_tokens": 20
}Found 3 cluster(s) with 8 total occurrences.
=== Cluster #1 (3 occurrences) ===
Similarity: 0.90-0.96
1. src/Services/InvoiceService.cs:12-38
2. src/Controllers/BillingController.cs:55-80
3. src/Utils/InvoiceHelpers.cs:100-130
=== Cluster #2 (2 occurrences) ===
Similarity: 0.88-0.92
1. src/Models/User.cs:45-60
2. src/Models/Admin.cs:30-45
/code_duplication_detector
/chunking
chunk.py # Chunk dataclass
csharp_chunker.py # Tree-sitter C# chunker
/embedding
provider.py # Abstract embedding provider
voyageai_provider.py # Voyage AI and OpenAI providers
/repository
embedding_store.py # SQLite + sqlite-vec storage
/similarity
similarity_engine.py # Cosine similarity computation
clusterer.py # Threshold-based clustering
/reporting
console_reporter.py # Console output
json_reporter.py # JSON output
/mcp
server.py # MCP server implementation
/scripts
build_grammar.py # Grammar build script
cli.py # Command-line interface
config.json # Configuration file
Run the test suite:
pytest tests/Run with coverage:
pytest tests/ --cov=. --cov-report=html- Chunking: The Tree-sitter parser extracts code blocks (methods, control structures, etc.) from C# files
- Embedding: Each chunk's source code is converted to a vector embedding using an AI model
- Storage: Chunks and embeddings are stored in SQLite with sqlite-vec for efficient vector operations
- Similarity: Cosine similarity is computed between all chunk pairs
- Clustering: Similar chunks are grouped using union-find with threshold filtering
- Reporting: Results are formatted for console or JSON output
MIT License