Code Duplication Detector

An embedding-based code duplication detector for C# codebases. Uses Tree-sitter for code parsing, vector embeddings for semantic similarity, and SQLite with sqlite-vec for efficient storage and retrieval.

Features

Tree-sitter C# parsing: Extracts methods, control structures, lambdas, and other code blocks
Semantic similarity: Uses embedding models to find semantically similar code, not just text matches
Sliding window: Handles large methods by splitting them into overlapping windows
Clustering: Groups similar code chunks into clusters
Multiple output formats: Console and JSON reports
MCP server: Exposes functionality via Model Context Protocol for AI agent integration

Installation

Prerequisites

Python 3.10+
C compiler (MSVC Build Tools on Windows, gcc/clang on Linux/macOS)

Setup

Clone the repository:

git clone <repository-url>
cd code_duplication_detector

Install dependencies:

pip install -r requirements.txt

Add the Tree-sitter C# grammar as a submodule:

git submodule add https://github.com/tree-sitter/tree-sitter-c-sharp.git vendor/tree-sitter-c-sharp
git submodule update --init --recursive

Build the grammar:

python scripts/build_grammar.py

Set up your API key:

# For OpenAI (default)
export OPENAI_API_KEY=your-api-key

# Or for Voyage AI
export VOYAGE_API_KEY=your-api-key

Usage

Command Line Interface

Scan a directory:

python cli.py scan /path/to/csharp/project

Generate a report:

python cli.py report

Generate JSON report:

python cli.py report --json --output report.json

Adjust similarity threshold:

python cli.py report --threshold 0.90

List all chunks:

python cli.py list

Clear the database:

python cli.py clear --force

MCP Server

Start the MCP server for AI agent integration:

python cli.py mcp

Available MCP tools:

scan_directory: Scan a C# codebase for code chunks
generate_report: Generate a duplication report
list_chunks: List all stored chunks
clear_embeddings: Clear the database

Configuration

Edit config.json to customize behavior:

{
  "embedding_provider": "openai",
  "similarity_threshold": 0.85,
  "database_path": "data/embeddings.db",
  "max_tokens": 200,
  "window_size": 100,
  "stride": 50,
  "min_tokens": 20
}

Output Example

Found 3 cluster(s) with 8 total occurrences.

=== Cluster #1 (3 occurrences) ===
Similarity: 0.90-0.96

1. src/Services/InvoiceService.cs:12-38
2. src/Controllers/BillingController.cs:55-80
3. src/Utils/InvoiceHelpers.cs:100-130

=== Cluster #2 (2 occurrences) ===
Similarity: 0.88-0.92

1. src/Models/User.cs:45-60
2. src/Models/Admin.cs:30-45

Project Structure

/code_duplication_detector
  /chunking
    chunk.py           # Chunk dataclass
    csharp_chunker.py  # Tree-sitter C# chunker
  /embedding
    provider.py        # Abstract embedding provider
    voyageai_provider.py  # Voyage AI and OpenAI providers
  /repository
    embedding_store.py # SQLite + sqlite-vec storage
  /similarity
    similarity_engine.py  # Cosine similarity computation
    clusterer.py       # Threshold-based clustering
  /reporting
    console_reporter.py   # Console output
    json_reporter.py      # JSON output
  /mcp
    server.py          # MCP server implementation
  /scripts
    build_grammar.py   # Grammar build script
  cli.py               # Command-line interface
  config.json          # Configuration file

Testing

Run the test suite:

pytest tests/

Run with coverage:

pytest tests/ --cov=. --cov-report=html

How It Works

Chunking: The Tree-sitter parser extracts code blocks (methods, control structures, etc.) from C# files
Embedding: Each chunk's source code is converted to a vector embedding using an AI model
Storage: Chunks and embeddings are stored in SQLite with sqlite-vec for efficient vector operations
Similarity: Cosine similarity is computed between all chunk pairs
Clustering: Similar chunks are grouped using union-find with threshold filtering
Reporting: Results are formatted for console or JSON output

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Duplication Detector

Features

Installation

Prerequisites

Setup

Usage

Command Line Interface

MCP Server

Configuration

Output Example

Project Structure

Testing

How It Works

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
chunking		chunking
embedding		embedding
expected		expected
mcp		mcp
reporting		reporting
repository		repository
scripts		scripts
similarity		similarity
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
ProjectSpecification.md		ProjectSpecification.md
README.md		README.md
TreeSitterGrammarBuildInstructions.md		TreeSitterGrammarBuildInstructions.md
cli.py		cli.py
config.json		config.json
expected_clusters.json		expected_clusters.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Code Duplication Detector

Features

Installation

Prerequisites

Setup

Usage

Command Line Interface

MCP Server

Configuration

Output Example

Project Structure

Testing

How It Works

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages