Skip to content

mstracha/code-duplication-detector

Repository files navigation

Code Duplication Detector

An embedding-based code duplication detector for C# codebases. Uses Tree-sitter for code parsing, vector embeddings for semantic similarity, and SQLite with sqlite-vec for efficient storage and retrieval.

Features

  • Tree-sitter C# parsing: Extracts methods, control structures, lambdas, and other code blocks
  • Semantic similarity: Uses embedding models to find semantically similar code, not just text matches
  • Sliding window: Handles large methods by splitting them into overlapping windows
  • Clustering: Groups similar code chunks into clusters
  • Multiple output formats: Console and JSON reports
  • MCP server: Exposes functionality via Model Context Protocol for AI agent integration

Installation

Prerequisites

  • Python 3.10+
  • C compiler (MSVC Build Tools on Windows, gcc/clang on Linux/macOS)

Setup

  1. Clone the repository:
git clone <repository-url>
cd code_duplication_detector
  1. Install dependencies:
pip install -r requirements.txt
  1. Add the Tree-sitter C# grammar as a submodule:
git submodule add https://github.com/tree-sitter/tree-sitter-c-sharp.git vendor/tree-sitter-c-sharp
git submodule update --init --recursive
  1. Build the grammar:
python scripts/build_grammar.py
  1. Set up your API key:
# For OpenAI (default)
export OPENAI_API_KEY=your-api-key

# Or for Voyage AI
export VOYAGE_API_KEY=your-api-key

Usage

Command Line Interface

Scan a directory:

python cli.py scan /path/to/csharp/project

Generate a report:

python cli.py report

Generate JSON report:

python cli.py report --json --output report.json

Adjust similarity threshold:

python cli.py report --threshold 0.90

List all chunks:

python cli.py list

Clear the database:

python cli.py clear --force

MCP Server

Start the MCP server for AI agent integration:

python cli.py mcp

Available MCP tools:

  • scan_directory: Scan a C# codebase for code chunks
  • generate_report: Generate a duplication report
  • list_chunks: List all stored chunks
  • clear_embeddings: Clear the database

Configuration

Edit config.json to customize behavior:

{
  "embedding_provider": "openai",
  "similarity_threshold": 0.85,
  "database_path": "data/embeddings.db",
  "max_tokens": 200,
  "window_size": 100,
  "stride": 50,
  "min_tokens": 20
}

Output Example

Found 3 cluster(s) with 8 total occurrences.

=== Cluster #1 (3 occurrences) ===
Similarity: 0.90-0.96

1. src/Services/InvoiceService.cs:12-38
2. src/Controllers/BillingController.cs:55-80
3. src/Utils/InvoiceHelpers.cs:100-130

=== Cluster #2 (2 occurrences) ===
Similarity: 0.88-0.92

1. src/Models/User.cs:45-60
2. src/Models/Admin.cs:30-45

Project Structure

/code_duplication_detector
  /chunking
    chunk.py           # Chunk dataclass
    csharp_chunker.py  # Tree-sitter C# chunker
  /embedding
    provider.py        # Abstract embedding provider
    voyageai_provider.py  # Voyage AI and OpenAI providers
  /repository
    embedding_store.py # SQLite + sqlite-vec storage
  /similarity
    similarity_engine.py  # Cosine similarity computation
    clusterer.py       # Threshold-based clustering
  /reporting
    console_reporter.py   # Console output
    json_reporter.py      # JSON output
  /mcp
    server.py          # MCP server implementation
  /scripts
    build_grammar.py   # Grammar build script
  cli.py               # Command-line interface
  config.json          # Configuration file

Testing

Run the test suite:

pytest tests/

Run with coverage:

pytest tests/ --cov=. --cov-report=html

How It Works

  1. Chunking: The Tree-sitter parser extracts code blocks (methods, control structures, etc.) from C# files
  2. Embedding: Each chunk's source code is converted to a vector embedding using an AI model
  3. Storage: Chunks and embeddings are stored in SQLite with sqlite-vec for efficient vector operations
  4. Similarity: Cosine similarity is computed between all chunk pairs
  5. Clustering: Similar chunks are grouped using union-find with threshold filtering
  6. Reporting: Results are formatted for console or JSON output

License

MIT License

About

embedding based semantic duplication detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors