# 08. CLONE DETECTION

- Introduction
- Approaches
- LLMs
- Exercise
- References

# 1. Introduction

## What is a Code Clone

A **code clone** is a fragment of source code that is structurally or functionally similar to another fragment.

To understand clones, it's helpful to consider the relationship between **syntax** (the structure of the code) and **semantics** (its meaning or behavior).

Relationship between syntax and semantics:
- _Same syntax_ implies _same semantics_.
- _Different syntax_ in regular code typically leads to _different semantics_.
- _Similar syntax_ with _different semantics_ often indicates a bug or anti-pattern.
- _Different syntax_ with _identical semantics_ defines a code clone.

The most common source of code clones is the "copy-paste-modify practice, where a developer duplicates a code fragment to reuse its logic, then makes minor adaptations.


## Types of Code Clones

| Type | Name | Description |
| :--- | :--- | :--- |
| **Type-1** | Exact Clones | Code is identical, ignoring differences in whitespace, comments, and layout. |
| **Type-2** | Renamed / Parameterized Clones | Structurally identical but with different identifier names and literal values. (Includes Type-1 differences). |
| **Type-3** | Near-Miss / Gapped Clones | Syntactically similar but with added, modified, or removed statements. (Includes Type-1 & Type-2 differences). |
| **Type-4** | Semantic Clones | Implement the same functionality using different syntax and code structure. |


> Examples:
> ![types](./res/08_clone_types.png)
>
> ![types](./res/08_equivalent_transformations.png)

## Benefits of Code Clone Detection

While code cloning can be useful in the short term, it often creates significant challenges for software maintenance and development. Detecting these clones provides several key benefits:

- _Efficient Bug Fixes_: When a bug is discovered in a code fragment, clone detection ensures that all similar fragments are identified and checked for the same bug, preventing it from persisting in the system.
- _Simplified Maintenance_: By identifying duplicated code, developers can reduce redundant work during improvements or adaptations, as a change only needs to be made once.
- _Improved Code Quality_: Clone detection facilitates refactoring by highlighting opportunities to eliminate redundancy, leading to a cleaner, more manageable codebase.
- _Data Set Preparation_: In software research, clone detection is a standard process for deduplicating code corpora to create high-quality datasets for analysis and machine learning.

## How many clones are there

Reusing code fragments by copying and pasting with little or no adaptation is common in software development.

As a result, software systems often contain clones. Research shows that a significant proportion (7% to 23%) of the code in a typical software system has been cloned.

- Linux: 22.3% ([Sheneamer Kalita, 2016](https://www.ijcaonline.org/archives/volume137/number10/24308-2016908896))
- JDK: 29% ([Kamiya et al, 2002](https://www.cs.drexel.edu/~spiros/teaching/CS675/papers/clone-kamiya.pdf))

70% of the code on GitHub consists of clones of previously created files.

## Code duplication map ([Lopes et al, 2017](http://janvitek.org/pubs/oopsla17b.pdf)):

- X-axis: number of files per project
- Y-axis: number of commits per project

The number in each square is the percentage of duplicate files for all projects in that square.

![dejavu](./res/08_dejavu_tiles_1.png)
![dejavu](./res/08_dejavu_tiles_2.png)

## Datasets

- [BigCloneBench](https://github.com/clonebench/BigCloneBench) --- includes over 8M clones from 25K Java repositories
- [BigCloneEval](https://github.com/jeffsvajlenko/BigCloneEval) --- a framework for testing different clone detection methods

# 2. Approaches

## Text-based


Text-based clone detection uses string comparison and generally follows these steps:

1.  _Normalization:_ The source code is normalized by removing non-functional elements, such as whitespace, comments, and sometimes standardizing identifier names.
2.  _Hashing:_ A hash function is applied to each normalized line or predefined code segment to generate a unique fingerprint for efficient comparison.
3.  _Matrix Construction:_ A similarity matrix is constructed where each cell $(i, j)$ is set to $1$ if the hash values for code units $i$ and $j$ are identical, and $0$ otherwise.
4.  _Pattern Analysis:_ The matrix is analyzed for patterns. Contiguous sequences of $1$s along the diagonal represent sequences of identical code, indicating potential clones.

> Example:
>
> ([Ducasse et al, 1999](https://ieeexplore.ieee.org/document/792593))
>
> ![](./res/08_clones_text.png)

## Token-based

Instead of comparing symbols, token comparison is used.


In token-based approaches, each line of source code is divided into tokens according to the lexical rules of the programming language of interest. Together, the tokens form a token sequence used for comparison. All whitespace (including line breaks and tabs) and comments between tokens are removed from the token sequences.

For example, if code fragments differ only in variable names, then different names will correspond to the same tokens and, therefore, will not interfere with the comparison.

More:
- [CCFinder](https://www.cs.drexel.edu/~spiros/teaching/CS675/papers/clone-kamiya.pdf)
- [CCFinderX](https://github.com/gpoo/ccfinderx)

## AST-based



This method compares code by analyzing its Abstract Syntax Tree (AST). The process typically follows these steps:

1. _Parse the Code:_ An Abstract Syntax Tree is built from the source code, capturing its syntactic structure.
2. _Extract Subtrees:_ Potential code segments are identified by extracting subtrees from the main AST.
3. _Generate Hashes:_ A hash function is applied to each subtree to create a unique fingerprint for efficient comparison.
4. _Compare Candidates:_ Subtrees with matching hash values are flagged as potential clones and undergo a detailed, structural comparison to confirm the match.

![](https://leanovate.github.io/bedcon/talk/abstract_syntax_tree.png)

## Data flow based

Token-based and syntax-based clone detection methods depend on the order in which the program instructions are written.
If developers swap instructions in the copied code, the copied code will not be found as a clone.

However, the order cannot be changed arbitrarily without changing the meaning of the program.
A program dependency graph is a representation of a program that represents only the control and data dependencies between statements.

![](https://www.researchgate.net/profile/Sergey-Troshin/publication/358740900/figure/fig1/AS:11431281080328895@1661254252291/Example-of-Data-Flow-Graph-from-GraphCodeBERT-Guo-et-al-2021_W640.jpg)

Clones can be identified as isomorphic subgraphs in the software dependency graph.

The problem is NP-hard, and approximate algorithms are used.

> Isomorphic graphs
>
> ![](./res/08_isomorphic_graphs.png)

## Image-based

[Ragkhitwetsagul et al - A picture is worth a thousand words Code clone detection based on image similarity 2018](https://ieeexplore.ieee.org/document/8327318)

![scheme](./res/08_clone_detection_image_scheme.png)

Approach:

1.
  - parse code and extract methods
  - remove comments
  - print pretty
  - convert to html with syntax highlighting
2.
  - convert each method to PNG image
  - get RGB image
3.
  - convert image to negative image
  - apply filters (Gaussian filter, for example)
4.
  - compare two images. One can use **The Earth Mover’s Distance (EMD)** --- a metric that treats image comparison as a transportation problem (optimal transportation problem), finding the minimum cost of transforming one distribution to another

![](./res/08_clone_detection_image_1.png)
![](./res/08_clone_detection_image_2.png)
![](./res/08_clone_detection_image_3.png)

## Embedding-based

Vectors are compared in a metric space. Vectors are built based on various metrics or pre-trained models (BERT, RoBERTa, CodeBERT, GraphCodeBERT, ...).

**Contrastive learning** is an approach in machine learning to identify similar and dissimilar objects.
The model learns to build embeddings for objects in such a way that similar objects will have close vectors, and different objects, accordingly, will differ.
Therefore, the decision whether two programs are clones or not is made based on the distances between the two vectors.

![](./res/08_contrastive_learning_pipeline.png)

- $o$, $o'$ $-$ two different objects
- $f$ $-$ an encoder model
- $q$, $k$ $-$ embeddings of the objects
- $g$ $-$ a special projector model which is specific for algorithm (in a simple case, $g$ can be an identical function, i.e., just pass $q$ and $k$ onwards)
- $q'$, $k'$ $-$ transformed vectors for the loss function $L$

More:
- [Zubkob et al - Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection](https://arxiv.org/abs/2206.08726)

# 3. LLMs

1. Векторные представления (Embeddings-based)
2. Генеративные подходы:
   - Идея: Попросить LLM напрямую проанализировать пару фрагментов кода.
   - Пример промпта: "Являются ли следующие две функции семантически эквивалентными? То есть, для одинаковых входных данных они производят одинаковый выход? Функция A: ... Функция B: ..."
   - Плюсы: Мощно, не требует предварительного вычисления эмбеддингов для всей базы.
   - Минусы: Дорого (вычислительно) для больших проектов, задержка, возможные галлюцинации.
3. Гибридные подходы: Комбинация быстрых классических методов (для Type I-III) и мощных LLM (для сложных Type IV).


Сильные и слабые стороны LLM-подхода:
- Сильные:
    - Высокая точность для семантических клонов
    - Устойчивость к рефакторингу, переименованиям, изменению структуры
    - Работа с разными языками программирования
- Слабые:
    - Требуют значительных вычислительных ресурсов
    - "Чёрный ящик": сложнее интерпретировать, почему два фрагмента были признаны клонами
    - Зависимость от качества обучающих данных и промптов

## [Zhu et al - An Empirical Study of LLM-Based Code Clone Detection](https://arxiv.org/abs/2511.01176)

Цель:
- Исследовать два ключевых аспекта использования LLM для обнаружения клонов кода: **способность обобщать** (работать на разных наборах данных) и **согласованность ответов**.

Методология:
- Создано 7 наборов данных с парами кода (клоны и не-клоны) на Java, C, C++ и Python из CodeNet (соревнования по программированию) и BigCloneBench (реальный код).
- Оценены 5 моделей:
    - o3-mini,
    - GPT-4o,
    - GPT-4o-mini,
    - Llama 3.1 и
    - Mistral.
- Протестированы различные промпты.
- Измерялись точность (precision), полнота (recall), F1-мера и согласованность ответов (насколько ответ модели одинаков при повторных запросах).

Выводы:

1.  **Обобщаемость:**
    - LLM показывают отличные результаты на данных из CodeNet (лучшая модель, `o3-mini`, достигла F1=0.943)
    - производительность резко падает на данных из BigCloneBench. Это означает, что модели, хорошо определяющие клоны в учебных/соревновательных задачах, плохо справляются с кодом из реальных проектов
    - модели хорошо обобщают знания между разными языками программирования (в рамках CodeNet)
    - выбор промпта критически важен и часто влияет на результат сильнее, чем выбор самой модели
2.  **Согласованность:**
    - большинство моделей (кроме Llama 3.1) демонстрируют высокую согласованность ответов (>90% ответов не менялись при повторных запросах)
    - параметр "температура" оказывает минимальное влияние на согласованность и точность. Гораздо важнее выбор промпта

## [Dou et al - Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey](https://arxiv.org/abs/2308.01191)

Additional sub-categories for Type-3 and Type-4 clones based on their syntactical similarity scores:
- Very Strongly Type-3 (VST3) clones, with similarity scores in the range of [0.9, 1.0)
- Strongly Type-3 (ST3) clones, with similarity scores between [0.7, 0.9)
- Moderately Type-3 (MT3) clones, with similarity scores between [0.5, 0.7)
- Weakly Type-3/Type-4 (WT3/T4) clones, with similarity scores between [0.0, 0.5)

Instructions:

![](res/08_prompts.png)

## Can LLMs detect code clones with a simple prompt?

![](res/08_llms_vs_non-llms.png)


> Using open-source LLMs for clone detection yields superior results in identifying Type-3 and Type-4 clone pairs when relying solely on a simple prompt. However, it does exhibit slightly poorer performance when detecting Type-1 and Type-2 clone pairs compared to existing tools. Notably, GPT-3.5-Turbo and GPT-4 stand out with the highest recall and accuracy rates across nearly all clone types.

## How do LLMs perform by using one-step chain-of-thought prompts?

* In one-step prompt engineering, the model is tasked with detecting code clones from various perspectives (i.e. clone type, similarity, and analogous lines of code pair).
* In multi-step prompt engineering, the model initially analyzes each function from multiple perspectives, subsequently integrating all the intermediate reasonings. This approach enables the model to detect code clones with prior knowledge, rather than merely following human instructions to provide a binary "yes" or "no" response.


![](res/08_clone_type_reasoning.png)

![](res/08_similar_line_reasoning.png)

> The clone detection performance of GPT-3.5-Turbo and GPT-4 can be improved by requiring models to provide clone type, similarity, reasoning, and similarity lines. Using one-step chain-of-thought prompts allows the models to analyze code pairs and intermediate reasoning, leading to better clone detection.

## Can LLMs perform better by using multi-step chain-of-thought prompts?

![](res/08_separate_explanation_codes.png)

> The clone detection performance of GPT-3.5-Turbo and GPT-4 can be improved by Multi-Step Chain-of-Thought prompts, including separating explanations and codes. Different from RQ2, separating explanations provide models of independent intermediate reasoning of code, and separating codes provide models of independent explanation of code, which avoid the influences between generated information.

## How do LLMs perform using code embedding?

This question focuses on whether LLMs can provide superior results compared to traditional pre-trained language models through code compression. 

![](res/08_similarity_distribution.png)

> Text-embedding-ada-002 is more effective than specialized CodeBERT models in identifying cloned code, exhibiting superior overall performance. The advantage of Text-embedding-ada-002 lies in its capacity to generate a wider range of similarity scores, leading to better discrimination between true and false positives.

## How does the performance of LLMs in code clone detection vary across different programming languages?

![](res/08_different_languages.png)

> The performance of LLMs in code clone detection varies across different programming languages, with a trend of superior results in Python, likely due to its inherent simplicity and prevalence in training data

# Exercise

Develop a hybrid approach for clone detection: using embeddings and generative models. Expectations:  
- Using LLM to implement the approach  
- Testing on a dataset  
- Justified conclusions

# References

- Allamanis - The adverse effects of code duplication in machine learning models of code
- Baker - On finding duplication and near-duplication in large software systems
- Bogomolov et al - Sosed A tool for finding similar software projects
- [Code code analysis](https://link.springer.com/book/10.1007/978-981-16-1927-4)
- Ducasse et al - A language independent approach for detecting duplicated code
- Gupta Gupta - Literature survey of clone detection techniques
- Huang et al - Code clone detection based on event embedding and event dependency
- Ivanov et al - AntiCopyPaster Extracting code duplicates as soon as they are introduced in the IDE
- Khajezade et al - Evaluating few shot and contrastive learning methods for code clone detection
- Kim et al - Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned
- Koschke - Survey of research on software clones
- Lopes et al - DejaVu A map of code duplicates on GitHub
- Nadim et al - Evaluating the performance of clone detection tools in detecting cloned co-change candidates
- Ragkhitwetsagul - Code similarity and clone search in large-scale source code data
- Ragkhitwetsagul et al - A picture is worth a thousand words Code clone detection based on image similarity
- Rahman et al - Clone detection on large Scala codebases
- Roy et al - Comparison and evaluation of code clone detection techniques and tools
- Saini et al - Oreo Detection of clones in the twilight zone
- Sheneamer Kalita - A survey of software clone detection techniques
- [Svajlenko Roy - A survey on the evaluation of clone detection performance and benchmarking](https://arxiv.org/abs/2006.15682)
- Svajlenko Roy - Efficiently measuring an accurate and generalized clone detection precision using clone clustering
- White et al - Deep learning code fragments for code clone detection
- Yahay Kim - Cross-languages source code clone detection using deep learning with InferCode
- Zhang et al - Challenging machine learning-based clone detectors via semantic-preserving code transformations
- Zhang et al - The development and prospect of code clone
- Zubkov et al - Evaluation of contrastive learning with various code representations for code clone detection 2022