# RQ2 

## Setup

* Follow the instructions on ```Readme.md``` 
* Inside the *pipeline* folder, run ```python -m src.rq2```. This will parse the .json file containing the dataset into  ```bigcodebench_clone_dataset.xml``` file containing all clone pairs (exclusing the original code). 
* Checkout [CloneCognition](https://github.com/pseudoPixels/CloneCognition)
* Paste the ```bigcodebench_clone_dataset.xml``` file inside ```input_clone_pairs/``` on the CloneCognition project.
    * Make sure to follow the CloneCognition instructions on their README
* Run ```python validateClones.py 0.76 input_clone_pairs/ out/```. This will produce a file  ```bigcodebench_clone_dataset.xml.mlValidated```
* The file ```bigcodebench_clone_dataset.xml.mlValidated``` is available in our repo under ```results/RQ2```

In [3]:
from src.config import *
RESULTS = f"../results/RQ2/{DATASET_NAME}_clone_dataset.xml.mlValidated"

## Results

In [37]:
# Summary
from collections import Counter
 
counts = Counter()
total = 0

with open(RESULTS, "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue  # skip empty lines

        first_field = line.split(",", 1)[0].lower()
        if first_field in {"true", "false"}:
            counts[first_field] += 1
            total += 1

true_count = counts.get("true", 0)
false_count = counts.get("false", 0)

true_pct = (true_count / total * 100) if total > 0 else 0
false_pct = (false_count / total * 100) if total > 0 else 0

print(f"Total entries : {total}")
print(f"Detected clones : {true_count} ({true_pct:.2f}%)")
print(f"Not detected : {false_count} ({false_pct:.2f}%)")


Total entries : 352
Detected clones : 1 (0.28%)
Not detected : 351 (99.72%)


In [None]:
# Extract and print entries marked as true
true_entries = []

with open(RESULTS, "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue

        cols = [c.strip() for c in line.split(",")]

        if cols[0].lower() == "true" and len(cols) >= 4:
            true_entries.append(((cols[1], cols[2], cols[3]), (cols[6], cols[7], cols[8])))
 
print("Detected clones:")
for id1, id2 in true_entries:
    print(f"{id1}, {id2}")

print(f"\nTotal detected clones: {len(true_entries)}")

Detected clones:
("BigCodeBench/293_zero-shot deepseek-r1:14b-test 1 ['refac_1'", "'refac_3'", "'refac_4']"), ("BigCodeBench/293_zero-shot deepseek-r1:14b-test 1 ['refac_2'", "'refac_6'", "'refac_7']")

Total detected clones: 1
