In [None]:
!pip install nbformat

In [2]:
# We need 4 callable function from interactive notebook, so we convert it in python file and import methods
 # Found out another more seamless method is available with %run, but the code is already written and this approach works!
!jupyter nbconvert --to python interactive.ipynb


[NbConvertApp] Converting notebook interactive.ipynb to python
[NbConvertApp] Writing 8494 bytes to interactive.py


I implemented a full evaluation framework for plagiarism detection systems using a pre-generated test dataset. I did the following:

- Loaded a labeled test dataset of code samples with positive (plagiarized) and negative (non-plagiarized) examples.
- Built functions to evaluate any detection system, recording predictions, confidence scores, execution time, and errors.
- Calculated standard metrics (accuracy, precision, recall, F1) and confusion matrix components.
- Conducted error analysis, identifying false positives and false negatives, and generated a structured error report.
- Implemented ablation studies to test the effect of parameters such as:
  - Number of retrieved documents (k) in RAG systems
  - Similarity thresholds for embedding-based detection
  - RRF parameters for hybrid RAG
- Created visualization utilities for:
  - Comparing metrics across systems
  - Confusion matrices
  - Ablation study results
  - Trade-off between F1 score and execution time
- Integrated a complete pipeline to run evaluation, error analysis, ablation studies, visualizations, and generate a comprehensive markdown report with recommendations.
- Saved results, figures, and metrics summaries to disk for reproducibility and further analysis.


In [1]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from tqdm import tqdm
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
import time
from interactive import *

Embedding: 100%|██████████| 1/1 [00:01<00:00,  1.16s/it]


{'method': 'embedding', 'is_plagiarized': False, 'max_similarity': 0.4037216901779175, 'threshold': 0.8, 'top_matches': [{'function_name': 'polar_force', 'similarity': 0.4037216901779175, 'file_path': 'data\\repos\\Python\\physics\\in_static_equilibrium.py'}, {'function_name': 'create_canvas', 'similarity': 0.3870880603790283, 'file_path': 'data\\repos\\Python\\cellular_automata\\game_of_life.py'}, {'function_name': 'random_vector', 'similarity': 0.32630038261413574, 'file_path': 'data\\repos\\Python\\linear_algebra\\src\\lib.py'}, {'function_name': 'component', 'similarity': 0.3261561989784241, 'file_path': 'data\\repos\\Python\\linear_algebra\\src\\lib.py'}, {'function_name': 'projection', 'similarity': 0.32339245080947876, 'file_path': 'data\\repos\\Python\\linear_algebra\\src\\transformations_2d.py'}]}
{'method': 'direct_llm', 'is_plagiarized': False, 'confidence': 0.25, 'explanation': 'The QUERY code implements a polar-to-Cartesian conversion function vector_components with an in_

Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.85it/s]


{'method': 'rag', 'is_plagiarized': True, 'confidence': 0.85, 'explanation': "The query vector_components function closely matches REFERENCE 1's polar_force function: both convert polar coordinates to Cartesian components using cos and sin, with a boolean flag to indicate if the input angle is already in radians or needs conversion. The parameter roles map (size/magnitude, direction/angle, in_radians/radian_mode) and the conditional structure are essentially identical, differing mainly in naming. This strongly indicates copying from REFERENCE 1.", 'retrieved_functions': [{'function_name': 'polar_force', 'similarity': 0.403698205947876}, {'function_name': 'create_canvas', 'similarity': 0.3870975375175476}, {'function_name': 'random_vector', 'similarity': 0.3263242244720459}, {'function_name': 'component', 'similarity': 0.32613325119018555}, {'function_name': 'projection', 'similarity': 0.3233739733695984}]}


Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.93it/s]


{'method': 'hybrid_rag', 'is_plagiarized': True, 'confidence': 0.85, 'explanation': "QUERY's vector_components performs polar-to-rectangular conversion using cosine and sine with an optional radians flag. This exactly mirrors REFERENCE 1's polar_force implementation (same math and control flow; only parameter names differ: size/magnitude, direction/angle, in_radians/radian_mode).", 'retrieved_functions': [{'function_name': 'polar_force', 'fused_score': 0.03333333333333333}, {'function_name': 'minCost', 'fused_score': 0.01639344262295082}, {'function_name': 'create_canvas', 'fused_score': 0.01639344262295082}, {'function_name': 'exits_word', 'fused_score': 0.016129032258064516}, {'function_name': 'random_vector', 'fused_score': 0.016129032258064516}]}


In [2]:
TEST_DATASET_PATH = Path("data/test_dataset.json")
RESULTS_DIR = Path("results")
RESULTS_DIR.mkdir(exist_ok=True)

In [3]:
def load_test_dataset(path=TEST_DATASET_PATH):
    """Load pre-generated test dataset"""
    with open(path, 'r') as f:
        data = json.load(f)

    df = pd.DataFrame(data)
    print(f"✅ Loaded {len(df)} test cases")
    print(f"   Positive (plagiarized): {(df['label'] == 1).sum()}")
    print(f"   Negative (not plagiarized): {(df['label'] == 0).sum()}")

    return df


In [4]:

def calculate_metrics(y_true, y_pred):
    """Calculate all evaluation metrics"""
    return {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, zero_division=0),
        'recall': recall_score(y_true, y_pred, zero_division=0),
        'f1': f1_score(y_true, y_pred, zero_division=0)
    }

def get_confusion_matrix(y_true, y_pred):
    """Get confusion matrix components"""
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return {'TP': tp, 'TN': tn, 'FP': fp, 'FN': fn}


In [5]:

def evaluate_system(detect_func, test_df, **kwargs):
    """Evaluate a detection system on test dataset"""
    predictions = []
    confidences = []
    execution_times = []
    errors = []

    for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Evaluating"):
        try:
            start_time = time.time()
            result = detect_func(row['code'], **kwargs)
            elapsed = time.time() - start_time

            predictions.append(1 if result['is_plagiarized'] else 0)
            confidences.append(result.get('confidence', result.get('max_similarity', 0.5)))
            execution_times.append(elapsed)

        except Exception as e:
            predictions.append(0)
            confidences.append(0.0)
            execution_times.append(0.0)
            errors.append({'idx': idx, 'error': str(e)})

    y_true = test_df['label'].values
    y_pred = np.array(predictions)

    metrics = calculate_metrics(y_true, y_pred)
    cm = get_confusion_matrix(y_true, y_pred)

    return {
        'metrics': metrics,
        'confusion_matrix': cm,
        'predictions': predictions,
        'confidences': confidences,
        'avg_time': np.mean(execution_times),
        'total_time': np.sum(execution_times),
        'errors': errors
    }


## ERROR ANALYSIS

In [6]:
def analyze_errors(test_df, predictions, error_type='FP'):
    """Analyze specific error types (FP: False Positives, FN: False Negatives)"""
    y_true = test_df['label'].values
    y_pred = np.array(predictions)

    if error_type == 'FP':
        # Predicted plagiarized but actually not
        error_mask = (y_true == 0) & (y_pred == 1)
    else:  # FN
        # Predicted not plagiarized but actually is
        error_mask = (y_true == 1) & (y_pred == 0)

    error_cases = test_df[error_mask].copy()
    error_cases['predicted'] = y_pred[error_mask]

    return error_cases

def generate_error_report(test_df, all_results):
    """Generate comprehensive error analysis report"""
    report = {}

    for method, results in all_results.items():
        predictions = results['predictions']

        fp_cases = analyze_errors(test_df, predictions, 'FP')
        fn_cases = analyze_errors(test_df, predictions, 'FN')

        report[method] = {
            'false_positives': len(fp_cases),
            'false_negatives': len(fn_cases),
            'fp_examples': fp_cases.head(3).to_dict('records') if len(fp_cases) > 0 else [],
            'fn_examples': fn_cases.head(3).to_dict('records') if len(fn_cases) > 0 else []
        }

    return report


## ABLATION STUDIES

In [7]:

def ablation_k_values(detect_func, test_df, k_values=[3, 5, 10]):
    """Test different k values for RAG systems"""
    results = {}

    for k in k_values:
        print(f"Testing k={k}...")
        result = evaluate_system(detect_func, test_df, top_k=k)
        results[k] = result['metrics']

    return pd.DataFrame(results).T

def ablation_threshold_values(test_df, thresholds=[0.7,  0.8,  0.9]):
    """Test different similarity thresholds for embedding method"""
    results = {}

    for thresh in thresholds:
        print(f"Testing threshold={thresh}...")
        result = evaluate_system(detect_embedding, test_df, threshold=thresh)
        results[thresh] = result['metrics']

    return pd.DataFrame(results).T

def ablation_rrf_k_values(test_df, rrf_k_values=[30, 60, 90]):
    """Test different RRF k values for hybrid RAG"""
    results = {}

    for rrf_k in rrf_k_values:
        print(f"Testing rrf_k={rrf_k}...")
        result = evaluate_system(detect_hybrid_rag, test_df, top_k=5, rrf_k=rrf_k)
        results[rrf_k] = result['metrics']

    return pd.DataFrame(results).T

## VISUALIZATION

In [10]:
def plot_metrics_comparison(all_results, save_path=RESULTS_DIR / "metrics_comparison.png"):
    """Plot comparison of all metrics across systems"""
    metrics_df = pd.DataFrame({
        method: results['metrics']
        for method, results in all_results.items()
    }).T

    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    metrics = ['accuracy', 'precision', 'recall', 'f1']

    for ax, metric in zip(axes.flat, metrics):
        metrics_df[metric].plot(kind='bar', ax=ax, color='steelblue')
        ax.set_title(f'{metric.capitalize()} by Method', fontsize=14, fontweight='bold')
        ax.set_ylabel(metric.capitalize())
        ax.set_xlabel('')
        ax.set_ylim(0, 1.0)
        ax.grid(axis='y', alpha=0.3)

        # Add value labels on bars
        for i, v in enumerate(metrics_df[metric]):
            ax.text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold')

    plt.tight_layout()
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    print(f"✅ Saved metrics comparison to {save_path}")
    plt.close()

def plot_confusion_matrices(all_results, save_path=RESULTS_DIR / "confusion_matrices.png"):
    """Plot confusion matrices for all systems"""
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    for ax, (method, results) in zip(axes.flat, all_results.items()):
        cm = results['confusion_matrix']
        cm_array = np.array([[cm['TN'], cm['FP']], [cm['FN'], cm['TP']]])

        sns.heatmap(cm_array, annot=True, fmt='d', cmap='Blues', ax=ax,
                    xticklabels=['Predicted Not Plagiarized', 'Predicted Plagiarized'],
                    yticklabels=['Actually Not Plagiarized', 'Actually Plagiarized'])
        ax.set_title(f'{method.replace("_", " ").title()}', fontsize=14, fontweight='bold')

    plt.tight_layout()
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    print(f"✅ Saved confusion matrices to {save_path}")
    plt.close()

def plot_ablation_study(ablation_df, title, xlabel, save_path):
    """Plot ablation study results"""
    fig, ax = plt.subplots(figsize=(10, 6))

    ablation_df.plot(ax=ax, marker='o', linewidth=2)
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.set_xlabel(xlabel, fontsize=12)
    ax.set_ylabel('Score', fontsize=12)
    ax.set_ylim(0, 1.0)
    ax.legend(title='Metrics', fontsize=10)
    ax.grid(alpha=0.3)

    plt.tight_layout()
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    print(f"✅ Saved ablation study to {save_path}")
    plt.close()

def plot_time_cost_tradeoff(all_results, save_path=RESULTS_DIR / "time_cost_tradeoff.png"):
    """Plot F1 score vs execution time tradeoff"""
    methods = list(all_results.keys())
    f1_scores = [all_results[m]['metrics']['f1'] for m in methods]
    avg_times = [all_results[m]['avg_time'] for m in methods]

    fig, ax = plt.subplots(figsize=(10, 6))

    scatter = ax.scatter(avg_times, f1_scores, s=200, alpha=0.6, c=range(len(methods)), cmap='viridis')

    for i, method in enumerate(methods):
        ax.annotate(method.replace('_', '\n'), (avg_times[i], f1_scores[i]),
                   fontsize=10, ha='center', fontweight='bold')

    ax.set_xlabel('Average Execution Time (seconds)', fontsize=12)
    ax.set_ylabel('F1 Score', fontsize=12)
    ax.set_title('Performance vs Computational Cost', fontsize=14, fontweight='bold')
    ax.grid(alpha=0.3)

    plt.tight_layout()
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    print(f"✅ Saved time-cost tradeoff to {save_path}")
    plt.close()

In [12]:

def run_full_evaluation():
    """Execute complete evaluation pipeline"""
    try:
        print("\n" + "="*60)
        print("PHASE 1: LOADING TEST DATA")
        print("="*60)
        test_df = load_test_dataset()

        print("\n" + "="*60)
        print("PHASE 2: EVALUATING ALL SYSTEMS")
        print("="*60)

        all_results = {}

        print("\n[1/4] Evaluating Embedding Search...")
        all_results['embedding'] = evaluate_system(detect_embedding, test_df, threshold=0.8)

        print("\n[2/4] Evaluating Direct LLM...")
        all_results['direct_llm'] = evaluate_system(detect_llm, test_df, max_context_functions=100)

        print("\n[3/4] Evaluating Standard RAG...")
        all_results['rag'] = evaluate_system(detect_rag, test_df, top_k=5)

        print("\n[4/4] Evaluating Hybrid RAG...")
        all_results['hybrid_rag'] = evaluate_system(detect_hybrid_rag, test_df, top_k=5)

        print("\n" + "="*60)
        print("PHASE 3: ERROR ANALYSIS")
        print("="*60)
        error_report = generate_error_report(test_df, all_results)
        print("\n" + "="*60)
        print("PHASE 4: GENERATING VISUALIZATIONS")
        print("="*60)
        plot_metrics_comparison(all_results)
        plot_confusion_matrices(all_results)
        plot_time_cost_tradeoff(all_results)



        print("\n" + "="*60)
        print("PHASE 5: ABLATION STUDIES")
        print("="*60)

        print("\nAblation 1: RAG k-values...")
        ablation_k = ablation_k_values(detect_rag, test_df)
        plot_ablation_study(ablation_k, "RAG: Effect of k on Performance",
                           "Number of Retrieved Documents (k)",
                           RESULTS_DIR / "ablation_k_values.png")

        print("\nAblation 2: Embedding thresholds...")
        ablation_thresh = ablation_threshold_values(test_df)
        plot_ablation_study(ablation_thresh, "Embedding: Effect of Similarity Threshold",
                           "Similarity Threshold",
                           RESULTS_DIR / "ablation_thresholds.png")

        print("\nAblation 3: Hybrid RAG RRF-k values...")
        ablation_rrf = ablation_rrf_k_values(test_df)
        plot_ablation_study(ablation_rrf, "Hybrid RAG: Effect of RRF Parameter",
                           "RRF k Parameter",
                           RESULTS_DIR / "ablation_rrf_k.png")

        print("\n" + "="*60)
        print("PHASE 6: GENERATING REPORT")
        print("="*60)
        generate_analysis_report(all_results, error_report, test_df)

        # Save numerical results
        results_summary = pd.DataFrame({
            method: results['metrics']
            for method, results in all_results.items()
        }).T

        results_summary.to_csv(RESULTS_DIR / "metrics_summary.csv")
        print(f"✅ Saved metrics summary to {RESULTS_DIR / 'metrics_summary.csv'}")

        print("\n" + "="*60)
        print("✅ EVALUATION COMPLETE")
        print("="*60)
        print(f"\nAll results saved to: {RESULTS_DIR.absolute()}")

        return all_results, error_report
    except Exception as e:
        print(e)


In [13]:

def print_comparison_table(all_results):
    """Print formatted comparison table"""

    print("\n" + "="*80)
    print("PERFORMANCE COMPARISON TABLE")
    print("="*80)

    headers = ["Method", "Accuracy", "Precision", "Recall", "F1", "Avg Time (s)"]
    print(f"{headers[0]:<20} {headers[1]:<12} {headers[2]:<12} {headers[3]:<12} {headers[4]:<12} {headers[5]:<12}")
    print("-"*80)

    for method, results in all_results.items():
        m = results['metrics']
        print(f"{method:<20} {m['accuracy']:<12.3f} {m['precision']:<12.3f} "
              f"{m['recall']:<12.3f} {m['f1']:<12.3f} {results['avg_time']:<12.4f}")

    print("="*80 + "\n")


In [12]:
all_results, error_report = run_full_evaluation()
print_comparison_table(all_results)


PHASE 1: LOADING TEST DATA
✅ Loaded 61 test cases
   Positive (plagiarized): 30
   Negative (not plagiarized): 31

PHASE 2: EVALUATING ALL SYSTEMS

[1/4] Evaluating Embedding Search...


Evaluating:   0%|          | 0/61 [00:00<?, ?it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:01<00:00,  1.08s/it][A
Evaluating:   2%|▏         | 1/61 [00:01<01:13,  1.22s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.45it/s][A
Evaluating:   3%|▎         | 2/61 [00:02<00:58,  1.01it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.68it/s][A
Evaluating:   5%|▍         | 3/61 [00:02<00:50,  1.15it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.64it/s][A
Evaluating:   7%|▋         | 4/61 [00:03<00:46,  1.22it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.92it/s][A
Evaluating:   8%|▊         | 5/61 [00:04<00:42,  1.31it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:


[2/4] Evaluating Direct LLM...


Evaluating: 100%|██████████| 61/61 [09:29<00:00,  9.34s/it]



[3/4] Evaluating Standard RAG...


Evaluating:   0%|          | 0/61 [00:00<?, ?it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s][A
Evaluating:   2%|▏         | 1/61 [00:17<17:45, 17.77s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.98it/s][A
Evaluating:   3%|▎         | 2/61 [00:29<13:57, 14.19s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.29it/s][A
Evaluating:   5%|▍         | 3/61 [00:57<19:58, 20.67s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.81it/s][A
Evaluating:   7%|▋         | 4/61 [01:13<17:35, 18.53s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.35it/s][A
Evaluating:   8%|▊         | 5/61 [01:30<16:51, 18.06s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:


[4/4] Evaluating Hybrid RAG...


Evaluating:   0%|          | 0/61 [00:00<?, ?it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.28it/s][A
Evaluating:   2%|▏         | 1/61 [00:19<19:36, 19.61s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.47it/s][A
Evaluating:   3%|▎         | 2/61 [00:35<17:17, 17.59s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.32it/s][A
Evaluating:   5%|▍         | 3/61 [00:47<14:27, 14.95s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.37it/s][A
Evaluating:   7%|▋         | 4/61 [00:58<12:30, 13.17s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.78it/s][A
Evaluating:   8%|▊         | 5/61 [01:10<12:04, 12.94s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:01<00:


PHASE 3: ERROR ANALYSIS

PHASE 4: GENERATING VISUALIZATIONS
✅ Saved metrics comparison to results\metrics_comparison.png
✅ Saved confusion matrices to results\confusion_matrices.png
✅ Saved time-cost tradeoff to results\time_cost_tradeoff.png

PHASE 5: ABLATION STUDIES

Ablation 1: RAG k-values...
Testing k=3...


Evaluating:   0%|          | 0/61 [00:00<?, ?it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.74it/s][A
Evaluating:   2%|▏         | 1/61 [00:14<14:01, 14.02s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.61it/s][A
Evaluating:   3%|▎         | 2/61 [00:27<13:11, 13.41s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.68it/s][A
Evaluating:   5%|▍         | 3/61 [00:42<13:59, 14.47s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.74it/s][A
Evaluating:   7%|▋         | 4/61 [00:53<12:22, 13.03s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.74it/s][A
Evaluating:   8%|▊         | 5/61 [01:10<13:32, 14.50s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:

Testing k=5...


Evaluating:   0%|          | 0/61 [00:00<?, ?it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.64it/s][A
Evaluating:   2%|▏         | 1/61 [00:16<16:47, 16.80s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.77it/s][A
Evaluating:   3%|▎         | 2/61 [00:30<14:36, 14.86s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.64it/s][A
Evaluating:   5%|▍         | 3/61 [00:47<15:27, 16.00s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:01<00:00,  1.06s/it][A
Evaluating:   7%|▋         | 4/61 [01:03<15:06, 15.90s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.59it/s][A
Evaluating:   8%|▊         | 5/61 [01:20<15:14, 16.34s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:

Testing k=10...


Evaluating:   0%|          | 0/61 [00:00<?, ?it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.74it/s][A
Evaluating:   2%|▏         | 1/61 [00:09<09:04,  9.08s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.19it/s][A
Evaluating:   3%|▎         | 2/61 [00:21<10:48, 11.00s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.69it/s][A
Evaluating:   5%|▍         | 3/61 [00:38<13:31, 13.99s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.48it/s][A
Evaluating:   7%|▋         | 4/61 [00:52<13:01, 13.72s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.67it/s][A
Evaluating:   8%|▊         | 5/61 [01:06<13:00, 13.94s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:

✅ Saved ablation study to results\ablation_k_values.png

Ablation 2: Embedding thresholds...
Testing threshold=0.7...


Evaluating:   0%|          | 0/61 [00:00<?, ?it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.60it/s][A
Evaluating:   2%|▏         | 1/61 [00:00<00:50,  1.19it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.95it/s][A
Evaluating:   3%|▎         | 2/61 [00:01<00:46,  1.27it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.98it/s][A
Evaluating:   5%|▍         | 3/61 [00:02<00:44,  1.29it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.63it/s][A
Evaluating:   7%|▋         | 4/61 [00:03<00:46,  1.22it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.86it/s][A
Evaluating:   8%|▊         | 5/61 [00:04<00:47,  1.19it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:

Testing threshold=0.8...


Evaluating:   0%|          | 0/61 [00:00<?, ?it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.71it/s][A
Evaluating:   2%|▏         | 1/61 [00:00<00:47,  1.26it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.62it/s][A
Evaluating:   3%|▎         | 2/61 [00:01<00:48,  1.22it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:01<00:00,  1.01s/it][A
Evaluating:   5%|▍         | 3/61 [00:02<00:58,  1.01s/it]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.90it/s][A
Evaluating:   7%|▋         | 4/61 [00:03<00:51,  1.11it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.52it/s][A
Evaluating:   8%|▊         | 5/61 [00:04<00:50,  1.11it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:

Testing threshold=0.9...


Evaluating:   0%|          | 0/61 [00:00<?, ?it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.88it/s][A
Evaluating:   2%|▏         | 1/61 [00:00<00:47,  1.27it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.51it/s][A
Evaluating:   3%|▎         | 2/61 [00:01<00:51,  1.14it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.70it/s][A
Evaluating:   5%|▍         | 3/61 [00:02<00:49,  1.17it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.58it/s][A
Evaluating:   7%|▋         | 4/61 [00:03<00:48,  1.17it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:00,  1.73it/s][A
Evaluating:   8%|▊         | 5/61 [00:04<00:47,  1.18it/s]
Embedding:   0%|          | 0/1 [00:00<?, ?it/s][A
Embedding: 100%|██████████| 1/1 [00:00<00:

✅ Saved ablation study to results\ablation_thresholds.png

Ablation 3: Hybrid RAG RRF-k values...
Testing rrf_k=30...


Evaluating: 100%|██████████| 61/61 [00:00<00:00, 4857.10it/s]


Testing rrf_k=60...


Evaluating: 100%|██████████| 61/61 [00:00<00:00, 6131.44it/s]


Testing rrf_k=90...


Evaluating: 100%|██████████| 61/61 [00:00<00:00, 3917.09it/s]


✅ Saved ablation study to results\ablation_rrf_k.png

PHASE 6: GENERATING REPORT
✅ Saved analysis report to results\analysis_report.md
✅ Saved metrics summary to results\metrics_summary.csv

✅ EVALUATION COMPLETE

All results saved to: C:\Users\root\Desktop\LLMs\Code_Plagiarism_Checker\results

PERFORMANCE COMPARISON TABLE
Method               Accuracy     Precision    Recall       F1           Avg Time (s)
--------------------------------------------------------------------------------
embedding            0.508        0.000        0.000        0.000        0.7124      
direct_llm           0.508        0.500        0.100        0.167        9.3332      
rag                  0.770        0.767        0.767        0.767        15.1601     
hybrid_rag           0.803        0.846        0.733        0.786        13.7696     

