# Security String Classification Inference with DeepSeek 7B

This notebook performs inference using a pre-trained security classification model. The model classifies strings as either "Secret" or "Non-sensitive" based on their context in issue reports.

## Overview
- **Task**: Binary classification of security-sensitive strings
- **Model**: Fine-tuned DeepSeek 7B with LoRA adapters
- **Framework**: Unsloth for efficient inference
- **Data**: CSV format with candidate strings and issue reports

## Step 1: Environment Setup and Imports

Import all necessary libraries and set up the environment for inference.

In [None]:
import pandas as pd
import numpy as np
import os

# Set environment for Unsloth
os.environ["UNSLOTH_IS_PRESENT"] = "1"

from tqdm import tqdm
import torch
from unsloth import FastLanguageModel
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
import json
from datetime import datetime
import re

# Set random seed for reproducibility
torch.manual_seed(69420)

print("✅ Environment setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

In [None]:
# Create model results directory
import os
from datetime import datetime

# Create timestamp for this run
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_name = "deepseek-7b-security-classifier"
results_dir = f"model_results/{model_name}/{timestamp}"

# Create directories
os.makedirs(results_dir, exist_ok=True)
os.makedirs(f"{results_dir}/metrics", exist_ok=True)
os.makedirs(f"{results_dir}/analysis", exist_ok=True)
os.makedirs(f"{results_dir}/predictions", exist_ok=True)

print(f"✅ Created model results directory: {results_dir}")
print(f"📁 Directory structure:")
print(f"  - {results_dir}/metrics/     (performance metrics)")
print(f"  - {results_dir}/analysis/    (error analysis)")
print(f"  - {results_dir}/predictions/ (detailed predictions)")

## Step 2: Load Test Data

Load the test dataset for evaluation.

In [None]:
# Load test data
df_test = pd.read_csv("../Data/test.csv")

print(f"Test shape: {df_test.shape}")

# Display data structure
print("\n📊 Test data sample:")
print(df_test.head(2))

print("\n📋 Column information:")
print(df_test.columns.tolist())
print(f"\nData types:\n{df_test.dtypes}")

## Step 3: Data Preprocessing

Apply the same preprocessing as used during training.

In [None]:

def create_context_window(text, target_string, window_size=200):
    """Create context window around target string"""
    target_index = text.find(target_string)
    if target_index != -1:
        start_index = max(0, target_index - window_size)
        end_index = min(len(text), target_index + len(target_string) + window_size)
        context_window = text[start_index:end_index]
        return context_window
    return None

# Apply preprocessing
print("🔄 Preprocessing test data...")
df_test['modified_text'] = df_test.apply(lambda row: create_context_window(row['text'], row['candidate_string']), axis=1)

# Convert labels to text format
df_test['label'] = df_test['label'].replace({0: 'Non-sensitive', 1: 'Secret'})

print("✅ Data preprocessing complete!")
print(f"📊 Label distribution in test data:")
print(df_test['label'].value_counts())

## Step 4: Load Pre-trained Model

Load the saved fine-tuned model for inference.

In [None]:
# Model configuration
max_seq_length = 1024
dtype = None
load_in_4bit = True

# Path to the saved model (adjust as needed)
model_path = "../models/deepseek-7b-ft-unsloth_merged"  # Use the merged model for easier inference

print("🔧 Loading fine-tuned model...")
print(f"Model path: {model_path}")
print(f"Max sequence length: {max_seq_length}")
print(f"4-bit quantization: {load_in_4bit}")

# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_path,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Set model to inference mode
FastLanguageModel.for_inference(model)

print("✅ Model loaded successfully and set to inference mode!")

## Step 5: Inference Functions

Define the inference functions using the same prompt template as training.

In [None]:
def format_prompt_for_inference(candidate_string, issue_report):
    """Format prompt for inference using DeepSeek's chat format"""
    
    system_prompt = """You are a security auditor or classifier specialized in identifying and categorizing sensitive secrets from issue reports.. Classify the given candidate string as either "Non-sensitive" or "Secret" based on its context.

A "Secret" includes sensitive information such as: 
- API keys and secrets (e.g., `sk_test_ABC123`)  
- Private and secret keys (e.g., private SSH keys, private cryptographic keys)  
- Authentication keys and tokens (e.g., `Bearer <token>`)  
- Database connection strings with credentials (e.g., `mongodb://user:password@host:port`)  
- Passwords, usernames, and any other private information that should not be shared openly.  

A "Non-sensitive" string is not considered secret and can be shared openly. This includes:  
- Public keys of any form (e.g., public SSH keys)  
- Non-sensitive configuration values or identifiers  
- Actual-looking keys that are clearly marked as dummy/test (e.g., with comments like '# dummy key' or variable names like 'test_key')  
- Strings that just look random or patterned but are not actually secrets (e.g., `xyz123`, 'xxxx', `abc123`, `EXAMPLE_KEY`, `token_value`)  
- Strings that are clearly placeholders or redacted text (e.g., 'XXXXXXXX', '[REDACTED]', '[TRUNCATED]')  
- **Obfuscated or masked values (e.g., '****', '****123', 'abc...xyz')**  

These are always considered **"Non-sensitive"**, even if they appear in a sensitive context.

Reply with only the classification: "Non-sensitive" or "Secret"."""

    user_prompt = f"""Classify the given candidate string based on its role in the provided issue report.

candidate_string: {candidate_string}
issue_report: {issue_report}"""

    # Inference format using DeepSeek's chat style
    prompt = f"""<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_prompt}<|im_end|>
<|im_start|>assistant
"""
    
    return prompt

def extract_label(model_response):
    """Extract label from model response"""
    if "Secret" in model_response:
        return "Secret"
    else:
        return "Non-sensitive"

def predict_single(candidate_string, issue_report, model, tokenizer):
    """Single prediction function for testing"""
    # Format prompt for inference
    test_prompt = format_prompt_for_inference(candidate_string, issue_report)
    
    # Tokenize
    inputs = tokenizer(
        test_prompt,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_seq_length
    )
    
    # Move to GPU
    inputs = {k: v.to("cuda") for k, v in inputs.items()}
    
    # Generate prediction
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_new_tokens=5,
            temperature=0.1,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            use_cache=False
        )
    
    # Decode response
    response = tokenizer.decode(outputs[0])
    assistant_marker = '<|im_start|>assistant'
    if assistant_marker in response:
        model_response = response.split(assistant_marker)[-1].strip()
    else:
        model_response = response[len(test_prompt):].strip()
    
    predicted_label = extract_label(model_response)
    return predicted_label, model_response

print("✅ Inference functions defined!")
print("  - format_prompt_for_inference: Formats prompts for inference")
print("  - extract_label: Extracts classification labels from model responses")
print("  - predict_single: Makes single predictions with the trained model")

## Step 6: Batch Prediction Function

Define batch prediction function for efficient processing of multiple samples.

In [None]:
def predict_batch(test_df, model, tokenizer, batch_size=8):
    """Batch prediction function for comprehensive testing"""
    y_pred = []
    errors = []
    
    print(f"🔄 Running batch predictions on {len(test_df):,} examples...")
    
    for i in tqdm(range(0, len(test_df), batch_size), desc="Predicting"):
        batch = test_df.iloc[i:i+batch_size]
        
        for idx, row in batch.iterrows():
            try:
                predicted_label, _ = predict_single(
                    row["candidate_string"], 
                    row["modified_text"], 
                    model, 
                    tokenizer
                )
                y_pred.append(predicted_label)
            except Exception as e:
                errors.append(f"Error at index {idx}: {e}")
                y_pred.append("Non-sensitive")  # Default prediction
                continue
    
    if errors:
        print(f"⚠️  {len(errors)} errors occurred during prediction:")
        for error in errors[:3]:  # Show first 3 errors
            print(f"  - {error}")
        if len(errors) > 3:
            print(f"  - ... and {len(errors) - 3} more errors")
    
    return y_pred

print("✅ Batch prediction function defined!")

## Step 7: Run Comprehensive Evaluation

Execute the model on the test set and collect predictions.

In [None]:
# Run comprehensive evaluation on test set
print("🚀 Running comprehensive evaluation on test set...")

# Get predictions for the entire test set
X_test = df_test
y_pred = predict_batch(X_test, model, tokenizer)
y_true_test = X_test['label'].tolist()

print(f"✅ Evaluation completed!")
print(f"📊 Prediction Summary:")
print(f"  - Total predictions: {len(y_pred):,}")
print(f"  - Unique predicted labels: {set(y_pred)}")
print(f"  - True label distribution: {X_test['label'].value_counts().to_dict()}")

# Quick accuracy check
correct_predictions = sum(1 for true, pred in zip(y_true_test, y_pred) if true == pred)
quick_accuracy = correct_predictions / len(y_pred) if len(y_pred) > 0 else 0
print(f"  - Quick accuracy: {quick_accuracy:.3f} ({correct_predictions}/{len(y_pred)})")

## Step 8: Detailed Performance Metrics

Calculate and display comprehensive performance metrics.

In [None]:
# ============================================
# Detailed Performance Metrics
# ============================================
accuracy = 0.0
precision_avg = 0.0
recall_avg = 0.0
f1_avg = 0.0

if len(y_pred) > 0:
    print("\n" + "="*50)
    print("📈 DETAILED PERFORMANCE METRICS")
    print("="*50)
    
    # Classification Report with 3 decimal places
    print("\n📊 Classification Report:")
    print(classification_report(y_true_test, y_pred, digits=3))
    
    # Calculate precision, recall, F1-score for each class
    labels = sorted(set(y_true_test))
    precision, recall, f1, support = precision_recall_fscore_support(
        y_true_test, 
        y_pred, 
        labels=labels
    )
    
    print(f"\n🏷️  Per-Class Detailed Metrics:")
    for i, label in enumerate(labels):
        print(f"\n  {label.upper()}:")
        print(f"    - Precision: {precision[i]:.3f}")
        print(f"    - Recall: {recall[i]:.3f}")
        print(f"    - F1-score: {f1[i]:.3f}")
        print(f"    - Support: {support[i]:,}")
    
    # Overall accuracy and binary metrics
    def map_func(x):
        return 1 if x == "Secret" else 0

    y_true_mapped = np.array([map_func(label) for label in y_true_test])
    y_pred_mapped = np.array([map_func(label) for label in y_pred])

    # Calculate overall accuracy
    overall_accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    
    # Calculate weighted averages for overall metrics
    precision_overall, recall_overall, f1_overall, _ = precision_recall_fscore_support(
        y_true_mapped, y_pred_mapped, average='weighted'
    )
    
    # Store these for later use in performance summary
    accuracy = overall_accuracy
    precision_avg = precision_overall
    recall_avg = recall_overall
    f1_avg = f1_overall
    
    print(f"\n🎯 Overall Performance:")
    print(f"  - Overall Accuracy: {overall_accuracy:.3f}")
    print(f"  - Weighted Precision: {precision_overall:.3f}")
    print(f"  - Weighted Recall: {recall_overall:.3f}")
    print(f"  - Weighted F1-Score: {f1_overall:.3f}")
    
    # Per-class accuracy
    for label_val, name in zip([0, 1], ["Non-sensitive", "Secret"]):
        label_indices = np.where(y_true_mapped == label_val)[0]
        if len(label_indices) > 0:
            label_accuracy = accuracy_score(
                y_true=y_true_mapped[label_indices], 
                y_pred=y_pred_mapped[label_indices]
            )
            print(f"  - Accuracy for {name}: {label_accuracy:.3f}")

else:
    print("❌ Cannot calculate metrics - no valid predictions made.")
    # Set default values if no predictions
    accuracy = 0.0
    precision_avg = 0.0
    recall_avg = 0.0
    f1_avg = 0.0

## Step 9: Confusion Matrix and Error Analysis

Analyze prediction errors and create confusion matrix.

In [None]:
# ============================================
# Confusion Matrix and Error Analysis
# ============================================

if len(y_pred) > 0:
    print("\n" + "="*50)
    print("🔍 CONFUSION MATRIX & ERROR ANALYSIS")
    print("="*50)
    
    # Confusion Matrix
    cm = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped)
    print("\n📊 Confusion Matrix:")
    print("Predicted →")
    print("Actual ↓     Non-sens  Secret")
    print(f"Non-sens      {cm[0,0]:6d}   {cm[0,1]:6d}")
    print(f"Secret        {cm[1,0]:6d}   {cm[1,1]:6d}")
    
    # Calculate derived metrics
    tn, fp, fn, tp = cm.ravel()
    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0  # Recall for Secret
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0  # Recall for Non-sensitive
    
    print(f"\n📊 Additional Metrics:")
    print(f"  - True Positives (Secret correctly identified): {tp:,}")
    print(f"  - True Negatives (Non-sensitive correctly identified): {tn:,}")
    print(f"  - False Positives (Non-sensitive labeled as Secret): {fp:,}")
    print(f"  - False Negatives (Secret labeled as Non-sensitive): {fn:,}")
    print(f"  - Sensitivity (Secret recall): {sensitivity:.3f}")
    print(f"  - Specificity (Non-sensitive recall): {specificity:.3f}")
    
    # Error Analysis
    print(f"\n🔍 Error Breakdown:")
    false_positives = [(true, pred, idx) for idx, (true, pred) in enumerate(zip(y_true_test, y_pred)) 
                      if true == 'Non-sensitive' and pred == 'Secret']
    false_negatives = [(true, pred, idx) for idx, (true, pred) in enumerate(zip(y_true_test, y_pred)) 
                      if true == 'Secret' and pred == 'Non-sensitive']
    
    print(f"  - False Positives: {len(false_positives):,} (Non-sensitive → Secret)")
    print(f"  - False Negatives: {len(false_negatives):,} (Secret → Non-sensitive)")
    
    # Show sample errors
    if false_negatives:
        print(f"\n❌ Sample False Negatives (Security Risk):")
        for i, (true, pred, idx) in enumerate(false_negatives[:3]):
            candidate = X_test.iloc[idx]['candidate_string']
            print(f"  {i+1}. Candidate: {candidate[:80]}{'...' if len(candidate) > 80 else ''}")
            print(f"     True: {true} → Predicted: {pred}")
    
    if false_positives:
        print(f"\n⚠️  Sample False Positives (Usability Impact):")
        for i, (true, pred, idx) in enumerate(false_positives[:3]):
            candidate = X_test.iloc[idx]['candidate_string']
            print(f"  {i+1}. Candidate: {candidate[:80]}{'...' if len(candidate) > 80 else ''}")
            print(f"     True: {true} → Predicted: {pred}")
    
    # Risk Assessment
    print(f"\n🚨 Risk Assessment:")
    if fn > 0:
        fn_rate = fn / (tp + fn)
        if fn_rate < 0.05:
            print(f"  ✅ LOW SECURITY RISK: False negative rate = {fn_rate:.3f}")
        elif fn_rate < 0.10:
            print(f"  ⚠️  MODERATE SECURITY RISK: False negative rate = {fn_rate:.3f}")
        else:
            print(f"  ❌ HIGH SECURITY RISK: False negative rate = {fn_rate:.3f}")
    
    if fp > 0:
        fp_rate = fp / (tn + fp)
        if fp_rate < 0.05:
            print(f"  ✅ LOW USABILITY IMPACT: False positive rate = {fp_rate:.3f}")
        elif fp_rate < 0.10:
            print(f"  ⚠️  MODERATE USABILITY IMPACT: False positive rate = {fp_rate:.3f}")
        else:
            print(f"  ❌ HIGH USABILITY IMPACT: False positive rate = {fp_rate:.3f}")

else:
    print("❌ Cannot perform error analysis - no valid predictions made.")

In [None]:
# ============================================
# Detailed False Positives and False Negatives Analysis
# ============================================

if len(y_pred) > 0:
    print("\n" + "="*60)
    print("🔍 DETAILED FALSE POSITIVES AND FALSE NEGATIVES ANALYSIS")
    print("="*60)
    
    # Identify all error cases
    false_positives = []
    false_negatives = []
    
    for idx, (true_label, pred_label) in enumerate(zip(y_true_test, y_pred)):
        if true_label == 'Non-sensitive' and pred_label == 'Secret':
            false_positives.append({
                'index': idx,
                'candidate_string': X_test.iloc[idx]['candidate_string'],
                'context_window': X_test.iloc[idx]['modified_text'],
                'actual_label': true_label,
                'predicted_label': pred_label
            })
        elif true_label == 'Secret' and pred_label == 'Non-sensitive':
            false_negatives.append({
                'index': idx,
                'candidate_string': X_test.iloc[idx]['candidate_string'],
                'context_window': X_test.iloc[idx]['modified_text'],
                'actual_label': true_label,
                'predicted_label': pred_label
            })
    
    # FALSE NEGATIVES Analysis (High Security Risk)
    print(f"\n❌ FALSE NEGATIVES (Secrets missed - HIGH SECURITY RISK)")
    print(f"Total Count: {len(false_negatives)}")
    print("-" * 60)
    
    if false_negatives:
        for i, fn_case in enumerate(false_negatives[:10]):  # Show top 10
            print(f"\n{i+1}. Index: {fn_case['index']}")
            candidate_string = X_test.iloc[fn_case['index']]['candidate_string']
            print(f"   Candidate: '{candidate_string}'")
            print(f"   Actual: {fn_case['actual_label']} → Predicted: {fn_case['predicted_label']}")
            context_preview = fn_case['context_window'][:200] + '...' if len(fn_case['context_window']) > 200 else fn_case['context_window']
            print(f"   Context: {context_preview}")
            print(f"   Risk: ⚠️  SECURITY BREACH - Secret not detected!")
        
        if len(false_negatives) > 10:
            print(f"\n... and {len(false_negatives) - 10} more false negatives")
        
        # Save false negatives to CSV with context window, actual label, and predicted label
        if 'results_dir' in locals():
            fn_df = pd.DataFrame(false_negatives)
            fn_file = f"{results_dir}/analysis/false_negatives.csv"
            fn_df.to_csv(fn_file, index=False)
            print(f"\n📄 False negatives saved to: {fn_file}")
    else:
        print("✅ No false negatives found! Perfect security detection.")
    
    # FALSE POSITIVES Analysis (Usability Impact)
    print(f"\n⚠️  FALSE POSITIVES (Non-secrets flagged - USABILITY IMPACT)")
    print(f"Total Count: {len(false_positives)}")
    print("-" * 60)
    
    if false_positives:
        for i, fp_case in enumerate(false_positives[:10]):  # Show top 10
            print(f"\n{i+1}. Index: {fp_case['index']}")
            candidate_string = X_test.iloc[fp_case['index']]['candidate_string']
            print(f"   Candidate: '{candidate_string}'")
            print(f"   Actual: {fp_case['actual_label']} → Predicted: {fp_case['predicted_label']}")
            context_preview = fp_case['context_window'][:200] + '...' if len(fp_case['context_window']) > 200 else fp_case['context_window']
            print(f"   Context: {context_preview}")
            print(f"   Impact: 📢 USABILITY - Non-secret flagged as secret")
        
        if len(false_positives) > 10:
            print(f"\n... and {len(false_positives) - 10} more false positives")
        
        # Save false positives to CSV with context window, actual label, and predicted label
        if 'results_dir' in locals():
            fp_df = pd.DataFrame(false_positives)
            fp_file = f"{results_dir}/analysis/false_positives.csv"
            fp_df.to_csv(fp_file, index=False)
            print(f"\n📄 False positives saved to: {fp_file}")
    else:
        print("✅ No false positives found! Perfect precision.")
    
    # Pattern Analysis
    print(f"\n🔍 PATTERN ANALYSIS")
    print("-" * 60)
    
    if false_negatives:
        fn_candidate_strings = [X_test.iloc[case['index']]['candidate_string'] for case in false_negatives]
        fn_lengths = [len(candidate) for candidate in fn_candidate_strings]
        print(f"False Negatives - Candidate String Lengths:")
        print(f"  Average: {np.mean(fn_lengths):.1f} chars")
        print(f"  Range: {min(fn_lengths)} - {max(fn_lengths)} chars")
    
    if false_positives:
        fp_candidate_strings = [X_test.iloc[case['index']]['candidate_string'] for case in false_positives]
        fp_lengths = [len(candidate) for candidate in fp_candidate_strings]
        print(f"False Positives - Candidate String Lengths:")
        print(f"  Average: {np.mean(fp_lengths):.1f} chars")
        print(f"  Range: {min(fp_lengths)} - {max(fp_lengths)} chars")

else:
    print("❌ Cannot perform false positives/negatives analysis - no valid predictions made.")

## Step 10: Performance Summary and Results Saving

Generate final performance summary and save results.

In [None]:
# ============================================
# Performance Summary and Results Saving
# ============================================

if len(y_pred) > 0:
    print("\n" + "="*50)
    print("📋 PERFORMANCE SUMMARY")
    print("="*50)
    
    # Create performance summary using correctly defined variables
    performance_summary = {
        'total_predictions': len(y_pred),
        'successful_predictions': len([p for p in y_pred if p in ['Secret', 'Non-sensitive']]),
        'failed_predictions': len([p for p in y_pred if p not in ['Secret', 'Non-sensitive']]),
        'accuracy': float(accuracy),
        'precision': float(precision_avg),
        'recall': float(recall_avg),
        'f1_score': float(f1_avg),
        'false_negatives': int(fn) if 'fn' in locals() else 0,
        'false_positives': int(fp) if 'fp' in locals() else 0,
        'true_positives': int(tp) if 'tp' in locals() else 0,
        'true_negatives': int(tn) if 'tn' in locals() else 0
    }
    
    print(f"✅ Performance Overview:")
    print(f"  - Total Test Samples: {performance_summary['total_predictions']:,}")
    print(f"  - Successful Predictions: {performance_summary['successful_predictions']:,}")
    print(f"  - Failed Predictions: {performance_summary['failed_predictions']:,}")
    print(f"  - Accuracy: {performance_summary['accuracy']:.3f}")
    print(f"  - Precision: {performance_summary['precision']:.3f}")
    print(f"  - Recall: {performance_summary['recall']:.3f}")
    print(f"  - F1-Score: {performance_summary['f1_score']:.3f}")
    
    # Model quality assessment
    print(f"\n🎯 Model Quality Assessment:")
    if accuracy >= 0.95:
        print("  ✅ EXCELLENT performance (≥95% accuracy)")
    elif accuracy >= 0.90:
        print("  ✅ GOOD performance (≥90% accuracy)")
    elif accuracy >= 0.80:
        print("  ⚠️  ACCEPTABLE performance (≥80% accuracy)")
    else:
        print("  ❌ POOR performance (<80% accuracy)")
    
    # Save detailed results to CSV
    try:
        print(f"\n💾 Saving Results to {results_dir}...")
        
        # Prepare detailed results with enhanced information
        detailed_results = []
        for i, (true_label, pred_label) in enumerate(zip(y_true_test, y_pred)):
            result = {
                'index': i,
                'candidate_string': X_test.iloc[i]['candidate_string'],
                'context_window': X_test.iloc[i]['modified_text'],
                'original_text': X_test.iloc[i]['text'] if 'text' in X_test.columns else '',
                'actual_label': true_label,
                'predicted_label': pred_label,
                'correct': true_label == pred_label,
                'error_type': 'Correct' if true_label == pred_label else 
                             'False Positive' if true_label == 'Non-sensitive' and pred_label == 'Secret' else
                             'False Negative' if true_label == 'Secret' and pred_label == 'Non-sensitive' else
                             'Other Error',
                'candidate_length': len(X_test.iloc[i]['candidate_string']),
                'context_length': len(X_test.iloc[i]['modified_text']) if X_test.iloc[i]['modified_text'] else 0
            }
            detailed_results.append(result)
        
        # Save detailed predictions to CSV
        results_df = pd.DataFrame(detailed_results)
        results_file = f"{results_dir}/predictions/detailed_predictions.csv"
        results_df.to_csv(results_file, index=False)
        print(f"  ✅ Detailed results saved to: {results_file}")
        
        # Save performance summary
        summary_file = f"{results_dir}/metrics/performance_summary.json"
        with open(summary_file, 'w') as f:
            json.dump(performance_summary, f, indent=2)
        print(f"  ✅ Performance summary saved to: {summary_file}")
        
        # Show error samples summary
        error_samples = results_df[results_df['correct'] == False]
        if len(error_samples) > 0:
            print(f"\n🔍 Error Samples Preview:")
            print(f"  - Total Errors: {len(error_samples):,}")
            print(f"  - False Negatives: {len(error_samples[error_samples['error_type'] == 'False Negative']):,}")
            print(f"  - False Positives: {len(error_samples[error_samples['error_type'] == 'False Positive']):,}")
            print(f"  - Other Errors: {len(error_samples[error_samples['error_type'] == 'Other Error']):,}")
            
            # Save error samples separately
            error_file = f"{results_dir}/analysis/error_samples.csv"
            error_samples.to_csv(error_file, index=False)
            print(f"  ✅ Error samples saved to: {error_file}")
        
        # Save classification report as text file with 3 decimal places
        report_file = f"{results_dir}/metrics/classification_report.txt"
        with open(report_file, 'w') as f:
            f.write("SECURITY STRING CLASSIFICATION - DETAILED PERFORMANCE REPORT\n")
            f.write("=" * 70 + "\n\n")
            f.write(f"Model: {model_name}\n")
            f.write(f"Timestamp: {timestamp}\n")
            f.write(f"Test Samples: {len(y_pred):,}\n\n")
            f.write("CLASSIFICATION REPORT:\n")
            f.write(classification_report(y_true_test, y_pred, digits=3))
            f.write(f"\n\nOVERALL METRICS:\n")
            f.write(f"Accuracy: {accuracy:.3f}\n")
            f.write(f"Precision: {precision_avg:.3f}\n")
            f.write(f"Recall: {recall_avg:.3f}\n")
            f.write(f"F1-Score: {f1_avg:.3f}\n")
        print(f"  ✅ Classification report saved to: {report_file}")
    
    except Exception as e:
        print(f"❌ Error saving results: {e}")
    
    print(f"\n🎉 INFERENCE TESTING COMPLETE!")
    print(f"Model evaluated successfully on {len(y_pred):,} test samples.")
    print(f"\n📁 All results saved to: {results_dir}")
    print(f"📊 Files created:")
    print(f"  - predictions/detailed_predictions.csv (enhanced with context and metadata)")
    print(f"  - metrics/performance_summary.json")
    print(f"  - metrics/classification_report.txt") 
    print(f"  - analysis/error_samples.csv")
    print(f"  - analysis/false_negatives.csv (context_window, actual_label, predicted_label)")
    print(f"  - analysis/false_positives.csv (context_window, actual_label, predicted_label)")
    print(f"\n✅ Ready for deployment or further analysis.")

else:
    print("❌ No valid predictions to summarize.")
    print("⚠️  Consider debugging the model inference process.")

## Step 11: Single Sample Testing (Optional)

Test the model on individual samples for interactive exploration.

In [None]:
# Test individual sample
sample_idx = 0  # Change this to test different samples
sample_row = X_test.iloc[sample_idx]

print("🔍 Single Sample Testing")
print(f"Sample Index: {sample_idx}")
print(f"Candidate String: {sample_row['candidate_string']}")
print(f"True Label: {sample_row['label']}")
print(f"Context Preview: {sample_row['modified_text'][:200]}...")

# Make prediction
predicted_label, full_response = predict_single(
    sample_row['candidate_string'],
    sample_row['modified_text'],
    model,
    tokenizer
)

print(f"\n📝 Model Response: {full_response[:100]}...")
print(f"🎯 Predicted Label: {predicted_label}")
print(f"✅ Correct: {predicted_label == sample_row['label']}")