# DeBERTa Single Model with TTA - Strategy B

## Overview
Based on experiment analysis, **simple approaches work best**. This notebook focuses on:
- Single best model: DeBERTa v3 base
- Multiple training runs with different seeds (TTA approach)
- Ensemble of the same model trained differently

## Strategy: Simplicity + Test-Time Augmentation

### Key Insight from Past Experiments:
- Experiment 7 (simple DeBERTa, 3 epochs) → **0.917 AUC (best)**
- Experiment 9 (optimized DeBERTa, 4 epochs) → 0.916 AUC (worse)
- **Lesson**: Simple is better, but can we improve through diversity?

### Approach:
Train DeBERTa v3 **3 times** with different random seeds:
1. Seed 42 (baseline)
2. Seed 123
3. Seed 456

Then ensemble the 3 predictions with equal weighting.

### Why This Should Work:
- **Same model, different initializations** → Captures different local optima
- **Reduces variance** → More stable predictions
- **No complexity added** → Keeps training simple (3 epochs, proven config)
- **TTA effect** → Similar to test-time augmentation but at training level

## Expected Performance
- **Target**: 0.918-0.919 AUC
- **Rationale**: Seed diversity should reduce variance without adding harmful complexity
- **Risk**: Low (same proven approach, just repeated)

## Acknowledgments

This work builds upon:

**Experiment 7: DeBERTa Large 2epochs 1hr (0.917 AUC)**
- Author: [itahiro](https://www.kaggle.com/itahiro)
- Notebook: https://www.kaggle.com/code/itahiro/deberta-large-2epochs-1hr
- Contribution: Proven DeBERTa v3 training configuration, URL semantic extraction

**Modification**: Train same model 3x with different seeds for ensemble diversity

In [None]:
%%writefile utils.py
import pandas as pd
import re

def url_to_semantics(text: str) -> str:
    if not isinstance(text, str):
        return ""

    url_pattern = r'https?://[^\s/$.?#].[^\s]*'
    urls = re.findall(url_pattern, text)
    
    if not urls:
        return "" 

    all_semantics = []
    seen_semantics = set()

    for url in urls:
        url_lower = url.lower()
        
        domain_match = re.search(r"(?:https?://)?([a-z0-9\-\.]+)\.[a-z]{2,}", url_lower)
        if domain_match:
            full_domain = domain_match.group(1)
            parts = full_domain.split('.')
            for part in parts:
                if part and part not in seen_semantics and len(part) > 3:
                    all_semantics.append(f"domain:{part}")
                    seen_semantics.add(part)

        path = re.sub(r"^(?:https?://)?[a-z0-9\.-]+\.[a-z]{2,}/?", "", url_lower)
        path_parts = [p for p in re.split(r'[/_.-]+', path) if p and p.isalnum()]

        for part in path_parts:
            part_clean = re.sub(r"\.(html?|php|asp|jsp)$|#.*|\?.*", "", part)
            if part_clean and part_clean not in seen_semantics and len(part_clean) > 3:
                all_semantics.append(f"path:{part_clean}")
                seen_semantics.add(part_clean)

    if not all_semantics:
        return ""

    return f"\nURL Keywords: {' '.join(all_semantics)}"


def get_dataframe_to_train(data_path):
    train_dataset = pd.read_csv(f"{data_path}/train.csv") 
    test_dataset = pd.read_csv(f"{data_path}/test.csv")

    flatten = []

    flatten.append(train_dataset[["body", "rule", "subreddit","rule_violation"]].copy())

    for violation_type in ["positive", "negative"]:
        for i in range(1, 3):
            col_name = f"{violation_type}_example_{i}"
            
            if col_name in train_dataset.columns:
                sub_dataset = train_dataset[[col_name, "rule", "subreddit"]].copy()
                sub_dataset = sub_dataset.rename(columns={col_name: "body"})
                sub_dataset["rule_violation"] = 1 if violation_type == "positive" else 0
                
                sub_dataset.dropna(subset=['body'], inplace=True)
                sub_dataset = sub_dataset[sub_dataset['body'].str.strip().str.len() > 0]
                
                if not sub_dataset.empty:
                    flatten.append(sub_dataset)
    
    for violation_type in ["positive", "negative"]:
        for i in range(1, 3):
            col_name = f"{violation_type}_example_{i}"
            
            if col_name in test_dataset.columns:
                sub_dataset = test_dataset[[col_name, "rule", "subreddit"]].copy()
                sub_dataset = sub_dataset.rename(columns={col_name: "body"})
                sub_dataset["rule_violation"] = 1 if violation_type == "positive" else 0
                
                sub_dataset.dropna(subset=['body'], inplace=True)
                sub_dataset = sub_dataset[sub_dataset['body'].str.strip().str.len() > 0]
                
                if not sub_dataset.empty:
                    flatten.append(sub_dataset)
    
    dataframe = pd.concat(flatten, axis=0)
    dataframe = dataframe.drop_duplicates(subset=['body', 'rule', 'subreddit'], ignore_index=True)
    dataframe.drop_duplicates(subset=['body','rule'],keep='first',inplace=True)
    
    return dataframe

In [None]:
%%writefile train_deberta_seed.py
import os
import sys
import pandas as pd
import torch
import random
import numpy as np
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)

from utils import get_dataframe_to_train, url_to_semantics

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed) 
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

class CFG:
    model_name_or_path = "/kaggle/input/huggingfacedebertav3variants/deberta-v3-base"
    data_path = "/kaggle/input/jigsaw-agile-community-rules/"
    
    # Use Experiment 7's proven config
    EPOCHS = 3
    LEARNING_RATE = 2e-5  
    MAX_LENGTH = 512
    BATCH_SIZE = 8

class JigsawDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])

def main(seed, output_suffix):
    print(f"\n{'='*60}")
    print(f"Training with seed: {seed}")
    print(f"{'='*60}\n")
    
    seed_everything(seed)
    
    training_data_df = get_dataframe_to_train(CFG.data_path)
    # Reshuffle with the new seed
    training_data_df = training_data_df.sample(frac=1, random_state=seed).reset_index(drop=True)
    print(f"Training dataset size: {len(training_data_df)}")

    test_df_for_prediction = pd.read_csv(f"{CFG.data_path}/test.csv")
    
    training_data_df['body_with_url'] = training_data_df['body'].apply(lambda x: x + url_to_semantics(x))
    training_data_df['input_text'] = training_data_df['rule'] + "[SEP]" + training_data_df['body_with_url']

    tokenizer = AutoTokenizer.from_pretrained(CFG.model_name_or_path)
    train_encodings = tokenizer(training_data_df['input_text'].tolist(), truncation=True, padding=True, max_length=CFG.MAX_LENGTH)
    train_labels = training_data_df['rule_violation'].tolist()
    train_dataset = JigsawDataset(train_encodings, train_labels)

    model = AutoModelForSequenceClassification.from_pretrained(CFG.model_name_or_path, num_labels=2)
    
    output_dir = f"./deberta_seed_{seed}"
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=CFG.EPOCHS,
        learning_rate=CFG.LEARNING_RATE,
        per_device_train_batch_size=CFG.BATCH_SIZE,
        warmup_ratio=0.1,
        weight_decay=0.01,
        report_to="none",
        save_strategy="no",
        logging_steps=1,
        seed=seed,
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
    )
    
    trainer.train()

    test_df_for_prediction['body_with_url'] = test_df_for_prediction['body'].apply(lambda x: x + url_to_semantics(x))
    test_df_for_prediction['input_text'] = test_df_for_prediction['rule'] + "[SEP]" + test_df_for_prediction['body_with_url']
    
    test_encodings = tokenizer(test_df_for_prediction['input_text'].tolist(), truncation=True, padding=True, max_length=CFG.MAX_LENGTH)
    test_dataset = JigsawDataset(test_encodings)
    
    predictions = trainer.predict(test_dataset)
    probs = torch.nn.functional.softmax(torch.tensor(predictions.predictions), dim=1)[:, 1].numpy()

    submission_df = pd.DataFrame({
        "row_id": test_df_for_prediction["row_id"],
        "rule_violation": probs
    })
    
    output_file = f"submission_deberta_{output_suffix}.csv"
    submission_df.to_csv(output_file, index=False)
    print(f"\nSaved predictions to: {output_file}")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python train_deberta_seed.py <seed> <output_suffix>")
        sys.exit(1)
    
    seed = int(sys.argv[1])
    output_suffix = sys.argv[2]
    main(seed, output_suffix)

## Training with 3 Different Seeds

We'll train the same DeBERTa v3 model three times with different random seeds.

In [None]:
# Train with seed 42 (baseline from Experiment 7)
!python train_deberta_seed.py 42 seed42

In [None]:
# Train with seed 123
!python train_deberta_seed.py 123 seed123

In [None]:
# Train with seed 456
!python train_deberta_seed.py 456 seed456

## Ensemble the Three Models

In [None]:
import pandas as pd
import numpy as np

# Load all three predictions
print("Loading predictions from 3 different seeds...")
pred1 = pd.read_csv('submission_deberta_seed42.csv')
pred2 = pd.read_csv('submission_deberta_seed123.csv')
pred3 = pd.read_csv('submission_deberta_seed456.csv')

print(f"All predictions loaded. Shape: {pred1.shape}")

# Rank normalization
def rank_normalize(series):
    return series.rank(method='average') / (len(series) + 1)

r1 = rank_normalize(pred1['rule_violation'])
r2 = rank_normalize(pred2['rule_violation'])
r3 = rank_normalize(pred3['rule_violation'])

# Equal weight ensemble
ensemble = (r1 + r2 + r3) / 3.0

print(f"\nEnsemble Statistics:")
print(f"  Mean: {ensemble.mean():.4f}")
print(f"  Std: {ensemble.std():.4f}")
print(f"  Min: {ensemble.min():.4f}")
print(f"  Max: {ensemble.max():.4f}")

# Measure prediction variance across seeds
variance = np.var([r1, r2, r3], axis=0).mean()
print(f"  Average variance across seeds: {variance:.6f}")
print(f"  (Lower is better - means predictions are similar)")

In [None]:
# Create final submission
submission = pd.DataFrame({
    'row_id': pred1['row_id'],
    'rule_violation': ensemble
})

submission.to_csv('/kaggle/working/submission.csv', index=False)

print("\n=== Submission Created ===")
print(f"Saved to: /kaggle/working/submission.csv")
print(f"\nFirst 10 rows:")
print(submission.head(10))

print("\n=== Expected Performance ===")
print(f"Target: 0.918-0.919 AUC")
print(f"Rationale: Seed diversity reduces variance without adding complexity")
print(f"Risk Level: Low (same proven config from Exp 7, just repeated)")

## Summary

This notebook implements a **simple yet effective** improvement strategy:

### What We Did:
- Used Experiment 7's proven DeBERTa v3 config (3 epochs, LR=2e-5)
- Trained 3 times with different seeds (42, 123, 456)
- Ensembled with equal weights

### Why This Should Work:
1. **Proven base**: Experiment 7 scored 0.917 with seed 42
2. **Variance reduction**: Different seeds capture different local optima
3. **No added complexity**: Same simple training, no over-optimization
4. **TTA-like effect**: Training-time augmentation through seed diversity

### Philosophy:
Sometimes the best improvement is not adding complexity, but adding **diversity** to what already works.