# AI Text Detection Model Training

This notebook implements a fine-tuning pipeline for RoBERTa model using LoRA (Low-Rank Adaptation) for AI-generated text detection. The model will be trained to classify text as either human-written or AI-generated.

## Overview of the Training Process
1. Setup and Dependencies Installation
2. Data Preparation and Tokenization
3. Model Configuration with LoRA
4. Training Process
5. Model Saving and Export

This approach uses Parameter-Efficient Fine-Tuning (PEFT) to reduce the computational resources needed while maintaining model performance.

## 1. Dependencies Installation

First, we'll install all required packages. These include:
- `transformers`: For the RoBERTa model and tokenizer
- `datasets`: For efficient data handling
- `peft`: For Parameter-Efficient Fine-Tuning
- Supporting libraries for deep learning and data processing

In [None]:
!pip install datasets>=2.15.0 python-dotenv>=1.0.0 torch>=2.1.0 transformers>=4.36.0 \
    numpy>=1.24.0,<2.0.0 pandas>=2.0.0 hf-xet>=1.1.2 accelerate>=0.23.0 peft>=0.6.0

## 2. Import Required Libraries

We import the necessary components:
- `RobertaTokenizerFast`: Fast tokenizer implementation for RoBERTa
- `RobertaForSequenceClassification`: Pre-trained RoBERTa model adapted for classification
- `Trainer` and `TrainingArguments`: HuggingFace's training utilities
- PEFT components for efficient fine-tuning

In [None]:
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import load_from_disk
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType

## 3. Label Configuration

We define our label mappings for binary classification:
- HUMAN (0): Represents human-written text
- AI (1): Represents AI-generated text

We create bidirectional mappings (`label2id` and `id2label`) for easy conversion between text labels and numeric indices.

In [None]:
label2id = {"HUMAN": 0, "AI": 1}
id2label = {0: "HUMAN", 1: "AI"}

def map_labels_of_dataframe(frame):
    """Convert text labels to numeric indices"""
    frame["label"] = frame["label"].map(label2id)
    return frame

## 4. Tokenizer and Dataset Preparation

We use the RoBERTa tokenizer and load our pre-tokenized datasets. The tokenization process:
- Uses maximum length padding
- Applies truncation for longer sequences
- Maintains consistency across all examples

In [None]:
model_name = "roberta-base"
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)

def tokenize_function(examples):
    """Tokenize text with padding and truncation"""
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Load pre-processed datasets
print("Loading tokenized datasets...")
tokenized_training_dataset = load_from_disk("data/tokenized_training")
tokenized_validation_dataset = load_from_disk("data/tokenized_validation")
print(f"Training samples: {len(tokenized_training_dataset)}")
print(f"Validation samples: {len(tokenized_validation_dataset)}")

## 5. Model Initialization

We initialize the RoBERTa model for sequence classification with:
- Binary classification setup (2 labels)
- Proper label mappings
- Base RoBERTa architecture

In [None]:
model = RobertaForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(label2id),
    label2id=label2id,
    id2label=id2label
)

## 6. Training Configuration

We set up the training arguments with:
- Evaluation and saving strategies per epoch
- Batch sizes optimized for memory efficiency
- Weight decay for regularization
- Logging configuration

In [None]:
output_dir = "./results"

training_args = TrainingArguments(
    output_dir=output_dir,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir=f'{output_dir}/logs',
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none"  # Disable wandb logging
)

## 7. LoRA Configuration

We implement LoRA (Low-Rank Adaptation) for efficient fine-tuning:
- Task type: Sequence Classification
- Rank (r): 8 for parameter efficiency
- Alpha: 32 for scaling
- Dropout: 0.1 for regularization

LoRA reduces the number of trainable parameters while maintaining model performance.

In [None]:
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1
)

peft_model = get_peft_model(model, peft_config)
# Display the number of trainable parameters
peft_model.print_trainable_parameters()

## 8. Training Process

We initialize the trainer with our LoRA-configured model and start the training process.
The trainer will:
- Handle batching and optimization
- Perform evaluation after each epoch
- Save checkpoints
- Log training metrics

In [None]:
peft_lora_finetuning_trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_training_dataset,
    eval_dataset=tokenized_validation_dataset
)

print("Starting training...")
training_results = peft_lora_finetuning_trainer.train()
print("Training completed!")
print(f"Final training metrics: {training_results.metrics}")

## 9. Save the Fine-tuned Model

Finally, we save both the model and tokenizer for future use:
- The model is saved with its LoRA adaptations
- The tokenizer is saved to ensure consistent preprocessing in inference

In [None]:
model_output_dir = "./finetuned_roberta"
print(f"Saving model to {model_output_dir}...")
peft_model.save_pretrained(model_output_dir)
tokenizer.save_pretrained(model_output_dir)
print("Model and tokenizer saved successfully!")

## Next Steps

After training, you can:
1. Evaluate the model on a test set
2. Use the model for inference on new text
3. Analyze the model's performance metrics
4. Fine-tune hyperparameters if needed

The saved model can be loaded and used for inference using the same PEFT and transformers libraries.