# AI Text Detection Model Training for Vertex AI

This notebook implements a fine-tuning pipeline for RoBERTa model using LoRA (Low-Rank Adaptation) for AI-generated text detection, adapted for Google Cloud Vertex AI training.

## Overview
1. Setup Project Structure
2. Create Training Package
3. Upload to Google Cloud Storage
4. Configure and Launch Vertex AI Training Job

## 1. Install Required Packages

First, we'll install the Google Cloud SDK and other required packages:

In [None]:
!pip install google-cloud-aiplatform>=2.11.0 google-cloud-storage>=2.8.0

## 2. Set up Google Cloud Project

Configure your Google Cloud project and create necessary directories:

In [None]:
import os

# Set your Google Cloud project ID and region
PROJECT_ID = "your-project-id"  # Replace with your project ID
REGION = "us-central1"  # Replace with your desired region
BUCKET_NAME = "your-bucket-name"  # Replace with your GCS bucket name

# Create trainer package directory structure
!mkdir -p trainer
!touch trainer/__init__.py

## 3. Create Training Module

Create the main training module (trainer/task.py) that will be executed by Vertex AI:

In [None]:
%%writefile trainer/task.py

import os
import argparse
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import load_from_disk
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model-dir', type=str, required=True)
    parser.add_argument('--train-data', type=str, required=True)
    parser.add_argument('--valid-data', type=str, required=True)
    parser.add_argument('--epochs', type=int, default=3)
    parser.add_argument('--batch-size', type=int, default=8)
    return parser.parse_args()

def train_model(args):
    # Label configuration
    label2id = {"HUMAN": 0, "AI": 1}
    id2label = {0: "HUMAN", 1: "AI"}

    # Initialize model and tokenizer
    model_name = "roberta-base"
    tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
    model = RobertaForSequenceClassification.from_pretrained(
        model_name,
        num_labels=len(label2id),
        label2id=label2id,
        id2label=id2label
    )

    # Load datasets
    tokenized_training_dataset = load_from_disk(args.train_data)
    tokenized_validation_dataset = load_from_disk(args.valid_data)

    # Configure LoRA
    peft_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.1
    )
    peft_model = get_peft_model(model, peft_config)

    # Training arguments
    training_args = TrainingArguments(
        output_dir=args.model_dir,
        eval_strategy="epoch",
        save_strategy="epoch",
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        num_train_epochs=args.epochs,
        weight_decay=0.01,
        report_to="tensorboard"
    )

    # Initialize trainer
    trainer = Trainer(
        model=peft_model,
        args=training_args,
        train_dataset=tokenized_training_dataset,
        eval_dataset=tokenized_validation_dataset
    )

    # Train and save
    trainer.train()
    peft_model.save_pretrained(args.model_dir)
    tokenizer.save_pretrained(args.model_dir)

if __name__ == '__main__':
    args = get_args()
    train_model(args)

## 4. Create setup.py

Create the setup.py file for packaging the training application:

In [None]:
%%writefile setup.py

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = [
    'datasets>=2.15.0',
    'torch>=2.1.0',
    'transformers>=4.36.0',
    'numpy>=1.24.0,<2.0.0',
    'pandas>=2.0.0',
    'accelerate>=0.23.0',
    'peft>=0.6.0'
]

setup(
    name='ai_text_detection_trainer',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='AI Text Detection training application for Vertex AI'
)

## 5. Build and Upload Training Package

Build the training package and upload it to Google Cloud Storage:

In [None]:
# Build the package
!python setup.py sdist --formats=gztar

# Upload to GCS
!gsutil cp dist/ai_text_detection_trainer-0.1.tar.gz gs://$BUCKET_NAME/trainer/

## 6. Launch Vertex AI Training Job

Configure and launch the training job on Vertex AI:

In [None]:
from google.cloud import aiplatform

# Initialize Vertex AI
aiplatform.init(project=PROJECT_ID, location=REGION)

# Configure the training job
job = aiplatform.CustomTrainingJob(
    display_name="ai-text-detection-training",
    script_path="trainer/task.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest",
    requirements=[f"gs://{BUCKET_NAME}/trainer/ai_text_detection_trainer-0.1.tar.gz"],
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1
)

# Launch the training job
job.run(
    args=[
        f"--model-dir=gs://{BUCKET_NAME}/model_output",
        f"--train-data=gs://{BUCKET_NAME}/data/tokenized_training",
        f"--valid-data=gs://{BUCKET_NAME}/data/tokenized_validation",
        "--epochs=3",
        "--batch-size=8"
    ],
    sync=True
)

## Next Steps

After the training job completes:
1. The trained model will be saved in your specified GCS bucket
2. You can download and evaluate the model locally
3. Deploy the model to a Vertex AI endpoint for predictions
4. Monitor the model's performance in production

Remember to clean up resources when they're no longer needed to avoid unnecessary charges.