# 🇻🇳 Vietnamese GEC with Contrastive Learning - Google Colab

**Clean & Simple**: Clone repository and run training pipeline for Vietnamese Grammatical Error Correction with BARTpho/ViT5 + Contrastive Learning.

## 📋 Pipeline Overview:
1. **Setup & Clone Repository** - Install dependencies and clone source code
2. **Data Preparation** - Load and preprocess viGEC dataset  
3. **Base Model Training** - Fine-tune BARTpho/ViT5 with hyperparameter optimization
4. **Negative Sample Generation** - Generate negative samples for contrastive learning
5. **Contrastive Learning Training** - Train with contrastive loss + R-Drop
6. **Inference & Evaluation** - Test and evaluate the model

⏰ **Estimated Total Time**: 4-9 hours (depending on GPU)  
🚀 **Ready to Run**: All import issues fixed, clean codebase

## 🚀 Step 1: Setup and Clone Repository

In [None]:
# Check GPU availability
import torch
print(f"🔥 CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🎮 GPU: {torch.cuda.get_device_name(0)}")
    print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️ No GPU available - training will be very slow!")

In [None]:
# Install required packages
print("📦 Installing dependencies...")
!pip install numpy
!pip3 install torch torchaudio torchvision torchtext torchdata
!pip install transformers datasets accelerate
!pip install optuna  wandb lightning
!pip install sentencepiece tokenizers nltk sacrebleu evaluate rouge-score
!pip install pandas scikit-learn tqdm rich omegaconf hydra-core
!pip install underthesea pyvi ipywidgets matplotlib seaborn
!pip install -U datasets huggingface_hub fsspec
!pip install optuna-integration[pytorch_lightning]

print("✅ All packages installed successfully!")

In [None]:
# Clone the repository (replace with your actual GitHub repository URL)
import os

# Change this to your actual repository URL
REPO_URL = "https://github.com/YOUR_USERNAME/CL_GEC.git"  # Update this!
PROJECT_DIR = "/content/CL_GEC"

# Clone or update repository
if not os.path.exists(PROJECT_DIR):
    print(f"📥 Cloning repository from {REPO_URL}...")
    !git clone {REPO_URL} {PROJECT_DIR}
else:
    print("📁 Repository already exists, pulling latest changes...")
    %cd {PROJECT_DIR}
    !git pull

# Change to project directory
%cd {PROJECT_DIR}
print(f"📂 Working directory: {os.getcwd()}")

# List files to verify
print("\n📋 Project files:")
!ls -la *.py

## 📊 Step 2: Data Preparation and System Check

## 🔧 New Features & Parameters

### ✨ Enhanced BaseTrainer Features:

1. **📊 Dataset Configuration**:
   - `dataset_name`: Choose dataset version (e.g., "phuhuy-se1/viGEC-v2")
   - `train_subset_ratio`: Use subset of training data (0.0-1.0) 
   - `validation_subset_ratio`: Use subset of validation data
   - `test_subset_ratio`: Use subset of test data

2. **🔍 Customizable Search Space**:
   - Define learning rate ranges
   - Configure weight decay options
   - Set batch size choices
   - Customize warmup ratios

3. **⚡ Flexible Training Modes**:
   - Hyperparameter optimization only
   - Training with specific parameters
   - Combined optimization + training

### 💡 Benefits:
- **Faster experimentation** with data subsets
- **Better hyperparameter control** 
- **Dataset version management**
- **Memory-efficient training** for limited resources

In [None]:
# Test imports and system readiness
from data_utils import (
    load_vigec_dataset, 
    get_model_and_tokenizer, 
    can_train_base_model,
    check_dataset_format
)
from rich.console import Console

console = Console()

# Check system readiness
console.print("[bold blue]🔍 System Readiness Check[/bold blue]")
system_ready = can_train_base_model()

# Check dataset format
console.print("\n[bold blue]📋 Dataset Format Check[/bold blue]")
dataset_ready = check_dataset_format()

if system_ready and dataset_ready:
    console.print("\n[bold green]✅ All checks passed! Ready to proceed.[/bold green]")
else:
    console.print("\n[bold red]❌ System not ready. Please check requirements.[/bold red]")

In [None]:
# Load and prepare dataset with configurable parameters
console.print("[bold blue]📊 Loading viGEC Dataset[/bold blue]")

# Dataset configuration - modify these as needed
DATASET_CONFIG = {
    "dataset_name": "phuhuy-se1/viGEC",  # Change to "phuhuy-se1/viGEC-v2" for version 2
    "train_subset_ratio": 0.1,  # Use 10% of training data for faster processing in Colab
    "validation_subset_ratio": 0.2,  # Use 20% of validation data  
    "test_subset_ratio": 0.05   # Use 5% of test data for faster evaluation
}

console.print(f"[yellow]📋 Dataset Configuration:[/yellow]")
for key, value in DATASET_CONFIG.items():
    console.print(f"  {key}: {value}")

# Load dataset with configurable parameters
data = load_vigec_dataset(
    dataset_name=DATASET_CONFIG["dataset_name"],
    train_subset_ratio=DATASET_CONFIG["train_subset_ratio"],
    validation_subset_ratio=DATASET_CONFIG["validation_subset_ratio"],
    test_subset_ratio=DATASET_CONFIG["test_subset_ratio"]
)

console.print(f"\n[green]Dataset loaded successfully![/green]")
for split, split_data in data.items():
    console.print(f"  {split}: {len(split_data)} samples")
    
# Show subset ratios effect
console.print(f"\n[blue]📊 Subset Effects:[/blue]")
console.print(f"  Training samples: ~{len(data['train'])} (subset ratio: {DATASET_CONFIG['train_subset_ratio']})")
console.print(f"  Validation samples: ~{len(data['validation'])} (subset ratio: {DATASET_CONFIG['validation_subset_ratio']})")
console.print(f"  Test samples: ~{len(data['test'])} (subset ratio: {DATASET_CONFIG['test_subset_ratio']})")

# Save processed data
from data_utils import save_processed_data
save_processed_data(data, "./data/processed")
console.print("\n[blue]✅ Data saved to ./data/processed/[/blue]")

## 🤖 Step 3: Model Selection and Testing

In [None]:
# Choose your model - uncomment one of these:
MODEL_NAME = "vinai/bartpho-syllable"  # Recommended for Vietnamese
# MODEL_NAME = "VietAI/vit5-base"     # Alternative option
# MODEL_NAME = "VietAI/vit5-large"    # Larger model (requires more GPU memory)

console.print(f"[bold blue]🤖 Loading Model: {MODEL_NAME}[/bold blue]")

# Load model and tokenizer
model, tokenizer = get_model_and_tokenizer(MODEL_NAME)

console.print(f"[green]✅ Model loaded successfully![/green]")
console.print(f"  Model: {model.__class__.__name__}")
console.print(f"  Tokenizer: {tokenizer.__class__.__name__}")
console.print(f"  Vocabulary size: {len(tokenizer)}")

# Test tokenization
test_text = "Tôi đang học tiếng việt."
tokens = tokenizer(test_text, return_tensors="pt")
console.print(f"\n[blue]🧪 Tokenization Test:[/blue]")
console.print(f"  Input: {test_text}")
console.print(f"  Tokens: {tokens['input_ids'].shape}")

## 🏋️ Step 4: Base Model Training

In [None]:
# Configure training parameters
TRAINING_CONFIG = {
    "model_name": MODEL_NAME,
    "output_dir": "./models/base_model",
    "max_epochs": 3,  # Reduced for Colab
    "batch_size": 8,  # Adjust based on GPU memory
    "use_wandb": True,  # Set to False if you don't want to use Weights & Biases
    "run_optimization": False,  # Set to True for hyperparameter optimization (takes longer)
    
    # New dataset parameters
    "dataset_name": "phuhuy-se1/viGEC",  # Change to "phuhuy-se1/viGEC-v2" for version 2
    "train_subset_ratio": 0.1,  # Use 10% of training data for faster training in Colab
    "validation_subset_ratio": 0.2,  # Use 20% of validation data
    "test_subset_ratio": 0.05,  # Use 5% of test data
    
    # Custom search space for hyperparameter optimization (if enabled)
    "search_space": {
        'learning_rate': {'low': 1e-5, 'high': 5e-4, 'log': True},
        'weight_decay': {'low': 0.001, 'high': 0.05, 'log': True},
        'label_smoothing': {'low': 0.0, 'high': 0.2},
        'batch_size': [8, 16, 24],  # Smaller batch sizes for Colab
        'warmup_ratio': {'low': 0.05, 'high': 0.15}
    }
}

console.print("[bold blue]🏋️ Base Model Training Configuration:[/bold blue]")
for key, value in TRAINING_CONFIG.items():
    if key != "search_space":  # Don't print the search space dict for brevity
        console.print(f"  {key}: {value}")

if TRAINING_CONFIG["run_optimization"]:
    console.print("\n[yellow]⚠️ Hyperparameter optimization enabled - this will take longer but may improve results[/yellow]")
    console.print(f"[blue]Search space configured with {len(TRAINING_CONFIG['search_space'])} parameters[/blue]")
else:
    console.print("\n[blue]ℹ️ Using default parameters for faster training[/blue]")

In [None]:
# Start base model training
from base_trainer import BaseTrainer

console.print("[bold green]🚀 Starting Base Model Training...[/bold green]")

# Create base trainer with enhanced parameters
base_trainer = BaseTrainer(
    model_name=TRAINING_CONFIG["model_name"],
    data_dir="./data/processed",  # Use the processed data directory
    output_dir=TRAINING_CONFIG["output_dir"],
    hyperopt=TRAINING_CONFIG["run_optimization"],  # Enable/disable hyperopt
    use_wandb=TRAINING_CONFIG["use_wandb"],
    
    # New dataset parameters
    dataset_name=TRAINING_CONFIG["dataset_name"],
    train_subset_ratio=TRAINING_CONFIG["train_subset_ratio"],
    validation_subset_ratio=TRAINING_CONFIG["validation_subset_ratio"],
    test_subset_ratio=TRAINING_CONFIG["test_subset_ratio"]
)

# Train the model with the correct method signature
if TRAINING_CONFIG["run_optimization"]:
    console.print("[yellow]🔍 Running hyperparameter optimization...[/yellow]")
    study = base_trainer.optimize_hyperparameters(
        n_trials=5,  # Reduced for Colab (increase to 10-20 for better results)
        batch_size=TRAINING_CONFIG["batch_size"],
        search_space=TRAINING_CONFIG["search_space"]
    )
    console.print(f"[green]✅ Best parameters: {study.best_params}[/green]")
    console.print(f"[green]✅ Best F0.5 score: {study.best_value:.4f}[/green]")
    
    # Train final model with best parameters  
    console.print("[blue]🏃 Training final model with best parameters...[/blue]")
    trained_model = base_trainer.train_with_params(
        params=study.best_params,
        max_epochs=TRAINING_CONFIG["max_epochs"],
        batch_size=study.best_params.get('batch_size', TRAINING_CONFIG["batch_size"])
    )
else:
    console.print("[blue]🏃 Training with default parameters...[/blue]")
    
    # Train the model (hyperopt is controlled by the hyperopt parameter in constructor)
    trained_model = base_trainer.train(
        max_epochs=TRAINING_CONFIG["max_epochs"],
        batch_size=TRAINING_CONFIG["batch_size"],
        search_space=None  # No search space needed for default training
    )

console.print("[bold green]✅ Base model training completed![/bold green]")

## 🎯 Step 5: Negative Sample Generation

In [None]:
# Generate negative samples for contrastive learning
from negative_sampler import NegativeSampler
import os

console.print("[bold blue]🎯 Generating Negative Samples...[/bold blue]")

# Create negative sampler (use the final model from training)
base_model_path = os.path.join(TRAINING_CONFIG["output_dir"], "final_model")

# Check if trained model exists
if os.path.exists(base_model_path):
    console.print(f"[green]✅ Using trained model from {base_model_path}[/green]")
    model_path = base_model_path
else:
    console.print(f"[yellow]⚠️ Trained model not found, using base model {MODEL_NAME}[/yellow]")
    model_path = MODEL_NAME

negative_sampler = NegativeSampler(
    model_path=model_path,
    model_name=MODEL_NAME
)

# Generate negative samples for training data
# Use smaller subset for Colab to avoid memory issues
train_subset = data['train'][:1000] if len(data['train']) > 1000 else data['train']

contrastive_data = negative_sampler.generate_contrastive_dataset(
    data=train_subset,
    num_negatives=3,  # Generate 3 negative samples per positive
    output_file="./data/contrastive_train.json"
)

console.print(f"[green]✅ Generated {len(contrastive_data)} contrastive samples![/green]")
console.print("[blue]💾 Saved to ./data/contrastive_train.json[/blue]")

## 🔥 Step 6: Contrastive Learning Training

In [None]:
# Contrastive learning training
from contrastive_trainer import ContrastiveTrainer
import os
import json
import shutil

console.print("[bold blue]🔥 Starting Contrastive Learning Training...[/bold blue]")

# First, we need to prepare the contrastive data in the expected format
contrastive_data_dir = "./data/contrastive"
os.makedirs(contrastive_data_dir, exist_ok=True)

# Convert the contrastive data to the expected format for validation
validation_contrastive = []
for item in data['validation'][:200]:  # Use subset for validation
    validation_contrastive.append({
        'source': item['source'],
        'positive': item['target'],
        'negatives': [item['source']]  # Simple negative sample
    })

# Save validation data
with open(os.path.join(contrastive_data_dir, "validation_contrastive.json"), "w", encoding="utf-8") as f:
    json.dump(validation_contrastive, f, indent=2, ensure_ascii=False)

# Copy training contrastive data to the expected location
if os.path.exists("./data/contrastive_train.json"):
    shutil.copy("./data/contrastive_train.json", 
                os.path.join(contrastive_data_dir, "train_contrastive.json"))

# Create contrastive trainer
contrastive_trainer = ContrastiveTrainer(
    base_model_path=os.path.join(TRAINING_CONFIG["output_dir"], "final_model"),
    contrastive_data_dir=contrastive_data_dir,
    output_dir="./models/contrastive_model",
    hyperopt=False  # Disable hyperopt for faster training in Colab
)

# Train with contrastive learning
contrastive_trainer.train()

console.print("[bold green]✅ Contrastive learning training completed![/bold green]")

## 🧪 Step 7: Inference and Evaluation

In [None]:
# Load the best model for inference
from inference import GECInference
import os

console.print("[bold blue]🧪 Setting up Inference...[/bold blue]")

# Determine which model to use for inference
contrastive_model_path = "./models/contrastive_model"
base_model_path = TRAINING_CONFIG["output_dir"]

if os.path.exists(contrastive_model_path) and os.listdir(contrastive_model_path):
    model_path = contrastive_model_path
    console.print(f"[green]✅ Using contrastive model from {model_path}[/green]")
elif os.path.exists(base_model_path) and os.listdir(base_model_path):
    model_path = base_model_path
    console.print(f"[yellow]⚠️ Using base model from {model_path}[/yellow]")
else:
    model_path = MODEL_NAME
    console.print(f"[blue]ℹ️ Using original model {model_path}[/blue]")

# Create inference engine with the best available model
gec_inference = GECInference(
    model_path=model_path,
    model_name=MODEL_NAME
)

console.print("[green]✅ Inference engine ready![/green]")

In [None]:
# Interactive testing
console.print("[bold blue]🎮 Interactive Testing[/bold blue]")

# Test samples
test_sentences = [
    "Tôi đang học tiếng việt ở trường đại học.",
    "Hôm nay trời rất đẹp và tôi muốn đi chơi.",
    "Cô ấy làm việc tại một công ty lớn ở Hà Nội.",
    "Chúng tôi sẽ đi du lịch vào cuối tuần này."
]

console.print("\n[yellow]📝 Test Results:[/yellow]")
for i, sentence in enumerate(test_sentences, 1):
    corrected = gec_inference.correct_text(sentence)
    console.print(f"\n{i}. Original: {sentence}")
    console.print(f"   Corrected: {corrected}")

# Custom input
console.print("\n[bold cyan]✏️ Try your own text:[/bold cyan]")
print("Enter Vietnamese text to correct (or 'quit' to exit):")

while True:
    user_input = input("> ")
    if user_input.lower() == 'quit':
        break
    
    corrected = gec_inference.correct_text(user_input)
    print(f"Corrected: {corrected}\n")

In [None]:
# Evaluate on test set
from evaluator import F05Evaluator

console.print("[bold blue]📊 Evaluating on Test Set...[/bold blue]")

# Create evaluator - check if we need to pass tokenizer
try:
    # Try with tokenizer first
    evaluator = F05Evaluator(tokenizer=gec_inference.tokenizer)
except:
    # Fallback to no tokenizer
    evaluator = F05Evaluator()

# Evaluate on test set (using subset for faster evaluation)
test_data_subset = data['test'][:100]  # Use 100 samples for evaluation
sources = [item['source'] for item in test_data_subset]
references = [item['target'] for item in test_data_subset]

# Generate predictions
console.print("[yellow]🔮 Generating predictions...[/yellow]")
predictions = []
for i, source in enumerate(sources):
    if i % 20 == 0:  # Progress indicator
        console.print(f"[blue]Processing {i+1}/{len(sources)}...[/blue]")
    pred = gec_inference.correct_text(source)
    predictions.append(pred)

# Calculate metrics
console.print("[yellow]📈 Calculating metrics...[/yellow]")
try:
    # Try batch evaluation first
    results = evaluator.evaluate_batch(predictions, references, sources)
except AttributeError:
    # Fallback to individual evaluation
    f05_scores = []
    for pred, ref, src in zip(predictions, references, sources):
        f05 = evaluator.calculate_f05(src, pred, ref)
        f05_scores.append(f05)
    
    results = {
        "f05_score": np.mean(f05_scores),
        "num_samples": len(f05_scores)
    }

console.print("\n[bold green]📈 Evaluation Results:[/bold green]")
for metric, value in results.items():
    if isinstance(value, float):
        console.print(f"  {metric}: {value:.4f}")
    else:
        console.print(f"  {metric}: {value}")

# Show some examples
console.print("\n[bold cyan]🔍 Sample Results:[/bold cyan]")
for i in range(min(5, len(sources))):
    console.print(f"\n{i+1}. Source: {sources[i]}")
    console.print(f"   Target: {references[i]}")
    console.print(f"   Prediction: {predictions[i]}")

## 💾 Step 8: Save and Export Results

In [None]:
# Save results and create export package
import json
import zipfile
from datetime import datetime

console.print("[bold blue]💾 Saving Results and Creating Export Package...[/bold blue]")

# Create results summary
results_summary = {
    "timestamp": datetime.now().isoformat(),
    "model_name": MODEL_NAME,
    "training_config": TRAINING_CONFIG,
    "evaluation_results": results,
    "test_samples": len(test_data),
    "model_paths": {
        "base_model": TRAINING_CONFIG["output_dir"],
        "contrastive_model": "./models/contrastive_model"
    }
}

# Save results
with open("./results_summary.json", "w", encoding="utf-8") as f:
    json.dump(results_summary, f, indent=2, ensure_ascii=False)

console.print("[green]✅ Results saved to ./results_summary.json[/green]")

# Create downloadable package
console.print("[yellow]📦 Creating export package...[/yellow]")

with zipfile.ZipFile("vietnamese_gec_models.zip", "w", zipfile.ZIP_DEFLATED) as zipf:
    # Add results
    zipf.write("results_summary.json")
    
    # Add model files (if they exist)
    import glob
    for model_file in glob.glob("./models/**/*.bin", recursive=True):
        zipf.write(model_file)
    for config_file in glob.glob("./models/**/config.json", recursive=True):
        zipf.write(config_file)
    
    # Add data samples
    if os.path.exists("./data/contrastive_train.json"):
        zipf.write("./data/contrastive_train.json")

console.print("[bold green]🎉 Export package created: vietnamese_gec_models.zip[/bold green]")
console.print("[blue]📁 You can download this file from the Colab file browser[/blue]")

# Display final summary
console.print("\n[bold cyan]🏆 Training Pipeline Completed Successfully![/bold cyan]")
console.print(f"[green]✅ Base model trained and saved[/green]")
console.print(f"[green]✅ Contrastive learning applied[/green]")
console.print(f"[green]✅ Model evaluated on test set[/green]")
console.print(f"[green]✅ Results exported for download[/green]")