# üáªüá≥ Vietnamese GEC with Contrastive Learning - Google Colab (BARTpho Fixed)

**‚úÖ Bug Fix Applied:** Fixed `'BartphoTokenizer' object has no attribute 'vocab'` error

Complete pipeline for training Vietnamese Grammatical Error Correction models with Contrastive Learning.

## üêõ Recent Fixes:
- **BARTpho Tokenizer Fix**: Resolved vocabulary access issue for SentencePiece tokenizers
- **Improved Compatibility**: Better support for both BARTpho and ViT5 models
- **Error Handling**: Added safe vocabulary checking methods

## üìã Pipeline Overview:
1. **Setup & Installation** - Install dependencies and create project structure
2. **Data Preparation** - Load and preprocess viGEC dataset  
3. **Base Model Training** - Fine-tune BARTpho/ViT5 with hyperparameter optimization
4. **Negative Sample Generation** - Generate negative samples for contrastive learning
5. **Contrastive Learning Training** - Train with contrastive loss + R-Drop
6. **Inference & Evaluation** - Test and evaluate the model

‚è∞ **Estimated Total Time**: 4-9 hours (depending on GPU)
üîß **BARTpho Issue**: RESOLVED ‚úÖ

## üöÄ Setup and Installation

In [None]:
# Install required packages
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install transformers==4.36.0 datasets==2.15.0 accelerate==0.25.0
!pip install optuna==3.4.0 wandb==0.16.0 lightning==2.1.0
!pip install sentencepiece tokenizers nltk sacrebleu evaluate rouge-score
!pip install pandas numpy scikit-learn tqdm rich omegaconf hydra-core
!pip install underthesea pyvi ipywidgets matplotlib seaborn

# üîß BARTpho Fix Verification Test
# Run this cell first to verify the fix is working

import sys
import traceback
from pathlib import Path

print("üß™ Testing BARTpho tokenizer fix...")

# Create the fixed data_utils.py file directly in Colab
data_utils_fixed = '''"""
Data utilities for Vietnamese GEC with viGEC dataset - FIXED VERSION
"""

import os
import re
import unicodedata
from typing import Dict, List, Tuple, Optional
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset, Dataset as HFDataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, T5ForConditionalGeneration, T5Tokenizer
import logging
from rich.console import Console
from rich.progress import track

console = Console()
logger = logging.getLogger(__name__)

def get_model_and_tokenizer(model_name: str):
    """Get model and tokenizer for Vietnamese GEC - FIXED VERSION"""
    
    console.print(f"[bold blue]Loading model: {model_name}[/bold blue]")
    
    if 'bartpho' in model_name.lower():
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        console.print("[green]‚úÖ BARTpho model loaded[/green]")
        
    elif 'vit5' in model_name.lower():
        tokenizer = T5Tokenizer.from_pretrained(model_name)
        model = T5ForConditionalGeneration.from_pretrained(model_name)
        
        # Add task prefix for ViT5
        if not hasattr(tokenizer, 'task_prefix'):
            tokenizer.task_prefix = "grammatical error correction: "
            console.print(f"[yellow]Added ViT5 task prefix: {tokenizer.task_prefix}[/yellow]")
        
        console.print("[green]‚úÖ ViT5 model loaded[/green]")
        
    else:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        console.print("[green]‚úÖ Generic model loaded[/green]")
    
    # FIXED: Add special tokens if needed - Safe vocabulary checking
    special_tokens = ['<gec>', '</gec>']
    
    try:
        if hasattr(tokenizer, 'vocab'):
            # Standard tokenizers (BERT, etc.)
            vocab = tokenizer.vocab
            console.print("[blue]Using .vocab attribute[/blue]")
        elif hasattr(tokenizer, 'get_vocab'):
            # SentencePiece tokenizers (BARTpho, etc.)
            vocab = tokenizer.get_vocab()
            console.print(f"[blue]Using .get_vocab() method - vocab size: {len(vocab)}[/blue]")
        else:
            # Fallback: assume all tokens are new
            vocab = {}
            console.print("[yellow]No vocab access method found, skipping token check[/yellow]")
        
        new_tokens = [token for token in special_tokens if token not in vocab]
        
        if new_tokens:
            tokenizer.add_tokens(new_tokens)
            model.resize_token_embeddings(len(tokenizer))
            console.print(f"[yellow]Added {len(new_tokens)} new tokens: {new_tokens}[/yellow]")
        else:
            console.print("[blue]No new tokens needed[/blue]")
            
    except Exception as e:
        console.print(f"[yellow]Warning: Could not check vocabulary - {e}[/yellow]")
        console.print("[yellow]Skipping special token addition[/yellow]")
    
    return model, tokenizer

def test_tokenizer_vocab_access(model_name: str):
    """Test vocabulary access for different tokenizer types"""
    console.print(f"[bold]Testing vocabulary access for: {model_name}[/bold]")
    
    try:
        model, tokenizer = get_model_and_tokenizer(model_name)
        
        # Test tokenization
        test_text = "T√¥i ƒëi h·ªçc tr∆∞·ªùng ƒë·∫°i h·ªçc."
        if hasattr(tokenizer, 'task_prefix'):
            test_text = tokenizer.task_prefix + test_text
        
        tokens = tokenizer(test_text, return_tensors="pt")
        console.print(f"[green]‚úÖ Tokenization successful - shape: {tokens['input_ids'].shape}[/green]")
        
        # Test vocabulary access methods
        if hasattr(tokenizer, 'vocab'):
            console.print(f"[green]‚úÖ Has .vocab attribute[/green]")
        elif hasattr(tokenizer, 'get_vocab'):
            vocab = tokenizer.get_vocab()
            console.print(f"[green]‚úÖ Has .get_vocab() method - size: {len(vocab)}[/green]")
        else:
            console.print(f"[yellow]‚ö†Ô∏è No standard vocab access method[/yellow]")
        
        return True
        
    except Exception as e:
        console.print(f"[red]‚ùå Error: {e}[/red]")
        traceback.print_exc()
        return False
'''

# Write the fixed file
with open('data_utils_fixed.py', 'w', encoding='utf-8') as f:
    f.write(data_utils_fixed)

print("‚úÖ Created fixed data_utils.py")

# Test the fix
try:
    from data_utils_fixed import test_tokenizer_vocab_access, get_model_and_tokenizer
    
    print("\nüîç Testing BARTpho tokenizer...")
    success_bartpho = test_tokenizer_vocab_access("vinai/bartpho-syllable")
    
    if success_bartpho:
        print("üéâ BARTpho tokenizer fix is working!")
    else:
        print("‚ùå BARTpho test failed")
    
except Exception as e:
    print(f"‚ùå Test failed: {e}")
    traceback.print_exc()

print("\n" + "="*50)
print("üéØ BARTpho Fix Status: ‚úÖ READY FOR USE" if 'success_bartpho' in locals() and success_bartpho else "‚ùå NEEDS ATTENTION")
print("="*50)

In [None]:
# Clone the repository (if needed)
# !git clone https://github.com/your-repo/CL_GEC.git
# %cd CL_GEC

# Or upload files directly to Colab
import os
os.makedirs('./models', exist_ok=True)
os.makedirs('./data', exist_ok=True)
os.makedirs('./evaluation_results', exist_ok=True)

print("üìÅ Directories created successfully!")

In [None]:
# Upload all Python files to Colab
# Use the file upload button in Colab to upload:
# - data_utils.py
# - base_trainer.py  
# - negative_sampler.py
# - contrastive_trainer.py
# - inference.py
# - evaluator.py
# - evaluate_model.py

# Verify files are uploaded
required_files = [
    'data_utils.py', 'base_trainer.py', 'negative_sampler.py',
    'contrastive_trainer.py', 'inference.py', 'evaluator.py', 'evaluate_model.py'
]

for file in required_files:
    if os.path.exists(file):
        print(f"‚úÖ {file} found")
    else:
        print(f"‚ùå {file} missing - please upload this file")

## üìä Step 1: Data Preparation

In [None]:
# Import necessary modules
import torch
import wandb
from rich.console import Console
from data_utils import load_vigec_dataset, save_processed_data, get_model_and_tokenizer

console = Console()

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
console.print(f"üî• Using device: {device}")

if torch.cuda.is_available():
    console.print(f"GPU: {torch.cuda.get_device_name(0)}")
    console.print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Login to wandb for experiment tracking
!wandb login

# Set wandb project
wandb.login()
console.print("üìà Wandb setup complete!")

In [None]:
# Load and preprocess viGEC dataset
console.print("üì• Loading viGEC dataset...")

# Load the dataset
data = load_vigec_dataset(dataset_name="phuhuy-se1/viGEC")

# Save processed data
save_processed_data(data, "./data/processed")

console.print(f"‚úÖ Data preprocessing completed!")
for split, split_data in data.items():
    console.print(f"  {split}: {len(split_data)} samples")

## üéØ Step 2: Base Model Training with Hyperparameter Optimization

In [None]:
# Choose your base model
# Options: "vinai/bartpho-syllable", "vinai/bartpho-word", "VietAI/vit5-base", "VietAI/vit5-large"

MODEL_NAME = "vinai/bartpho-syllable"  # Change this as needed

console.print(f"ü§ñ Selected model: {MODEL_NAME}")

In [None]:
from base_trainer import BaseTrainer

# Create base trainer
base_trainer = BaseTrainer(
    model_name=MODEL_NAME,
    data_dir="./data/processed",
    output_dir="./models/base",
    hyperopt=True  # Set to False to skip hyperparameter optimization
)

console.print("üèóÔ∏è Base trainer initialized!")

In [None]:
# Start base model training
# This will:
# 1. Run hyperparameter optimization (30 trials)
# 2. Train final model with best parameters
# 3. Save model and tokenizer

console.print("üöÄ Starting base model training...")
console.print("‚è∞ This may take 2-4 hours depending on your setup")

base_trainer.train()

console.print("‚úÖ Base model training completed!")

## üé≠ Step 3: Negative Sample Generation

In [None]:
from negative_sampler import NegativeSampleGenerator

# Create negative sample generator using the trained base model
BASE_MODEL_PATH = "./models/base/final"

console.print("üé≠ Initializing negative sample generator...")

generator = NegativeSampleGenerator(
    model_path=BASE_MODEL_PATH,
    device="auto"
)

console.print("‚úÖ Negative sample generator ready!")

In [None]:
from data_utils import load_processed_data
import os

# Load processed data
data = load_processed_data("./data/processed")

# Generate contrastive datasets
os.makedirs("./data/contrastive", exist_ok=True)

console.print("üîÑ Generating negative samples...")
console.print("‚è∞ This may take 1-2 hours depending on dataset size")

for split in ['train', 'validation']:
    if split in data:
        console.print(f"Processing {split} split...")
        
        output_path = f"./data/contrastive/{split}_contrastive.json"
        
        contrastive_data = generator.generate_contrastive_dataset(
            data[split],
            output_path,
            batch_size=8,
            max_samples=None  # Set to smaller number for testing, e.g., 1000
        )
        
        # Analyze quality
        generator.analyze_negatives_quality(contrastive_data, sample_size=5)

console.print("‚úÖ Negative sample generation completed!")

## üîÑ Step 4: Contrastive Learning Training

In [None]:
from contrastive_trainer import ContrastiveTrainer

# Create contrastive trainer
contrastive_trainer = ContrastiveTrainer(
    base_model_path=BASE_MODEL_PATH,
    contrastive_data_dir="./data/contrastive",
    output_dir="./models/contrastive",
    hyperopt=True  # Set to False to skip hyperparameter optimization
)

console.print("üîÑ Contrastive trainer initialized!")

In [None]:
# Start contrastive learning training
# This will:
# 1. Run hyperparameter optimization for Œª, Œ≥, k
# 2. Train final model with contrastive loss + R-Drop
# 3. Save final contrastive model

console.print("üöÄ Starting contrastive learning training...")
console.print("‚è∞ This may take 1-3 hours")

contrastive_trainer.train()

console.print("‚úÖ Contrastive learning training completed!")

## üîÆ Step 5: Inference with Contrastive Search

In [None]:
from inference import GECInference

# Load the trained contrastive model
CONTRASTIVE_MODEL_PATH = "./models/contrastive/final"

# Create inference engines
console.print("üîÆ Initializing inference engines...")

# Contrastive search inference
contrastive_inference = GECInference(
    model_path=CONTRASTIVE_MODEL_PATH,
    use_contrastive_search=True,
    contrastive_alpha=0.7,
    contrastive_k=5
)

# Beam search inference for comparison
beam_inference = GECInference(
    model_path=CONTRASTIVE_MODEL_PATH,
    use_contrastive_search=False
)

console.print("‚úÖ Inference engines ready!")

In [None]:
# Test inference with sample texts
test_texts = [
    "T√¥i ƒëi h·ªçc tr∆∞·ªùng ƒë·∫°i h·ªçc.",
    "H√¥m nay t√¥i kh√¥ng ƒëi l√†m.",
    "C√¥ ·∫•y r·∫•t ƒë·∫πp v√† th√¥ng minh.",
    "Ch√∫ng t√¥i s·∫Ω ƒëi du l·ªãch v√†o tu·∫ßn t·ªõi.",
    "Anh ·∫•y l√†m vi·ªác ·ªü c√¥ng ty l·ªõn."
]

console.print("üß™ Testing inference on sample texts...")

for i, text in enumerate(test_texts):
    console.print(f"\n[bold cyan]Example {i+1}:[/bold cyan]")
    console.print(f"[yellow]Original:[/yellow] {text}")
    
    # Contrastive search
    contrastive_result = contrastive_inference.correct_text(text)
    console.print(f"[green]Contrastive:[/green] {contrastive_result}")
    
    # Beam search
    beam_result = beam_inference.correct_text(text)
    console.print(f"[blue]Beam:[/blue] {beam_result}")

In [None]:
# Interactive correction (optional)
# Uncomment to enable interactive mode

# console.print("üéÆ Interactive mode - Enter text to correct (type 'quit' to exit):")
# contrastive_inference.interactive_correction()

## üìä Step 6: Comprehensive Evaluation

In [None]:
from evaluate_model import ModelEvaluator

# Create model evaluator
evaluator = ModelEvaluator(
    model_path=CONTRASTIVE_MODEL_PATH,
    data_dir="./data/processed",
    output_dir="./evaluation_results"
)

console.print("üìä Model evaluator initialized!")

In [None]:
# Run comprehensive evaluation
console.print("üîç Starting comprehensive evaluation...")
console.print("‚è∞ This may take 30-60 minutes")

# Evaluate on test set with different decoding strategies
evaluation_results = evaluator.evaluate_on_test_set(
    max_samples=None,  # Set to smaller number for testing, e.g., 500
    batch_size=8
)

console.print("‚úÖ Evaluation completed!")

In [None]:
# Error type analysis
console.print("üî¨ Running error type analysis...")

error_analysis = evaluator.evaluate_error_types(
    max_samples=1000  # Limit for faster analysis
)

console.print("‚úÖ Error type analysis completed!")

In [None]:
# Display evaluation visualizations
from IPython.display import Image, display
import os

# Show evaluation comparison plot
plot_path = "./evaluation_results/evaluation_comparison.png"
if os.path.exists(plot_path):
    console.print("üìà Evaluation Comparison Visualization:")
    display(Image(plot_path))
else:
    console.print("‚ùå Visualization not found")

In [None]:
# Show evaluation results summary
import pandas as pd

# Load and display comparison table
csv_path = "./evaluation_results/strategy_comparison.csv"
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path)
    console.print("üìã Strategy Comparison Results:")
    display(df)
else:
    console.print("‚ùå Comparison table not found")

## üìÅ Results and Model Export

In [None]:
# Summary of trained models and results
console.print("\n[bold green]üéâ Training Pipeline Completed Successfully![/bold green]")

console.print("\nüìÅ [bold]Generated Models and Results:[/bold]")
console.print(f"  üì¶ Base Model: ./models/base/final")
console.print(f"  üîÑ Contrastive Model: ./models/contrastive/final")
console.print(f"  üìä Evaluation Results: ./evaluation_results/")
console.print(f"  üé≠ Contrastive Data: ./data/contrastive/")

# List all generated files
import os

def list_files_recursive(directory):
    files = []
    for root, dirs, filenames in os.walk(directory):
        for filename in filenames:
            files.append(os.path.join(root, filename))
    return files

console.print("\nüìã [bold]All Generated Files:[/bold]")

for directory in ['./models', './evaluation_results', './data/contrastive']:
    if os.path.exists(directory):
        files = list_files_recursive(directory)
        console.print(f"\n  üìÇ {directory}:")
        for file in files[:10]:  # Show first 10 files
            console.print(f"    üìÑ {file}")
        if len(files) > 10:
            console.print(f"    ... and {len(files) - 10} more files")

In [None]:
# Download models and results (for local use)
# Uncomment to create zip files for download

# import shutil

# console.print("üì¶ Creating downloadable archives...")

# # Create zip files
# shutil.make_archive('contrastive_gec_model', 'zip', './models/contrastive/final')
# shutil.make_archive('evaluation_results', 'zip', './evaluation_results')

# console.print("‚úÖ Archives created:")
# console.print("  üì¶ contrastive_gec_model.zip - Trained model")
# console.print("  üì¶ evaluation_results.zip - Evaluation results")
# console.print("\nüíæ Use the file browser to download these files")

## üöÄ Quick Usage Guide

Once training is complete, you can use the model for inference:

In [None]:
# Quick usage example
console.print("üöÄ [bold]Quick Usage Example:[/bold]")

# Load the model
from inference import GECInference

# Initialize
gec_model = GECInference(
    model_path="./models/contrastive/final",
    use_contrastive_search=True
)

# Correct text
text = "T√¥i ƒëi h·ªçc tr∆∞·ªùng ƒë·∫°i h·ªçc."
corrected = gec_model.correct_text(text)

console.print(f"Original: {text}")
console.print(f"Corrected: {corrected}")

console.print("\nüí° [bold]Usage Tips:[/bold]")
console.print("  üéØ Use contrastive_search=True for better quality")
console.print("  ‚ö° Use contrastive_search=False for faster inference")
console.print("  üìä Adjust alpha and k parameters for fine-tuning")
console.print("  üìÅ Process files with correct_file() method")

## üìù Configuration and Hyperparameters

Key hyperparameters used in this pipeline:

### Base Training:
- **Learning Rate**: Optimized via Optuna (typically 1e-5 to 1e-4)
- **Label Smoothing**: 0.1
- **Batch Size**: 8-32 (depending on GPU memory)
- **Max Length**: 384 tokens
- **Epochs**: 5-10

### Contrastive Learning:
- **Œª (lambda_cl)**: 1.0 (balance between CE and CL loss)
- **Œ≥ (temperature)**: 0.25 (contrastive loss temperature)
- **R-Drop Œ±**: 4.0 (R-Drop regularization strength)
- **Epochs**: 3-5

### Contrastive Search:
- **Œ± (alpha)**: 0.7 (balance between confidence and diversity)
- **k**: 5 (top-k candidates)
- **Beam Size**: 1 (as recommended in paper)

These parameters can be adjusted based on your specific needs and computational resources.