# Machine Translation with GPT-OSS Models - Tutorial

This notebook demonstrates how to fine-tune GPT-OSS models for machine translation using the provided toolkit.

## Quick Start Guide

### 1. Installation

In [None]:
# Install required packages
!pip install torch transformers datasets trl peft accelerate sacrebleu pandas

### 2. Import Libraries

In [None]:
import pandas as pd
import torch
from datasets import load_dataset
import subprocess
import os

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

### 3. Prepare Your Data

Create a CSV file with 'source' and 'target' columns:

In [None]:
# Create sample translation data
sample_data = {
    'source': [
        "Hello, how are you today?",
        "I would like to order a coffee, please.",
        "The weather is beautiful today.",
        "Can you help me find the train station?",
        "Thank you for your help."
    ],
    'target': [
        "Hola, ¿cómo estás hoy?",
        "Me gustaría pedir un café, por favor.",
        "El clima está hermoso hoy.",
        "¿Puedes ayudarme a encontrar la estación de tren?",
        "Gracias por tu ayuda."
    ]
}

df = pd.DataFrame(sample_data)
df.to_csv('my_translation_data.csv', index=False)
print("Sample data created!")
print(df)

### 4. Train the Model

Use the machine translation script to train your model:

In [None]:
# Training command
cmd = [
    "python", "machine_translation.py",
    "--config", "configs/mt_lora.yaml",
    "--source_lang", "en",
    "--target_lang", "es",
    "--dataset_name", "csv",
    "--dataset_config", "data_files=my_translation_data.csv",
    "--num_train_epochs", "3",
    "--per_device_train_batch_size", "2",
    "--output_dir", "./my_translator"
]

print("Starting training...")
print(" ".join(cmd))

# Run training (uncomment to execute)
# result = subprocess.run(cmd, capture_output=True, text=True)
# print(result.stdout)
# if result.stderr:
#     print("Errors:", result.stderr)

### 5. Test Your Model

Use the demo script to test translations:

In [None]:
# Test translation
test_cmd = [
    "python", "demo_translation.py",
    "--model_path", "./my_translator",
    "--text", "Good morning, how can I help you?"
]

print("Testing translation...")
print(" ".join(test_cmd))

# Run test (uncomment to execute)
# result = subprocess.run(test_cmd, capture_output=True, text=True)
# print(result.stdout)

### 6. Batch Translation

Translate multiple sentences from a file:

In [None]:
# Create test input file
test_sentences = [
    "Welcome to our store.",
    "What time do you close?",
    "I need help with my order.",
    "The food was delicious."
]

with open('test_sentences.txt', 'w') as f:
    for sentence in test_sentences:
        f.write(sentence + '\n')

print("Test sentences saved to test_sentences.txt")

In [None]:
# Batch translation command
batch_cmd = [
    "python", "generate_translation.py",
    "--model_path", "./my_translator",
    "--input_file", "test_sentences.txt",
    "--output_file", "translations.txt",
    "--source_lang", "en",
    "--target_lang", "es"
]

print("Batch translation command:")
print(" ".join(batch_cmd))

# Run batch translation (uncomment to execute)
# result = subprocess.run(batch_cmd, capture_output=True, text=True)
# print(result.stdout)

### 7. Evaluate Your Model

Calculate BLEU scores and other metrics:

In [None]:
# Evaluation command
eval_cmd = [
    "python", "evaluate_translation.py",
    "--model_path", "./my_translator",
    "--dataset_name", "csv",
    "--dataset_config", "data_files=my_translation_data.csv",
    "--source_lang", "en",
    "--target_lang", "es",
    "--output_file", "evaluation_results.json"
]

print("Evaluation command:")
print(" ".join(eval_cmd))

# Run evaluation (uncomment to execute)
# result = subprocess.run(eval_cmd, capture_output=True, text=True)
# print(result.stdout)

## Configuration Options

### Language Pairs
- English ↔ Spanish: `--source_lang en --target_lang es`
- English ↔ French: `--source_lang en --target_lang fr`
- German ↔ English: `--source_lang de --target_lang en`

### Training Modes
- **LoRA (Recommended)**: `--config configs/mt_lora.yaml`
- **Full Fine-tuning**: `--config configs/mt_full.yaml`
- **Memory Optimized**: `--config configs/mt_lora_memory_optimized.yaml`

### Dataset Formats
- **CSV**: `--dataset_name csv --dataset_config data_files=your_data.csv`
- **WMT**: `--dataset_name wmt14 --dataset_config de-en`
- **OPUS**: `--dataset_name opus100 --dataset_config en-es`

## Tips for Better Results

1. **Data Quality**: Use high-quality, diverse translation pairs
2. **Data Size**: More data generally leads to better results
3. **Domain Matching**: Train on data similar to your use case
4. **Hyperparameters**: Adjust learning rate and batch size based on your data
5. **Evaluation**: Always evaluate on held-out test data

## Troubleshooting

- **Memory Issues**: Use smaller batch sizes or memory-optimized config
- **CSV Parsing Errors**: Ensure proper CSV format with 'source' and 'target' columns
- **Poor Translations**: Try more training epochs or better quality data
- **CUDA Errors**: Check GPU memory and PyTorch CUDA compatibility

## Next Steps

1. **Scale Up**: Use larger datasets and models
2. **Multi-language**: Train models for multiple language pairs
3. **Domain Adaptation**: Fine-tune for specific domains (medical, legal, etc.)
4. **Production**: Deploy your model using the provided scripts

For more details, see the `MACHINE_TRANSLATION_README.md` and `EXAMPLE_WALKTHROUGH.md` files.