# Azerbaijani Lemmatizer - Cloud Training

**Before running:**
1. Enable GPU: Settings ‚Üí Accelerator ‚Üí GPU T4 x2
2. Upload code as Kaggle Dataset
3. Upload data files as separate Dataset
4. Add both datasets to this notebook (right sidebar)

**Expected time:** 8-10 hours for 20 epochs

## 1. Setup Environment

In [None]:
# Install dependencies
!pip install -q transformers torch pyyaml tqdm sentencepiece

# Verify GPU
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("WARNING: No GPU! Enable it in Settings ‚Üí Accelerator ‚Üí GPU")

## 2. Load Code and Data

**Option A:** If you uploaded code as Kaggle Dataset

In [None]:
# Copy code from input dataset
!cp -r /kaggle/input/azerbaijani-lemmatizer/* /kaggle/working/
%cd /kaggle/working

# Verify structure
!ls -la

**Option B:** If you're cloning from GitHub

In [None]:
# Clone repository
!git clone https://github.com/yourusername/azerbaijani-lemmatizer.git
%cd azerbaijani-lemmatizer

## 3. Update Config for Kaggle Paths

In [None]:
import yaml

# Load config
config_path = 'configs/improved_training.yaml'
with open(config_path) as f:
    config = yaml.safe_load(f)

# Update paths to use Kaggle input dataset
# IMPORTANT: Replace 'your-dataset-name' with your actual dataset name!
DATA_PATH = '/kaggle/input/az-lemmatizer-data'  # Change this!

config['data']['train_path'] = f'{DATA_PATH}/moraz_500k_train_filtered.json'
config['data']['val_path'] = f'{DATA_PATH}/moraz_500k_val.json'
config['data']['test_path'] = f'{DATA_PATH}/ud_test.json'
config['vocab_path'] = f'{DATA_PATH}/char_vocab.json'

# Save updated config
with open('configs/kaggle_training.yaml', 'w') as f:
    yaml.dump(config, f)

print("‚úì Config updated for Kaggle paths")
print(f"\nData directory: {DATA_PATH}")
print("\nVerifying data files...")

import os
for key, path in config['data'].items():
    if isinstance(path, str) and path.endswith('.json'):
        exists = os.path.exists(path)
        print(f"  {key}: {'‚úì' if exists else '‚úó'} {path}")

vocab_exists = os.path.exists(config['vocab_path'])
print(f"  vocab: {'‚úì' if vocab_exists else '‚úó'} {config['vocab_path']}")

## 4. Start Training

In [None]:
# Start training
!python scripts/train.py --config configs/kaggle_training.yaml

## 5. Save Results

**IMPORTANT:** Kaggle sessions expire! Save results to /kaggle/working/ so you can download them.

In [None]:
import shutil
import os

# Create output directory
os.makedirs('/kaggle/working/results', exist_ok=True)

# Copy checkpoints
checkpoint_dir = 'checkpoints/improved_training'
if os.path.exists(checkpoint_dir):
    # Copy best model
    if os.path.exists(f'{checkpoint_dir}/best_model.pt'):
        shutil.copy(
            f'{checkpoint_dir}/best_model.pt',
            '/kaggle/working/results/best_model.pt'
        )
        print("‚úì Saved best model")
    
    # Copy latest checkpoint
    if os.path.exists(f'{checkpoint_dir}/latest.pt'):
        shutil.copy(
            f'{checkpoint_dir}/latest.pt',
            '/kaggle/working/results/latest.pt'
        )
        print("‚úì Saved latest checkpoint")
    
    # Copy training history
    if os.path.exists(f'{checkpoint_dir}/training_history.json'):
        shutil.copy(
            f'{checkpoint_dir}/training_history.json',
            '/kaggle/working/results/training_history.json'
        )
        print("‚úì Saved training history")

print("\nüìÅ Results saved to /kaggle/working/results/")
print("   Download from: Data ‚Üí Output Files (right sidebar)")

## 6. Quick Evaluation (Optional)

In [None]:
# Quick evaluation on UD test set
!python scripts/evaluate.py \
    --checkpoint /kaggle/working/results/best_model.pt \
    --test-data /kaggle/input/az-lemmatizer-data/ud_test.json \
    --config configs/kaggle_training.yaml \
    --output-dir /kaggle/working/results/evaluation

## 7. Download Results

After training completes:
1. Go to: Data ‚Üí Output (right sidebar)
2. Download `results/` folder
3. Contains:
   - `best_model.pt` - Best model checkpoint
   - `latest.pt` - Latest checkpoint (for resuming)
   - `training_history.json` - Loss curves
   - `evaluation/` - Evaluation metrics (if you ran evaluation)

## Troubleshooting

**Session timeout:**
- Kaggle sessions run for max 12 hours
- Training should complete in ~8-10 hours
- If it times out, resume from latest checkpoint:
  ```python
  !python scripts/train.py \
      --config configs/kaggle_training.yaml \
      --resume /kaggle/working/results/latest.pt
  ```

**Data not found:**
- Check dataset is added to notebook (right sidebar)
- Verify path in cell 3 matches your dataset name
- List files: `!ls /kaggle/input/`

**No GPU:**
- Settings ‚Üí Accelerator ‚Üí GPU T4 x2
- Save settings
- Restart kernel

**Out of memory:**
- Reduce batch size in config:
  ```python
  config['training']['batch_size'] = 16  # Default is 32
  ```