# ArtEmis Image Captioning - Multi-Model Training (Colab T4 GPU)

## üìã SETUP CHECKLIST (Complete Before Running)

**Before running this notebook:**

1. **Enable GPU Runtime:**
   - Go to `Runtime` ‚Üí `Change runtime type` ‚Üí Select `T4 GPU`

2. **Upload Data to Google Drive:**
   - Create folder: `Google Drive/artemis-captioning/`
   - Upload these from your local machine:
     ```
     artemis-captioning/
     ‚îú‚îÄ‚îÄ data/
     ‚îÇ   ‚îú‚îÄ‚îÄ processed/
     ‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ images/        (5000 pre-resized 128x128 images, ~57 MB)
     ‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ splits/        (train.json, val.json, test.json)
     ‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ captions/      (caption JSON files)
     ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ vocabulary.json
     ‚îÇ   ‚îî‚îÄ‚îÄ embeddings/        (optional - glove, word2vec, tfidf)
     ‚îú‚îÄ‚îÄ utils/                 (all Python files)
     ‚îú‚îÄ‚îÄ models/                (all Python files)
     ‚îî‚îÄ‚îÄ train.py
     ```

3. **Update DRIVE_DATA_PATH** in Cell 6 if needed (default: `/content/drive/MyDrive/artemis-captioning`)

## üöÄ Training Configurations

| Config | Model | Images | Epochs | Est. Time |
|--------|-------|--------|--------|-----------|
| colab_cnn_large | CNN+LSTM (512 embed, 1024 hidden) | 15,000 | 50 | ~3-4 hours |
| colab_vit_standard | ViT (256 embed, 6 layers) | 15,000 | 50 | ~2-3 hours |
| colab_cnn_glove | CNN+LSTM + GloVe embeddings | 15,000 | 40 | ~2-3 hours |

**Total estimated time: ~8-10 hours** (run overnight or train one at a time)

## 1. Setup Environment

In [None]:
# Check GPU
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    raise RuntimeError("No GPU available! Please enable GPU in Runtime -> Change runtime type")

In [None]:
# Install required packages
!pip install -q nltk gensim pillow tqdm

In [None]:
# Download NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('wordnet', quiet=True)

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Set up paths - MODIFY THIS to match your Drive folder
import os
from pathlib import Path

DRIVE_DATA_PATH = '/content/drive/MyDrive/artemis-captioning'

# Check if data exists
required_files = [
    'data/processed/vocabulary.json',
    'data/processed/splits/train.json',
    'train.py',
    'models/cnn_lstm.py',
    'utils/data_loader.py'
]

print("Checking required files...")
all_found = True
for f in required_files:
    path = os.path.join(DRIVE_DATA_PATH, f)
    if os.path.exists(path):
        print(f"  ‚úì {f}")
    else:
        print(f"  ‚úó {f} - NOT FOUND")
        all_found = False

# Check for preprocessed images
preprocessed_path = os.path.join(DRIVE_DATA_PATH, 'data/processed/images')
if os.path.exists(preprocessed_path) and os.listdir(preprocessed_path):
    num_imgs = sum(1 for _ in Path(preprocessed_path).rglob('*.jpg'))
    print(f"  ‚úì Preprocessed images: {num_imgs} files")
else:
    print(f"  ‚ö† No preprocessed images found - will use raw wikiart")

if not all_found:
    raise FileNotFoundError("Missing required files! See above.")

In [None]:
# Copy data to local storage for faster access
!mkdir -p /content/artemis
!cp -r "{DRIVE_DATA_PATH}/data" /content/artemis/
!cp -r "{DRIVE_DATA_PATH}/utils" /content/artemis/
!cp -r "{DRIVE_DATA_PATH}/models" /content/artemis/
!cp "{DRIVE_DATA_PATH}/train.py" /content/artemis/

print("‚úì Data copied to /content/artemis/")

In [None]:
# Setup Python path
import sys
os.chdir('/content/artemis')
sys.path.insert(0, '/content/artemis')
print(f"Working directory: {os.getcwd()}")

In [None]:
# Import modules
import torch
import torch.nn as nn
import torch.optim as optim
import json
import time
from datetime import datetime
import numpy as np

from utils.data_loader import create_dataloaders
from utils.evaluation import BLEUScore
from train import Trainer

print("‚úì All modules imported")

## 2. Configuration

In [None]:
# Global settings
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
NUM_WORKERS = 2

# Three model configurations for Colab
COLAB_CONFIGS = {
    'colab_cnn_large': {
        'model_type': 'cnn_lstm',
        'description': 'CNN+LSTM Large Capacity',
        'batch_size': 32,
        'num_images': 15000,
        'epochs': 50,
        'learning_rate': 1e-4,
        'embed_dim': 512,
        'hidden_dim': 1024,
        'attention_dim': 512,
        'dropout': 0.4,
        'encoder_lr_factor': 0.1,
    },
    'colab_vit_standard': {
        'model_type': 'vit',
        'description': 'Vision Transformer Standard',
        'batch_size': 32,
        'num_images': 15000,
        'epochs': 50,
        'learning_rate': 1e-4,
        'embed_dim': 256,
        'num_heads': 8,
        'num_layers': 6,
        'ff_dim': 1024,
        'dropout': 0.1,
    },
    'colab_cnn_glove': {
        'model_type': 'cnn_lstm',
        'description': 'CNN+LSTM with GloVe Embeddings',
        'batch_size': 32,
        'num_images': 15000,
        'epochs': 40,  # Slightly fewer since pretrained embeddings converge faster
        'learning_rate': 5e-5,  # Lower LR for pretrained embeddings
        'embed_dim': 300,  # GloVe dimension
        'hidden_dim': 512,
        'attention_dim': 256,
        'dropout': 0.3,
        'encoder_lr_factor': 0.1,
        'use_glove': True,
    }
}

print("Training Configurations:")
print("=" * 70)
for name, cfg in COLAB_CONFIGS.items():
    print(f"\n{name}: {cfg['description']}")
    print(f"  Model: {cfg['model_type']}")
    print(f"  Images: ~{cfg['num_images']}, Epochs: {cfg['epochs']}")
    print(f"  Batch size: {cfg['batch_size']}, LR: {cfg['learning_rate']}")

In [None]:
# Load vocabulary
with open('data/processed/vocabulary.json', 'r', encoding='utf-8') as f:
    vocab_data = json.load(f)

vocab_size = vocab_data['vocab_size']
word_to_idx = vocab_data['word2idx']
idx_to_word = {int(k): v for k, v in vocab_data['idx2word'].items()}

print(f"Vocabulary size: {vocab_size}")

In [None]:
# Load GloVe embeddings if available
glove_embeddings = None
glove_path = 'data/embeddings/glove_embeddings.npy'
if os.path.exists(glove_path):
    glove_embeddings = np.load(glove_path)
    print(f"‚úì GloVe embeddings loaded: {glove_embeddings.shape}")
else:
    print("‚ö† GloVe embeddings not found - will use random initialization")

## 3. Helper Functions

In [None]:
def create_cnn_lstm_model(config):
    """Create CNN+LSTM model."""
    from models.cnn_lstm import ImageCaptioningModel
    
    model = ImageCaptioningModel(
        embed_dim=config['embed_dim'],
        attention_dim=config['attention_dim'],
        decoder_dim=config['hidden_dim'],
        vocab_size=vocab_size,
        encoder_dim=2048,
        dropout=config['dropout'],
        pretrained_encoder=True
    )
    
    # Load GloVe embeddings if specified
    if config.get('use_glove') and glove_embeddings is not None:
        model.decoder.embedding.weight.data.copy_(
            torch.tensor(glove_embeddings, dtype=torch.float32)
        )
        print("  ‚úì GloVe embeddings loaded into model")
    
    return model


def create_vit_model(config):
    """Create Vision Transformer model."""
    from models.vision_transformer import VisionTransformerCaptioning
    
    model = VisionTransformerCaptioning(
        vocab_size=vocab_size,
        embed_dim=config['embed_dim'],
        num_heads=config['num_heads'],
        num_encoder_layers=config['num_layers'],
        num_decoder_layers=config['num_layers'],
        ff_dim=config['ff_dim'],
        max_seq_len=30,
        dropout=config['dropout'],
        img_size=128,
        patch_size=16
    )
    return model


class LimitedLoader:
    """Wrapper to limit batches per epoch."""
    def __init__(self, loader, max_batches):
        self.loader = loader
        self.max_batches = max_batches
    
    def __iter__(self):
        for i, batch in enumerate(self.loader):
            if i >= self.max_batches:
                break
            yield batch
    
    def __len__(self):
        return min(len(self.loader), self.max_batches)


def train_model(config_name, config):
    """Train a single model configuration."""
    print("\n" + "=" * 70)
    print(f"TRAINING: {config_name}")
    print(f"Description: {config['description']}")
    print("=" * 70)
    
    # Create directories
    checkpoint_dir = f'checkpoints/{config_name}'
    output_dir = f'outputs/{config_name}'
    os.makedirs(checkpoint_dir, exist_ok=True)
    os.makedirs(output_dir, exist_ok=True)
    
    # Save config
    with open(f'{output_dir}/config.json', 'w') as f:
        json.dump(config, f, indent=2)
    
    # Create data loaders
    print("\nCreating data loaders...")
    train_loader, val_loader, _ = create_dataloaders(
        images_dir='data/processed/images',
        captions_dir='data/processed/captions',
        splits_dir='data/processed/splits',
        vocab_file='data/processed/vocabulary.json',
        batch_size=config['batch_size'],
        num_workers=NUM_WORKERS
    )
    
    # Limit batches
    max_train_batches = config['num_images'] // config['batch_size']
    max_val_batches = max(20, max_train_batches // 5)
    
    train_loader = LimitedLoader(train_loader, max_train_batches)
    val_loader = LimitedLoader(val_loader, max_val_batches)
    
    print(f"  Train batches: {len(train_loader)} (~{len(train_loader) * config['batch_size']} images)")
    print(f"  Val batches: {len(val_loader)}")
    
    # Create model
    print("\nCreating model...")
    if config['model_type'] == 'cnn_lstm':
        model = create_cnn_lstm_model(config)
    else:
        model = create_vit_model(config)
    
    model = model.to(DEVICE)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"  Parameters: {total_params:,}")
    
    # Create optimizer
    if config['model_type'] == 'cnn_lstm':
        encoder_params = list(model.encoder.parameters())
        decoder_params = list(model.decoder.parameters()) + list(model.attention.parameters())
        optimizer = torch.optim.Adam([
            {'params': encoder_params, 'lr': config['learning_rate'] * config['encoder_lr_factor']},
            {'params': decoder_params, 'lr': config['learning_rate']}
        ], weight_decay=1e-5)
    else:
        optimizer = torch.optim.Adam(
            model.parameters(),
            lr=config['learning_rate'],
            weight_decay=1e-5
        )
    
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='min', factor=0.5, patience=3, verbose=True
    )
    
    # Create trainer
    bleu_scorer = BLEUScore(idx_to_word, word_to_idx)
    trainer = Trainer(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        optimizer=optimizer,
        scheduler=scheduler,
        evaluator=bleu_scorer,
        device=DEVICE,
        checkpoint_dir=checkpoint_dir,
        grad_clip=5.0
    )
    
    # Train
    print(f"\nStarting training for {config['epochs']} epochs...")
    start_time = time.time()
    history = trainer.train(num_epochs=config['epochs'])
    duration = time.time() - start_time
    
    # Save results
    results = {
        'config_name': config_name,
        'description': config['description'],
        'model_type': config['model_type'],
        'num_images': config['num_images'],
        'epochs': config['epochs'],
        'parameters': total_params,
        'final_train_loss': history[-1]['train_loss'],
        'final_val_loss': history[-1]['val_loss'],
        'best_val_loss': min(h['val_loss'] for h in history),
        'best_bleu': max(h.get('bleu', 0) for h in history),
        'duration_minutes': duration / 60,
        'history': history
    }
    
    with open(f'{output_dir}/results.json', 'w') as f:
        json.dump(results, f, indent=2)
    
    print(f"\n{'=' * 70}")
    print(f"COMPLETED: {config_name}")
    print(f"Duration: {duration/60:.1f} minutes")
    print(f"Best Val Loss: {results['best_val_loss']:.4f}")
    print(f"Best BLEU: {results['best_bleu']:.4f}")
    print(f"{'=' * 70}")
    
    return results

## 4. Train Models

**Option A:** Run next cell to train ALL 3 models sequentially (~8-10 hours total)

**Option B:** Use the individual training cells below to train one at a time

In [None]:
# Train all three configurations
all_results = {}

for config_name, config in COLAB_CONFIGS.items():
    try:
        results = train_model(config_name, config)
        all_results[config_name] = results
    except Exception as e:
        print(f"\n‚ùå Error training {config_name}: {e}")
        import traceback
        traceback.print_exc()
        all_results[config_name] = {'error': str(e)}
    
    # Clear GPU memory between models
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

print("\n" + "=" * 70)
print("ALL TRAINING COMPLETE!")
print("=" * 70)

### Option B: Train Individual Models (run only ONE of these cells)

In [None]:
# Train Model 1: CNN+LSTM Large (~3-4 hours)
config_name = 'colab_cnn_large'
results_1 = train_model(config_name, COLAB_CONFIGS[config_name])
print(f"\n‚úÖ Model 1 complete! Best BLEU: {results_1['best_bleu']:.4f}")

In [None]:
# Train Model 2: ViT Standard (~2-3 hours)
config_name = 'colab_vit_standard'
results_2 = train_model(config_name, COLAB_CONFIGS[config_name])
print(f"\n‚úÖ Model 2 complete! Best BLEU: {results_2['best_bleu']:.4f}")

In [None]:
# Train Model 3: CNN+LSTM with GloVe (~2-3 hours)
config_name = 'colab_cnn_glove'
results_3 = train_model(config_name, COLAB_CONFIGS[config_name])
print(f"\n‚úÖ Model 3 complete! Best BLEU: {results_3['best_bleu']:.4f}")

## 5. Results Summary

In [None]:
# Print summary
print("\n" + "=" * 70)
print("TRAINING RESULTS SUMMARY")
print("=" * 70)

for name, results in all_results.items():
    if 'error' in results:
        print(f"\n{name}: FAILED - {results['error']}")
    else:
        print(f"\n{name}: {results['description']}")
        print(f"  Parameters: {results['parameters']:,}")
        print(f"  Duration: {results['duration_minutes']:.1f} minutes")
        print(f"  Final Val Loss: {results['final_val_loss']:.4f}")
        print(f"  Best BLEU: {results['best_bleu']:.4f}")

# Save combined results
with open('outputs/colab_all_results.json', 'w') as f:
    # Convert history to serializable format
    save_results = {}
    for name, res in all_results.items():
        if 'error' not in res:
            save_results[name] = {k: v for k, v in res.items() if k != 'history'}
        else:
            save_results[name] = res
    json.dump(save_results, f, indent=2)

print("\n‚úì Results saved to outputs/colab_all_results.json")

In [None]:
# Plot training curves
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors = ['blue', 'green', 'orange']
for i, (name, results) in enumerate(all_results.items()):
    if 'error' in results:
        continue
    
    history = results['history']
    epochs = range(1, len(history) + 1)
    val_losses = [h['val_loss'] for h in history]
    bleu_scores = [h.get('bleu', 0) for h in history]
    
    axes[0].plot(epochs, val_losses, color=colors[i], label=name)
    axes[1].plot(epochs, bleu_scores, color=colors[i], label=name)

axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Validation Loss')
axes[0].set_title('Validation Loss Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('BLEU Score')
axes[1].set_title('BLEU Score Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('outputs/colab_training_comparison.png', dpi=150)
plt.show()

## 6. Save to Google Drive

In [None]:
# Copy results to Google Drive
!mkdir -p "{DRIVE_DATA_PATH}/checkpoints"
!mkdir -p "{DRIVE_DATA_PATH}/outputs"

# Copy all checkpoint and output folders
for name in COLAB_CONFIGS.keys():
    if os.path.exists(f'checkpoints/{name}'):
        !cp -r "checkpoints/{name}" "{DRIVE_DATA_PATH}/checkpoints/"
    if os.path.exists(f'outputs/{name}'):
        !cp -r "outputs/{name}" "{DRIVE_DATA_PATH}/outputs/"

# Copy combined results
!cp outputs/colab_all_results.json "{DRIVE_DATA_PATH}/outputs/"
!cp outputs/colab_training_comparison.png "{DRIVE_DATA_PATH}/outputs/"

print("\n‚úì All results saved to Google Drive!")
print(f"Location: {DRIVE_DATA_PATH}")