# 📚 Books Classification - Fixed Cloud GPU Training

This notebook trains a book sentence classification model using GPU acceleration on Google Colab.

**Fixed Issues:**
- CUDA version mismatches
- Datasets protocol compatibility
- Package version conflicts

**Selected Books:**
- Anna Karenina (Classics)
- The Adventures of Alice in Wonderland (Children's Books)
- Frankenstein (Science-Fiction)
- The Life of Julius Caesar (Biographies)

**Model:** Constructive Learning Model with BERT encoder

## 🔍 GPU Check

First, let's verify we have GPU access:

In [None]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"GPU Memory Available: {torch.cuda.memory_allocated(0) / 1e9:.1f} GB used")
else:
    print("⚠️  No GPU detected! Please enable GPU in Runtime > Change runtime type")
    print("   Runtime > Change runtime type > Hardware accelerator: GPU")

## 🔧 Fix Package Versions

This cell fixes CUDA version mismatches and datasets protocol issues:

In [None]:
# Uninstall problematic packages
!pip uninstall -y torch torchvision torchaudio datasets transformers

# Install compatible PyTorch with CUDA 11.8
!pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

# Install compatible datasets and transformers
!pip install datasets==2.14.0 transformers==4.30.2 tokenizers==0.13.3

# Install other dependencies
!pip install accelerate==0.20.3 deepspeed==0.9.5
!pip install nltk scikit-learn pandas numpy matplotlib seaborn
!pip install wandb tensorboard tqdm PyYAML

# Verify installation
import torch
import datasets
import transformers
print(f"✅ PyTorch {torch.__version__} installed successfully")
print(f"✅ Datasets {datasets.__version__} installed successfully")
print(f"✅ Transformers {transformers.__version__} installed successfully")
print(f"✅ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")

## 📥 Download NLTK Data

Download required NLTK data for sentence tokenization:

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
print("✅ NLTK data downloaded successfully!")

## 📤 Upload Project Files

Upload your project files to Colab:

In [None]:
from google.colab import files
import zipfile
import os

print("📤 Please upload the books_classification_colab.zip file:")
uploaded = files.upload()

# Extract the uploaded file
for filename in uploaded.keys():
    if filename.endswith('.zip'):
        print(f"📦 Extracting {filename}...")
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall('.')
        print(f"✅ Extracted {filename}")
        break
    else:
        print(f"⚠️  {filename} is not a ZIP file. Please upload books_classification_colab.zip")

print("📁 Files uploaded and extracted!")

## ✅ Verify Project Files

Let's verify that all necessary files are present:

In [None]:
import os

required_files = [
    'configs/config.yaml',
    'data/prepare_data.py',
    'models/constructive_model.py',
    'train_cloud.py',
    'requirements-cloud.txt'
]

print("🔍 Checking required files:")
all_present = True
for file_path in required_files:
    if os.path.exists(file_path):
        print(f"✅ {file_path}")
    else:
        print(f"❌ {file_path} - MISSING")
        all_present = False

if all_present:
    print("\n🎉 All files present! Ready to proceed.")
else:
    print("\n⚠️  Some files are missing. Please check the upload.")

## 🗂️ Prepare Data

Run data preparation to download and process the books dataset:

In [None]:
import sys
sys.path.append('.')

print("📚 Starting data preparation...")
!python data/prepare_data.py
print("✅ Data preparation completed!")

## 📊 Verify Dataset

Let's check the prepared dataset with the fixed datasets version:

In [None]:
from datasets import load_from_disk
import json

# Load dataset with fixed datasets version
try:
    dataset = load_from_disk('data/processed_dataset')
    print(f"📊 Dataset loaded successfully!")
    print(f"   Train: {len(dataset['train'])} samples")
    print(f"   Validation: {len(dataset['validation'])} samples")
    print(f"   Test: {len(dataset['test'])} samples")
    
    # Load metadata
    with open('data/metadata.json', 'r') as f:
        metadata = json.load(f)
    
    print(f"\n📚 Books in dataset:")
    for book_id, book_name in metadata['id_to_label'].items():
        print(f"   {book_id}: {book_name}")
    
    # Show sample
    print(f"\n📝 Sample sentence:")
    print(f"   {dataset['train'][0]['sentence'][:100]}...")
    
except Exception as e:
    print(f"❌ Error loading dataset: {e}")
    print("This might be due to datasets version incompatibility.")
    print("Please restart the runtime and run the fix cell again.")

## 📊 Setup Weights & Biases (Optional)

Setup experiment tracking for monitoring training progress:

In [None]:
import wandb

# Uncomment the line below to login to WandB
# wandb.login()

print("📊 WandB setup completed!")
print("💡 To enable experiment tracking, uncomment 'wandb.login()' above")

## 🚀 Start Training

Start the cloud-optimized training with GPU acceleration:

In [None]:
print("🚀 Starting GPU training...")
print(f"🎯 Training on: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")

# Start training with 5 epochs for quick testing
!python train_cloud.py --epochs 5

print("✅ Training completed!")

## 📈 Check Training Results

Let's check the training results and model performance:

In [None]:
import os
import glob

print("📊 Training Results:")

# Check for checkpoints
checkpoints = glob.glob('experiments/checkpoints/*.pt')
if checkpoints:
    print(f"✅ Found {len(checkpoints)} checkpoints:")
    for checkpoint in sorted(checkpoints):
        size_mb = os.path.getsize(checkpoint) / 1024 / 1024
        print(f"   📁 {os.path.basename(checkpoint)} ({size_mb:.1f} MB)")
else:
    print("⚠️  No checkpoints found")

# Check for logs
logs = glob.glob('experiments/logs/*.log')
if logs:
    print(f"\n📝 Found {len(logs)} log files:")
    for log in logs:
        print(f"   📄 {os.path.basename(log)}")

print("\n🎉 Training results ready!")

## 🧪 Test Model

Test the trained model with sample sentences:

In [None]:
print("🧪 Testing trained model...")
!python test_prediction.py
print("✅ Model testing completed!")

## 📥 Download Results

Download the training results and model files:

In [None]:
import zipfile
import os

print("📦 Creating results package...")

# Create a zip file with results
with zipfile.ZipFile('training_results.zip', 'w') as zipf:
    # Add experiments directory
    if os.path.exists('experiments'):
        for root, dirs, files in os.walk('experiments'):
            for file in files:
                file_path = os.path.join(root, file)
                zipf.write(file_path, file_path)
    
    # Add data directory
    if os.path.exists('data'):
        for root, dirs, files in os.walk('data'):
            for file in files:
                file_path = os.path.join(root, file)
                zipf.write(file_path, file_path)

# Download the results
files.download('training_results.zip')
print("✅ Results downloaded to your computer!")

## 🎯 Next Steps

### What you can do next:

1. **📊 Analyze Results**: Check the training logs and metrics
2. **🔧 Fine-tune**: Adjust hyperparameters in `configs/config.yaml`
3. **🚀 Scale Up**: Train for more epochs or use larger models
4. **📱 Deploy**: Use the trained model for inference
5. **📈 Monitor**: Set up WandB for better experiment tracking

### Performance Tips:

- **Free Colab**: Limited to ~12 hours, use for testing
- **Colab Pro**: More hours, better GPUs (V100/A100)
- **Batch Size**: Adjust based on GPU memory
- **Checkpointing**: Saves progress every epoch

### 🎉 Congratulations!

You've successfully trained a book sentence classification model on Google Colab with GPU acceleration!