# SageMaker Environment Setup

Single consolidated notebook for setting up the GL RL Model environment on SageMaker.

**Instance Type**: ml.t2.medium (CPU) or any SageMaker instance
**Kernel**: Python 3

## Option 1: Quick Setup (Run Shell Script)

In [None]:
# Run the consolidated setup script
!bash ../../sagemaker/1_setup/setup.sh

## Option 2: Manual Setup (Step by Step)

In [None]:
# Step 1: Install compiled packages with conda (avoids CMake build issues)
%conda install -c conda-forge sentencepiece pyarrow -y

In [None]:
# Step 2: Install PyTorch
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

In [None]:
# Step 3: Install ML libraries
%pip install transformers datasets peft trl accelerate huggingface-hub tokenizers

In [None]:
# Step 4: Install utilities and fix dependencies
%pip install --upgrade numpy pandas protobuf tqdm fsspec aiohttp multiprocess>=0.70.18

## Verify Installation

In [None]:
import sys
import platform

print("System Information:")
print(f"Python: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"Machine: {platform.machine()}")
print("\n" + "="*50 + "\n")

# Test all imports
packages = {
    "torch": "PyTorch",
    "transformers": "Transformers",
    "datasets": "Datasets",
    "peft": "PEFT",
    "trl": "TRL",
    "sentencepiece": "Sentencepiece",
    "pyarrow": "PyArrow",
    "accelerate": "Accelerate"
}

all_good = True
for module_name, display_name in packages.items():
    try:
        module = __import__(module_name)
        version = getattr(module, "__version__", "unknown")
        print(f"✅ {display_name}: {version}")
    except ImportError:
        print(f"❌ {display_name}: Not installed")
        all_good = False

# Check CUDA
print("\n" + "="*50)
import torch
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("Device: CPU (Expected on ml.t2.medium)")
    print("For GPU training, use sagemaker/2_training/GPU_Training.ipynb")

print("\n" + "="*50)
if all_good:
    print("✅ All packages installed successfully!")
else:
    print("⚠️ Some packages need to be installed")

## Test Model Loading

In [None]:
from transformers import AutoTokenizer

# Test Qwen model tokenizer
model_name = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
print(f"Loading {model_name} tokenizer...")

try:
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    
    # Test tokenization
    test_text = "SELECT * FROM users WHERE age > 25;"
    tokens = tokenizer.encode(test_text)
    decoded = tokenizer.decode(tokens)
    
    print("✅ Tokenizer loaded successfully!")
    print(f"\nTest: '{test_text}'")
    print(f"Tokens: {len(tokens)}")
    print(f"Decoded: '{decoded}'")
    print(f"\nVocabulary size: {tokenizer.vocab_size}")
except Exception as e:
    print(f"⚠️ Error: {e}")
    print("This might be a network issue. Retry or check internet connection.")

## Setup Training Data

In [None]:
import os
import json

# Create training data directory
os.makedirs("data/training", exist_ok=True)

# Sample training data
sample_data = [
    {"query": "Show me all customers", "sql": "SELECT * FROM customers;", "context": "customers(id, name, email, created_at)"},
    {"query": "Get total sales by month", "sql": "SELECT DATE_FORMAT(date, '%Y-%m') as month, SUM(amount) as total FROM sales GROUP BY month;", "context": "sales(id, date, amount, product_id)"},
    {"query": "Find top 5 products by revenue", "sql": "SELECT p.name, SUM(s.amount) as revenue FROM products p JOIN sales s ON p.id = s.product_id GROUP BY p.id ORDER BY revenue DESC LIMIT 5;", "context": "products(id, name, price), sales(id, product_id, amount)"},
    {"query": "List users who registered today", "sql": "SELECT * FROM users WHERE DATE(created_at) = CURDATE();", "context": "users(id, name, email, created_at)"},
    {"query": "Calculate average order value", "sql": "SELECT AVG(total_amount) as avg_order_value FROM orders;", "context": "orders(id, customer_id, total_amount, order_date)"}
]

# Write to file
data_file = "data/training/query_pairs.jsonl"
with open(data_file, 'w') as f:
    for item in sample_data:
        f.write(json.dumps(item) + '\n')

print(f"✅ Created training data: {data_file}")
print(f"Number of examples: {len(sample_data)}")

# Display sample
print("\nSample data:")
for i, item in enumerate(sample_data[:2], 1):
    print(f"\nExample {i}:")
    print(f"  Query: {item['query']}")
    print(f"  SQL: {item['sql']}")

## ✅ Setup Complete!

### Next Steps:

1. **For GPU Training**: Open `sagemaker/2_training/GPU_Training.ipynb`
2. **For CPU Inference**: Open `sagemaker/3_inference/CPU_Inference.ipynb`

### Key Points:
- ✅ All dependencies installed (conda + pip approach)
- ✅ No CMake build issues (using precompiled binaries)
- ✅ Ready for model training and inference
- 💡 Remember to stop the instance when not in use!