# Deploying Llama 3.2 on Vertex AI with Custom Container

This notebook provides a complete walkthrough for deploying Meta's Llama 3.2 model on Google Cloud Vertex AI using a custom container image.

## Prerequisites
- Google Cloud Project with billing enabled
- Vertex AI API enabled
- Access to Llama 3.2 model weights (from Meta or Hugging Face)
- Docker installed locally (for building the container)
- `gcloud` CLI configured

## Overview
1. Set up environment and dependencies
2. Create custom container with model serving code
3. Build and push container to Google Container Registry
4. Upload model to Vertex AI Model Registry
5. Deploy to endpoint
6. Test the deployment

## Step 1: Environment Setup

In [None]:
# Install required packages
!pip install google-cloud-aiplatform transformers torch accelerate sentencepiece protobuf

In [None]:
# Set up your Google Cloud project variables
PROJECT_ID = "your-project-id"  # Replace with your project ID
REGION = "us-central1"  # Choose your preferred region
BUCKET_NAME = f"{PROJECT_ID}-llama-models"  # GCS bucket for model artifacts
REPOSITORY = "llama-models"  # Artifact Registry repository name
IMAGE_NAME = "llama32-serve"
IMAGE_TAG = "latest"
IMAGE_URI = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{REPOSITORY}/{IMAGE_NAME}:{IMAGE_TAG}"

# Model configuration
MODEL_NAME = "llama-3.2-3b"  # or llama-3.2-1b, llama-3.2-11b, llama-3.2-90b
HUGGINGFACE_MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"  # Update based on your model

In [None]:
# Initialize Vertex AI
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=f"gs://{BUCKET_NAME}")

## Step 2: Create GCS Bucket and Artifact Registry

In [None]:
# Create GCS bucket for model artifacts
!gsutil mb -l {REGION} gs://{BUCKET_NAME} || echo "Bucket already exists"

In [None]:
# Create Artifact Registry repository
!gcloud artifacts repositories create {REPOSITORY} \
    --repository-format=docker \
    --location={REGION} \
    --description="Repository for Llama models" || echo "Repository already exists"

In [None]:
# Configure Docker to authenticate with Artifact Registry
!gcloud auth configure-docker {REGION}-docker.pkg.dev

## Step 3: Create Custom Container Files

We'll create the necessary files for our custom prediction container.

In [None]:
import os

# Create directory structure
os.makedirs("container", exist_ok=True)
os.chdir("container")

### Create Dockerfile

In [None]:
%%writefile Dockerfile
FROM python:3.10-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY predictor.py .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV AIP_HTTP_PORT=8080
ENV AIP_HEALTH_ROUTE=/health
ENV AIP_PREDICT_ROUTE=/predict

# Expose port
EXPOSE 8080

# Run the web service
CMD ["python", "predictor.py"]

### Create requirements.txt

In [None]:
%%writefile requirements.txt
transformers>=4.38.0
torch>=2.1.0
accelerate>=0.25.0
sentencepiece>=0.1.99
protobuf>=3.20.0
flask>=3.0.0
gunicorn>=21.2.0
google-cloud-storage>=2.14.0

### Create predictor.py - Main Serving Code

In [None]:
%%writefile predictor.py
import os
import logging
from flask import Flask, request, jsonify
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = Flask(__name__)

# Global variables for model and tokenizer
model = None
tokenizer = None

# Environment variables
AIP_HTTP_PORT = os.getenv("AIP_HTTP_PORT", "8080")
AIP_HEALTH_ROUTE = os.getenv("AIP_HEALTH_ROUTE", "/health")
AIP_PREDICT_ROUTE = os.getenv("AIP_PREDICT_ROUTE", "/predict")
MODEL_PATH = os.getenv("MODEL_PATH", "/models")
HUGGINGFACE_MODEL_ID = os.getenv("HUGGINGFACE_MODEL_ID", "meta-llama/Llama-3.2-3B-Instruct")
HF_TOKEN = os.getenv("HF_TOKEN", None)  # Hugging Face token for gated models

def load_model():
    """Load the Llama model and tokenizer"""
    global model, tokenizer
    
    logger.info(f"Loading model from {HUGGINGFACE_MODEL_ID}...")
    
    try:
        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(
            HUGGINGFACE_MODEL_ID,
            token=HF_TOKEN,
            trust_remote_code=True
        )
        
        # Load model with appropriate device settings
        device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"Using device: {device}")
        
        model = AutoModelForCausalLM.from_pretrained(
            HUGGINGFACE_MODEL_ID,
            token=HF_TOKEN,
            torch_dtype=torch.float16 if device == "cuda" else torch.float32,
            device_map="auto" if device == "cuda" else None,
            trust_remote_code=True
        )
        
        if device == "cpu":
            model = model.to(device)
        
        model.eval()
        logger.info("Model loaded successfully")
        
    except Exception as e:
        logger.error(f"Error loading model: {str(e)}")
        raise

@app.route(AIP_HEALTH_ROUTE, methods=["GET"])
def health_check():
    """Health check endpoint"""
    return jsonify({"status": "healthy"}), 200

@app.route(AIP_PREDICT_ROUTE, methods=["POST"])
def predict():
    """Prediction endpoint"""
    try:
        # Parse request
        request_json = request.get_json()
        instances = request_json.get("instances", [])
        
        if not instances:
            return jsonify({"error": "No instances provided"}), 400
        
        predictions = []
        
        for instance in instances:
            # Extract prompt and parameters
            prompt = instance.get("prompt", "")
            max_tokens = instance.get("max_tokens", 512)
            temperature = instance.get("temperature", 0.7)
            top_p = instance.get("top_p", 0.9)
            
            # Tokenize input
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
            
            # Generate response
            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=max_tokens,
                    temperature=temperature,
                    top_p=top_p,
                    do_sample=True,
                    pad_token_id=tokenizer.eos_token_id
                )
            
            # Decode output
            generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            # Remove the input prompt from the output
            response = generated_text[len(prompt):].strip()
            
            predictions.append({
                "generated_text": response,
                "full_output": generated_text
            })
        
        return jsonify({"predictions": predictions}), 200
        
    except Exception as e:
        logger.error(f"Prediction error: {str(e)}")
        return jsonify({"error": str(e)}), 500

if __name__ == "__main__":
    # Load model at startup
    load_model()
    
    # Run Flask app
    app.run(host="0.0.0.0", port=int(AIP_HTTP_PORT), debug=False)

## Step 4: Build and Push Container Image

In [None]:
# Build the Docker image
!docker build -t {IMAGE_URI} .

In [None]:
# Push the image to Artifact Registry
!docker push {IMAGE_URI}

In [None]:
# Go back to parent directory
os.chdir("..")

## Step 5: Upload Model to Vertex AI

**Important:** If you're using a gated model from Hugging Face (like Llama), you'll need to:
1. Request access on the Hugging Face model page
2. Create a Hugging Face access token
3. Pass it as an environment variable during deployment

In [None]:
# Optional: Set your Hugging Face token if needed
HF_TOKEN = "your-huggingface-token"  # Replace with your token or leave empty

In [None]:
from google.cloud import aiplatform

# Define environment variables for the container
env_vars = {
    "HUGGINGFACE_MODEL_ID": HUGGINGFACE_MODEL_ID,
}

if HF_TOKEN and HF_TOKEN != "your-huggingface-token":
    env_vars["HF_TOKEN"] = HF_TOKEN

# Upload model to Vertex AI Model Registry
model = aiplatform.Model.upload(
    display_name=f"{MODEL_NAME}-custom",
    serving_container_image_uri=IMAGE_URI,
    serving_container_environment_variables=env_vars,
    serving_container_ports=[8080],
    serving_container_predict_route="/predict",
    serving_container_health_route="/health",
)

print(f"Model uploaded with resource name: {model.resource_name}")

## Step 6: Deploy Model to Endpoint

Choose appropriate machine type based on your model size:
- **Llama 3.2 1B/3B**: `n1-standard-4` or `n1-standard-8`
- **Llama 3.2 11B**: `n1-standard-8` or `n1-highmem-8` with GPU
- **Llama 3.2 90B**: GPU required (e.g., `n1-standard-16` with A100 or V100)

For GPU deployment, you'll need to request GPU quota and modify the machine type.

In [None]:
# Create endpoint
endpoint = aiplatform.Endpoint.create(
    display_name=f"{MODEL_NAME}-endpoint",
    project=PROJECT_ID,
    location=REGION,
)

print(f"Endpoint created: {endpoint.display_name}")
print(f"Endpoint resource name: {endpoint.resource_name}")

In [None]:
# Deploy model to endpoint
# This can take 10-20 minutes
MACHINE_TYPE = "n1-standard-8"  # Adjust based on your model size

deployed_model = model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=f"{MODEL_NAME}-deployment",
    machine_type=MACHINE_TYPE,
    min_replica_count=1,
    max_replica_count=1,
    traffic_percentage=100,
)

print(f"Model deployed to endpoint: {endpoint.resource_name}")

### Optional: Deploy with GPU

For larger models, use GPU acceleration:

In [None]:
# Uncomment to deploy with GPU
# MACHINE_TYPE = "n1-standard-16"
# ACCELERATOR_TYPE = "NVIDIA_TESLA_V100"  # or NVIDIA_TESLA_A100
# ACCELERATOR_COUNT = 1

# deployed_model = model.deploy(
#     endpoint=endpoint,
#     deployed_model_display_name=f"{MODEL_NAME}-deployment",
#     machine_type=MACHINE_TYPE,
#     accelerator_type=ACCELERATOR_TYPE,
#     accelerator_count=ACCELERATOR_COUNT,
#     min_replica_count=1,
#     max_replica_count=1,
#     traffic_percentage=100,
# )

## Step 7: Test the Deployment

In [None]:
# Test prediction
test_instance = {
    "prompt": "Write a short poem about artificial intelligence:",
    "max_tokens": 100,
    "temperature": 0.7,
    "top_p": 0.9
}

prediction = endpoint.predict(instances=[test_instance])
print("\nPrediction result:")
print(prediction.predictions[0]['generated_text'])

In [None]:
# Test with a conversational prompt
conversation_instance = {
    "prompt": """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What are the three most important things to know about machine learning?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>""",
    "max_tokens": 200,
    "temperature": 0.7
}

prediction = endpoint.predict(instances=[conversation_instance])
print("\nConversational prediction:")
print(prediction.predictions[0]['generated_text'])

## Step 8: Using the Endpoint via REST API

You can also call the endpoint using REST API from any application:

In [None]:
import requests
import json
from google.auth import default
from google.auth.transport.requests import Request

# Get authentication credentials
credentials, project = default()
credentials.refresh(Request())

# Prepare the API endpoint URL
endpoint_id = endpoint.name.split('/')[-1]
api_endpoint = f"https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_id}:predict"

# Prepare request
headers = {
    "Authorization": f"Bearer {credentials.token}",
    "Content-Type": "application/json"
}

payload = {
    "instances": [
        {
            "prompt": "Explain quantum computing in simple terms:",
            "max_tokens": 150,
            "temperature": 0.7
        }
    ]
}

# Make request
response = requests.post(api_endpoint, headers=headers, json=payload)
result = response.json()

print("\nREST API Response:")
print(json.dumps(result, indent=2))

## Step 9: Monitoring and Management

In [None]:
# List all endpoints
endpoints = aiplatform.Endpoint.list()
print("Available endpoints:")
for ep in endpoints:
    print(f"- {ep.display_name}: {ep.resource_name}")

In [None]:
# Get endpoint details
print(f"\nEndpoint details:")
print(f"Display name: {endpoint.display_name}")
print(f"Resource name: {endpoint.resource_name}")
print(f"Create time: {endpoint.create_time}")
print(f"\nDeployed models:")
for deployed_model in endpoint.gca_resource.deployed_models:
    print(f"- {deployed_model.display_name}")

## Step 10: Cleanup (Optional)

**Warning:** Running these cells will delete your endpoint and model. Only run if you want to clean up resources.

In [None]:
# Undeploy model from endpoint
# endpoint.undeploy_all()
# print("All models undeployed from endpoint")

In [None]:
# Delete endpoint
# endpoint.delete(force=True)
# print("Endpoint deleted")

In [None]:
# Delete model
# model.delete()
# print("Model deleted")

## Additional Resources

- [Vertex AI Documentation](https://cloud.google.com/vertex-ai/docs)
- [Custom Container Serving](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements)
- [Llama 3.2 Model Card](https://huggingface.co/meta-llama)
- [Transformers Documentation](https://huggingface.co/docs/transformers/)

## Tips and Best Practices

1. **Model Size**: Choose the appropriate model size based on your use case. Smaller models (1B, 3B) are faster and cheaper.
2. **GPU vs CPU**: For production workloads with larger models, GPUs are recommended for better performance.
3. **Autoscaling**: Configure min/max replicas based on expected traffic patterns.
4. **Monitoring**: Set up Cloud Monitoring alerts for endpoint health and latency.
5. **Cost Optimization**: Use Spot VMs or preemptible instances for development/testing.
6. **Security**: Store HF tokens in Secret Manager instead of environment variables for production.

## Troubleshooting

**Issue**: Container fails to start
- Check Cloud Logging for container logs
- Verify Hugging Face token is valid
- Ensure sufficient memory/CPU for model size

**Issue**: Slow predictions
- Consider using GPU instances
- Enable model quantization (8-bit or 4-bit)
- Use smaller models for faster inference

**Issue**: Out of memory errors
- Increase machine type memory
- Reduce max_tokens in requests
- Enable gradient checkpointing in model loading