[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gouthamgo/FineTuning/blob/main/lessons/module5_deployment/01_deploy_production.ipynb)

# üöÄ Deploy Your Model to Production

**Duration:** 2 hours  
**Level:** Intermediate  
**What You'll Learn:** How to actually deploy models so people can use them

---

## This Is What Separates Hobbyists from Professionals!

Real talk: **Training a model in a notebook means NOTHING if you can't deploy it.**

Companies don't care about your Colab notebooks. They care if you can:
- ‚úÖ Deploy a model as an API
- ‚úÖ Handle real user traffic
- ‚úÖ Monitor performance
- ‚úÖ Update models without downtime

**This lesson teaches you ALL of that.**

We'll deploy the same model THREE different ways:
1. **HuggingFace Spaces** (Easiest, free)
2. **FastAPI + Docker** (Production-grade)
3. **AWS Lambda** (Serverless, scalable)

By the end, you'll have a deployed model with a public URL you can share with recruiters! üéØ

## üèóÔ∏è Deployment Option 1: HuggingFace Spaces (Start Here!)

**Pros:**
- ‚úÖ Completely free
- ‚úÖ No server management
- ‚úÖ Automatic HTTPS
- ‚úÖ Public URL instantly
- ‚úÖ Perfect for portfolio

**Cons:**
- ‚ùå Limited to Gradio/Streamlit
- ‚ùå Slower than dedicated servers
- ‚ùå Can't customize infrastructure

**Best for:** Demos, portfolio projects, MVP

---

### Step 1: Create Your Gradio App

In [None]:
!pip install -q gradio transformers torch

In [None]:
# app.py - This is what you'll deploy!

import gradio as gr
from transformers import pipeline

# Load your fine-tuned model
# In production, you'd load from HuggingFace Hub or local path
classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

def predict(text):
    """Get sentiment prediction"""
    result = classifier(text)[0]
    
    label = result['label']
    score = result['score']
    
    return f"**Prediction:** {label}\n**Confidence:** {score:.2%}"

# Create Gradio interface
demo = gr.Interface(
    fn=predict,
    inputs=gr.Textbox(label="Enter text to analyze", placeholder="Type something..."),
    outputs=gr.Markdown(label="Result"),
    title="Sentiment Analysis Model",
    description="Fine-tuned DistilBERT for sentiment classification. Part of my ML portfolio.",
    examples=[
        ["This product is amazing! I love it!"],
        ["Terrible experience. Very disappointed."],
        ["It's okay, nothing special."],
    ],
    theme=gr.themes.Soft(),
)

if __name__ == "__main__":
    demo.launch()

### Step 2: Deploy to HuggingFace Spaces

```bash
# 1. Create a new Space on huggingface.co/spaces
# 2. Choose "Gradio" as the SDK
# 3. Clone the repo:
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
cd YOUR_SPACE_NAME

# 4. Create these files:
```

**app.py** (your Gradio code from above)

**requirements.txt:**
```
gradio
transformers
torch
```

**README.md:**
```markdown
---
title: Sentiment Analysis
emoji: üé≠
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
---

# Sentiment Analysis Model

Fine-tuned DistilBERT for sentiment classification.

Built by [Your Name] as part of ML portfolio.
```

```bash
# 5. Push to HuggingFace:
git add .
git commit -m "Deploy sentiment analysis model"
git push
```

**That's it!** Your model is now live at:
`https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`

üéâ **Put this link on your resume and LinkedIn!**

## üèóÔ∏è Deployment Option 2: FastAPI + Docker (Production-Grade)

**Pros:**
- ‚úÖ Full control
- ‚úÖ REST API (industry standard)
- ‚úÖ Can deploy anywhere (AWS, GCP, Azure)
- ‚úÖ Scalable
- ‚úÖ What companies actually use

**Cons:**
- ‚ùå Requires more setup
- ‚ùå Need to manage servers
- ‚ùå Costs money (unless free tier)

**Best for:** Production apps, companies, professional projects

---

### Step 1: Create FastAPI Application

In [None]:
# main.py - Production-grade FastAPI app

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import pipeline
import time
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Create FastAPI app
app = FastAPI(
    title="Sentiment Analysis API",
    description="Production-ready sentiment analysis using fine-tuned transformers",
    version="1.0.0"
)

# Load model on startup
model = None

@app.on_event("startup")
async def load_model():
    """Load model when server starts"""
    global model
    logger.info("Loading model...")
    model = pipeline(
        "sentiment-analysis",
        model="distilbert-base-uncased-finetuned-sst-2-english"
    )
    logger.info("Model loaded successfully!")

# Request/Response models
class PredictionRequest(BaseModel):
    text: str
    
class PredictionResponse(BaseModel):
    label: str
    confidence: float
    processing_time_ms: float

# Health check endpoint
@app.get("/health")
async def health_check():
    """Check if service is healthy"""
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

# Prediction endpoint
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    """Get sentiment prediction"""
    
    if not model:
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    if not request.text or len(request.text.strip()) == 0:
        raise HTTPException(status_code=400, detail="Text cannot be empty")
    
    # Log request
    logger.info(f"Prediction request: {request.text[:50]}...")
    
    # Time the prediction
    start_time = time.time()
    
    try:
        result = model(request.text)[0]
        processing_time = (time.time() - start_time) * 1000  # Convert to ms
        
        response = PredictionResponse(
            label=result['label'],
            confidence=result['score'],
            processing_time_ms=processing_time
        )
        
        logger.info(f"Prediction: {response.label} ({response.confidence:.2f}) in {processing_time:.2f}ms")
        
        return response
        
    except Exception as e:
        logger.error(f"Prediction error: {str(e)}")
        raise HTTPException(status_code=500, detail="Prediction failed")

# Batch prediction endpoint
@app.post("/predict/batch")
async def predict_batch(texts: list[str]):
    """Batch prediction for multiple texts"""
    
    if not model:
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    if len(texts) > 100:
        raise HTTPException(status_code=400, detail="Max 100 texts per batch")
    
    start_time = time.time()
    results = model(texts)
    processing_time = (time.time() - start_time) * 1000
    
    return {
        "predictions": results,
        "count": len(results),
        "processing_time_ms": processing_time
    }

# Metrics endpoint (for monitoring)
@app.get("/metrics")
async def get_metrics():
    """Get service metrics"""
    # In production, you'd track real metrics
    return {
        "total_requests": "See logs",
        "average_latency_ms": "See logs",
        "model_version": "1.0.0"
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

### Step 2: Create Dockerfile

```dockerfile
# Dockerfile
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy requirements
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY main.py .

# Expose port
EXPOSE 8000

# Run application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```

**requirements.txt:**
```
fastapi==0.104.1
uvicorn[standard]==0.24.0
transformers==4.35.0
torch==2.1.0
pydantic==2.5.0
```

### Step 3: Build and Run

```bash
# Build Docker image
docker build -t sentiment-api .

# Run container
docker run -p 8000:8000 sentiment-api
```

### Step 4: Test Your API

```bash
# Health check
curl http://localhost:8000/health

# Single prediction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "This is amazing!"}'

# Batch prediction
curl -X POST "http://localhost:8000/predict/batch" \
  -H "Content-Type: application/json" \
  -d '["Great product!", "Terrible service"]'
```

### Step 5: Deploy to Cloud

**AWS (ECS):**
```bash
# Push to ECR
aws ecr create-repository --repository-name sentiment-api
docker tag sentiment-api:latest YOUR_ECR_URL/sentiment-api:latest
docker push YOUR_ECR_URL/sentiment-api:latest

# Deploy to ECS (use AWS Console or Terraform)
```

**Google Cloud Run:**
```bash
# Build and deploy in one command!
gcloud run deploy sentiment-api \
  --source . \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated
```

**Azure Container Instances:**
```bash
# Push to ACR
az acr build --registry myregistry --image sentiment-api .

# Deploy
az container create \
  --resource-group myResourceGroup \
  --name sentiment-api \
  --image myregistry.azurecr.io/sentiment-api:latest
```

## üèóÔ∏è Deployment Option 3: AWS Lambda (Serverless)

**Pros:**
- ‚úÖ Scales automatically
- ‚úÖ Pay per request (cheap!)
- ‚úÖ Zero server management
- ‚úÖ High availability built-in

**Cons:**
- ‚ùå Cold start latency
- ‚ùå Size limits (10GB)
- ‚ùå 15-minute timeout

**Best for:** Variable traffic, cost optimization, microservices

---

### Lambda Handler Code

```python
# lambda_function.py
import json
import boto3
from transformers import pipeline

# Load model (done once per container)
model = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

def lambda_handler(event, context):
    """AWS Lambda handler"""
    
    try:
        # Parse request
        body = json.loads(event['body'])
        text = body['text']
        
        # Predict
        result = model(text)[0]
        
        # Return response
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*'  # CORS
            },
            'body': json.dumps({
                'label': result['label'],
                'confidence': result['score']
            })
        }
        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }
```

### Deploy to Lambda

```bash
# 1. Create deployment package
pip install -t package transformers torch
cd package
zip -r ../deployment-package.zip .
cd ..
zip -g deployment-package.zip lambda_function.py

# 2. Create Lambda function
aws lambda create-function \
  --function-name sentiment-analysis \
  --runtime python3.9 \
  --role YOUR_LAMBDA_ROLE_ARN \
  --handler lambda_function.lambda_handler \
  --zip-file fileb://deployment-package.zip \
  --memory-size 3008 \
  --timeout 60

# 3. Create API Gateway endpoint
# (Use AWS Console - easier for beginners)
```

## üìä Comparison Table

| Feature | HuggingFace Spaces | FastAPI + Docker | AWS Lambda |
|---------|-------------------|------------------|------------|
| **Cost** | Free | $5-50/mo | $0.20 per 1M requests |
| **Setup Time** | 5 minutes | 30 minutes | 1 hour |
| **Scalability** | Limited | Manual | Automatic |
| **Latency** | 2-3s | <500ms | 1-2s (cold), <200ms (warm) |
| **Control** | Low | High | Medium |
| **Best For** | Demos | Production | Variable traffic |
| **Resume Impact** | Good | Excellent | Excellent |

**My recommendation:**
1. **Start with HuggingFace Spaces** - Get something live fast
2. **Learn FastAPI + Docker** - What companies actually use
3. **Try Lambda** - Shows you know cloud/serverless

## üéØ Production Best Practices

### 1. **Add Authentication**
```python
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

@app.post("/predict")
async def predict(
    request: PredictionRequest,
    credentials: HTTPAuthorizationCredentials = Depends(security)
):
    # Verify API key
    if credentials.credentials != "your-secret-key":
        raise HTTPException(status_code=401)
    # ... rest of code
```

### 2. **Add Rate Limiting**
```python
from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/predict")
@limiter.limit("100/minute")  # Max 100 requests per minute
async def predict(request: Request, data: PredictionRequest):
    # ... code
```

### 3. **Add Monitoring**
```python
import prometheus_client
from prometheus_client import Counter, Histogram

# Metrics
prediction_counter = Counter('predictions_total', 'Total predictions')
prediction_latency = Histogram('prediction_latency_seconds', 'Prediction latency')

@app.post("/predict")
async def predict(request: PredictionRequest):
    with prediction_latency.time():
        result = model(request.text)
    prediction_counter.inc()
    return result
```

### 4. **Add Caching**
```python
from functools import lru_cache

@lru_cache(maxsize=1000)
def predict_cached(text: str):
    return model(text)
```

### 5. **Error Handling**
```python
@app.exception_handler(Exception)
async def global_exception_handler(request: Request, exc: Exception):
    logger.error(f"Unhandled error: {exc}")
    return JSONResponse(
        status_code=500,
        content={"error": "Internal server error"}
    )
```

### 6. **Model Versioning**
```python
@app.post("/v1/predict")  # Version in URL
async def predict_v1(request: PredictionRequest):
    return model_v1(request.text)

@app.post("/v2/predict")  # New version doesn't break old clients
async def predict_v2(request: PredictionRequest):
    return model_v2(request.text)
```

## üöÄ Resume Bullets

**After completing this lesson, you can say:**

- "Deployed ML models to production using FastAPI, Docker, and AWS Lambda"
- "Built RESTful API serving 1000+ requests/day with <200ms latency"
- "Implemented monitoring, rate limiting, and authentication for production ML services"
- "Reduced infrastructure costs by 60% using serverless deployment (AWS Lambda)"
- "Created public demo showcasing fine-tuned models (link: [your HF Space])"

**Interview answers:**

Q: "How do you deploy ML models?"
A: "I've deployed models three different ways: HuggingFace Spaces for quick demos, FastAPI + Docker for production environments, and AWS Lambda for cost-effective serverless deployment. For my last project, I chose [X] because [business reason]."

Q: "How do you monitor models in production?"
A: "I track latency, throughput, and error rates using Prometheus. I also log all predictions with confidence scores to detect model drift. If confidence drops below threshold, I alert the team."

Q: "How do you handle scaling?"
A: "For Docker deployments, I use container orchestration (ECS/Kubernetes) with auto-scaling based on CPU/memory. For Lambda, scaling is automatic. I also implement caching and batch processing to reduce load."

## üéâ Congratulations!

You now know how to deploy ML models like a professional!

**Action items:**
1. ‚úÖ Deploy one model to HuggingFace Spaces (do this now!)
2. ‚úÖ Build a FastAPI wrapper locally
3. ‚úÖ Test with Docker
4. ‚úÖ Deploy to cloud (free tier)
5. ‚úÖ Add URL to LinkedIn/resume

Companies love seeing **live deployed models**. This immediately sets you apart from 90% of candidates who only have notebooks.

---

**Next:** MLOps & Monitoring - Keep your models healthy in production! üìä