# ML Zoomcamp 2025 — Homework 5 (Deployment)

This notebook walks you through all tasks using **uv**, **FastAPI**, and **Docker**.

**Tip:** Use Python **3.12** or **3.13**.

> Run each cell in order. If a cell uses external tools (uv, Docker, wget) and you run into issues,
run it in a terminal and return to the notebook for the parsing/verification steps.

## Setup
This notebook assumes you have internet access and can install packages. If you're in a managed environment, you may need to run the shell commands in a local terminal instead.

## Question 1 — Install `uv` and check the version

**Goal:** Find the version of `uv` you installed (using `uv --version` or `uv self version`).

Expected answer format (example from the assignment text): **0.9.5**

In [None]:
# If you don't have uv yet, install it (choose one):
# Option A: pip (works in many environments)
python -m pip install -U uv

# Option B: pipx (recommended if available)
# pipx install uv

# Check version
uv --version
uv self version


## Project skeleton (optional but recommended)
You can do this homework in a clean folder. These cells will create a `hw05` folder and a minimal `pyproject.toml` later for Docker.


In [None]:
mkdir -p hw05
cd hw05 || exit 1

# Initialize a uv project if you want a local virtual env + lockfile
uv init --package fastapi --python 3.13 || true
uv sync || true


## Question 2 — Lock `scikit-learn==1.6.1` and extract the first hash from `uv.lock`

We will:
1. Add `scikit-learn==1.6.1` to the project (without necessarily syncing/installing).
2. Refresh the lock file.
3. Parse the first `sha256:` hash under the `[[package]]` section for `scikit-learn`.

**Expected output:** A string like `sha256:...` (copy this into the submission form).

In [None]:
# Ensure we're in the hw05 folder (run the previous setup cell first)
cd hw05 || exit 1

# Add scikit-learn 1.6.1 to the project without syncing (installing)
uv add scikit-learn==1.6.1 --no-sync

# Refresh the lock file without installing
uv lock --refresh --no-install

# Show a small excerpt for manual confirmation (optional)
rg -n "\[\[package\]\]\nname = \"scikit-learn\"" uv.lock || true


In [None]:
# Parse the first sha256 hash for scikit-learn from uv.lock
import re, pathlib
lock_path = pathlib.Path('hw05') / 'uv.lock'
text = lock_path.read_text(encoding='utf-8')
m = re.search(r"\[\[package\]\]\s*name\s*=\s*\"scikit-learn\"(.*?)(?=\n\[\[package\]\]|\Z)", text, re.S)
assert m, 'Could not find scikit-learn section in uv.lock'
section = m.group(1)
hashes = re.findall(r'hash\s*=\s*\"(sha256:[0-9a-f]{64})\"', section)
assert hashes, 'No hashes found for scikit-learn in uv.lock'
first_hash = hashes[0]
print('First scikit-learn 1.6.1 hash:', first_hash)


## Models (pre-trained pipeline)
We'll use the provided pipeline with `DictVectorizer` and `LogisticRegression` saved in `pipeline_v1.bin`.

## Question 3 — Load `pipeline_v1.bin` and score a record

Record to score:
```json
{
  "lead_source": "paid_ads",
  "number_of_courses_viewed": 2,
  "annual_income": 79276.0
}
```
Print the probability of conversion (positive class).

In [None]:
# Download the pipeline file (GitHub raw link)
wget -O pipeline_v1.bin https://github.com/DataTalksClub/machine-learning-zoomcamp/raw/refs/heads/master/cohorts/2025/05-deployment/pipeline_v1.bin

# (Optional) verify checksum, should be: 7d17d2e4dfbaf1e408e1a62e6e880d49
md5sum pipeline_v1.bin | awk '{print $1}'


In [None]:
import pickle
with open('pipeline_v1.bin', 'rb') as f:
    pipeline = pickle.load(f)

record = {
    "lead_source": "paid_ads",
    "number_of_courses_viewed": 2,
    "annual_income": 79276.0,
}

proba = float(pipeline.predict_proba([record])[0, 1])
print('Q3 probability:', round(proba, 3))


## Question 4 — Serve the model with FastAPI and score a client

Client to score via POST:
```python
client = {
    "lead_source": "organic_search",
    "number_of_courses_viewed": 4,
    "annual_income": 80304.0
}
```
We will write a small FastAPI service and a client request.

In [None]:
# Ensure FastAPI and Uvicorn are available in your environment
import sys, subprocess
subprocess.run([sys.executable, '-m', 'pip', 'install', '-U', 'fastapi', 'uvicorn', 'pydantic', 'scikit-learn', 'scipy'], check=False)

from textwrap import dedent
app_code = dedent('''
from fastapi import FastAPI
from pydantic import BaseModel
import pickle

class Lead(BaseModel):
    lead_source: str | None = None
    number_of_courses_viewed: float | int | None = 0
    annual_income: float | int | None = 0

with open('pipeline_v1.bin', 'rb') as f:
    pipeline = pickle.load(f)

app = FastAPI()

@app.get('/')
def root():
    return {'status': 'ok'}

@app.post('/predict')
def predict(lead: Lead):
    record = lead.model_dump()
    proba = float(pipeline.predict_proba([record])[0, 1])
    return {'probability': proba}
''')
open('app.py', 'w', encoding='utf-8').write(app_code)
print('Wrote app.py')
print('To run locally:  uvicorn app:app --reload')


In [None]:
# Client call (run this **after** starting `uvicorn app:app --reload` in a separate terminal)
import json, time
import requests

url = 'http://127.0.0.1:8000/predict'
client = {
    "lead_source": "organic_search",
    "number_of_courses_viewed": 4,
    "annual_income": 80304.0,
}
try:
    resp = requests.post(url, json=client, timeout=5)
    print('Response:', resp.json())
except Exception as e:
    print('Make sure the server is running (uvicorn app:app --reload). Error:', e)



## Question 5 — Docker base image size

Pull the base image and check its size with `docker images` (look under the `SIZE` column).

In [None]:
# Pull the prepared base image
docker pull agrigorev/zoomcamp-model:2025

# List images and locate the size column
docker images | awk 'NR==1 || /agrigorev\/zoomcamp-model\s+2025/'


## Dockerfile — build your service image

Create a `Dockerfile` that starts FROM the provided base and installs dependencies from `pyproject.toml`, then runs the FastAPI app with Uvicorn.

In [None]:
from textwrap import dedent
dockerfile = dedent('''
FROM agrigorev/zoomcamp-model:2025
WORKDIR /code

# Install uv (fast installer)
RUN python -m pip install -U uv

# Copy project files
COPY pyproject.toml ./
COPY app.py ./
COPY pipeline_v1.bin ./

# Install dependencies via uv (no dev deps)
RUN uv sync --no-dev --frozen || uv sync --no-dev

EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
''')
open('Dockerfile', 'w', encoding='utf-8').write(dockerfile)
print('Wrote Dockerfile')

pyproject = dedent('''
[project]
name = "mlzoomcamp-hw5"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
  "fastapi",
  "uvicorn",
  "scikit-learn",
  "scipy",
  "pydantic>=2",
]
''')
open('pyproject.toml', 'w', encoding='utf-8').write(pyproject)
print('Wrote pyproject.toml')
print('Build with:  docker build -t hw5-service .')
print('Run with:    docker run --rm -p 8000:8000 hw5-service')


## Question 6 — Run your container and score the client

After building and running your container, POST the same `client` to the running service and print the probability.

In [None]:
import requests
url = 'http://127.0.0.1:8000/predict'  # if running locally with -p 8000:8000
client = {
    "lead_source": "organic_search",
    "number_of_courses_viewed": 4,
    "annual_income": 80304.0,
}
try:
    resp = requests.post(url, json=client, timeout=5)
    print('Q6 probability:', round(resp.json().get('probability', float('nan')), 3))
except Exception as e:
    print('Ensure the Docker container is running and listening on port 8000. Error:', e)


## Answers — Fill these into the course form
- Q1: uv version — run the cell in *Question 1* and copy the version (example in the prompt was **0.9.5**).
- Q2: First `scikit-learn==1.6.1` hash — run the parser cell in *Question 2* and copy the first `sha256:` value.
- Q3: Probability from `pipeline_v1.bin` — run *Question 3* (expected ~**0.533**).
- Q4: Probability from FastAPI app — run *Question 4* client cell (expected ~**0.534**).
- Q5: Size of `agrigorev/zoomcamp-model:2025` — run *Question 5* (expected **121 MB**).
- Q6: Probability from Dockerized service — run *Question 6* (expected ~**0.79**).

> If your numbers differ slightly, pick the closest option in the multiple choice.

---
_Notebook generated: 2025-10-22T21:13:28Z_
