## Homework

> Note: sometimes your answer doesn't match one of the options exactly. 
> That's fine. 
> Select the option that's closest to your solution.

We recommend using python 3.12 or 3.13 in this homework.

In this homework, we're going to continue working with the lead scoring dataset. You don't need the dataset: we will provide the model for you.

## My setup with `uv`

I aim to create two virtual environments:
- One in the project root – the standard Python virtual environment.
- Another in the homework folder (`cohorts/2025/05-deployment`) using `uv`

Here's how I set it up (maybe not the best way to do it though):
1.  I activate the standard virtual environment (created with `venv`) and install `uv`
```bash (venv)
pip install uv
```
2. Then, I navigate to the homework folder and initialize a new `uv` project:
```bash (venv)
uv init
```

***Note:*** At this point there is no virtual environment managed by `uv` yet

## Question 1

* Install `uv`
* What's the version of uv you installed?
* Use `--version` to find out

### Solution 1

I have already installed `uv` above.
So I only have to execute (in the activated `venv`)
```bash (venv)
uv --version
```
Which returns
```
uv 0.9.5
```

## Initialize an empty uv project

You should create an empty folder for homework
and do it there. 

## Question 2

* Use uv to install Scikit-Learn version 1.6.1 
* What's the first hash for Scikit-Learn you get in the lock file?
* Include the entire string starting with sha256:, don't include quotes

### Solution 2 & Jupyter setup

I want to do two things:
1. do the exercise
2. connect this Jupyter notebook with the `uv` managed virtual environment

The first part is easy. Just run: (I think you have to be in the homework folder for this and all subsequent `uv add` actions)
```bash (venv)
uv add scikit-learn==1.6.1
```
On my setup this always returns a warning
```
warning: `VIRTUAL_ENV=/home/jx/projects/machine-learning-zoomcamp/venv` does not match the project environment path `.venv` and will be ignored; use `--active` to target the active environment instead
```
this warning indicates that uv detected a different active venv; it is safe to ignore because we want to stay in the Python venv. Of course having multiple active venvs is a bit confusing.

After installing our first package with `uv` it also creates a virtual environment in this homework folder. Now we have two virtual environments:
- standard Python virtual environment in `venv/bin/activate`. This one has `uv` installed.
- `uv` managed virtual environment in `cohorts/2025/05-deployment/.venv/bin/activate`. This one has `scikit-learn==1.6.1` installed.

**Note**: To activate `uv` virtual environment from the project root you can type `cohorts/2025/05-deployment/.venv/bin/activate` in the console. For leaving the environment you can type `deactivate`.

Next I want to use the `uv` virtual environment for this notebook. First I have to install `ipykernel`:
```bash (venv)
uv add ipykernel
```
I'm using vs code to run the `.ipynb`s. My vscode doesn't find the `uv`'virtual enviornment automatically. I had to press `ctrn + shift + P` and select `Python: Select Interpreter` there I could choose `select interpreter path`. Here I used `cohorts/2025/05-deployment/.venv/bin/python`. This gave an error but it selectable for the notebook.

Now we can check if we are running the right version of scikit-learn



In [10]:
import sklearn

sklearn.__version__

'1.6.1'


## Models

We have prepared a pipeline with a dictionary vectorizer and a model.

It was trained (roughly) using this code:

```python
categorical = ['lead_source']
numeric = ['number_of_courses_viewed', 'annual_income']

df[categorical] = df[categorical].fillna('NA')
df[numeric] = df[numeric].fillna(0)

train_dict = df[categorical + numeric].to_dict(orient='records')

pipeline = make_pipeline(
    DictVectorizer(),
    LogisticRegression(solver='liblinear')
)

pipeline.fit(train_dict, y_train)
```

> **Note**: You don't need to train the model. This code is just for your reference.

And then saved with Pickle. Download it [here](https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/cohorts/2025/05-deployment/pipeline_v1.bin).

With `wget`:

```bash
wget https://github.com/DataTalksClub/machine-learning-zoomcamp/raw/refs/heads/master/cohorts/2025/05-deployment/pipeline_v1.bin
```

When downloading the file we have to make sure that it lands in the right folder for this notebook to find it. The notebook looks for files in the project root as we can see here:

In [None]:
import os

print("Current working directory:", os.getcwd())

Current working directory: /home/jx/projects/machine-learning-zoomcamp


The current working directory is the same:

In [14]:
!pwd

/home/jx/projects/machine-learning-zoomcamp


So everything is fine and we can just execute the command from below (best just once):

In [15]:
#!wget https://github.com/DataTalksClub/machine-learning-zoomcamp/raw/refs/heads/master/cohorts/2025/05-deployment/pipeline_v1.bin

## Question 3

Let's use the model!

* Write a script for loading the pipeline with pickle
* Score this record:

```json
{
    "lead_source": "paid_ads",
    "number_of_courses_viewed": 2,
    "annual_income": 79276.0
}
```

What's the probability that this lead will convert? 

* 0.333
* 0.533
* 0.733
* 0.933

If you're getting errors when unpickling the files, check their checksum:

```bash
$ md5sum pipeline_v1.bin
7d17d2e4dfbaf1e408e1a62e6e880d49 *pipeline_v1.bin
```

### Solution 3

In [32]:
import pickle

from sklearn.pipeline import Pipeline

with open("pipeline_v1.bin", "rb") as file_in:
    pipeline: Pipeline = pickle.load(file_in)

record = {
    "lead_source": "paid_ads",
    "number_of_courses_viewed": 2,
    "annual_income": 79276.0,
}

print("The classes", pipeline.classes_)
pipeline.predict_proba(record)

The classes [0 1]


array([[0.46639273, 0.53360727]])

So the correct probability is:

* 0.333
* **0.533**
* 0.733
* 0.933

## Question 4

Now let's serve this model as a web service

* Install FastAPI
* Write FastAPI code for serving the model
* Now score this client using `requests`:

```python
url = "YOUR_URL"
client = {
    "lead_source": "organic_search",
    "number_of_courses_viewed": 4,
    "annual_income": 80304.0
}
requests.post(url, json=client).json()
```

What's the probability that this client will get a subscription?

* 0.334
* 0.534
* 0.734
* 0.934

### Solution 4

We have to install three packages for this
```bash (venv)
uv add requests fastapi[standard] uvicorn
```

- requests: the exercise asks us to use requests to send the html request to the server
- fastapi: to write the webserver in python
- uvicorn: to host the webserver. We could actually also use fastapi for this but we need uvicorn later anyway

After that we have to execute `question_4_server.py` (make sure to be in the homework folder, otherwise `uv run` wont find `uvicorn`)

```bash venv
uv run uvicorn question_4_server:app
```

Then it should show you on which port it's running and we can execute the following code to get the answer:

In [None]:
import requests

url = "http://localhost:8000/predict"
client = {
    "lead_source": "organic_search",
    "number_of_courses_viewed": 4,
    "annual_income": 80304.0,
}
requests.post(url, json=client).json()

{'prob': 0.5340417283801275}

So the answer is:

* 0.334
* **0.534**
* 0.734
* 0.934

## Docker

Install [Docker](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/05-deployment/06-docker.md). 
We will use it for the next two questions.

For these questions, we prepared a base image: `agrigorev/zoomcamp-model:2025`. 
You'll need to use it (see Question 5 for an example).

This image is based on `3.13.5-slim-bookworm` and has
a pipeline with logistic regression (a different one)
as well a dictionary vectorizer inside. 

This is how the Dockerfile for this image looks like:

```docker 
FROM python:3.13.5-slim-bookworm
WORKDIR /code
COPY pipeline_v2.bin .
```

We already built it and then pushed it to [`agrigorev/zoomcamp-model:2025`](https://hub.docker.com/r/agrigorev/zoomcamp-model).

> **Note**: You don't need to build this docker image, it's just for your reference.

## My Docker setup

I can't explain how to install Docker. For Ubuntu I used this guide https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository. But on the official site it says that the recommended way is to install docker desktop. I don't actually know why I've chosen the harder way, but I also learned some linux stuff on the way.

I also added my user to the docker group https://docs.docker.com/engine/install/linux-postinstall so I can run docker commands without sudo.

Once all of this is done we can do the next exercise.
**Note:** I didn't know how to update the shell used in this jupyter notebook and in the end I just restarted my pc before the next exercise (I'm sure there is another way though)

## Question 5

Download the base image `agrigorev/zoomcamp-model:2025`. You can easily make it by using [docker pull](https://docs.docker.com/engine/reference/commandline/pull/) command.

So what's the size of this base image?

* 45 MB
* 121 MB
* 245 MB
* 330 MB

You can get this information when running `docker images` - it'll be in the "SIZE" column.

### Solution 5

In [24]:
! docker pull agrigorev/zoomcamp-model:2025

2025: Pulling from agrigorev/zoomcamp-model
Digest: sha256:14d79fde0bbf078eb18c99c2bd007205917b758ec11060b2994963a1e485c2ae
Status: Image is up to date for agrigorev/zoomcamp-model:2025
docker.io/agrigorev/zoomcamp-model:2025


In [25]:
! docker images

REPOSITORY                 TAG       IMAGE ID       CREATED        SIZE
ml-zoomcamp-homework-05    2025      1f1fe8338dc7   13 hours ago   204MB
agrigorev/zoomcamp-model   2025      4a9ecc576ae9   4 days ago     121MB
hello-world                latest    1b44b5a3e06a   2 months ago   10.1kB


So the answer is:

* 45 MB
* **121 MB**
* 245 MB
* 330 MB

## Dockerfile

Now create your own `Dockerfile` based on the image we prepared.

It should start like that:

```docker
FROM agrigorev/zoomcamp-model:2025
# add your stuff here
```

Now complete it:

* Install all the dependencies from pyproject.toml
* Copy your FastAPI script
* Run it with uvicorn 

After that, you can build your docker image.

## How I wrote my Dockerfile

You can inspect the dockerfile yourself. There are some points though. First let's look at the history of the dockerfile that we downloaded earlier:

In [27]:
! docker history agrigorev/zoomcamp-model:2025

IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
4a9ecc576ae9   4 days ago     COPY pipeline_v2.bin . # buildkit               1.3kB     buildkit.dockerfile.v0
<missing>      4 days ago     WORKDIR /code                                   0B        buildkit.dockerfile.v0
<missing>      4 months ago   CMD ["python3"]                                 0B        buildkit.dockerfile.v0
<missing>      4 months ago   RUN /bin/sh -c set -eux;  for src in idle3 p…   36B       buildkit.dockerfile.v0
<missing>      4 months ago   RUN /bin/sh -c set -eux;   savedAptMark="$(a…   36.7MB    buildkit.dockerfile.v0
<missing>      4 months ago   ENV PYTHON_SHA256=93e583f243454e6e9e4588ca2c…   0B        buildkit.dockerfile.v0
<missing>      4 months ago   ENV PYTHON_VERSION=3.13.5                       0B        buildkit.dockerfile.v0
<missing>      4 months ago   ENV GPG_KEY=7169605F62C751356D054A26A821E680…   0B        buildkit.dockerfile.v0
<missing>      4

So there are two remarks:
1. The pipeline is called `pipeline_v2.bin` so the code from exercise 4 wouldn't work as it uses `pipeline_v1.bin`. So made a copy of `question_4_server.py` to `question_6_server.py` which has basically just this one change.
2. Since the pipeline_v2.bin is in the current working directory we shouldn't create another one

In [28]:
# build docker image
! docker build -t ml-zoomcamp-homework-05:2025 cohorts/2025/05-deployment

[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.1s (9/9) FINISHED                                 docker:default
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 672B                                       0.0s
[0m[34m => [internal] load metadata for docker.io/agrigorev/zoomcamp-model:2025   0.0s
[0m[34m => [internal] load .dockerignore                                          0.0s
[0m[34m => => transferring context: 2B                                            0.0s
[0m[34m => [1/4] FROM docker.io/agrigorev/zoomcamp-model:2025                     0.0s
[0m[34m => [internal] load build context                                          0.0s
[0m[34m => => transferring context: 104B                                          0.0s
[0m[34m => CACHED [2/4] RUN pip install uv                                        0.0s
[0m

## Question 6

Let's run your docker container!

After running it, score this client once again:

```python
url = "YOUR_URL"
client = {
    "lead_source": "organic_search",
    "number_of_courses_viewed": 4,
    "annual_income": 80304.0
}
requests.post(url, json=client).json()
```

What's the probability that this lead will convert?

* 0.39
* 0.59
* 0.79
* 0.99

### Solution 6

We have to run docker

```bash
docker run -p 9696:9696 ml-zoomcamp-homework-05:2025
```
(we can't really run it here because the process does not terminate, maybe there is a way. but anyway I prefer the bash for this)

And then very similarly to before we can just send a request to it

In [None]:
import requests

url = "http://localhost:9696/predict"
client = {
    "lead_source": "organic_search",
    "number_of_courses_viewed": 4,
    "annual_income": 80304.0,
}
requests.post(url, json=client).json()

{'prob': 0.9933071490756734}

So the solution is:

* 0.39
* 0.59
* 0.79
* **0.99**

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw05
* If your answer doesn't match options exactly, select the closest one
