## Homework

In this homework, we'll deploy the ride duration model in batch mode. Like in homework 1, we'll use the Yellow Taxi Trip Records dataset. 

You'll find the starter code in the [homework](homework) directory.

Solution: [homework_solution/](homework_solution/)

## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.
We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](homework/starter.ipynb).

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

* ~~1.24~~
* **6.24**
* ~~12.28~~
* ~~18.28~~


In [1]:
!pip freeze | grep scikit-learn

scikit-learn==1.5.2


In [2]:
!python -V

Python 3.12.4


In [119]:
!pipenv run pip freeze | grep scikit-learn

scikit-learn==1.5.0


In [118]:
import os
import pickle
import json

import numpy as np
import pandas as pd

from IPython.display import JSON, Code

In [5]:
with open('./homework/model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [6]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [7]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')

In [8]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

In [9]:
print(f"Standard deviation of the predicted duration is: {np.std(y_pred)=:.3f}")

Standard deviation of the predicted duration is: np.std(y_pred)=6.247




## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output. 

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results. 

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* ~~36M~~
* ~~46M~~
* ~~56M~~
* **66M**

__Note:__ Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use `pyarrow`, not `fastparquet`. 

In [12]:
year, month = 2023, 3

In [27]:
year = 2023
month = 3
taxi_type = 'yellow'

#input_file = f'https://s3.amazonaws.com/nyc-tlc/trip+data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'
output_file = f'../data/output/{taxi_type}_{year:04d}-{month:02d}.parquet'

In [21]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'Airport_fee', 'duration',
       'ride_id'],
      dtype='object')

In [13]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [39]:
df_result = pd.DataFrame()
df_result['ride_id'] = df['ride_id']
df_result['predicted_duration'] = y_pred

In [31]:
df_result.head()

Unnamed: 0,ride_id,predicted_duration
0,2023/03_0,16.245906
1,2023/03_1,26.134796
2,2023/03_2,11.884264
3,2023/03_3,11.99772
4,2023/03_4,10.234486


In [40]:
output_file

'../data/output/yellow_2023-03.parquet'

In [33]:
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [34]:
! ls -hAlt ../data/output/

total 73M
-rw-r--r-- 1 Gleb None  66M Jun 14 22:36 yellow_2023-03.parquet
-rw-r--r-- 1 Gleb None 2.2M Jun 14 22:10 val.pkl
-rw-r--r-- 1 Gleb None 2.3M Jun 14 22:10 train.pkl
-rw-r--r-- 1 Gleb None 2.4M Jun 14 22:10 test.pkl
-rw-r--r-- 1 Gleb None 128K Jun 14 22:10 dv.pkl


In [38]:
file_size_bytes = os.path.getsize(output_file)
file_size_mb = file_size_bytes / (1024 * 1024)

print(f"The size of the output file is: {file_size_mb:.2f} MB")

The size of the output file is: 65.46 MB



## Q3. Creating the scoring script

Now let's turn the notebook into a script. 

Which command you need to execute for that?

---

##### **A:** `jupyter nbconvert ./homework-4.ipynb --to script`


In [47]:
! jupyter nbconvert ./homework-4.ipynb --to script

[NbConvertApp] Converting notebook ./homework-4.ipynb to script
[NbConvertApp] Writing 7422 bytes to homework-4.py


## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter
notebook.

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

---

##### **A:** *'sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c'*

In [48]:
!pip freeze | grep scikit-learn

scikit-learn==1.5.2


In [49]:
!python -V

Python 3.12.4


In [55]:
#! pipenv install scikit-learn==1.5.0 pandas pyarrow

In [56]:
# Successfully created virtual environment!
# Virtualenv location: C:\Users\Gleb\.virtualenvs\04-deployment-eiJutVa_
# Creating a Pipfile for this project...
# Pipfile.lock not found, creating...
# Locking  dependencies...
# Locking  dependencies...
# Updated Pipfile.lock 
# (702ad05de9bc9de99a4807c8dde1686f31e0041d7b5f6f6b74861195a52110f5)!
# Upgrading scikit-learn==1.5.0 in  dependencies.

In [120]:
!pipenv run pip freeze | grep scikit-learn

scikit-learn==1.5.0


In [54]:
! ls -hAlt .

total 465K
-rw-r--r-- 1 Gleb None 324K Jun 14 23:38 .jupyter_ystore.db
-rw-r--r-- 1 Gleb None  41K Jun 14 23:38 homework-4.ipynb
-rw-r--r-- 1 Gleb None  13K Jun 14 23:36 Pipfile.lock
-rw-r--r-- 1 Gleb None  164 Jun 14 23:36 Pipfile
-rwxr-xr-x 1 Gleb None 7.8K Jun 14 22:43 run_predict.py
drwxr-xr-x 1 Gleb None    0 Jun 14 22:21 homework
drwxr-xr-x 1 Gleb None    0 Jun 14 21:44 .ipynb_checkpoints
-rw-r--r-- 1 Gleb None 4.6K Jun 14 21:42 homework.md


In [123]:
Code(filename='./Pipfile')

In [87]:
! head ./Pipfile.lock

{
    "_meta": {
        "hash": {
            "sha256": "d2754ce48be28d727735e2ba7e5ebbd2aad511f6c92d6c205ae58c4698123e14"
        },
        "pipfile-spec": 6,
        "requires": {
            "python_version": "3.12"
        },
        "sources": [


In [60]:
with open('./Pipfile.lock', 'r') as f:
    dict_ = json.load(f)

In [64]:
JSON(dict_)

<IPython.core.display.JSON object>

In [66]:
dict_['default']['scikit-learn']['hashes'][:5]

['sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c',
 'sha256:118a8d229a41158c9f90093e46b3737120a165181a1b58c03461447aa4657415',
 'sha256:12e40ac48555e6b551f0a0a5743cc94cc5a765c9513fe708e01f0aa001da2801',
 'sha256:174beb56e3e881c90424e21f576fa69c4ffcf5174632a79ab4461c4c960315ac',
 'sha256:1b94d6440603752b27842eda97f6395f570941857456c606eb1d638efdb38184']

## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two 
parameters: year and month.

Run the script for April 2023. 

What's the mean predicted duration? 

* ~~7.29~~
* **14.29**
* ~~21.29~~
* ~~28.29~~

Hint: just add a print statement to your script.


In [71]:
! python ./run_predict.py -h

usage: run_predict.py [-h] [--year YEAR] [--month MONTH]
                      [--model-path MODEL_PATH]

options:
  -h, --help            show this help message and exit
  --year YEAR           Year to process
  --month MONTH         Month to process
  --model-path MODEL_PATH
                        Path to the model file


In [75]:
! pipenv run python ./run_predict.py --year 2023 --month 4 --model-path ./homework/model.bin

Mean predicted duration: 14.29


## Q6. Docker container 

Finally, we'll package the script in the docker container. 
For that, you'll need to use a base image that we prepared. 

This is what the content of this image is:

```dockerfile
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

Note: you don't need to run it. We have already done it.

It is pushed to [`agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim`](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo),
which you need to use as your base image.

That is, your Dockerfile should start with:

```dockerfile
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image. 

Now run the script with docker. What's the mean predicted duration
for May 2023? 

* **0.19**
* ~~7.24~~
* ~~14.24~~
* ~~21.19~~


In [122]:
Code(filename='Dockerfile')

In [121]:
Code(filename='requirements.txt')

In [None]:
# >docker build -t taxi-prediction .

# [+] Building 66.7s (10/10) FINISHED                                                                docker:desktop-linux
#  => [internal] load build definition from Dockerfile                                                               0.0s
#  => => transferring dockerfile: 240B                                                                               0.0s
#  => [internal] load metadata for docker.io/agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim                        1.3s
#  => [internal] load .dockerignore                                                                                  0.0s
#  => => transferring context: 2B                                                                                    0.0s
#  => [1/5] FROM docker.io/agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim@sha256:f54535b73a8c3ef91967d5588de57d4e  0.1s
#  => => resolve docker.io/agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim@sha256:f54535b73a8c3ef91967d5588de57d4e  0.0s
#  => [internal] load build context                                                                                  0.0s
#  => => transferring context: 71B                                                                                   0.0s
#  => CACHED [2/5] WORKDIR /app                                                                                      0.0s
#  => CACHED [3/5] COPY requirements.txt .                                                                           0.0s
#  => [4/5] RUN pip install -r requirements.txt                                                                     34.2s
#  => [5/5] COPY run_predict.py .                                                                                    0.2s
#  => exporting to image                                                                                            30.5s
#  => => exporting layers                                                                                           24.6s
#  => => exporting manifest sha256:5f85273341e79eca82c84b2d24ad7ee5fbaa954307a54f51b66ba87b34f07f62                  0.0s
#  => => exporting config sha256:5172bb0d29706b78fce24f3d534908da8e2ba9145f0966e38f4b5974174424f1                    0.0s
#  => => exporting attestation manifest sha256:571b87bf14f058232ccea886dca2178e5acbea535493dce2dd776d530ca1d1e8      0.1s
#  => => exporting manifest list sha256:a4ed179fec620e0be68c8c025196e38bd1ecde33c05f4c8b4a3e6c8ed6745e14             0.0s
#  => => naming to docker.io/library/taxi-prediction:latest                                                          0.0s
#  => => unpacking to docker.io/library/taxi-prediction:latest                                                       5.8s

In [93]:
! docker run taxi-prediction --year 2023 --month 5 --model-path /app/model.bin

Mean predicted duration: 0.19


## Bonus: upload the result to the cloud (Not graded)

Just printing the mean duration inside the docker image 
doesn't seem very practical. Typically, after creating the output 
file, we upload it to the cloud storage.

Modify your code to upload the parquet file to S3/GCS/etc.

## Bonus: Use an orchestrator for batch inference

Here we didn't use any orchestration. In practice we usually do.

* Split the code into logical code blocks
* Use a workflow orchestrator for the code execution

## Publishing the image to dockerhub

This is how we published the image to Docker hub:

```bash
docker build -t mlops-zoomcamp-model:2024-3.10.13-slim .
docker tag mlops-zoomcamp-model:2024-3.10.13-slim agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

docker login --username USERNAME
docker push agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim
```

This is just for your reference, you don't need to do it.


## Submit the results

* Submit your results here: https://courses.datatalks.club/mlops-zoomcamp-2025/homework/hw4
* It's possible that your answers won't match exactly. If it's the case, select the closest one.