In [1]:
!pip freeze | grep scikit-learn

scikit-learn==1.5.0


In [2]:
!python -V

Python 3.9.19


In [3]:
import os
import pickle
import pandas as pd

In [4]:
with open("model.bin", "rb") as f_in:
    dv, model = pickle.load(f_in)

In [5]:
categorical = ["PULocationID", "DOLocationID"]

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df["duration"] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df["duration"] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype("int").astype("str")
    
    return df

In [6]:
year = 2023
month = 3
taxi_type = "yellow"
input_file = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet"
output_file = f"output/{taxi_type}/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet"
os.makedirs(os.path.split(output_file)[0], exist_ok=True)

In [7]:
df = read_data(input_file)

In [8]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

## Q1. Notebook

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?


In [9]:
print(f"std: {y_pred.std():.2f}")

std: 6.25


**Solution:** `6.24`

## Q2. Preparing the output

In [10]:
# Creating an artificial `ride_id` column
df["ride_id"] = f"{year:04d}/{month:02d}_" + df.index.astype("str")

In [11]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'Airport_fee', 'duration',
       'ride_id'],
      dtype='object')

In [12]:
df_result = pd.DataFrame()
df_result["predicted_duration"] = y_pred
df_result["ride_id"] = df["ride_id"]

In [13]:
# Writing the ride id and the predictions to a dataframe with results
df_result.to_parquet(
    output_file,
    engine="pyarrow",
    compression=None,
    index=False
)
print(f"Saved: {output_file}")

Saved: output/yellow/yellow_tripdata_2023-03.parquet


In [14]:
!du -sh $output_file

65M	output/yellow/yellow_tripdata_2023-03.parquet


**Solution**: `66M`

## Q3. Creating the scoring script

Which command is needed to extract the jupyter notebook as script?

**Solution:** `jupyter nbconvert --to script homework.ipynb`

## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter notebook.

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?


Creating the required Pipenv:
```bash
pipenv install scikit-learn==1.5.0 pandas pyarrow
```

**Solution:** `057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c`

## Q5. Parametrize the script


Let's now make the script configurable via CLI. We'll create two 
parameters: year and month.

Run the script for April 2023. 

What's the mean predicted duration? 

* 7.29
* 14.29
* 21.29
* 28.29

Hint: just add a print statement to your script.

**Solution:** `14.29`

The code can be found here: [homework.py](homework.py). To get the result above the following command was used:
```bash
py homework.py --year 2023 --month 4 --model model.bin --taxi_type yellow
```

## Q6. Docker container 


Finally, we'll package the script in the docker container. 
For that, you'll need to use a base image that we prepared. 

This is what the content of this image is:

```dockerfile
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```
*Note*: you don't need to run it. We have already done it.

It is pushed to [`agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim`](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo),
which you need to use as your base image.

**Solution:** `0.19`

Relevant files:
- [Dockerfile](Dockerfile)

Building the container
```bash
docker build -t hw4-deployment .
```

Running the model
```bash
docker run -it --rm hw4-deployment --year 2023 --month 5 --model model.bin --taxi_type yellow
```