# MLOps Zoomcamp 2023 - Session #4

Author: José Victor

Starter code is at in the [homework](https://github.com/DataTalksClub/mlops-zoomcamp/tree/main/cohorts/2023/04-deployment/homework) directory.

In [1]:
import pickle
import numpy as np
import pandas as pd

## Q1 Notebook

We'll start with the same notebook we ended up with in homework 1. We cleaned it a little bit and kept only scoring part. You can find the initial notebook [here](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2023/04-deployment/homework/starter.ipynb).

Run this notebook for the February 2022 data.

What's the standard deviation of the predicted duration for this dataset?

* (X) 5.28
* ( ) 10.28
* ( ) 15.28
* ( ) 20.28

In [2]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [3]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [4]:
df = read_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet')

In [5]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

In [7]:
print(f"Predictions standard deviation: {np.std(y_pred)}")

Predictions standard deviation: 5.28140357655334


## Q2 Preparing the output

Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f"{year:04d}/{month:02d}" + df.index.astype("str")
```

Next, write the ride id and the predictions to a dataframe with results.

Save it as parquet:

```python
df_result.to_parquet(output_file, engine="pyarrow", compression=None, index=False)
```

What's the size of the output file?

* ( ) 28M
* ( ) 38M
* ( ) 48M
* (X) 58M

$\textbf{Note}$: Make sure you the snippet above for saving the file. It should contain only these two columns. For this question, don't change the dtypes of the columns and use pyarrow, not fastparquet.

In [8]:
year = 2022
month = 2

In [9]:
df["ride_id"] = f"{year:04d}/{month:02d}" + df.index.astype("str")

In [10]:
df["predictions"] = y_pred.copy()

In [11]:
df_result = df[["ride_id", "predictions"]].copy()

In [12]:
output_file = "predictions.parquet"

In [13]:
df_result.to_parquet(output_file, engine="pyarrow", compression=None, index=False)

In [14]:
!dir

 O volume na unidade J � NVME Kingston
 O N�mero de S�rie do Volume � 4C3B-EF96

 Pasta de j:\Coding\mlops-zoomcamp2023\04_deployment

18/06/2023  16:05    <DIR>          .
18/06/2023  16:05    <DIR>          ..
18/06/2023  15:46                77 Dockerfile
16/06/2023  23:09             5.870 homework04.ipynb
18/06/2023  15:47            17.369 model.bin
18/06/2023  16:14        57.092.901 predictions.parquet
18/06/2023  16:05             3.196 scoring.ipynb
18/06/2023  15:47             2.198 starter.ipynb
               6 arquivo(s)     57.121.611 bytes
               2 pasta(s)   584.763.977.728 bytes dispon�veis


## Q3 Creating the scoring script

Now let's turn the notebook into a script.

Which command you need to execute for that?

In [16]:
!jupyter nbconvert --to script scoring.ipynb 

[NbConvertApp] Converting notebook scoring.ipynb to script
[NbConvertApp] Writing 1238 bytes to scoring.py


## Q4 Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: check the starter notebook for details.

After installing the libraries, pipenv creates two files: `Pipfile` and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the depencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

Answer: $\textbf{065e9673e24e0dc5113e2dd2b4ca30c9d8aa2fa90f4c0597241c93b63130d233}$

## Q5 Parametrize the script

Let's now make the script configurable via CLI. We'll create two parameters: year and month.

Run the script for March 2022.

What's the mean predicted duration?

* ( ) 7.76
* (X) 12.76
* ( ) 17.76
* ( ) 22.76

Hint: just add a print statement to your script.

In [17]:
!python scoring.py 2022 03

Duration mean predictions: 12.758556818790902


## Q6 Docker container

Finally, we'll package the script in the docker container. For that, you'll need to use a base image that we prepared.

This is how it looks like::

```
FROM python:3.10.0-slim

WORKDIR /app

COPY [ "model2.bin", "model.bin" ]
```
(see [`homework/Dockerfile`](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2023/04-deployment/homework/Dockerfile))

We pushed it to `svizor/zoomcamp-model:mlops-3.10.0-slim`, which you should use as your base image. That is, this is how your Dockerfile should start:

```
FROM svizor/zoomcamp-model:mlops-3.10.0-slim

# do stuff here
```
This image already has a pickle file with a dictionary vectorizer and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need to use the pickle file already in the image.

Now run the script with docker. What's the mean predicted duration for April 2022?

* ( ) 7.92
* (X) 12.83
* ( ) 17.92
* ( ) 22.83

In [18]:
!python scoring.py 2022 04

Duration mean predictions: 12.865128336784926
