In [1]:
!pip freeze | grep scikit-learn

scikit-learn @ file:///Users/runner/miniforge3/conda-bld/scikit-learn_1652391811680/work


In [2]:
import pickle
import pandas as pd

In [3]:
with open('model.bin', 'rb') as f_in:
    dv, lr = pickle.load(f_in)

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


In [4]:
categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [5]:
year = 2021
month = 2

In [6]:
df = read_data(
    f"https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_{year:04d}-{month:02d}.parquet"
)

In [7]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = lr.predict(X_val)

## Q1. Notebook

What's the mean predicted duration for this dataset?

* 11.19
* 16.19
* 21.19
* 26.19

In [8]:
y_pred.mean()

16.191691679979066

## Q2. Preparing the output

What's the size of the output file?

* 9M
* 19M
* 29M
* 39M

In [9]:
df["ride_id"] = f"{year:04d}/{month:02d}_" + df.index.astype("str")

def save_results(df_result: pd.DataFrame, output_file: str):
    df_result.to_parquet(
        output_file,
        engine="pyarrow",
        compression=None,
        index=False,
    )

results = df[["ride_id"]].copy()
results["pred"] = y_pred

In [11]:
import os

print(results.head())
save_results(results, output_file="tmp")
print(os.path.getsize("tmp") // 1024 ** 2, "M")
os.remove("tmp")

     ride_id       pred
1  2021/02_1  14.539865
2  2021/02_2  13.740422
3  2021/02_3  15.593339
4  2021/02_4  15.188118
5  2021/02_5  13.817206
18 M


## Q3. Creating the scoring script

Now let's turn the notebook into a script. 

Which command you need to execute for that?

In [28]:
!jupyter nbconvert --to script starter_olegtaratuhin.ipynb

[NbConvertApp] Converting notebook starter.ipynb to script
[NbConvertApp] Writing 3744 bytes to starter.py


## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version:
check the starter notebook for details. 

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

In [12]:
"08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b"

'08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b'

## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two 
parameters: year and month.

Run the script for March 2021. 

What's the mean predicted duration? 

* 11.29
* 16.29
* 21.29
* 26.29

In [13]:
!pipenv run python starter_olegtaratuhin.py --year 2021 --month 3

Launch script with: Namespace(year=2021, month=3)
Mean duration: 16.298821614015107


## Q6. Docker contaner 

Finally, we'll package the script in the docker container. 
For that, you'll need to use a base image that we prepared. 

This is how it looks like:

```
FROM python:3.9.7-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

(see [`homework/Dockerfile`](homework/Dockerfile))

We pushed it to [`agrigorev/zoomcamp-model:mlops-3.9.7-slim`](https://hub.docker.com/layers/zoomcamp-model/agrigorev/zoomcamp-model/mlops-3.9.7-slim/images/sha256-7fac33c783cc6018356ce16a4b408f6c977b55a4df52bdb6c4d0215edf83af5d?context=explore),
which you should use as your base image.

That is, this is how your Dockerfile should start:

```docker
FROM agrigorev/zoomcamp-model:mlops-3.9.7-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image. 

Now run the script with docker. What's the mean predicted duration
for April 2021? 


* 9.96
* 16.55
* 25.96
* 36.55

In [14]:
9.96

9.96