In [16]:
!pip freeze | grep scikit-learn
!pip show scikit-learn

scikit-learn @ file:///croot/scikit-learn_1684954695550/work
Name: scikit-learn
Version: 1.2.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /home/pxmopsadmin/anaconda3/lib/python3.11/site-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: imbalanced-learn, mlflow


In [2]:
import pickle
import pandas as pd

In [3]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [4]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [6]:
df = read_data('./data/yellow_tripdata_2022-02.parquet')

In [7]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

#### Homework

In this homework, we'll deploy the ride duration model in batch mode. Like in homework 1, we'll use the Yellow Taxi Trip Records dataset. 

You'll find the starter code in the [homework](homework) directory.


#### Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.
We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](homework/starter.ipynb).

Run this notebook for the February 2022 data.

What's the standard deviation of the predicted duration for this dataset?

* 5.28
* 10.28
* 15.28
* 20.28

In [8]:
y_pred.std()

5.28140357655334

#### Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output. 

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results. 

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* 28M
* 38M
* 48M
* 58M

__Note:__ Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use pyarrow, not fastparquet. 


In [12]:
year = 2022
month = 2
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
df_result = pd.DataFrame({'ride_id': df['ride_id'], 'predictions': y_pred})

df_result.to_parquet(
    'output/predictions_2022-02.parquet',
    engine='pyarrow',
    compression=None,
    index=False
)

!du -h ./output/predictions_2022-02.parquet

58M	./output/predictions_2022-02.parquet


#### Q3. Creating the scoring script

Now let's turn the notebook into a script. 

Which command you need to execute for that?

In [15]:
!jupyter nbconvert --to script starter.ipynb

[NbConvertApp] Converting notebook starter.ipynb to script
[NbConvertApp] Writing 4841 bytes to starter.py


#### Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version:
it should be `scikit-learn==1.2.2`. 

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

In [32]:
!pipenv install scikit-learn==1.2.2 pandas pyarrow
# sha256:065e9673e24e0dc5113e2dd2b4ca30c9d8aa2fa90f4c0597241c93b63130d233

[1;32mInstalling scikit-[0m[1;33mlearn[0m[1;32m==[0m[1;36m1.2[0m[1;32m.[0m[1;36m2[0m[1;33m...[0m
[?25lResolving scikit-[33mlearn[0m==[1;36m1.2[0m.[1;36m2[0m[33m...[0m
[2K✔ Installation Succeeded
[2K[32m⠋[0m Installing scikit-learn...
[1A[2K[1;32mInstalling pandas[0m[1;33m...[0m
[?25lResolving pandas[33m...[0m
[2K✔ Installation Succeeded
[2K[32m⠋[0m Installing pandas...
[1A[2K[1;32mInstalling pyarrow[0m[1;33m...[0m
[?25lResolving pyarrow[33m...[0m
[2K[1mAdded [0m[1;32mpyarrow[0m to Pipfile's [1;33m[[0m[33mpackages[0m[1;33m][0m [33m...[0m
[2K✔ Installation Succeededw...
[2K[32m⠋[0m Installing pyarrow...
[1A[2K[1;33mPipfile.lock [0m[1;33m([0m[1;33m6fc9fa[0m[1;33m)[0m[1;33m out of date, updating to [0m[1;33m([0m[1;33mac6219[0m[1;33m)[0m[1;33m...[0m
Locking[0m [33m[packages][0m dependencies...[0m
[?25lBuilding requirements[33m...[0m
[2KResolving dependencies[33m...[0m
[2K✔ Success! Locking...
[

#### Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two 
parameters: year and month.

Run the script for March 2022. 

What's the mean predicted duration? 

* 7.76
* 12.76
* 17.76
* 22.76

Hint: just add a print statement to your script.

In [22]:
# 1. Run all code in def run()
# 2. if __name__ == '__main__':    run()
# 3. import sys
# 4. transform fixed variables into sys.arg[i] variables and dynamicize code
!python starter.py 2022 3

The mean predicted duration is 12.758556818790902.


#### Q6. Docker container 

Finally, we'll package the script in the docker container. 
For that, you'll need to use a base image that we prepared. 

This is how it looks like:

```
FROM python:3.10.0-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

(see [`homework/Dockerfile`](homework/Dockerfile))

We pushed it to [`svizor/zoomcamp-model:mlops-3.10.0-slim`](https://hub.docker.com/layers/svizor/zoomcamp-model/mlops-3.10.0-slim/images/sha256-595bf690875f5b9075550b61c609be10f05e6915609ef4ea4ce9797116c99eff?context=repo),
which you should use as your base image.

That is, this is how your Dockerfile should start:

```docker
FROM svizor/zoomcamp-model:mlops-3.10.0-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image. 

Now run the script with docker. What's the mean predicted duration
for April 2022? 


* 7.92
* 12.83
* 17.92
* 22.83


In [35]:
## Dockerfile
# FROM svizor/zoomcamp-model:mlops-3.10.0-slim

# RUN pip install -U pip
# RUN pip install pipenv

# WORKDIR /app
# COPY ["Pipfile", "Pipfile.lock", "./"]

# RUN pipenv install --system --deploy

# COPY [ "model2.bin", "model.bin"]

# # Set the entrypoint to run the prediction script
# ENTRYPOINT ["python", "starter.py"]
!docker build -t ride-duration-predictor .

DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  234.8MB
Step 1/8 : FROM svizor/zoomcamp-model:mlops-3.10.0-slim
 ---> 9c46916c0687
Step 2/8 : RUN pip install -U pip
 ---> Using cache
 ---> 794281cc0659
Step 3/8 : RUN pip install pipenv
 ---> Using cache
 ---> ee2f90400090
Step 4/8 : WORKDIR /app
 ---> Using cache
 ---> 5cf947c7e49f
Step 5/8 : COPY ["Pipfile", "Pipfile.lock", "starter.py", "./"]
 ---> Using cache
 ---> c6f21e5b6b8b
Step 6/8 : RUN pipenv install --system --deploy
 ---> Using cache
 ---> ee2b586e5d01
Step 7/8 : RUN mkdir -p /app/output
 ---> Running in e2aeab9295a0
Removing intermediate container e2aeab9295a0
 ---> df97c4bde005
Step 8/8 : ENTRYPOINT ["python", "starter.py"]
 ---> Running in 26e174887f52
Removing intermediate container 26e174887f52
 ---> 130a4ff97950
Successfull

In [36]:
!docker run --rm ride-duration-predictor 2022 4

The mean predicted duration is 12.827242870079969.
