## Homework

In this homework, we'll deploy the ride duration model in batch mode. Like in homework 1, we'll use the Yellow Taxi Trip Records dataset. 

You'll find the starter code in the [homework](homework) directory.

In [1]:
!pip freeze | grep scikit-learn

scikit-learn @ file:///home/conda/feedstock_root/build_artifacts/scikit-learn_1685023709438/work


In [2]:
!conda list scikit-learn

# packages in environment at /home/zatoichi/anaconda3/envs/mlops:
#
# Name                    Version                   Build  Channel
scikit-learn              1.2.2            py39hc236052_2    conda-forge


In [3]:
import pickle
import pandas as pd

In [4]:
with open('./homework/model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [5]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.
We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](homework/starter.ipynb).

Run this notebook for the February 2022 data.

What's the standard deviation of the predicted duration for this dataset?

* **5.28**
* 10.28
* 15.28
* 20.28

In [6]:
year = 2022
month = 2

In [7]:
df = read_data(f'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{year:04}-{month:02}.parquet')

In [8]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

In [9]:
round(y_pred.std(),2)

5.28

## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output. 

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results. 

Save it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```

What's the size of the output file?

* 28M
* 38M
* 48M
* **58M**

__Note:__ Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use pyarrow, not fastparquet. 

In [10]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

In [11]:
df_result = df[['ride_id']].copy()
df_result['prediction'] = y_pred
output_file = f'./predictions/pred_yellow_tripdata_{year:04}-{month:02}.parquet'

In [12]:
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)

In [13]:
import os

file_size = os.path.getsize(f'predictions/pred_yellow_tripdata_{year:04}-{month:02}.parquet')
print("File Size is :", round(file_size/(1024*1024),2),"MB")

File Size is : 57.22 MB


## Q3. Creating the scoring script

Now let's turn the notebook into a script. 

Which command you need to execute for that?

In [14]:
# q3-score.ipynb was created to find the mean riding time for a given month and a year(after 2022).

In [15]:
# To convert q3-score.ipynb into python executable file, the following command was used 
# jupyter nbconvert --to script <file-to-convert>.ipynb
!jupyter nbconvert --to script q3-score.ipynb

[NbConvertApp] Converting notebook q3-score.ipynb to script
[NbConvertApp] Writing 1711 bytes to q3-score.py


## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version:
it should be `scikit-learn==1.2.2`. 

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

- Open a terminal in the same project folder and run the following code.
    
     `pip install pipenv` <br>
     `pipenv shell` <br>
     `pipenv install -r requirements.txt` <br>
    
    ***After this the necessary libraries are installed from list of libaries in requirements.txt***

    ***Pipfile and Pipfile.lock are updated. Open the Pipfile.lock, scroll till the 1st hash of scikit-learn and copy the data and paste it here.***


***The first hash for scikit-learn dependency in Pipfile.lock is :*** <br>
`065e9673e24e0dc5113e2dd2b4ca30c9d8aa2fa90f4c0597241c93b63130d233` <br>
OR <br>
`"sha256:065e9673e24e0dc5113e2dd2b4ca30c9d8aa2fa90f4c0597241c93b63130d233"`

## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two 
parameters: year and month.

Run the script for March 2022. 

What's the mean predicted duration? 

* 7.76
* **12.76**
* 17.76
* 22.76

Hint: just add a print statement to your script.

In [16]:
# q5-score.py is a copy of the q3-score.py with some modifications for parameterization and for use with docker
!python q5-score.py 2022 3  

Reading Data for the month:03 of the year:2022 to predict the mean riding time

Predicting...

The predicted mean riding time for the month:03 of the year:2022 is 12.76


## Q6. Docker container 

Finally, we'll package the script in the docker container. 
For that, you'll need to use a base image that we prepared. 

This is how it looks like:

```
FROM python:3.10.0-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

(see [`homework/Dockerfile`](homework/Dockerfile))

We pushed it to [`svizor/zoomcamp-model:mlops-3.10.0-slim`](https://hub.docker.com/layers/svizor/zoomcamp-model/mlops-3.10.0-slim/images/sha256-595bf690875f5b9075550b61c609be10f05e6915609ef4ea4ce9797116c99eff?context=repo),
which you should use as your base image.

That is, this is how your Dockerfile should start:

```docker
FROM svizor/zoomcamp-model:mlops-3.10.0-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image. 

Now run the script with docker. What's the mean predicted duration
for April 2022? 


* 7.92
* **12.83**
* 17.92
* 22.83

***To build a docker image, use the following script.*** <br>
***Ensure that the Dockerfile is in the same folder along with the necessary files and folders referred in the Dockerfile.***<br>
***The command is: <br> `docker build -t <<container_image_name>>:<<tag>> .`***

In [17]:
!docker build -t week4_deployment:latest .

[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                                         
[?25h[1A[0G[?25l[+] Building 0.1s (12/12) FINISHED                                              
[34m => [internal] load .dockerignore                                          0.0s
[0m[34m => => transferring context: 2B                                            0.0s
[0m[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 342B                                       0.0s
[0m[34m => [internal] load metadata for docker.io/svizor/zoomcamp-model:mlops-3.  0.0s
[0m[34m => [1/7] FROM docker.io/svizor/zoomcamp-model:mlops-3.10.0-slim           0.0s
[0m[34m => [internal] load build context                                          0.0s
[0m[34m => => transferring context: 93B                                           0.0s
[0m[34m => CACHED [2/7] RUN pip install -U pip                                    0.0s
[0

***To run the docker container with the parameterized inputs, run the following command:<br>***
***`docker run -it <<container_image_name>>:<<tag>>  <<input_parameter1>> <<input_parameter2>>`***
    

In [18]:
# Mean Ride Duration Prediction for April 2022
!docker run -it week4_deployment:latest 2022 4

Reading Data for the month:04 of the year:2022 to predict the mean riding time

Predicting...

The predicted mean riding time for the month:04 of the year:2022 is 12.83


## Bonus: upload the result to the cloud (Not graded)

Just printing the mean duration inside the docker image 
doesn't seem very practical. Typically, after creating the output 
faile, we upload it to the cloud storage.

Modify your code to upload the parquet file to S3/GCS/etc.

## Publishing the image to dockerhub

This is how we published the image to Docker hub:

```bash
docker build -t mlops-zoomcamp-model:v1 .
docker tag mlops-zoomcamp-model:v1 svizor/zoomcamp-model:mlops-3.10.0-slim
docker push svizor/zoomcamp-model:mlops-3.10.0-slim
```

In [19]:
!docker build -t mlops-zoomcamp-model-week4-deployment:latest .

[1A[1B[0G[?25l[+] Building 0.0s (0/0)                                                         
[?25h[1A[0G[?25l[+] Building 0.1s (2/3)                                                         
[34m => [internal] load build definition from Dockerfile                       0.1s
[0m[34m => => transferring dockerfile: 342B                                       0.0s
[0m[34m => [internal] load .dockerignore                                          0.0s
[0m[34m => => transferring context: 2B                                            0.0s
[0m => [internal] load metadata for docker.io/svizor/zoomcamp-model:mlops-3.  0.0s
[?25h[1A[1A[1A[1A[1A[1A[0G[?25l[+] Building 0.2s (12/12) FINISHED                                              
[34m => [internal] load build definition from Dockerfile                       0.1s
[0m[34m => => transferring dockerfile: 342B                                       0.0s
[0m[34m => [internal] load .dockerignore                           

In [20]:
!docker tag mlops-zoomcamp-model-week4-deployment:latest zatoichi/mlops-zoomcamp-model-week4-deployment:mlops-3.10.0-slim

In [21]:
!docker push zatoichi/mlops-zoomcamp-model-week4-deployment:mlops-3.10.0-slim

The push refers to repository [docker.io/zatoichi/mlops-zoomcamp-model-week4-deployment]

[1B70a54087: Preparing 
[1B33802461: Preparing 
[1B856be0eb: Preparing 
[1Bbf18a086: Preparing 
[1B84c2fefd: Preparing 
[1B4e5f9742: Preparing 
[1B53d86b70: Preparing 
[1Ba10cb66d: Preparing 
[6Bbf18a086: Preparing 
[1Bf6564658: Preparing 
[1B83285c91: Preparing 
[1Bf803d22a: Preparing 
[1B21b9bc30: Preparing 
[1Bae00a1e0: Preparing 
[1Be3a13052: Preparing 
[1B565baf43: Preparing 
[1B10ac81d3: Preparing 
[2B10ac81d3: Layer already exists [14A[2K[17A[2K[18A[2K[13A[2K[12A[2K[10A[2K[11A[2K[8A[2K[6A[2K[7A[2K[3A[2K[1A[2K[2A[2Kmlops-3.10.0-slim: digest: sha256:92bf9c5050185c7a284c131c3671c18a077c6e9468c506d2de7e8829e079e1a0 size: 4292


## Submit the results

* Submit your results here: https://forms.gle/4tnqB5yGeMrTtKKa6
* It's possible that your answers won't match exactly. If it's the case, select the closest one.
* You can submit your answers multiple times. In this case, the last submission will be used for scoring.


## Deadline

The deadline for submitting is 26 June 2023 (Monday) 23:00 CEST. 
After that, the form will be closed.