# Homework - Module 04

In this homework, we'll deploy the ride duration model in batch mode. Like in homework 1, we'll use [the NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), more specifically the **Yellow** Yellow Taxi Trip Records dataset.

### Question 1. Notebook

We'll start with the same notebook we ended up with in homework 1. We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](./starter.ipynb).

After, running this notebook for the March 2023 data, the standard deviation of the predicted duration for this dataset is `1.24 6.24 12.28 18.28`.

### Question 2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output. 

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

We can then write the ride id and the predictions to a dataframe with results, saving it as parquet:

```python
df_result.to_parquet(
    output_file,
    engine = 'pyarrow',
    compression = None,
    index = False
)
```

__Note:__ We used the snippet above for saving the file. It should contain only these two columns. For this question, we didn't change the
dtypes of the columns and used `pyarrow`, not `fastparquet`. 

In [None]:
# Output file size
!ls lh ouput_file

The size of the output file is ` 36M 46M 56M 66M`.

### Question 3. Creating the scoring script

Now let's execute a command to turn the notebook into a script. 

In [None]:
# Turn notebook to script
!jupyter nbconvert --to=script starter.ipynb


### Question 4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

We first install all the required libraries, paying attention to the Scikit-Learn version (same version as in the starter notebook).

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

The first hash for the Scikit-Learn dependency is:

### Question 5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two 
parameters: year and month, and run the script for April 2023, adding a print statement to it. 

The mean predicted duration is ` 7.29 14.29 21.29 28.29`.

### Question 6. Docker container 

Finally, we'll package the script in the docker container. For that, we'll need to use a base image already prepared. This is what the content of this image is:

```dockerfile
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

Note: There is no need to run it. It was already done to be pushed to [`agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim`](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo), which we will use as base image.

Our Dockerfile is as follows:

```dockerfile
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

# do stuff here
```

This image already has a pickle file with a dictionary vectorizer
and a model. We will use them. Indeed, There is no need to copy the model to the docker image, but only to use the pickle file already in the image. 

Now let's run the script with docker:

The mean predicted duration
for May 2023 is `0.19 7.24 14.24 21.19`.

#### Bonus: upload the result to the cloud (Not graded)

Just printing the mean duration inside the docker image 
doesn't seem very practical. Typically, after creating the output 
file, we upload it to the cloud storage.

Modify your code to upload the parquet file to S3/GCS/etc.


#### Bonus: Use an orchestrator for batch inference

Here we didn't use any orchestration. In practice we usually do.

* Split the code into logical code blocks
* Use a workflow orchestrator for the code execution

#### Publishing the image to dockerhub

This is how we published the image to Docker hub:

```bash
docker build -t mlops-zoomcamp-model:2024-3.10.13-slim .
docker tag mlops-zoomcamp-model:2024-3.10.13-slim agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

docker login --username USERNAME
docker push agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim
```

This is just for your reference.

---