## Homework 4

### Q1. Notebook
We'll start with the same notebook we ended up with in homework 1. We cleaned it a little bit and kept only the scoring part. You can find the initial notebook here.

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

In [1]:
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error
import numpy as np

In [3]:
def read_dataframe(filename):
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)

        df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
        df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
    elif filename.endswith('.parquet'):
        df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

In [4]:
df_train = read_dataframe('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')
df_val = read_dataframe('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet')

In [5]:
df_train['PU_DO'] = df_train['PULocationID'] + '_' + df_train['DOLocationID']
df_val['PU_DO'] = df_val['PULocationID'] + '_' + df_val['DOLocationID']

In [6]:
categorical = ['PU_DO'] #'PULocationID', 'DOLocationID']
categorical = ['PULocationID', 'DOLocationID']

numerical = ['trip_distance']

dv = DictVectorizer()

train_dicts = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [7]:
file_namme = 'models/dv.bin'
with open(file_namme, 'wb') as f_out:
    pickle.dump(dv, f_out)

In [8]:
target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

In [9]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

std = np.std(y_pred)

In [10]:
file_namme = 'models/lr.bin'
with open(file_namme, 'wb') as f_out:
    pickle.dump(lr, f_out)

In [11]:
print(f"Standard deviation {std}")

Standard deviation 6.770460857201022


### Q2. Preparing the output
Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial ride_id column:

In [12]:
year = 2023
month = 3
df_train['ride_id'] = f'{year:04d}/{month:02d}_' + df_train.index.astype('str')

In [13]:

# Assuming you have a DataFrame `df` with 'ride_id' and an array/Series `y_pred`
df_result = pd.DataFrame({
    'ride_id': df_train['ride_id'],
    'prediction': y_pred
})

output_file = 'results.parquet'
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    index=False
)

### Q3. Creating the scoring script
Now let's turn the notebook into a script.

Which command you need to execute for that?

In [14]:
!jupyter nbconvert --to script homework_week_4.ipynb

[NbConvertApp] Converting notebook homework_week_4.ipynb to script
[NbConvertApp] Writing 5118 bytes to homework_week_4.py


jupyter nbconvert --to script <notebook-name>.ipynb

### Q4. Virtual environmentt

In [15]:
! pip freeze | findstr scikit-learn

scikit-learn==1.7.0


In [16]:
! pipenv install scikit-learn==1.7.0

To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.
Installing scikit-learn==1.7.0...
Installation Succeeded
To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.
Installing dependencies from Pipfile.lock (52a7b3)...
All dependencies are now up-to-date!
Building requirements...
[    ] Locking packages...
Resolving dependencies...
[    ] Locking packages...
[    ] Locking packages...
[==  ] Locking packages...
[=== ] Locking packages...
[ ===] Locking packages...
[  ==] Locking packages...
[   =] Locking packages...
[   =] Locking packages...
[  ==] Locking packages...
[ ===] Locking packages...
[====] Locking packages...
[=== ] Locking packages...
[==  ] Locking packages...
[    ] Locking packages...
[=   ] Locking packages...
[==  ] Locking packages...
[=== ] Locking packages...
[ ===] Locking packages...
[  ==] Locking packages...
[   =] Lock

Courtesy Notice:
Pipenv found itself running within a virtual environment,  so it will 
automatically use that environment, instead of  creating its own for any 
project. You can set
PIPENV_IGNORE_VIRTUALENVS=1 to force pipenv to ignore that environment and 
create  its own instead.
Upgrading scikit-learn==1.7.0 in  dependencies.


In [17]:
print("The first hash is sha256:014e07a23fe02e65f9392898143c542a50b6001dbe89cb867e19688e468d049b")

The first hash is sha256:014e07a23fe02e65f9392898143c542a50b6001dbe89cb867e19688e468d049b


### Q5. Parametrize the script
Let's now make the script configurable via CLI. We'll create two parameters: year and month.

Run the script for April 2023.

What's the mean predicted duration?



In [18]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_val)

std = np.std(y_pred)
mean = np.mean(y_pred)

In [19]:
print(f"Mean predicted duration {mean}")

Mean predicted duration 15.084995420295018


### Q6. Docker container
Finally, we'll package the script in the docker container. For that, you'll need to use a base image that we prepared.

This is what the content of this image is:

FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
Note: you don't need to run it. We have already done it.

It is pushed to agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim, which you need to use as your base image.

That is, your Dockerfile should start with:

FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

This image already has a pickle file with a dictionary vectorizer and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need to use the pickle file already in the image.

Now run the script with docker. What's the mean predicted duration for May 2023?