In [3]:
!pip install scikit-learn==1.0.2

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting scikit-learn==1.0.2
  Downloading scikit_learn-1.0.2-cp39-cp39-macosx_10_13_x86_64.whl (8.0 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.0/8.0 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting scipy>=1.1.0
  Downloading scipy-1.8.1-cp39-cp39-macosx_12_0_universal2.macosx_10_9_x86_64.whl (55.6 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.6/55.6 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0mm
Collecting joblib>=0.11
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-n

In [11]:
import pickle
import pandas as pd
import numpy as np

In [6]:
with open('model.bin', 'rb') as f_in:
    dv, lr = pickle.load(f_in)

In [7]:
categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [8]:
df = read_data('../data/fhv_tripdata_2021-02.parquet')

In [9]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = lr.predict(X_val)

## Q1. Notebook
Run this notebook for the February 2021 FVH data.

What's the mean predicted duration for this dataset?

- 11.19
- 16.19
- 21.19
- 26.19

In [13]:
print(f"The mean predicted duration for this dataset is {np.mean(y_pred)}")

The mean predicted duration for this dataset is 16.191691679979066


## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial ride_id column:

In [17]:
year = 2021
month = 2

In [18]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

Next, write the ride id and the predictions to a dataframe with results.

In [20]:
df_result = df.copy(deep=True)

In [31]:
df_result.drop(df_result.columns.difference(['ride_id']), 1, inplace=True)

  df_result.drop(df_result.columns.difference(['ride_id']), 1, inplace=True)


In [36]:
df_result["predictions"] = y_pred

In [37]:
df_result.to_parquet(
    "../data/results.parquet",
    engine='pyarrow',
    compression=None,
    index=False
)

In [39]:
print("The size of the output file is 19.7MB")

The size of the output file is 19.7MB


## Q3. Creating the scoring script


Now let's turn the notebook into a script.

Which command you need to execute for that?

**Answer**: 

```
jupyter nbconvert --to script homework_answrs.ipynb
```

## Q4. Virtual environment
Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: check the starter notebook for details.

After installing the libraries, pipenv creates two files: Pipfile and Pipfile.lock. The Pipfile.lock file keeps the hashes of the dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

**Answers**

"sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b"

## Q5. Parametrize the script


Let's now make the script configurable via CLI. We'll create two parameters: year and month.

Run the script for March 2021.

What's the mean predicted duration?

- 11.29
- 16.29
- 21.29
- 26.29