In [23]:
%%capture
pip install scikit-learn==1.5.0

In [24]:
!pip show scikit-learn

Name: scikit-learn
Version: 1.5.0
Summary: A set of python modules for machine learning and data mining
Home-page: https://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: D:\programming\mlops-zc\.venv\Lib\site-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: mlflow


In [25]:
!python -V

Python 3.11.2


In [26]:
import pickle
import pandas as pd

In [27]:
with open('model.bin', 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [28]:
categorical = ['PULocationID', 'DOLocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [29]:
year = 2023
month = 3

df = read_data(f'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{year}-{month:02d}.parquet')

In [30]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = model.predict(X_val)

### Q1. Notebook
Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

In [31]:
y_pred.std()

6.247488852238703

### Q2. Preparing the output
Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial `ride_id` column:

```
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```
Next, write the ride id and the predictions to a dataframe with results.

Save it as parquet:
```
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False
)
```
What's the size of the output file?

Note: Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the dtypes of the columns and use `pyarrow`, not `fastparquet`.

In [35]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

df_result = df[["ride_id", "tpep_pickup_datetime", "PULocationID", "DOLocationID", "duration"]].copy().rename(columns={"duration": "actual_duration"})
df_result["predicted_duration"] = y_pred
df_result["diff"] = df_result.actual_duration - df_result.predicted_duration

df_result = pd.DataFrame()
df_result["ride_id"] = df.ride_id
df_result["predicted_duration"] = y_pred

df_result.to_parquet(
    f"./output/yellow_tripdata_{year}-{month:02d}.parquet",
    engine='pyarrow',
    compression=None,
    index=False
)