In [1]:
import polars as pl
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from datetime import datetime

## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [2]:
def clean_df(df:pl.DataFrame) -> pl.DataFrame:

    df = df.rename(lambda x:x.lower().replace(' ','_'))
    
    schema = df.schema

    for col_name, datatype in schema.items():
        if datatype.is_float():
            df = df.with_columns(pl.col(col_name).cast(pl.Float64))
        if datatype.is_integer():
            #dirty move here, casting int as float. There's at least one column that's a float in one dataframe but an int in another
            df = df.with_columns(pl.col(col_name).cast(pl.Float64))
            
    return df

df_list = [
    pl.read_parquet(i)
    .pipe(clean_df)
     for i in ['../Downloads/yellow_tripdata_2023-01.parquet', '../Downloads/yellow_tripdata_2023-02.parquet']
]

df = pl.concat(df_list)

df.shape

(5980721, 19)

19 columns

## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 32.59
* 42.59
* 52.59
* 62.59

In [3]:
df = (
    df
    .with_columns(
        #cast as int (nanoseconds), convert to minutes
        duration =  (pl.col('tpep_dropoff_datetime') - pl.col('tpep_pickup_datetime')).cast(int)/1e9/60,
        #add a month column for filtering in later questions:
        month=pl.col('tpep_pickup_datetime').dt.strftime('%Y-%m')
    )
)

(
    df
    .filter(pl.col('month')=='2023-01')
    .select('duration')
    .std()
)

duration
f64
42.585592


42.59 minutes

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%

In [4]:
df = (
    df
    .with_columns(
        keep_me = (pl.col('duration') >= 1) & (pl.col('duration') <= 60)
    )
)

(
    df
    .select('keep_me')
    .mean()
)

keep_me
f64
0.980672


keep 98% of the rows

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715

In [5]:
df_X = (
    df
    .filter(pl.col('keep_me')==True)
    .select('pulocationid', 'dolocationid')
    .cast(str)
)

ohe = OneHotEncoder()

X = ohe.fit_transform(df_X)

In [6]:
X.shape

(5865124, 519)

tbh I don't really understand this question if all we need to do is one hot encode.  
after we drop the outliers we have 515 columns in the matrix

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters, where duration is the response variable
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 3.64
* 7.64
* 11.64
* 16.64

In [7]:
model = make_pipeline(
    make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ['pulocationid','dolocationid']),
        remainder='drop'
    ),
    LinearRegression(n_jobs=-1)
)

In [13]:
df_train = (
    df
    .filter(
        (pl.col('keep_me') == True)
        & (pl.col('month') == '2023-01')
    )
)

df_val = (
    df
    .filter(
        (pl.col('keep_me') == True)
        & (pl.col('month') == '2023-02')
    )
)

In [9]:
model.fit(df_train, df_train.select('duration'))

In [10]:
model[-1].n_features_in_

515

yeah, so choose the 515 column option for the answer to the previous question!

In [11]:
y_train_pred = model.predict(df_train)
print(root_mean_squared_error(df_train.select('duration'), y_train_pred))

7.649236741030097


RMSE = 7.64

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

* 3.81
* 7.81
* 11.81
* 16.81

In [14]:
y_val_pred = model.predict(df_val)
print(root_mean_squared_error(df_val.select('duration'), y_val_pred))

7.811816694756555


RMSE = 7.81

## Submit the results

* Submit your results here: https://courses.datatalks.club/mlops-zoomcamp-2024/homework/hw1
* If your answer doesn't match options exactly, select the closest one