## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.
I basically followed the process in the optional lecture [here](https://www.youtube.com/watch?v=iRunifGSHFc&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK&index=5&pp=iAQB). There are certain things I could approached differently but they will skew the provided answers.

__Used packages__

In [26]:
import pandas as pd
from pathlib import Path
from typing import Dict, List

In [27]:
data_dir = "../data/raw_data"

__Download Data__. I added it to central locaiton avoid litering data all over my space.
We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records". To down the data for January and February 2022 use

```bash
! wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -O raw-train.parquet
! wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -O raw-test.parquet
```

## Reading data

In [28]:
!python -V

Python 3.9.16


In [29]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [30]:
def load_data(path: Path, file_name: str) -> pd.DataFrame:
    """Load parquet onjext from file into a DataFrame.

    Args:
        path (Path): The base path of the data
        file_name (str): The parquet data file name

    Returns:
        pd.DataFrame: DataFrame of  raw data
    """
 
    return pd.read_parquet(Path(data_dir) / f"{file_name}.parquet")

def add_trip_duration(
    df: pd.DataFrame, pick_up_time: str="tpep_pickup_datetime", drop_off_time: str ="tpep_dropoff_datetime"
) -> pd.DataFrame:
    """Adds the column `duration` to the DataFrame.

    The column duration is derived from the pickup and drop off time stamps.

    Args:
        df (pd.DataFrame): Raw data of the taxi trip data
        pick_up_time (str, optional): The pickup time. Defaults to "tpep_pickup_datetime".
        drop_off_time (str, optional): Dropoff time. Defaults to "tpep_dropoff_datetime".

    Returns:
        pd.DataFrame: The taxi data with duration added
    """
    df[pick_up_time] = pd.to_datetime(df[pick_up_time])
    df[drop_off_time] = pd.to_datetime(df[drop_off_time])
    df["duration"] = df[drop_off_time] - df[pick_up_time] 
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    return df


def remove_outliers(df: pd.DataFrame, strategy: Dict[str, Dict[str, float]])-> pd.DataFrame:
    """Removes outliers defined as a strategy.

    The strategy is defined as the a dict with columns as keys and the key is another dict
    with keys min and max whose values depict the cutoff of the outliers

    Args:
        df (pd.DataFrame): DataFrame with outliers present.
        strategy (Dict[str, Dict[str, float]]): The strategy to remove outliers

    Returns:
        pd.DataFrame: The processed data with ourliers removed.
    """
    # mask = pd.Series(df.shape[0]*[True])
    
    for column, outlier in strategy.items():
        # TODO: Make the following mask to only run on series and filter outside the loop
        # print(column, outlier)
        mask = (
            (df[column] >= outlier.get("min", df[column].min())) 
            & (df[column] <= outlier.get("max", df[column].max()))
            ) 
        # print(mask)
        df = df[mask]

    return df

def preprocess_taxi_data(df: pd.DataFrame, pickup_dropoff: Dict[str, str],  strategy: Dict[str, Dict[str, float]], categorical_features: List[str])-> pd.DataFrame:
    """Adds duration data using `add_trip_duration` and removes outliers using `remove_outliers.`

    Args:
        df (pd.DataFrame): DataFrame with outliers present.
        strategy (Dict[str, Dict[str, float]]): The strategy to remove outliers
        categorical_features (List[str]): List of categorical features to pass to categoricla feature processing

    Returns:
        pd.DataFrame: The processed data with ourliers removed.
    """
    if not "duration" in df.columns:
        df = add_trip_duration(df=df, pick_up_time=pickup_dropoff["pickup"], drop_off_time=pickup_dropoff["dropoff"])
    df = remove_outliers(df=df, strategy=strategy)
    df = categorial_feature_prepocessing(df, categorical_features=categorical_features)
    
    return df

def categorial_feature_prepocessing(df: pd.DataFrame, categorical_features: List[str]) -> pd.DataFrame:
    """Preprocess categorical features.

    Here we simply preprocess them by casting them as strings.

    Args:
        df (pd.DataFrame): Data with both numerical and categorical columns
        categorical_features (List[str]): List of ategorical feature 

    Returns:
        pd.DataFrame: Process dataframe
    """
    df[categorical_features] = df[categorical_features].astype(str)

    return df 

def train_model(df_train:pd.DataFrame, df_test:pd.DataFrame, categorial_features: List[str], target: str = "duration"):
    """Train a Linear Regression and calculate the RMSE on the validation dataframe

    Args:
        df_train (pd.DataFrame): Training data
        df_test (pd.DataFrame): Test data
        categorial_features (List[str]): List of categorical features
        target (str, optional): Target column. Defaults to "duration".

    Returns:
        Dict[str, float]: Dict of mse for train and test
    """ 
    
    dv = DictVectorizer()   
    train_dicts = df_train[categorial_features].to_dict(orient='records')
    test_dicts = df_test[categorial_features].to_dict(orient='records')
    
    X_train = dv.fit_transform(train_dicts)
    X_test = dv.transform(test_dicts)
    
    y_train = df_train[target].values
    y_test = df_test[target].values

    lr = LinearRegression()
    lr.fit(X_train, y_train)

    y_pred_train = lr.predict(X_train)
    y_pred_test = lr.predict(X_test)


    mse = {
        "train-mse": mean_squared_error(y_train, y_pred_train, squared=False),
        "test-mse": mean_squared_error(y_test, y_pred_test, squared=False)
        }
    
    return dv, mse 

### Read the data

In [31]:
raw_train = load_data(Path(data_dir), file_name="yellow_tripdata_2022-01") 
raw_train.head(2)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0


In [32]:
raw_train.shape

(2463931, 19)

## Q1. Downloading the data

Read the data for January. How many columns are there?



In [33]:
print(f'There are {len(raw_train.columns)} columns in data for january')

There are 19 columns in data for january


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

In [34]:
raw_train_with_duration = add_trip_duration(raw_train)

In [35]:
raw_train_with_duration.head(2)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,duration
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0,17.816667
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0,8.4


In [36]:
print(f"The standard deviation of trips in January is {raw_train_with_duration['duration'].std():.4f}")

The standard deviation of trips in January is 46.4453


## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%


In [37]:
preprocess_train = remove_outliers(raw_train_with_duration, strategy={"duration":{"min": 1, "max": 60}})

In [38]:
preprocess_train.head(2)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,duration
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0,17.816667
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0,8.4


In [39]:
raw_train_with_duration.shape

(2463931, 20)

In [40]:
print(f"Outlier removed is:{ (preprocess_train.shape[1]/raw_train_with_duration.shape[1]):.2f}")

Outlier removed is:1.00


The fraction of records left after drop outliers is 98%

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

In [41]:
categorical = ['PULocationID', 'DOLocationID']
TARGET = 'duration'

In [42]:
train_data = categorial_feature_prepocessing(preprocess_train[categorical + [TARGET]].copy(), categorical_features=categorical)
train_data.head()

Unnamed: 0,PULocationID,DOLocationID,duration
0,142,236,17.816667
1,236,42,8.4
2,166,166,8.966667
3,114,68,10.033333
4,68,163,37.533333


In [43]:
raw_test = load_data(Path(data_dir), file_name="yellow_tripdata_2022-02") 
raw_test.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-02-01 00:06:58,2022-02-01 00:19:24,1.0,5.4,1.0,N,138,252,1,17.0,1.75,0.5,3.9,0.0,0.3,23.45,0.0,1.25
1,1,2022-02-01 00:38:22,2022-02-01 00:55:55,1.0,6.4,1.0,N,138,41,2,21.0,1.75,0.5,0.0,6.55,0.3,30.1,0.0,1.25
2,1,2022-02-01 00:03:20,2022-02-01 00:26:59,1.0,12.5,1.0,N,138,200,2,35.5,1.75,0.5,0.0,6.55,0.3,44.6,0.0,1.25
3,2,2022-02-01 00:08:00,2022-02-01 00:28:05,1.0,9.88,1.0,N,239,200,2,28.0,0.5,0.5,0.0,3.0,0.3,34.8,2.5,0.0
4,2,2022-02-01 00:06:48,2022-02-01 00:33:07,1.0,12.16,1.0,N,138,125,1,35.5,0.5,0.5,8.11,0.0,0.3,48.66,2.5,1.25


In [44]:
test_data = preprocess_taxi_data(
    df = raw_test, 
    pickup_dropoff = {"pickup": "tpep_pickup_datetime", "dropoff": "tpep_dropoff_datetime"},
    strategy={"duration":{"min": 1, "max": 60}}, categorical_features=categorical
    )
test_data.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[categorical_features] = df[categorical_features].astype(str)


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,duration
0,1,2022-02-01 00:06:58,2022-02-01 00:19:24,1.0,5.4,1.0,N,138,252,1,17.0,1.75,0.5,3.9,0.0,0.3,23.45,0.0,1.25,12.433333
1,1,2022-02-01 00:38:22,2022-02-01 00:55:55,1.0,6.4,1.0,N,138,41,2,21.0,1.75,0.5,0.0,6.55,0.3,30.1,0.0,1.25,17.55


In [45]:
dv, mse = train_model(train_data, test_data, categorial_features=categorical, target=TARGET)

In [46]:
len(dv.feature_names_)

515

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 6.99
* 11.99
* 16.99
* 21.99

In [48]:
print(f'Train mse is : {mse["train-mse"]:.2f}')

Train mse is : 6.99


## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

* 7.79
* 12.79
* 17.79
* 22.79

In [49]:
print(f'Train mse is : {mse["test-mse"]:.2f}')

Train mse is : 7.79
