## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.


## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19 [X]


In [30]:
import pandas as pd
import numpy as np

In [8]:
DATA_PATH = 'Homeworks/data/'


In [9]:
def read_taxis_data(filepath,
    taxi_type = 'green',
    year = '2022',
    train_month = '01',
    valid_month = '02'):

  df = pd.DataFrame()
  data_months = [train_month, valid_month]

  for i in data_months:
    df_part = pd.read_parquet(filepath + f'{year}/{taxi_type}_tripdata_{year}-{i}.parquet')
    df = pd.concat([df, df_part], ignore_index=True)

  if taxi_type == "green":
      dropoff_column = "lpep_dropoff_datetime"
      pickup_column = "lpep_pickup_datetime"
  if taxi_type == "yellow":
      dropoff_column = "tpep_dropoff_datetime"
      pickup_column = "tpep_pickup_datetime"
  else:
      print("Please specify the taxis type argument")

  df['duration'] = (df[dropoff_column] - df[pickup_column])
  df['duration'] = df['duration'].dt.total_seconds().div(60)
  df['valid'] = 0
  df['valid'] = np.where(df[pickup_column] >= (year + '-' + valid_month), 1 , df.valid)
    
  return df


In [22]:
data = read_taxis_data(DATA_PATH,
    taxi_type = 'yellow',
    year='2023',
    train_month =  '01',
    valid_month = '02')


In [23]:
data.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee', 'Airport_fee',
       'duration', 'valid'],
      dtype='object')

In [24]:
## Feb 2023 has a typo in one of the columns 
data[~(data["Airport_fee"].isna())]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,Airport_fee,duration,valid
3066766,1,2023-02-01 00:32:53,2023-02-01 00:34:34,2.0,0.30,1.0,N,142,163,2,...,0.5,0.00,0.00,1.0,9.40,2.5,,0.00,1.683333,1
3066767,2,2023-02-01 00:35:16,2023-02-01 00:35:30,1.0,0.00,1.0,N,71,71,4,...,-0.5,0.00,0.00,-1.0,-5.50,0.0,,0.00,0.233333,1
3066768,2,2023-02-01 00:35:16,2023-02-01 00:35:30,1.0,0.00,1.0,N,71,71,4,...,0.5,0.00,0.00,1.0,5.50,0.0,,0.00,0.233333,1
3066769,1,2023-02-01 00:29:33,2023-02-01 01:01:38,0.0,18.80,1.0,N,132,26,1,...,0.5,0.00,0.00,1.0,74.65,0.0,,1.25,32.083333,1
3066770,2,2023-02-01 00:12:28,2023-02-01 00:25:46,1.0,3.22,1.0,N,161,145,1,...,0.5,3.30,0.00,1.0,25.30,2.5,,0.00,13.300000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5903899,2,2023-02-28 23:47:42,2023-02-28 23:54:17,1.0,1.76,1.0,N,239,50,1,...,0.5,3.00,0.00,1.0,18.00,2.5,,0.00,6.583333,1
5903900,2,2023-02-28 23:10:57,2023-02-28 23:17:52,2.0,1.86,1.0,N,50,239,1,...,0.5,3.14,0.00,1.0,18.84,2.5,,0.00,6.916667,1
5903901,2,2023-02-28 23:09:54,2023-02-28 23:23:41,1.0,2.75,1.0,N,142,234,1,...,0.5,4.26,0.00,1.0,25.56,2.5,,0.00,13.783333,1
5903902,2,2023-02-28 23:50:17,2023-03-01 00:14:33,1.0,8.36,1.0,N,186,7,1,...,0.5,0.00,6.55,1.0,48.85,2.5,,0.00,24.266667,1


In [27]:
print("Q1. January 2023 taxis data has", len(data.columns[:-3]), "columns")

Q1. January 2023 taxis data has 19 columns


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 32.59
* 42.59 [X]
* 52.59
* 62.59

In [31]:
std = np.std(data[data["valid"] == 0]["duration"]).round(2)
print("Q2. Standard deviation of the trips duration in January: ", std)


Q2. Standard deviation of the trips duration in January:  42.6


## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98% [X]

In [34]:
def trim_data(data):
  df = data.copy()
  df = df[(df.duration >= 1) & (df.duration <= 60)]
  return df

data_trimmed = trim_data(data)

In [35]:
df_shape_before = data.shape[0]
df_shape_after = data_trimmed.shape[0]

print("Q3. After removing outliers, we keep ",round((df_shape_after/df_shape_before)*100, 2) ,"% of the data")


Q3. After removing outliers, we keep  98.07 % of the data


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2 
* 155
* 345
* 515 [X]
* 715

In [38]:
from sklearn.feature_extraction import DictVectorizer

def one_hot_encoding(data):
    df = data.copy()

    # Re-cast to get IDs treated as categories
    df['PULocationID'] = df['PULocationID'].astype(str)
    df['DOLocationID'] = df['DOLocationID'].astype(str)

    # Variable section
    categorical = ['PULocationID', 'DOLocationID']

    # DictVectorizer
    dv = DictVectorizer()

    dict_train = df[df.valid == 0][categorical].to_dict(orient='records')
    dict_valid = df[df.valid == 1][categorical].to_dict(orient='records')

    X_train = dv.fit_transform(dict_train)
    X_valid = dv.transform(dict_valid)

    y_train = df[df.valid == 0].duration.values
    y_valid = df[df.valid == 1].duration.values

    return X_train, y_train, X_valid, y_valid, dv

X_train, y_train, X_valid, y_valid, dv = one_hot_encoding(data_trimmed)


In [39]:
print("Q4. Number of columns of X vectorized:", X_train.shape[1])

Q4. Number of columns of X vectorized: 515


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 3.64
* 7.64 [X]
* 11.64
* 16.64

In [41]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

In [42]:
def run_linear_regressor(X_train, y_train,
                         X_valid, y_valid):

    lr = LinearRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_valid)
    metric = root_mean_squared_error(y_valid, y_pred).round(2)

    return metric

train_rmse =run_linear_regressor(X_train, y_train,
                                 X_train, y_train)

In [43]:
print("Q5. RMSE ofr a LinearRegression only on train data:", train_rmse)


Q5. RMSE ofr a LinearRegression only on train data: 7.65


## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

* 3.81
* 7.81 [X]
* 11.81
* 16.81

In [45]:
rmse =run_linear_regressor(
        X_train, y_train,
        X_valid, y_valid)

In [46]:
print("Q6. RMSE ofr a LinearRegression using train/test data:", rmse)

Q6. RMSE ofr a LinearRegression using train/test data: 7.81
