In [3]:
!python -V

Python 3.12.3


In [4]:
LINK_YELLOW_TAXI_DATA_202301 = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"
LINK_YELLOW_TAXI_DATA_202302 = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet"

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* **19** (answer)

In [37]:
# !wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
# !wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet
# !wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet

--2024-05-14 01:23:11--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 2600:9000:2503:9200:b:20a5:b140:21, 2600:9000:2503:ea00:b:20a5:b140:21, 2600:9000:2503:0:b:20a5:b140:21, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|2600:9000:2503:9200:b:20a5:b140:21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45616512 (44M) [application/x-www-form-urlencoded]
Saving to: ‘yellow_tripdata_2022-02.parquet’


2024-05-14 01:23:23 (4.17 MB/s) - ‘yellow_tripdata_2022-02.parquet’ saved [45616512/45616512]



In [6]:
import pandas as pd

In [7]:
df1= pd.read_parquet("yellow_tripdata_2023-01.parquet")
df = pd.concat((df1, df2))

In [8]:
list_columns_202301 = list(df1.columns)
print(list_columns_202301)
print("len(list_columns)", len(list_columns_202301))

['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'airport_fee']
len(list_columns) 19


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 32.59
* **42.59** (answer)
* 52.59
* 62.59

In [9]:
df1['duration_td'] = df1['tpep_dropoff_datetime'] - df1['tpep_pickup_datetime']
df1['duration'] = df1['duration_td'].apply(lambda td: td.total_seconds() / 60)
std_dev = df1['duration'].std()
print("df1['duration'].std()", std_dev)

df1['duration'].std() 42.594351241920904


## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* **98%** (answer)


In [10]:
original_len = len(df1)

df1 = df1[(df1['duration'] >= 1) & (df1['duration'] <= 60)]

new_len = len(df1)
ratio = new_len / original_len

print("ratio:", ratio)


ratio: 0.9812202822125979


In [18]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error
from sklearn.metrics import root_mean_squared_error

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* **515** (answer)
* 715

In [12]:
# df1['PU_DO'] = df1['PULocationID'].astype(str) + '_' + df1['DOLocationID'].astype(str)
df1['PU'] = 'PU_' + df1['PULocationID'].astype(str)
df1['DO'] = 'DO_' + df1['DOLocationID'].astype(str)


categorical = ['PU', 'DO']
# categorical = ['PU_DO']
numerical = []

dv = DictVectorizer()

train_dicts = df1[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

target = 'duration'
y_train = df1[target].values

In [13]:
print(X_train.shape)

(3009173, 515)


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 3.64
* **7.64** (answer)
* 11.64
* 16.64

In [28]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

# mean_squared_error(y_train, y_pred, squared=False)
root_mean_squared_error(y_train, y_pred)

7.649261929201487

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

* 3.81
* 7.81
* 11.81
* **16.81** (closter to 46.81)

In [38]:
df2 = pd.read_parquet("yellow_tripdata_2022-02.parquet")

In [39]:
PU_set = set(df1['PU'])
DO_set = set(df1['DO'])

In [40]:
df2['duration_td'] = df2['tpep_dropoff_datetime'] - df2['tpep_pickup_datetime']
df2['duration'] = df2['duration_td'].apply(lambda td: td.total_seconds() / 60)


df2['PU'] = 'PU_' + df2['PULocationID'].astype(str)
df2['DO'] = 'DO_' + df2['DOLocationID'].astype(str)


# df2 = df2[df2['PU'].isin(PU_set) & df2['DO'].isin(DO_set)]

categorical = ['PU', 'DO']
# categorical = ['PU_DO']
numerical = []

val_dicts = df2[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

target = 'duration'
y_val = df2[target].values

In [41]:
# lr2 = LinearRegression()
# lr2.fit(X_train, y_train)
y_pred_val = lr.predict(X_val)

# mean_squared_error(y_val, y_pred, squared=False)
root_mean_squared_error(y_val, y_pred_val)

46.811056597558185

In [42]:
ls = Lasso(0.01)
ls.fit(X_train, y_train)

y_pred_val2 = ls.predict(X_val)

# mean_squared_error(y_val, y_pred, squared=False)
root_mean_squared_error(y_val, y_pred_val2)

46.883219189655975