In [1]:
!python -V

Python 3.9.16


In [3]:
import pandas as pd

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2022.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [1]:
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet

--2023-05-17 09:26:45--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 143.204.101.175, 143.204.101.20, 143.204.101.58, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|143.204.101.175|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38139949 (36M) [application/x-www-form-urlencoded]
Saving to: ‘yellow_tripdata_2022-01.parquet.3’


2023-05-17 09:26:45 (92.5 MB/s) - ‘yellow_tripdata_2022-01.parquet.3’ saved [38139949/38139949]

--2023-05-17 09:26:46--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 143.204.101.63, 143.204.101.58, 143.204.101.20, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|143.204.101.63|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4561651

In [23]:
df = pd.read_parquet("./yellow_tripdata_2022-01.parquet")
print(len(df.columns))
# there are 19 columns

19


## Q1. Answer
19


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 41.45
* 46.45
* 51.45
* 56.45


In [24]:
# print(df.columns)
df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

# print(df.duration)

print(df.duration.std())

46.44530513776499


## Q2. Answer
46.44530513776499, so 46.45

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%

In [25]:
import pandas as pd

print(df.query('duration <= 60 and duration >= 1').duration.count() / df.duration.count())

df = df.query('duration <= 60 and duration >= 1')

0.9827547930522406


## Q3. Answer

0.9827215128995089

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715

In [26]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

categorical = ['PULocationID', 'DOLocationID']

df[categorical] = df[categorical].astype(str)

train_dicts = df[categorical].to_dict(orient='records')

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

target = 'duration'
y_train = df[target].values



In [27]:
X_train
# df
# 515 it is

<2421440x515 sparse matrix of type '<class 'numpy.float64'>'
	with 4842880 stored elements in Compressed Sparse Row format>

## Q4. Answer

515


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 6.99
* 11.99
* 16.99
* 21.99

In [28]:
# Q5
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

# RMSE? without False is RMSE
mean_squared_error(y_train, y_pred, squared=False)

6.986190836477672

## Q5. Answer

6.985303474996651 so 6.99

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

* 7.79
* 12.79
* 17.79
* 22.79

In [30]:

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

categorical = ['PULocationID', 'DOLocationID']
df = pd.read_parquet("./yellow_tripdata_2022-02.parquet")

df[categorical] = df[categorical].astype(str)
df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)
df = df.query('duration <= 60 and duration >= 1')

train_dicts = df[categorical].to_dict(orient='records')

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)
# here we should have the matrix

target = 'duration'
y_train = df[target].values
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

# RMSE? without False is RMSE
mean_squared_error(y_train, y_pred, squared=False)

7.6395012630761885

## Q6. Answer
7.63?