## Homework
**Zoomcamp homework link:** https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2024/01-intro/homework.md  

For this homework, we will use the 2023 January and February Yellow Taxi trip records.

### Data Download Links
- **January:** https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
- **February:** https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet

Reading the data

In [1]:
import pandas as pd

df_yellow_jan23 = pd.read_parquet('data/yellow_tripdata_2023-01.parquet')
df_yellow_feb23 = pd.read_parquet('data/yellow_tripdata_2023-02.parquet')

In [None]:
df_yellow_feb23.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,1,2023-02-01 00:32:53,2023-02-01 00:34:34,2.0,0.3,1.0,N,142,163,2,4.4,3.5,0.5,0.0,0.0,1.0,9.4,2.5,0.0
1,2,2023-02-01 00:35:16,2023-02-01 00:35:30,1.0,0.0,1.0,N,71,71,4,-3.0,-1.0,-0.5,0.0,0.0,-1.0,-5.5,0.0,0.0
2,2,2023-02-01 00:35:16,2023-02-01 00:35:30,1.0,0.0,1.0,N,71,71,4,3.0,1.0,0.5,0.0,0.0,1.0,5.5,0.0,0.0
3,1,2023-02-01 00:29:33,2023-02-01 01:01:38,0.0,18.8,1.0,N,132,26,1,70.9,2.25,0.5,0.0,0.0,1.0,74.65,0.0,1.25
4,2,2023-02-01 00:12:28,2023-02-01 00:25:46,1.0,3.22,1.0,N,161,145,1,17.0,1.0,0.5,3.3,0.0,1.0,25.3,2.5,0.0


In [3]:
print(f"The data for January has {len(df_yellow_jan23.columns)} columns")

The data for January has 19 columns


Computing Duration

In [4]:
df_train = df_yellow_jan23.copy()
df_train['duration'] = df_train.tpep_dropoff_datetime - df_train.tpep_pickup_datetime
df_train.duration = df_train.duration.apply(lambda td: td.total_seconds() / 60)
df_train.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,duration
0,2,2023-01-01 00:32:10,2023-01-01 00:40:36,1.0,0.97,1.0,N,161,141,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0,8.433333
1,2,2023-01-01 00:55:08,2023-01-01 01:01:27,1.0,1.1,1.0,N,43,237,1,7.9,1.0,0.5,4.0,0.0,1.0,16.9,2.5,0.0,6.316667
2,2,2023-01-01 00:25:04,2023-01-01 00:37:49,1.0,2.51,1.0,N,48,238,1,14.9,1.0,0.5,15.0,0.0,1.0,34.9,2.5,0.0,12.75
3,1,2023-01-01 00:03:48,2023-01-01 00:13:25,0.0,1.9,1.0,N,138,7,1,12.1,7.25,0.5,0.0,0.0,1.0,20.85,0.0,1.25,9.616667
4,2,2023-01-01 00:10:29,2023-01-01 00:21:19,1.0,1.43,1.0,N,107,79,1,11.4,1.0,0.5,3.28,0.0,1.0,19.68,2.5,0.0,10.833333


In [5]:
std_dev = df_train['duration'].std()
print(f'The standard deviation of the trips duration in January is {std_dev}')

The standard deviation of the trips duration in January is 42.594351241920904


Dropping outliers

In [6]:
# We will keep only the records where the duration was between 1 and 60 minutes
len_data = len(df_train)
df_train = df_train[(df_train.duration >= 1) & (df_train.duration <= 60)]
len_data_betw_1_and_60 = len(df_train)

print(f"After dropping the outliers, {len_data_betw_1_and_60/len_data * 100}% of the records are left.")

After dropping the outliers, 98.1220282212598% of the records are left.


One-hot encoding

In [7]:
categorical_features = ['PULocationID', 'DOLocationID']
numerical_features = ['trip_distance']

# casting feature columns as strings for one hot encoding
df_train[categorical_features] = df_train[categorical_features].astype(str)

# turning the df into a list of dictionaries
train_dicts = df_train[categorical_features + numerical_features].to_dict(orient='records')

In [8]:
train_dicts[:5]

[{'PULocationID': '161', 'DOLocationID': '141', 'trip_distance': 0.97},
 {'PULocationID': '43', 'DOLocationID': '237', 'trip_distance': 1.1},
 {'PULocationID': '48', 'DOLocationID': '238', 'trip_distance': 2.51},
 {'PULocationID': '138', 'DOLocationID': '7', 'trip_distance': 1.9},
 {'PULocationID': '107', 'DOLocationID': '79', 'trip_distance': 1.43}]

In [9]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

# The dictionary vectorizer automaically converts string features into one hot encoded columns

In [10]:
X_train

<3009173x516 sparse matrix of type '<class 'numpy.float64'>'
	with 9027519 stored elements in Compressed Sparse Row format>

In [11]:
dv.get_feature_names_out()

array(['DOLocationID=1', 'DOLocationID=10', 'DOLocationID=100',
       'DOLocationID=101', 'DOLocationID=102', 'DOLocationID=106',
       'DOLocationID=107', 'DOLocationID=108', 'DOLocationID=109',
       'DOLocationID=11', 'DOLocationID=111', 'DOLocationID=112',
       'DOLocationID=113', 'DOLocationID=114', 'DOLocationID=115',
       'DOLocationID=116', 'DOLocationID=117', 'DOLocationID=118',
       'DOLocationID=119', 'DOLocationID=12', 'DOLocationID=120',
       'DOLocationID=121', 'DOLocationID=122', 'DOLocationID=123',
       'DOLocationID=124', 'DOLocationID=125', 'DOLocationID=126',
       'DOLocationID=127', 'DOLocationID=128', 'DOLocationID=129',
       'DOLocationID=13', 'DOLocationID=130', 'DOLocationID=131',
       'DOLocationID=132', 'DOLocationID=133', 'DOLocationID=134',
       'DOLocationID=135', 'DOLocationID=136', 'DOLocationID=137',
       'DOLocationID=138', 'DOLocationID=139', 'DOLocationID=14',
       'DOLocationID=140', 'DOLocationID=141', 'DOLocationID=142',
  

In [12]:
print(f'The dimensionality of the feature matrix is {X_train.shape[1]}')

The dimensionality of the feature matrix is 516


Training a model

In [13]:
target = 'duration'
y_train = df_train[target].values
y_train

array([ 8.43333333,  6.31666667, 12.75      , ..., 24.51666667,
       13.        , 14.4       ])

In [14]:
from sklearn.metrics import root_mean_squared_error

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

rmse_train = root_mean_squared_error(y_train, y_pred)

In [15]:
print(f'The RMSE on training data is {rmse_train}')

The RMSE on training data is 7.649143388169879


Evaluating the model

In [None]:
# EXECUTING THIS ALL AT ONCE SEEMS TO KEEP KILLING THE KERNEL

# similarly, generating X_validation from february data

# def get_training_data(raw_data):
#     categorical_features = ['PULocationID', 'DOLocationID']
#     numerical_features = ['trip_distance']

#     # casting feature columns as strings for one hot encoding
#     raw_data[categorical_features] = raw_data[categorical_features].astype(str)

#     # turning the df into a list of dictionaries
#     train_dicts = raw_data[categorical_features + numerical_features].to_dict(orient='records')

#     dv = DictVectorizer()
#     X_train = dv.fit_transform(train_dicts)

#     return X_train

# X_val = get_training_data(df_yellow_feb23)
# X_val

In [16]:
categorical_features = ['PULocationID', 'DOLocationID']
numerical_features = ['trip_distance']

# casting feature columns as strings for one hot encoding
df_yellow_feb23[categorical_features] = df_yellow_feb23[categorical_features].astype(str)

In [17]:
# turning the df into a list of dictionaries
val_dicts = df_yellow_feb23[categorical_features + numerical_features].to_dict(orient='records')

dv = DictVectorizer()
X_val = dv.fit_transform(val_dicts)

: 