<a href="https://colab.research.google.com/github/na-learning/ML/blob/main/ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

Load the NYC taxi dataset into a Pandas DataFrame and do a few basic checks to ensure the data is loaded properly. Note, there are several months of data that can be used. For simplicity, use the Yellow Taxi 2022-01 parquet file here. Here are your tasks:

1. Load the yellow_tripdata_2022-01.parquet file into Pandas.
2. Print the first 5 rows of data.
3. Drop any rows of data that contain NULL values.
4. Create a new feature, 'trip_duration' that captures the duration of the trip in minutes.
5. Create a varible named 'target_variable' to store the name of the thing we're trying to predict, 'total_amount'.
6. Create a list called 'feature_cols' containing the feature names that we'll be using to predict our target variable. The list should contain 'VendorID', 'trip_distance', 'payment_type', 'PULocationID', 'DOLocationID', and 'trip_duration'.

In [76]:
# Load the dataset into a pandas DataFrame (from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
data = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet', engine='pyarrow')

In [77]:
# Display the first few rows of the dataset
print(data.shape)
print(data.head(5))

(2463931, 19)
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2022-01-01 00:35:40   2022-01-01 00:53:29              2.0   
1         1  2022-01-01 00:33:43   2022-01-01 00:42:07              1.0   
2         2  2022-01-01 00:53:21   2022-01-01 01:02:19              1.0   
3         2  2022-01-01 00:25:21   2022-01-01 00:35:23              1.0   
4         2  2022-01-01 00:36:48   2022-01-01 01:14:20              1.0   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0           3.80         1.0                  N           142           236   
1           2.10         1.0                  N           236            42   
2           0.97         1.0                  N           166           166   
3           1.09         1.0                  N           114            68   
4           4.30         1.0                  N            68           163   

   payment_type  fare_amount  extra  mta_tax  tip_amount  to

In [78]:
# Drop rows with missing values.
data = data.dropna(axis=0)
print(data.shape)

(2392428, 19)


In [79]:
# Create new feature, 'trip_duration'.
col = data.apply(lambda row: ((row.tpep_dropoff_datetime - row.tpep_pickup_datetime).seconds)/60, axis=1)
data = data.assign(trip_duration=col.values)

In [80]:
target_variable = "total_amount"

In [81]:
# Create a list called feature_col to store column names
feature_cols = ['VendorID', 'trip_distance', 'payment_type', 'PULocationID', 'DOLocationID', 'trip_duration']

Use Scikit-Learn's train_test_split to split the data into training and test sets. Don't forget to set the random state.

In [None]:
# Split dataset into training and test sets
# cols = ",".join(str(feature) for feature in feature_cols)
# cols += "," + str(target_variable)
# data = data[cols.split(",")]
X = data[feature_cols]
y = data[target_variable]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Create a model that always predicts the mean total fare of the training dataset. Use Scikit-Learn's mean_absolute_error to evaluate this model. Is it any good?

In [83]:
# Create a baseline for mean absolute error of total amount
from sklearn.dummy import DummyRegressor
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X_train, y_train)
y_pred = dummy_regr.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print(mae)


9.198227928516678


1. Use Scikit-Learn's ColumnTransformer to preprocess the categorical and continuous features independently. Apply the StandardScaler to the continuous columns and OneHotEncoder to the categorical columns.

2. Integrate the preprocessor in the previous step with Scikit-Learn's LinearRegression model using a Pipeline.

3. Train the pipeline on the training data.

4. Evaluate the model using mean absolute error as a metric on the test data. Does the model beat the baseline?

In [84]:
# Use Scikit-Learn's ColumnTransformer to preprocess the categorical and
# continuous features independently.
numerical_x = X.select_dtypes(include=['int64', 'float64']).columns
categorical_x = X.select_dtypes(include=['object', 'bool']).columns

print(numerical_x)
print(categorical_x)


t = [('num', StandardScaler(), numerical_x), ('cat', OneHotEncoder(), categorical_x)]

transformer = ColumnTransformer(transformers=t, remainder='passthrough')

Index(['VendorID', 'trip_distance', 'payment_type', 'PULocationID',
       'DOLocationID', 'trip_duration'],
      dtype='object')
Index([], dtype='object')


In [85]:
# Create a pipeline object containing the column transformations and regression
# model.
pipeline = Pipeline(steps=[('trans', transformer), ('model', LinearRegression())])

In [87]:
# Fit the pipeline on the training data.
pipeline.fit(X_train, y_train)

In [88]:
# Make predictions on the test data.
y_pred = pipeline.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(mae)

2.9656198158340654


This model definitely beats the baseline model

1. Build a Random Forest Regressor model using Scikit-Learn's RandomForestRegressor and train it on the train data.

2. Evaluate the performance of the model on the test data using mean absolute error as a metric. Mess around with various input parameter configurations to see how they affect the model. Can you beat the performance of the linear regression model?

In [None]:
# Build random forest regressor model
rfr = RandomForestRegressor(n_estimators=100, random_state=42)
rfr.fit(X_train, y_train)

  rfr.fit(X_train, y_train)
