PACE stages
Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.


PACE: Plan
Consider the questions in your PACE Strategy Document to reflect on the Plan stage.

In this stage, consider the following questions:

What are you being asked to do?
What are the ethical implications of the model? What are the consequences of your model making errors?

What is the likely effect of the model when it predicts a false negative (i.e., when the model says a customer will give a tip, but they actually won't)?

What is the likely effect of the model when it predicts a false positive (i.e., when the model says a customer will not give a tip, but they actually will)?

Do the benefits of such a model outweigh the potential problems?

Would you proceed with the request to build this model? Why or why not?

Can the objective be modified to make it less problematic?

Exemplar responses:

Question 1:

Predict if a customer will not leave a tip.

Question 2:

Drivers who didn't receive tips will probably be upset that the app told them a customer would leave a tip. If it happened often, drivers might not trust the app. Drivers are unlikely to pick up people who are predicted to not leave tips. Customers will have difficulty finding a taxi that will pick them up, and might get angry at the taxi company. Even when the model is correct, people who can't afford to tip will find it more difficult to get taxis, which limits the accessibility of taxi service to those who pay extra.

Question 3:

It's not good to disincentivize drivers from picking up customers. It could also cause a customer backlash. The problems seem to outweigh the benefits.

Question 4:

No. Effectively limiting equal access to taxis is ethically problematic, and carries a lot of risk.

Question 5:

We can build a model that predicts the most generous customers. This could accomplish the goal of helping taxi drivers increase their earnings from tips while preventing the wrongful exclusion of certain people from using taxis.

Suppose you were to modify the modeling objective so, instead of predicting people who won't tip at all, you predicted people who are particularly generous—those who will tip 20% or more? Consider the following questions:

Exemplar responses:

Question 1: What features do you need to make this prediction?

Ideally, we'd have behavioral history for each customer, so we could know how much they tipped on previous taxi rides. We'd also want times, dates, and locations of both pickups and dropoffs, estimated fares, and payment method.

Question 2: What would be the target variable?

The target variable would be a binary variable (1 or 0) that indicates whether or not the customer is expected to tip ≥ 20%.

Question 3:

This is a supervised learning, classification task. We could use accuracy, precision, recall, F-score, area under the ROC curve, or a number of other metrics. However, we don't have enough information at this time to know which are most appropriate. We need to know the class balance of the target variable.

Complete the following steps to begin:

Task 1. Imports and data loading
Import packages and libraries needed to build and evaluate random forest and XGBoost classification models.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# This is the function that helps plot feature importance 
from xgboost import plot_importance

In [3]:
pd.set_option('display.max_columns', None)

In [5]:
nyc_preds_means = pd.read_csv('nyc_preds_means.csv')
df0 = pd.read_csv('C:../data/raw/2017_Yellow_Taxi_Trip_Data.csv')

In [6]:
df0.head()

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
1,35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
2,106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
3,38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
4,30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [7]:
nyc_preds_means.head()

Unnamed: 0,mean_duration,mean_distance,predicted_fare
0,22.847222,3.521667,16.434245
1,24.47037,3.108889,16.052218
2,7.25,0.881429,7.053706
3,30.25,3.7,18.73165
4,14.616667,4.435,15.845642


In [8]:
# Merge datasets
df0 = df0.merge(nyc_preds_means,
                left_index=True,
                right_index=True)

df0.head()

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,mean_duration,mean_distance,predicted_fare
0,24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56,22.847222,3.521667,16.434245
1,35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8,24.47037,3.108889,16.052218
2,106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75,7.25,0.881429,7.053706
3,38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69,30.25,3.7,18.73165
4,30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8,14.616667,4.435,15.845642



PACE: Analyze
Consider the questions in your PACE Strategy Documentto reflect on the Analyze stage.

Task 2. Feature engineering
You have already prepared much of this data and performed exploratory data analysis (EDA) in previous courses.

Call info() on the dataframe.

In [9]:
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22699 non-null  int64  
 1   VendorID               22699 non-null  int64  
 2   tpep_pickup_datetime   22699 non-null  object 
 3   tpep_dropoff_datetime  22699 non-null  object 
 4   passenger_count        22699 non-null  int64  
 5   trip_distance          22699 non-null  float64
 6   RatecodeID             22699 non-null  int64  
 7   store_and_fwd_flag     22699 non-null  object 
 8   PULocationID           22699 non-null  int64  
 9   DOLocationID           22699 non-null  int64  
 10  payment_type           22699 non-null  int64  
 11  fare_amount            22699 non-null  float64
 12  extra                  22699 non-null  float64
 13  mta_tax                22699 non-null  float64
 14  tip_amount             22699 non-null  float64
 15  to

You know from your EDA that customers who pay cash generally have a tip amount of $0. To meet the modeling objective, you'll need to sample the data to select only the customers who pay with credit card.

Copy df0 and assign the result to a variable called df1. Then, use a Boolean mask to filter df1 so it contains only customers who paid with credit card.

In [10]:
# Subset the data to isolate only customers who paid by credit card
df1 = df0[df0['payment_type']==1]

Target
Notice that there isn't a column that indicates tip percent, which is what you need to create the target variable. You'll have to engineer it.

Add a tip_percent column to the dataframe by performing the following calculation:

𝑡𝑖𝑝 𝑝𝑒𝑟𝑐𝑒𝑛𝑡=𝑡𝑖𝑝 𝑎𝑚𝑜𝑢𝑛𝑡𝑡𝑜𝑡𝑎𝑙 𝑎𝑚𝑜𝑢𝑛𝑡−𝑡𝑖𝑝 𝑎𝑚𝑜𝑢𝑛𝑡
 
Round the result to three places beyond the decimal. This is an important step. It affects how many customers are labeled as generous tippers. In fact, without performing this step, approximately 1,800 people who do tip ≥ 20% would be labeled as not generous.

To understand why, you must consider how floats work. Computers make their calculations using floating-point arithmetic (hence the word "float"). Floating-point arithmetic is a system that allows computers to express both very large numbers and very small numbers with a high degree of precision, encoded in binary. However, precision is limited by the number of bits used to represent a number, which is generally 32 or 64, depending on the capabilities of your operating system.

This comes with limitations in that sometimes calculations that should result in clean, precise values end up being encoded as very long decimals. Take, for example, the following calculation:

In [33]:
# Create tip % col
df0['tip_percent'] = round(df0['tip_amount'] / (df0['total_amount'] - df0['tip_amount']), 3)

In [34]:
# Create 'generous' col (target)
df0['generous'] = df0['tip_percent']
df0['generous'] = (df0['generous'] >= 0.2)
df0['generous'] = df0['generous'].astype(int)

In [35]:
# Convert pickup and dropoff cols to datetime
df0['tpep_pickup_datetime'] = pd.to_datetime(df0['tpep_pickup_datetime'], format='%m/%d/%Y %I:%M:%S %p')
df0['tpep_dropoff_datetime'] = pd.to_datetime(df0['tpep_dropoff_datetime'], format='%m/%d/%Y %I:%M:%S %p')

In [36]:
# Create a 'day' col
df0['day'] = df0['tpep_pickup_datetime'].dt.day_name().str.lower()

In [37]:
# Create 'am_rush' col
df0['am_rush'] = df0['tpep_pickup_datetime'].dt.hour

# Create 'daytime' col
df0['daytime'] = df0['tpep_pickup_datetime'].dt.hour

# Create 'pm_rush' col
df0['pm_rush'] = df0['tpep_pickup_datetime'].dt.hour

# Create 'nighttime' col
df0['nighttime'] = df0['tpep_pickup_datetime'].dt.hour

In [38]:
# Define 'am_rush()' conversion function [06:00–10:00)
def am_rush(hour):
    if 6 <= hour['am_rush'] < 10:
        val = 1
    else:
        val = 0
    return val

In [39]:
# Apply 'am_rush' function to the 'am_rush' series
df0['am_rush'] = df0.apply(am_rush, axis=1)
df0['am_rush'].head()

0    1
1    0
2    1
3    0
4    0
Name: am_rush, dtype: int64

In [19]:
# Define 'daytime()' conversion function [10:00–16:00)
def daytime(hour):
    if 10 <= hour['daytime'] < 16:
        val = 1
    else:
        val = 0
    return val

In [40]:
# Apply 'daytime' function to the 'daytime' series
df0['daytime'] = df0.apply(daytime, axis=1)

In [41]:
# Define 'pm_rush()' conversion function [16:00–20:00)
def pm_rush(hour):
    if 16 <= hour['pm_rush'] < 20:
        val = 1
    else:
        val = 0
    return val

In [42]:
# Apply 'pm_rush' function to the 'pm_rush' series
df0['pm_rush'] = df0.apply(pm_rush, axis=1)

In [43]:
# Define 'nighttime()' conversion function [20:00–06:00)
def nighttime(hour):
    if 20 <= hour['nighttime'] < 24:
        val = 1
    elif 0 <= hour['nighttime'] < 6:
        val = 1
    else:
        val = 0
    return val

In [44]:
# Apply 'nighttime' function to the 'nighttime' series
df0['nighttime'] = df0.apply(nighttime, axis=1)

In [45]:
# Create 'month' col
df0['month'] = df0['tpep_pickup_datetime'].dt.strftime('%b').str.lower()

In [46]:
# Drop columns
drop_cols = ['Unnamed: 0', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
             'payment_type', 'trip_distance', 'store_and_fwd_flag', 'payment_type',
             'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
             'improvement_surcharge', 'total_amount', 'tip_percent']

df0 = df0.drop(drop_cols, axis=1)
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   VendorID         22699 non-null  int64  
 1   passenger_count  22699 non-null  int64  
 2   RatecodeID       22699 non-null  int64  
 3   PULocationID     22699 non-null  int64  
 4   DOLocationID     22699 non-null  int64  
 5   mean_duration    22699 non-null  float64
 6   mean_distance    22699 non-null  float64
 7   predicted_fare   22699 non-null  float64
 8   generous         22699 non-null  int32  
 9   day              22699 non-null  object 
 10  am_rush          22699 non-null  int64  
 11  daytime          22699 non-null  int64  
 12  pm_rush          22699 non-null  int64  
 13  nighttime        22699 non-null  int64  
 14  month            22699 non-null  object 
dtypes: float64(3), int32(1), int64(9), object(2)
memory usage: 2.5+ MB


In [47]:
# 1. Define list of cols to convert to string
cols_to_str = ['RatecodeID', 'PULocationID', 'DOLocationID', 'VendorID']

# 2. Convert each column to string
for col in cols_to_str:
    df0[col] = df0[col].astype('str')

In [48]:
# Convert categoricals to binary
df2 = pd.get_dummies(df0, drop_first=True)
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Columns: 398 entries, passenger_count to month_sep
dtypes: bool(389), float64(3), int32(1), int64(5)
memory usage: 9.9 MB


In [49]:
# Get class balance of 'generous' col
df2['generous'].value_counts(normalize=True)

generous
0    0.64602
1    0.35398
Name: proportion, dtype: float64

In [50]:
# Isolate target variable (y)
y = df2['generous']

# Isolate the features (X)
X = df2.drop('generous', axis=1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

Random forest
Begin with using GridSearchCV to tune a random forest model.

Instantiate the random forest classifier rf and set the random state.

Create a dictionary cv_params of any of the following hyperparameters and their corresponding values to tune. The more you tune, the better your model will fit the data, but the longer it will take.

max_depth
max_features
max_samples
min_samples_leaf
min_samples_split
n_estimators
Define a set scoring of scoring metrics for GridSearch to capture (precision, recall, F1 score, and accuracy).

Instantiate the GridSearchCV object rf1. Pass to it as arguments:

estimator=rf
param_grid=cv_params
scoring=scoring
cv: define the number of you cross-validation folds you want (cv=_)
refit: indicate which evaluation metric you want to use to select the model (refit=_)
Note: refit should be set to 'f1'.

In [54]:
# 1. Instantiate the random forest classifier
rf = RandomForestClassifier(random_state=42)

# 2. Create a dictionary of hyperparameters to tune 
# Note that this example only contains 1 value for each parameter for simplicity,
# but you should assign a dictionary with ranges of values
cv_params = {'max_depth': [None],
             'max_features': [1.0],
             'max_samples': [0.7],
             'min_samples_leaf': [1],
             'min_samples_split': [2],
             'n_estimators': [300]
             }

# 3. Define a set of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}

# 4. Instantiate the GridSearchCV object
rf1 = GridSearchCV(rf, cv_params, scoring=acuracy_score, cv=4, refit='f1')


NameError: name 'acuracy_score' is not defined

In [52]:
%%time
rf1.fit(X_train, y_train)

InvalidParameterError: The 'scoring' parameter of GridSearchCV must be a str among {'accuracy', 'max_error', 'balanced_accuracy', 'jaccard_weighted', 'roc_auc_ovo', 'adjusted_mutual_info_score', 'jaccard_macro', 'neg_brier_score', 'neg_mean_gamma_deviance', 'neg_mean_absolute_error', 'f1_micro', 'f1_samples', 'matthews_corrcoef', 'mutual_info_score', 'precision_micro', 'roc_auc_ovo_weighted', 'neg_median_absolute_error', 'f1_weighted', 'recall_macro', 'neg_root_mean_squared_log_error', 'explained_variance', 'neg_mean_squared_error', 'roc_auc_ovr', 'normalized_mutual_info_score', 'precision_weighted', 'top_k_accuracy', 'jaccard', 'precision_samples', 'recall', 'fowlkes_mallows_score', 'v_measure_score', 'average_precision', 'recall_samples', 'neg_mean_squared_log_error', 'completeness_score', 'neg_mean_absolute_percentage_error', 'recall_weighted', 'rand_score', 'f1', 'positive_likelihood_ratio', 'jaccard_samples', 'r2', 'precision', 'precision_macro', 'roc_auc_ovr_weighted', 'jaccard_micro', 'neg_root_mean_squared_error', 'adjusted_rand_score', 'f1_macro', 'roc_auc', 'neg_mean_poisson_deviance', 'recall_micro', 'neg_negative_likelihood_ratio', 'homogeneity_score', 'neg_log_loss'}, a callable, an instance of 'list', an instance of 'tuple', an instance of 'dict' or None. Got {'accuracy', 'recall', 'f1', 'precision'} instead.