# Predicting On-time Delivery of Online Purchases
## Part II: Data Cleaning, Preparation and Feature Engineering
## AAI-510 Team 7 Final Project

Team 7:  Ken Devoe, Tyler Foreman, Geoffrey Fadera

University of San Diego, Applied Artificial Intelligence

Date:  June 24, 2024

GitHub Repository: https://github.com/kdevoe/aai510-group7

## Imports and Setup

In [15]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

### Load Dataset and Interrogate

In [28]:
datafile = 'shipping.csv'
TARGET = 'Reached.on.Time_Y.N'

# Load dataset
df = pd.read_csv(datafile)

# Drop the ID column
df = df.drop(columns='ID')

# Display dataset
df

Unnamed: 0,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
0,D,Flight,4,2,177,3,low,F,44,1233,1
1,F,Flight,4,5,216,2,low,M,59,3088,1
2,A,Flight,2,2,183,4,low,M,48,3374,1
3,B,Flight,3,3,176,4,medium,M,10,1177,1
4,C,Flight,2,2,184,3,medium,F,46,2484,1
...,...,...,...,...,...,...,...,...,...,...,...
10994,A,Ship,4,1,252,5,medium,F,1,1538,1
10995,B,Ship,4,1,232,5,medium,F,6,1247,0
10996,C,Ship,5,4,242,5,low,F,4,1155,0
10997,F,Ship,5,2,223,6,medium,M,2,1210,0


In [29]:
# separate features from target
X = df.drop(columns=TARGET)
y = df[TARGET]

### Encode Categorical Features

There are several categorical features present in the dataset that must be encoded for use by any downstream models.  These features will be one-hot encoded.


In [30]:
# one hot encode features
X_encoded = pd.get_dummies(X)

In [31]:
# inspect results
X_encoded.head()

Unnamed: 0,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Discount_offered,Weight_in_gms,Warehouse_block_A,Warehouse_block_B,Warehouse_block_C,Warehouse_block_D,Warehouse_block_F,Mode_of_Shipment_Flight,Mode_of_Shipment_Road,Mode_of_Shipment_Ship,Product_importance_high,Product_importance_low,Product_importance_medium,Gender_F,Gender_M
0,4,2,177,3,44,1233,0,0,0,1,0,1,0,0,0,1,0,1,0
1,4,5,216,2,59,3088,0,0,0,0,1,1,0,0,0,1,0,0,1
2,2,2,183,4,48,3374,1,0,0,0,0,1,0,0,0,1,0,0,1
3,3,3,176,4,10,1177,0,1,0,0,0,1,0,0,0,0,1,0,1
4,2,2,184,3,46,2484,0,0,1,0,0,1,0,0,0,0,1,1,0


### Normalize Continuous Numerical Features

For the remaining continuous features, a scaler will be fit to the dataset to normalize the numerical range

In [32]:
#list for cols to scale
cols_to_scale = ['Cost_of_the_Product', 'Discount_offered', 'Weight_in_gms', 'Prior_purchases', 'Customer_care_calls']

#create and fit scaler
scaler = StandardScaler()
scaler.fit(X_encoded[cols_to_scale])

#scale selected data
X_encoded[cols_to_scale] = scaler.transform(X_encoded[cols_to_scale])

In [34]:
# inspect results
X_encoded.head()

Unnamed: 0,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Discount_offered,Weight_in_gms,Warehouse_block_A,Warehouse_block_B,Warehouse_block_C,Warehouse_block_D,Warehouse_block_F,Mode_of_Shipment_Flight,Mode_of_Shipment_Road,Mode_of_Shipment_Ship,Product_importance_high,Product_importance_low,Product_importance_medium,Gender_F,Gender_M
0,-0.047711,2,-0.690722,-0.372735,1.889983,-1.46824,0,0,0,1,0,1,0,0,0,1,0,1,0
1,-0.047711,5,0.120746,-1.029424,2.815636,-0.333893,0,0,0,0,1,1,0,0,0,1,0,0,1
2,-1.799887,2,-0.565881,0.283954,2.136824,-0.159002,1,0,0,0,0,1,0,0,0,1,0,0,1
3,-0.923799,3,-0.711529,0.283954,-0.208162,-1.502484,0,1,0,0,0,1,0,0,0,0,1,0,1
4,-1.799887,2,-0.545074,-0.372735,2.013404,-0.703244,0,0,1,0,0,1,0,0,0,0,1,1,0


### Perform Train/Test Split

Split the prepared dataset into train/test/val datasets for use by modeling in next stages.  Implement an 80/10/10 split.

In [35]:
# split the data into train and test sets (80% and 20%)
X_train, X_test_val, y_train, y_test_val = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# split the test_val into test and val sets (10% and 10%)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, test_size=0.5, random_state=42)

### Write prepared data to file for use by modeling experiments

In [36]:
# train data
X_train.to_csv('./data/x_train.csv')
y_train.to_csv('./data/y_train.csv')

# test data
X_test.to_csv('./data/x_test.csv')
y_test.to_csv('./data/y_test.csv')

# val data
X_val.to_csv('./data/x_val.csv')
y_val.to_csv('./data/y_val.csv')