## HW4 - Moody Billah

Loading the necessary libraries

In [1]:
import time
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.naive_bayes import MultinomialNB 
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix

### Original Time

The following shows the total execution time for the original code of the modeling process, which is about 641 milliseconds. This is already a very good execution time for the business used case and there is no significant computational bottleneck at any point.

In [2]:
%%timeit

data_path = 'https://raw.githubusercontent.com/ulabox/datasets/master/data/ulabox_orders_with_categories_partials_2017.csv'
ulabox_data = pd.read_csv(data_path)

ulabox_data.drop(columns=['customer', 'order'], inplace=True)

day_names = {1:'Mon', 2:'Tue', 3:'Wed', 4:'Thu', 5:'Fri', 6:'Sat', 7:'Sun'}
ulabox_data['weekday'] = ulabox_data['weekday'].replace(day_names)

day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
ulabox_data['weekday'] = ulabox_data['weekday'].astype('category').cat.reorder_categories(day_order)

ulabox_data['hour'] = ulabox_data['hour'].astype(str).str.pad(2, fillchar='0') + 'h'

ulabox_data.columns = ulabox_data.columns.astype(str).str.replace('%', '')

ulabox_data.loc[:,'Food':'Pets'] = ulabox_data.loc[:,'Food':'Pets']/100

ulabox_data[['total_items', 'discount']] = MinMaxScaler().fit_transform(ulabox_data[['total_items', 'discount']])

ulabox_data = pd.get_dummies(ulabox_data, drop_first=True)

ulabox_train, ulabox_test = train_test_split(ulabox_data, test_size=0.2, random_state=100)

ulabox_train.reset_index(drop=True, inplace=True)
ulabox_test.reset_index(drop=True, inplace=True)

response_cols = ulabox_data.loc[:,'Food':'Pets'].columns
response_train = ulabox_train[response_cols]
response_test = ulabox_test[response_cols]

response_train_max = response_train.idxmax(axis='columns')
response_test_max = response_test.idxmax(axis='columns')

explanatory_train = ulabox_train.drop(columns=response_cols)
explanatory_test = ulabox_test.drop(columns=response_cols)

priors = list(response_train.mean().sort_index())
priors[0] = priors[0] + (1 - sum(priors))

model2 = MultinomialNB(class_prior=priors)
model2.fit(explanatory_train, response_train_max)

response_pred2 = pd.DataFrame(model2.predict_proba(explanatory_test))
response_pred2.columns = sorted(response_cols)

641 ms ± 99.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Improved Time

The following shows the total execution time for the code with slight modifications, which is about 549 milliseconds.In lines 18 and 29, the 'loc' function using column labels is replaced with the 'iloc' function using indexes.

In [3]:
%%timeit

data_path = 'https://raw.githubusercontent.com/ulabox/datasets/master/data/ulabox_orders_with_categories_partials_2017.csv'
ulabox_data = pd.read_csv(data_path)

ulabox_data.drop(columns=['customer', 'order'], inplace=True)

day_names = {1:'Mon', 2:'Tue', 3:'Wed', 4:'Thu', 5:'Fri', 6:'Sat', 7:'Sun'}
ulabox_data['weekday'] = ulabox_data['weekday'].replace(day_names)

day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
ulabox_data['weekday'] = ulabox_data['weekday'].astype('category').cat.reorder_categories(day_order)

ulabox_data['hour'] = ulabox_data['hour'].astype(str).str.pad(2, fillchar='0') + 'h'

ulabox_data.columns = ulabox_data.columns.astype(str).str.replace('%', '')

ulabox_data.iloc[:, 4:12] = ulabox_data.iloc[:, 4:12]/100

ulabox_data[['total_items', 'discount']] = MinMaxScaler().fit_transform(ulabox_data[['total_items', 'discount']])

ulabox_data = pd.get_dummies(ulabox_data, drop_first=True)

ulabox_train, ulabox_test = train_test_split(ulabox_data, test_size=0.2, random_state=100)

ulabox_train.reset_index(drop=True, inplace=True)
ulabox_test.reset_index(drop=True, inplace=True)

response_cols = ulabox_data.iloc[:, 2:10].columns
response_train = ulabox_train[response_cols]
response_test = ulabox_test[response_cols]

response_train_max = response_train.idxmax(axis='columns')
response_test_max = response_test.idxmax(axis='columns')

explanatory_train = ulabox_train.drop(columns=response_cols)
explanatory_test = ulabox_test.drop(columns=response_cols)

priors = list(response_train.mean().sort_index())
priors[0] = priors[0] + (1 - sum(priors))

model2 = MultinomialNB(class_prior=priors)
model2.fit(explanatory_train, response_train_max)

response_pred2 = pd.DataFrame(model2.predict_proba(explanatory_test))
response_pred2.columns = sorted(response_cols)

549 ms ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Explanation

Since the original execution time is already quite fast, there is not much room for improvement. Most of the changes that were tried actually resulted in a slower execution time, so those changes are not present in the code above. The only change that showed any speed gain was switching from 'loc' to 'iloc', as shown in lines 18 and 29 of the code. However, the average execution time only decreased by 92 milliseconds, which is insignificant for the business used case. Using the 'loc' function may actually be preferable because the explicit column labels make the code more readable.