PROBLEM STATMENT

The large increase in shipping demand has not been matched by an increase in the capabilities of logistics companies.
Delayed delivery can be a risk in many sectors, one of which is retail sales in e-commerce, late delivery will cause the product supply chain to be hampered and reduce the credibility of the retailer. Apart from that, delays by the expedition will also cause buyer disappointment which of course can be detrimental to the retailer.

GOAL

Building a binary classification machine learning model that can predict delays in logistics/product delivery in e-commerce with high accuracy

1)Analyze the data and determine the target feature/binary label (is_late -> (1 or 0), according to the problem statement (delay in delivery)
2)Carry out data processing, to produce data that is clean from noise
3)Carrying out feature engineering, by creating new features to add data patterns which will make it easier for the model to carry out classification (so it is hoped that the accuracy will increase)
4)Select features with high importance using feature importance techniques (Pearson Correlation Matrix, KBest, ChiSquare, and SHAP), to reduce model complexity, computational load, and improve model performance
5)Carry out modeling using several Baseline algorithms (Logistic Regression, SVM, and Decision Tree), as well as advanced algorithms using Ensemble Learning (XGBoost, LGBM, CatBoost, Adaboost, and Random Forest)
6)Evaluate the model with accuracy metrics

importing the library

In [1]:
# importing common libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score

# libraries for evaluation
pd.set_option('display.max_columns', 99)
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, f1_score, roc_auc_score


# importing time libraries
from datetime import timedelta
from datetime import datetime

# libraries for baseline algorithm
# from sklearn import svm
# from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import linear_model, tree, ensemble
from sklearn.neighbors import KNeighborsClassifier

# libraries for advanced algorithm
import xgboost
import xgboost as xgb
from xgboost import XGBClassifier
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier

# libraries for feature selection
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [2]:
#loading the  train dataset into dataframe

train_orders = pd.read_csv('train/df_Orders.csv')
train_payments= pd.read_csv('train/df_Payments.csv')
train_products = pd.read_csv('train/df_Products.csv')
train_customers = pd.read_csv('train/df_customers.csv')
train_orderitems = pd.read_csv('train/df_Orderitems.csv')

In [3]:
#loading the test dataset into dataframe

test_orders = pd.read_csv('test/df_Orders.csv')
test_payments= pd.read_csv('test/df_Payments.csv')
test_products = pd.read_csv('test/df_Products.csv')
test_customers = pd.read_csv('test/df_customers.csv')
test_orderitems = pd.read_csv('test/df_Orderitems.csv')

EXPLORATORY DATA ANALYSIS

In [4]:
#MERGING THE DATA
train_products=train_products.drop_duplicates()

In [5]:
#merging the train data
train_data = train_orders.merge(train_customers, on="customer_id", how="left")
train_data = train_data.merge(train_orderitems, on="order_id", how="left")
train_data = train_data.merge(train_payments, on="order_id", how="left")
train_data = train_data.merge(train_products, on="product_id", how="left")

In [6]:
#merging the test data

test_data = test_orders.merge(test_customers, on="customer_id", how="left")
test_data = test_data.merge(test_orderitems, on="order_id", how="left")
test_data = test_data.merge(test_payments, on="order_id", how="left")
test_products = test_products.drop_duplicates()
test_data = test_data.merge(test_products, on="product_id", how="left")

In [7]:
#to find out wheter it is late or not we onyl need to findthe where theorder has been delivered.

train_data = train_data[train_data.order_status == 'delivered']


Yes, whenever you're working with time-related data in Python, especially when you want to perform operations like comparisons, calculations, or formatting, it's generally a good practice to convert those time values into datetime objects

In [8]:
# Convert order date and shipping date features to datetime type
train_data[["order_purchase_timestamp", "order_approved_at", "order_delivered_timestamp", "order_estimated_delivery_date"]] = train_data[["order_purchase_timestamp", "order_approved_at", "order_delivered_timestamp", "order_estimated_delivery_date"]].apply(pd.to_datetime)
test_data[["order_purchase_timestamp", "order_approved_at"]] = test_data[["order_purchase_timestamp", "order_approved_at"]].apply(pd.to_datetime)

In [9]:
#Let's remove the Order Status feature, because now it only contains delivered (it won't affect model training
train_data = train_data.drop(['order_status'], axis=1)

In [10]:
#removing the na values
train_data = train_data.dropna(subset=['order_approved_at', 'order_delivered_timestamp','product_category_name','product_weight_g','product_length_cm','product_height_cm','product_width_cm'])
test_data = test_data.dropna(subset=['order_approved_at', 'product_category_name','product_weight_g','product_length_cm','product_height_cm','product_width_cm'])


In [11]:
train_data.loc[:, 'is_late'] = (train_data['order_delivered_timestamp'] > train_data['order_estimated_delivery_date']).astype(int)


In [12]:
#It appears that the distribution of data that is late and not is quite imbalanced. This is quite normal in real world data

Bivariate Analysis

In [13]:
train_data['product_volume'] = train_data['product_length_cm'] * train_data['product_height_cm'] * train_data['product_width_cm']

In [14]:
test_data['product_volume'] = test_data['product_length_cm'] * test_data['product_height_cm'] * test_data['product_width_cm']

In [15]:
test_data.shape

(38094, 21)

Volume Binning

In [16]:
sectil = train_data['product_volume'].quantile([0.25]).values[0]
thirtil = train_data['product_volume'].quantile([0.75]).values[0]
train_data['volume_category'] = train_data['product_volume'].apply(lambda x:'large' if x>=thirtil else ('medium' if x>=sectil and x<thirtil else 'small'))

In [17]:
test_data['volume_category'] = test_data['product_volume'].apply(lambda x:'large' if x>=thirtil else ('medium' if x>=sectil and x<thirtil else 'small'))

In [18]:
volume_late = train_data[train_data.is_late == 1]['volume_category'].value_counts().rename_axis('volume_category').reset_index(name='late_counts')
volume_all = train_data['volume_category'].value_counts().rename_axis('volume_category').reset_index(name='all_counts')
volume_late['all_count'] = volume_late.volume_category.map(volume_all.set_index('volume_category')['all_counts'])
volume_late['late_prop'] = volume_late['late_counts'] / volume_late['all_count']
volume_list = volume_late['volume_category'].to_list()


In [19]:
def propVolLate(data):
    if(data in volume_list):
        return volume_late[volume_late.volume_category == data]['late_prop'].values[0]
    else:
        return 0

train_data['vol_late_prop'] = train_data['volume_category'].apply(propVolLate)


In [20]:
test_data['vol_late_prop'] = test_data['volume_category'].apply(propVolLate)


In [21]:
train_data['purchase_day_of_week'] = train_data['order_purchase_timestamp'].dt.dayofweek
train_data['approved_day_of_week'] = train_data['order_approved_at'].dt.dayofweek

In [22]:
test_data['purchase_day_of_week'] = test_data['order_purchase_timestamp'].dt.dayofweek
test_data['approved_day_of_week'] = test_data['order_approved_at'].dt.dayofweek

In [23]:
pur_late = train_data[train_data.is_late == 1]['purchase_day_of_week'].value_counts().rename_axis('purchase_day_of_week').reset_index(name='late_counts')
pur_all = train_data['purchase_day_of_week'].value_counts().rename_axis('purchase_day_of_week').reset_index(name='all_counts')
pur_late['all_count'] = pur_late.purchase_day_of_week.map(pur_all.set_index('purchase_day_of_week')['all_counts'])
pur_late['late_prop'] = pur_late['late_counts'] / pur_late['all_count']
pur_late['day_of_week'] = pur_late['purchase_day_of_week'].map({0: 'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'})
pur_list = pur_late['purchase_day_of_week'].to_list()


In [24]:
def propPurLate(data):
    if(data in pur_list):
        return pur_late[pur_late.purchase_day_of_week == data]['late_prop'].values[0]
    else:
        return 0

train_data['pur_late_prop'] = train_data['purchase_day_of_week'].apply(propPurLate)


In [25]:
test_data['pur_late_prop'] = test_data['purchase_day_of_week'].apply(propPurLate)


In [26]:
app_late = train_data[train_data.is_late == 1]['approved_day_of_week'].value_counts().rename_axis('approved_day_of_week').reset_index(name='late_counts')
app_all = train_data['approved_day_of_week'].value_counts().rename_axis('approved_day_of_week').reset_index(name='all_counts')
app_late['all_count'] = app_late.approved_day_of_week.map(app_all.set_index('approved_day_of_week')['all_counts'])
app_late['late_prop'] = app_late['late_counts'] / app_late['all_count']
app_late['day_of_week'] = app_late['approved_day_of_week'].map({0: 'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'})
app_list = app_late['approved_day_of_week'].to_list()


In [27]:
def propAppLate(data):
    if(data in app_list):
        return app_late[app_late.approved_day_of_week == data]['late_prop'].values[0]
    else:
        return 0

train_data['app_late_prop'] = train_data['approved_day_of_week'].apply(propAppLate)


In [28]:
test_data['app_late_prop'] = test_data['approved_day_of_week'].apply(propAppLate)


In [29]:
train_data['purchase_hour'] = train_data['order_purchase_timestamp'].dt.hour
train_data['approved_hour'] = train_data['order_approved_at'].dt.hour


In [30]:
test_data['purchase_hour'] = test_data['order_purchase_timestamp'].dt.hour
test_data['approved_hour'] = test_data['order_approved_at'].dt.hour


In [31]:
purh_late = train_data[train_data.is_late == 1]['purchase_hour'].value_counts().rename_axis('purchase_hour').reset_index(name='late_counts')
purh_all = train_data['purchase_hour'].value_counts().rename_axis('purchase_hour').reset_index(name='all_counts')
purh_late['all_count'] = purh_late.purchase_hour.map(purh_all.set_index('purchase_hour')['all_counts'])
purh_late['late_prop'] = purh_late['late_counts'] / purh_late['all_count']
purh_list =purh_late['purchase_hour'].to_list()
purh_late['hour'] = purh_late['purchase_hour'].map({0: '00.00', 1: '01.00', 2: '02.00', 3:'03.00', 4:'04.00', 5:'05.00', 6:'06.00', 7:'07.00',8:'08.00',9:'09.00',10:'10.00',11:'11.00',12:'12.00',13:'13.00',14:'14.00',15:'15.00',16:'16.00',17:'17.00',18:'18.00',19:'19.00',20:'20.00',21:'21.00',22:'22.00',23:'23.00'})


In [32]:
def propPurhLate(data):
    if(data in purh_list):
        return purh_late[purh_late.purchase_hour == data]['late_prop'].values[0]
    else:
        return 0

train_data['purh_late_prop'] = train_data['purchase_hour'].apply(propPurhLate)


In [33]:
test_data['purh_late_prop'] = test_data['purchase_hour'].apply(propPurhLate)


In [34]:
apph_late = train_data[train_data.is_late == 1]['approved_hour'].value_counts().rename_axis('approved_hour').reset_index(name='late_counts')
apph_all = train_data['approved_hour'].value_counts().rename_axis('approved_hour').reset_index(name='all_counts')
apph_late['all_count'] = apph_late.approved_hour.map(apph_all.set_index('approved_hour')['all_counts'])
apph_late['late_prop'] = apph_late['late_counts'] / apph_late['all_count']
apph_list =apph_late['approved_hour'].to_list()
apph_late['hour'] = apph_late['approved_hour'].map({0: '00.00', 1: '01.00', 2: '02.00', 3:'03.00', 4:'04.00', 5:'05.00', 6:'06.00', 7:'07.00',8:'08.00',9:'09.00',10:'10.00',11:'11.00',12:'12.00',13:'13.00',14:'14.00',15:'15.00',16:'16.00',17:'17.00',18:'18.00',19:'19.00',20:'20.00',21:'21.00',22:'22.00',23:'23.00'})


In [35]:
def propApphLate(data):
    if(data in apph_list):
        return apph_late[apph_late.approved_hour == data]['late_prop'].values[0]
    else:
        return 0

train_data['apph_late_prop'] = train_data['approved_hour'].apply(propApphLate)


In [36]:
test_data['apph_late_prop'] = test_data['approved_hour'].apply(propApphLate)


In [37]:
train_data['approved_min_purchase'] = ((train_data['order_approved_at'] - train_data['order_purchase_timestamp']).astype('timedelta64[s]') / pd.Timedelta(hours=1)).astype(int)
train_data['approved_min_purchase'].astype(float)


0         0.0
1         0.0
2        23.0
3         0.0
4         0.0
         ... 
89311    35.0
89312    15.0
89313     4.0
89314     0.0
89315     0.0
Name: approved_min_purchase, Length: 87109, dtype: float64

In [38]:
train_data[train_data['is_late'] == 1]['approved_min_purchase'].describe()

count    6706.000000
mean       12.325529
std        29.669757
min         0.000000
25%         0.000000
50%         0.000000
75%        20.000000
max       741.000000
Name: approved_min_purchase, dtype: float64

In [39]:
test_data['approved_min_purchase'] = ((test_data['order_approved_at'] - test_data['order_purchase_timestamp']).astype('timedelta64[s]') / pd.Timedelta(hours=1)).astype(int)
test_data['approved_min_purchase'].astype(float)


0         0.0
1         2.0
2        20.0
3         0.0
4        66.0
         ... 
38274     0.0
38275     0.0
38276    13.0
38277    19.0
38278     0.0
Name: approved_min_purchase, Length: 38094, dtype: float64

In [40]:
sectil = train_data['approved_min_purchase'].quantile([0.25]).values[0]
thirtil = train_data['approved_min_purchase'].quantile([0.75]).values[0]
train_data['apmpu_category'] = train_data['approved_min_purchase'].apply(lambda x:'slow' if x>=thirtil else ('fast' if x>=sectil and x<thirtil else 'fast'))
test_data['apmpu_category'] = test_data['approved_min_purchase'].apply(lambda x:'slow' if x>=thirtil else ('fast' if x>=sectil and x<thirtil else 'fast'))



In [41]:
apmpu_late = train_data[train_data.is_late == 1]['apmpu_category'].value_counts().rename_axis('apmpu_category').reset_index(name='late_counts')
apmpu_all = train_data['apmpu_category'].value_counts().rename_axis('apmpu_category').reset_index(name='all_counts')
apmpu_late['all_count'] = apmpu_late.apmpu_category.map(apmpu_all.set_index('apmpu_category')['all_counts'])
apmpu_late['late_prop'] = apmpu_late['late_counts'] / apmpu_late['all_count']
apmpu_list = apmpu_late['apmpu_category'].to_list()


In [42]:
def propApmpuLate(data):
    if(data in apmpu_list):
        return apmpu_late[apmpu_late.apmpu_category == data]['late_prop'].values[0]
    else:
        return 0

train_data['apmpu_late_prop'] = train_data['apmpu_category'].apply(propApmpuLate)
test_data['apmpu_late_prop'] = test_data['apmpu_category'].apply(propApmpuLate)



In [43]:
sectil = train_data['price'].quantile([0.25]).values[0]
thirtil = train_data['price'].quantile([0.75]).values[0]
train_data['price_category'] = train_data['price'].apply(lambda x:'expensive' if x>=thirtil else ('affordable' if x>=sectil and x<thirtil else 'cheap'))

In [44]:
price_late = train_data[train_data.is_late == 1]['price_category'].value_counts().rename_axis('price_category').reset_index(name='late_counts')
price_all = train_data['price_category'].value_counts().rename_axis('price_category').reset_index(name='all_counts')
price_late['all_count'] = price_late.price_category.map(price_all.set_index('price_category')['all_counts'])
price_late['late_prop'] =price_late['late_counts'] / price_late['all_count']
price_list =price_late['price_category'].to_list()


In [45]:
def priceLate(data):
    if(data in price_list):
        return price_late[price_late.price_category == data]['late_prop'].values[0]
    else:
        return 0

train_data['price_late_prop'] = train_data['price_category'].apply(priceLate)


In [46]:
test_data['price_category'] = test_data['price'].apply(lambda x:'expensive' if x>=thirtil else ('affordable' if x>=sectil and x<thirtil else 'cheap'))


In [47]:
test_data['price_late_prop'] = test_data['price_category'].apply(priceLate)


In [48]:
install_late = train_data[train_data.is_late == 1]['payment_installments'].value_counts().rename_axis('payment_installments').reset_index(name='late_counts')
install_all = train_data['payment_installments'].value_counts().rename_axis('payment_installments').reset_index(name='all_counts')
install_late['all_count'] = install_late.payment_installments.map(install_all.set_index('payment_installments')['all_counts'])
install_late['late_prop'] =install_late['late_counts'] / install_late['all_count']
install_list =install_late['payment_installments'].to_list()
install_late['install'] = install_late['payment_installments'].map({1: '1', 2: '2', 3: '3', 4:'4', 5:'5', 6:'6', 7:'7', 8:'8',9:'9',10:'10',11:'11',12:'12',13:'13',14:'14',15:'15',16:'16',17:'17',18:'18',19:'19',20:'20',21:'21',22:'22',23:'23',24:'24'})


In [49]:
def installLate(data):
    if(data in install_list):
        return install_late[install_late.payment_installments == data]['late_prop'].values[0]
    else:
        return 0

train_data['install_late_prop'] = train_data['payment_installments'].apply(installLate)


In [50]:
test_data['install_late_prop'] = test_data['payment_installments'].apply(installLate)


In [51]:
pt_late = train_data[train_data.is_late == 1]['payment_type'].value_counts().rename_axis('payment_type').reset_index(name='late_counts')
pt_all = train_data['payment_type'].value_counts().rename_axis('payment_type').reset_index(name='all_counts')
pt_late['all_count'] = pt_late.payment_type.map(pt_all.set_index('payment_type')['all_counts'])
pt_late['late_prop'] =pt_late['late_counts'] / pt_late['all_count']
pt_list =pt_late['payment_type'].to_list()


In [52]:
def ptLate(data):
    if(data in pt_list):
        return pt_late[pt_late.payment_type == data]['late_prop'].values[0]
    else:
        return 0

train_data['pt_late_prop'] = train_data['payment_type'].apply(ptLate)


In [53]:
test_data['pt_late_prop'] = test_data['payment_type'].apply(ptLate)


In [54]:
sectil = train_data['product_weight_g'].quantile([0.25]).values[0]
thirtil = train_data['product_weight_g'].quantile([0.75]).values[0]
train_data['weight_category'] = train_data['product_weight_g'].apply(lambda x:'heavy' if x>=thirtil else ('medium' if x>=sectil and x<thirtil else 'light'))


In [55]:
test_data['weight_category'] = test_data['product_weight_g'].apply(lambda x:'heavy' if x>=thirtil else ('medium' if x>=sectil and x<thirtil else 'light'))


In [56]:
pw_late = train_data[train_data.is_late == 1]['weight_category'].value_counts().rename_axis('weight_category').reset_index(name='late_counts')
pw_all = train_data['weight_category'].value_counts().rename_axis('weight_category').reset_index(name='all_counts')
pw_late['all_count'] = pw_late.weight_category.map(pw_all.set_index('weight_category')['all_counts'])
pw_late['late_prop'] =pw_late['late_counts'] / pw_late['all_count']
pw_list =pw_late['weight_category'].to_list()


In [57]:
def pwLate(data):
    if(data in pw_list):
        return pw_late[pw_late.weight_category == data]['late_prop'].values[0]
    else:
        return 0

train_data['pw_late_prop'] = train_data['weight_category'].apply(pwLate)


In [58]:
test_data['pw_late_prop'] = test_data['weight_category'].apply(pwLate)


In [59]:
train_data['shipping_charges'].describe()

count    87109.000000
mean        44.212876
std         37.582343
min          0.000000
25%         20.020000
50%         35.010000
75%         57.140000
max        409.680000
Name: shipping_charges, dtype: float64

In [60]:
sectil = train_data['shipping_charges'].quantile([0.25]).values[0]
thirtil = train_data['shipping_charges'].quantile([0.75]).values[0]
train_data['freight_category'] = train_data['shipping_charges'].apply(lambda x:'expensive' if x>=thirtil else ('medium' if x>=sectil and x<thirtil else 'cheap'))


In [61]:
test_data['freight_category'] = test_data['shipping_charges'].apply(lambda x:'expensive' if x>=thirtil else ('medium' if x>=sectil and x<thirtil else 'cheap'))


In [62]:
sc_late = train_data[train_data.is_late == 1]['freight_category'].value_counts().rename_axis('freight_category').reset_index(name='late_counts')
sc_all = train_data['freight_category'].value_counts().rename_axis('freight_category').reset_index(name='all_counts')
sc_late['all_count'] = sc_late.freight_category.map(sc_all.set_index('freight_category')['all_counts'])
sc_late['late_prop'] =sc_late['late_counts'] / sc_late['all_count']
sc_list =sc_late['freight_category'].to_list()


In [63]:
def scLate(data):
    if(data in sc_list):
        return sc_late[sc_late.freight_category == data]['late_prop'].values[0]
    else:
        return 0

train_data['sc_late_prop'] = train_data['freight_category'].apply(scLate)


In [64]:
test_data['sc_late_prop'] = test_data['freight_category'].apply(scLate)


In [65]:
ps_late = train_data[train_data.is_late == 1]['payment_sequential'].value_counts().rename_axis('payment_sequential').reset_index(name='late_counts')
ps_all = train_data['payment_sequential'].value_counts().rename_axis('payment_sequential').reset_index(name='all_counts')
ps_late['all_count'] = ps_late.payment_sequential.map(ps_all.set_index('payment_sequential')['all_counts'])
ps_late['late_prop'] =ps_late['late_counts'] / ps_late['all_count']
ps_list =ps_late['payment_sequential'].to_list()
# ps_list['seq'] = ps_list['payment_sequential'].map({1: '1', 2: '2', 3: '3', 4:'4', 5:'5', 6:'6', 7:'7', 8:'8',9:'9',10:'10',11:'11',12:'12',13:'13',14:'14',15:'15',16:'16',17:'17',18:'18',19:'19',20:'20',21:'21',22:'22',23:'23',24:'24'})


In [66]:
def psLate(data):
    if(data in ps_list):
        return ps_late[ps_late.payment_sequential == data]['late_prop'].values[0]
    else:
        return 0

train_data['ps_late_prop'] = train_data['payment_sequential'].apply(psLate)


In [67]:
test_data['ps_late_prop'] = test_data['payment_sequential'].apply(psLate)


In [68]:
zip_late = train_data[train_data.is_late == 1]['customer_zip_code_prefix'].value_counts().rename_axis('zip_code_prefix').reset_index(name='late_counts')


In [69]:
sectil = zip_late['late_counts'].quantile([0.25]).values[0]
thirtil = zip_late['late_counts'].quantile([0.75]).values[0]

zip_late['dist_late'] = zip_late['late_counts'].apply(lambda x:'sering' if x>=thirtil else ('lumayan' if x>=sectil and x<thirtil else 'jarang'))


In [70]:
zip_all = train_data['customer_zip_code_prefix'].value_counts().rename_axis('zip_code_prefix').reset_index(name='all_counts')
zip_late['all_count'] = zip_late.zip_code_prefix.map(zip_all.set_index('zip_code_prefix')['all_counts'])


In [71]:
zip_late['late_prop'] = zip_late['late_counts'] / zip_late['all_count']
zip_late['zip_code_prefix'] = zip_late['zip_code_prefix'].astype('category')


In [72]:
zip_list = zip_late['zip_code_prefix'].to_list()

def propZipLate(data):
    if(data in zip_list):
        return zip_late[zip_late.zip_code_prefix == data]['late_prop'].values[0]
    else:
        return 0

train_data['zip_late_prop'] = train_data['customer_zip_code_prefix'].apply(propZipLate)


In [73]:
test_data['zip_late_prop'] = test_data['customer_zip_code_prefix'].apply(propZipLate)


In [74]:
def countZipLate(data):
    if(data in zip_list):
        return zip_late[zip_late.zip_code_prefix == data]['late_counts'].values[0]
    else:
        return 0

train_data['zip_late_count'] = train_data['customer_zip_code_prefix'].apply(countZipLate)


In [75]:
test_data['zip_late_count'] = test_data['customer_zip_code_prefix'].apply(countZipLate)


In [76]:
sering = zip_late[zip_late.dist_late == 'sering']['zip_code_prefix'].to_list()
lumayan = zip_late[zip_late.dist_late == 'lumayan']['zip_code_prefix'].to_list()
jarang = zip_late[zip_late.dist_late == 'jarang']['zip_code_prefix'].to_list()

In [77]:
def zipLate(data):
    if(data in sering):
        return 'sering'
    elif(data in lumayan):
        return 'lumayan'
    elif(data in jarang):
        return 'jarang'
    else:
        return 'never'

train_data['zip_late_freq'] = train_data['customer_zip_code_prefix'].apply(zipLate)


In [78]:
test_data['zip_late_freq'] = test_data['customer_zip_code_prefix'].apply(zipLate)


In [79]:
def hasLate(data):
    if(data == 'never'):
        return 0
    else:
        return 1
train_data['zip_ever_late'] = train_data['zip_late_freq'].apply(hasLate)


In [80]:
test_data['zip_ever_late'] = test_data['zip_late_freq'].apply(hasLate)


In [81]:
city_late = train_data[train_data.is_late == 1]['customer_city'].value_counts().rename_axis('customer_city').reset_index(name='late_counts')


In [82]:
city_all = train_data['customer_city'].value_counts().rename_axis('customer_city').reset_index(name='all_counts')
city_late['all_count'] = city_late.customer_city.map(city_all.set_index('customer_city')['all_counts'])
city_late['late_prop'] = city_late['late_counts'] / city_late['all_count']


In [83]:
sectil = city_late['late_counts'].quantile([0.25]).values[0]
thirtil = city_late['late_counts'].quantile([0.75]).values[0]
city_late['dist_late'] = city_late['late_counts'].apply(lambda x:'sering' if x>=thirtil else ('lumayan' if x>=sectil and x<thirtil else 'jarang'))



In [84]:
city_list = city_late['customer_city'].to_list()

def propCityLate(data):
    if(data in city_list):
        return city_late[city_late.customer_city == data]['late_prop'].values[0]
    else:
        return 0

train_data['city_late_prop'] = train_data['customer_city'].apply(propCityLate)


In [85]:
test_data['city_late_prop'] = test_data['customer_city'].apply(propCityLate)


In [86]:
def countCityLate(data):
    if(data in city_list):
        return city_late[city_late.customer_city == data]['late_counts'].values[0]
    else:
        return 0

train_data['city_late_count'] = train_data['customer_city'].apply(countCityLate)


In [87]:
test_data['city_late_count'] = test_data['customer_city'].apply(countCityLate)


In [88]:
c_sering = city_late[city_late.dist_late == 'sering']['customer_city'].to_list()
c_lumayan = city_late[city_late.dist_late == 'lumayan']['customer_city'].to_list()
c_jarang = city_late[city_late.dist_late == 'jarang']['customer_city'].to_list()

def cityLate(data):
    if(data in c_sering):
        return 'sering'
    elif(data in c_lumayan):
        return 'lumayan'
    elif(data in c_jarang):
        return 'jarang'
    else:
        return 'never'

train_data['city_late_freq'] = train_data['customer_city'].apply(cityLate)


In [89]:
test_data['city_late_freq'] = test_data['customer_city'].apply(cityLate)


In [90]:
train_data['city_ever_late'] = train_data['city_late_freq'].apply(hasLate)


In [91]:
test_data['city_ever_late'] = test_data['city_late_freq'].apply(hasLate)


In [92]:
state_late = train_data[train_data.is_late == 1]['customer_state'].value_counts().rename_axis('customer_state').reset_index(name='late_counts')


In [93]:
state_late = train_data[train_data.is_late == 1]['customer_state'].value_counts().rename_axis('customer_state').reset_index(name='late_counts')


In [94]:
state_all = train_data['customer_state'].value_counts().rename_axis('customer_state').reset_index(name='all_counts')
state_late['all_count'] = state_late.customer_state.map(state_all.set_index('customer_state')['all_counts'])
state_late['late_prop'] = state_late['late_counts'] / state_late['all_count']


In [95]:
sectil = state_late['late_counts'].quantile([0.25]).values[0]
thirtil = state_late['late_counts'].quantile([0.75]).values[0]
state_late['dist_late'] = state_late['late_counts'].apply(lambda x:'sering' if x>=thirtil else ('lumayan' if x>=sectil and x<thirtil else 'jarang'))


In [96]:
state_list = state_late['customer_state'].to_list()

def propStateLate(data):
    if(data in state_list):
        return state_late[state_late.customer_state == data]['late_prop'].values[0]
    else:
        return 0

train_data['state_late_prop'] = train_data['customer_state'].apply(propStateLate)


In [97]:
test_data['state_late_prop'] = test_data['customer_state'].apply(propStateLate)


In [98]:
def countStateLate(data):
    if(data in state_list):
        return state_late[state_late.customer_state == data]['late_counts'].values[0]
    else:
        return 0

train_data['state_late_count'] = train_data['customer_state'].apply(countStateLate)


In [99]:
test_data['state_late_count'] = test_data['customer_state'].apply(countStateLate)


In [100]:
s_sering = state_late[state_late.dist_late == 'sering']['customer_state'].to_list()
s_lumayan = state_late[state_late.dist_late == 'lumayan']['customer_state'].to_list()
s_jarang = state_late[state_late.dist_late == 'jarang']['customer_state'].to_list()

def stateLate(data):
    if(data in s_sering):
        return 'sering'
    elif(data in s_lumayan):
        return 'lumayan'
    elif(data in s_jarang):
        return 'jarang'
    else:
        return 'never'

train_data['state_late_freq'] = train_data['customer_state'].apply(stateLate)


In [101]:
test_data['state_late_freq'] = test_data['customer_state'].apply(stateLate)


In [102]:
train_data['state_ever_late'] = train_data['state_late_freq'].apply(hasLate)


In [103]:
test_data['state_ever_late'] = test_data['state_late_freq'].apply(hasLate)


In [104]:
train_data[train_data['state_ever_late'] == 0]

Unnamed: 0,order_id,customer_id,order_purchase_timestamp,order_approved_at,order_delivered_timestamp,order_estimated_delivery_date,customer_zip_code_prefix,customer_city,customer_state,product_id,seller_id,price,shipping_charges,payment_sequential,payment_type,payment_installments,payment_value,product_category_name,product_weight_g,product_length_cm,product_height_cm,product_width_cm,is_late,product_volume,volume_category,vol_late_prop,purchase_day_of_week,approved_day_of_week,pur_late_prop,app_late_prop,purchase_hour,approved_hour,purh_late_prop,apph_late_prop,approved_min_purchase,apmpu_category,apmpu_late_prop,price_category,price_late_prop,install_late_prop,pt_late_prop,weight_category,pw_late_prop,freight_category,sc_late_prop,ps_late_prop,zip_late_prop,zip_late_count,zip_late_freq,zip_ever_late,city_late_prop,city_late_count,city_late_freq,city_ever_late,state_late_prop,state_late_count,state_late_freq,state_ever_late


In [105]:
test_data[test_data['state_ever_late'] == 0]

Unnamed: 0,order_id,customer_id,order_purchase_timestamp,order_approved_at,customer_zip_code_prefix,customer_city,customer_state,product_id,seller_id,price,shipping_charges,payment_sequential,payment_type,payment_installments,payment_value,product_category_name,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_volume,volume_category,vol_late_prop,purchase_day_of_week,approved_day_of_week,pur_late_prop,app_late_prop,purchase_hour,approved_hour,purh_late_prop,apph_late_prop,approved_min_purchase,apmpu_category,apmpu_late_prop,price_category,price_late_prop,install_late_prop,pt_late_prop,weight_category,pw_late_prop,freight_category,sc_late_prop,ps_late_prop,zip_late_prop,zip_late_count,zip_late_freq,zip_ever_late,city_late_prop,city_late_count,city_late_freq,city_ever_late,state_late_prop,state_late_count,state_late_freq,state_ever_late


In [106]:
# It turns out that all states are ever late, so just remove the ever late state feature
train_data = train_data.drop(['state_ever_late'], axis=1)


In [107]:
# It turns out that all states are ever late, so just remove the ever late state feature
test_data = test_data.drop(['state_ever_late'], axis=1)


In [108]:
seller_late = train_data[train_data.is_late == 1]['seller_id'].value_counts().rename_axis('seller_id').reset_index(name='late_counts')


In [109]:
seller_all = train_data['seller_id'].value_counts().rename_axis('seller_id').reset_index(name='all_counts')
seller_late['all_count'] = seller_late.seller_id.map(seller_all.set_index('seller_id')['all_counts'])
seller_late['late_prop'] = seller_late['late_counts'] / seller_late['all_count']


In [110]:
sectil = seller_late['late_counts'].quantile([0.25]).values[0]
thirtil = seller_late['late_counts'].quantile([0.75]).values[0]
seller_late['dist_late'] = seller_late['late_counts'].apply(lambda x:'sering' if x>=thirtil else ('lumayan' if x>=sectil and x<thirtil else 'jarang'))



In [111]:
seller_list = seller_late['seller_id'].to_list()

def propSellerLate(data):
    if(data in seller_list):
        return seller_late[seller_late.seller_id == data]['late_prop'].values[0]
    else:
        return 0

train_data['seller_late_prop'] = train_data['seller_id'].apply(propSellerLate)


In [112]:
test_data['seller_late_prop'] = test_data['seller_id'].apply(propSellerLate)


In [113]:
def countSellerLate(data):
    if(data in seller_list):
        return seller_late[seller_late.seller_id == data]['late_counts'].values[0]
    else:
        return 0

train_data['seller_late_count'] = train_data['seller_id'].apply(countSellerLate)


In [114]:
test_data['seller_late_count'] = test_data['seller_id'].apply(countSellerLate)


In [115]:
sel_sering = seller_late[seller_late.dist_late == 'sering']['seller_id'].to_list()
sel_lumayan = seller_late[seller_late.dist_late == 'lumayan']['seller_id'].to_list()
sel_jarang = seller_late[seller_late.dist_late == 'jarang']['seller_id'].to_list()

def sellerLate(data):
    if(data in sel_sering):
        return 'sering'
    elif(data in sel_lumayan):
        return 'lumayan'
    elif(data in sel_jarang):
        return 'jarang'
    else:
        return 'never'

train_data['seller_late_freq'] = train_data['seller_id'].apply(sellerLate)


In [116]:
test_data['seller_late_freq'] = test_data['seller_id'].apply(sellerLate)


In [117]:
prod_late = train_data[train_data.is_late == 1]['product_id'].value_counts().rename_axis('product_id').reset_index(name='late_counts')


In [129]:
prod_all = train_data['product_id'].value_counts().rename_axis('product_id').reset_index(name='all_counts')
prod_late['all_count'] = prod_late.product_id.map(prod_all.set_index('product_id')['all_counts'])
prod_late['late_prop'] = prod_late['late_counts'] / prod_late['all_count']


In [130]:
sectil = prod_late['late_counts'].quantile([0.25]).values[0]
thirtil = prod_late['late_counts'].quantile([0.75]).values[0]
prod_late['dist_late'] = prod_late['late_counts'].apply(lambda x:'sering' if x>=thirtil else ('lumayan' if x>=sectil and x<thirtil else 'jarang'))



In [131]:
print(prod_late.head())  # This will show the first few rows along with column names


     product_id  late_counts  all_count dist_late  late_prop
0  9NwzO0Pm0fDM           48        383    sering   0.125326
1  SLTlrWtcYt1m           34        319    sering   0.106583
2  8IhgV2nH9kXE           30        106    sering   0.283019
3  ssZQDTdv1ISb           29        150    sering   0.193333
4  UgkSjxoiV9Ev           29        378    sering   0.076720


In [132]:
prod_list = prod_late['product_id'].to_list()

def propProductLate(data):
    if(data in prod_list):
        return prod_late[prod_late.product_id == data]['late_prop'].values[0]
    else:
        return 0
train_data['product_late_prop'] = train_data['product_id'].apply(propProductLate)


In [133]:
test_data['product_late_prop'] = test_data['product_id'].apply(propProductLate)


In [134]:
def countProductLate(data):
    if(data in prod_list):
        return prod_late[prod_late.product_id == data]['late_counts'].values[0]
    else:
        return 0

train_data['product_late_count'] = train_data['product_id'].apply(countProductLate)


In [135]:
test_data['product_late_count'] = test_data['product_id'].apply(countProductLate)


In [136]:
prod_sering = prod_late[prod_late.dist_late == 'sering']['product_id'].to_list()
prod_lumayan = prod_late[prod_late.dist_late == 'lumayan']['product_id'].to_list()
prod_jarang = prod_late[prod_late.dist_late == 'jarang']['product_id'].to_list()

def productLate(data):
    if(data in prod_sering):
        return 'sering'
    elif(data in prod_lumayan):
        return 'lumayan'
    elif(data in prod_jarang):
        return 'jarang'
    else:
        return 'never'

train_data['product_late_freq'] = train_data['product_id'].apply(productLate)


In [137]:
test_data['product_late_freq'] = test_data['product_id'].apply(productLate)


In [138]:
cat_name = train_data['product_category_name'].value_counts().rename_axis('product_category_name').reset_index(name='counts')


In [139]:
cat_late = train_data[train_data.is_late == 1]['product_category_name'].value_counts().rename_axis('product_category_name').reset_index(name='count_late')


In [140]:
cat_late['all_count'] = cat_late.product_category_name.map(cat_name.set_index('product_category_name')['counts'])

In [141]:
cat_late['late_prop'] = cat_late['count_late'] / cat_late['all_count']


In [142]:
cat_late.describe()

Unnamed: 0,count_late,all_count,late_prop
count,50.0,50.0,50.0
mean,134.12,1737.22,0.079256
std,715.871487,9237.587361,0.032917
min,1.0,7.0,0.029412
25%,4.0,59.5,0.056876
50%,11.0,121.0,0.076691
75%,48.75,637.25,0.092
max,5084.0,65618.0,0.210526


In [143]:
sectil = cat_late['count_late'].quantile([0.25]).values[0]
thirtil = cat_late['count_late'].quantile([0.75]).values[0]
cat_late['dist_late'] = cat_late['count_late'].apply(lambda x:'sering' if x>=thirtil else ('lumayan' if x>=sectil and x<thirtil else 'jarang'))


In [144]:
cat_list = cat_late['product_category_name'].to_list()

def propCatLate(data):
    if(data in cat_list):
        return cat_late[cat_late.product_category_name == data]['late_prop'].values[0]
    else:
        return 0

train_data['cat_late_prop'] = train_data['product_category_name'].apply(propCatLate)


In [145]:
test_data['cat_late_prop'] = test_data['product_category_name'].apply(propCatLate)


In [146]:
cat_sering = cat_late[cat_late.dist_late == 'sering']['product_category_name'].to_list()
cat_lumayan = cat_late[cat_late.dist_late == 'lumayan']['product_category_name'].to_list()
cat_jarang = cat_late[cat_late.dist_late == 'jarang']['product_category_name'].to_list()

def catLate(data):
    if(data in cat_sering):
        return 'sering'
    elif(data in cat_lumayan):
        return 'lumayan'
    elif(data in cat_jarang):
        return 'jarang'
    else:
        return 'never'

train_data['prod_cat_late_freq'] = train_data['product_category_name'].apply(catLate)


In [147]:
test_data['prod_cat_late_freq'] = test_data['product_category_name'].apply(catLate)


In [148]:
train_data.to_csv('./raw-train-intel2.csv', index=False)

In [149]:
train_data['prob_mean'] = train_data[['vol_late_prop', 'pur_late_prop', 'app_late_prop',
                                'purh_late_prop', 'apph_late_prop', 'price_late_prop',
                                'install_late_prop', 'pt_late_prop', 'pw_late_prop',
                                'sc_late_prop', 'ps_late_prop', 'zip_late_prop',
                                'city_late_prop', 'state_late_prop', 'seller_late_prop', 
                                'product_late_prop', 'cat_late_prop', 'apmpu_late_prop']].mean(axis = 1, skipna = True)


In [150]:
test_data['prob_mean'] = test_data[['vol_late_prop', 'pur_late_prop', 'app_late_prop',
                                'purh_late_prop', 'apph_late_prop', 'price_late_prop',
                                'install_late_prop', 'pt_late_prop', 'pw_late_prop',
                                'sc_late_prop', 'ps_late_prop', 'zip_late_prop',
                                'city_late_prop', 'state_late_prop', 'seller_late_prop', 
                                'product_late_prop', 'apmpu_late_prop','cat_late_prop']].mean(axis = 1, skipna = True)


In [151]:
train_fix = train_data.drop(['customer_id', 'order_purchase_timestamp', 'order_approved_at', 
                             'order_delivered_timestamp', 'order_estimated_delivery_date',
                             'product_length_cm', 'product_height_cm', 'product_width_cm'
                               ], axis=1)

In [152]:
test_fix = test_data.drop(['customer_id', 'order_purchase_timestamp', 'order_approved_at', 
                       'product_length_cm', 'product_height_cm', 'product_width_cm',
                      ], axis=1)

In [153]:
train_fix['approved_day_of_week'] = train_fix['approved_day_of_week'].astype(int)
train_fix['approved_hour'] = train_fix['approved_hour'].astype(int)
train_fix['approved_min_purchase'] = train_fix['approved_min_purchase'].astype(float)
for col in train_fix.select_dtypes(include=["object"]).columns:
    train_fix[col] = train_fix[col].astype('category')

In [154]:
test_fix['approved_day_of_week'] = test_fix['approved_day_of_week'].astype(int)
test_fix['approved_hour'] = test_fix['approved_hour'].astype(int)
test_fix['approved_min_purchase'] = test_fix['approved_min_purchase'].astype(float)
for col in test_fix.select_dtypes(include=["object"]).columns:
    test_fix[col] = test_fix[col].astype('category')
    


In [155]:
ohe_tr = pd.get_dummies(train_fix, columns = ['payment_type'])


In [156]:
ohe_ts = pd.get_dummies(test_fix, columns = ['payment_type'])


In [157]:
for col in ohe_tr.select_dtypes(include=["bool"]).columns:
    ohe_tr[col] = ohe_tr[col].astype('int')

In [158]:
for col in ohe_ts.select_dtypes(include=["bool"]).columns:
    ohe_ts[col] = ohe_ts[col].astype('int')

In [159]:
ohe_tr['customer_zip_code_prefix'] = ohe_tr['customer_zip_code_prefix'].astype('category')
ohe_ts['customer_zip_code_prefix'] = ohe_ts['customer_zip_code_prefix'].astype('category')


In [160]:
label_encoder = preprocessing.LabelEncoder()

zip_code = pd.concat([ohe_tr['customer_zip_code_prefix'], ohe_ts['customer_zip_code_prefix']])
label_encoder.fit(zip_code)
ohe_tr['customer_zip_code_prefix']= label_encoder.transform(ohe_tr['customer_zip_code_prefix'])
ohe_ts['customer_zip_code_prefix']= label_encoder.transform(ohe_ts['customer_zip_code_prefix'])

custom_city = pd.concat([ohe_tr['customer_city'], ohe_ts['customer_city']])
label_encoder.fit(custom_city)
ohe_tr['customer_city']= label_encoder.transform(ohe_tr['customer_city'])
ohe_ts['customer_city']= label_encoder.transform(ohe_ts['customer_city'])
custom_state = pd.concat([ohe_tr['customer_state'], ohe_ts['customer_state']])
label_encoder.fit(custom_state)
ohe_tr['customer_state']= label_encoder.transform(ohe_tr['customer_state'])
ohe_ts['customer_state']= label_encoder.transform(ohe_ts['customer_state'])

prods_id = pd.concat([ohe_tr['product_id'], ohe_ts['product_id']])
label_encoder.fit(prods_id)
ohe_tr['product_id']= label_encoder.transform(ohe_tr['product_id'])
ohe_ts['product_id']= label_encoder.transform(ohe_ts['product_id'])

sell_id = pd.concat([ohe_tr['seller_id'], ohe_ts['seller_id']])
label_encoder.fit(sell_id)
ohe_tr['seller_id']= label_encoder.transform(ohe_tr['seller_id'])
ohe_ts['seller_id']= label_encoder.transform(ohe_ts['seller_id'])

cat_name = pd.concat([ohe_tr['product_category_name'], ohe_ts['product_category_name']])
label_encoder.fit(cat_name)
ohe_tr['product_category_name']= label_encoder.transform(ohe_tr['product_category_name'])
ohe_ts['product_category_name']= label_encoder.transform(ohe_ts['product_category_name'])


In [161]:
ohe_tr['price_category'].replace({'cheap': 0, 'affordable': 1, 'expensive': 2}, inplace=True)
ohe_tr['volume_category'].replace({'small': 0, 'medium': 1, 'large': 2}, inplace=True)
ohe_tr['weight_category'].replace({'light': 0, 'medium': 1, 'heavy': 2}, inplace=True)
ohe_tr['freight_category'].replace({'cheap': 0, 'medium': 1, 'expensive': 2}, inplace=True)
ohe_tr['zip_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_tr['apmpu_category'].replace({'fast': 0, 'medium': 1, 'slow': 2}, inplace=True)
ohe_tr['city_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_tr['state_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_tr['seller_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_tr['product_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_tr['prod_cat_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ohe_tr['price_category'].replace({'cheap': 0, 'affordable': 1, 'expensive': 2}, inplace=True)
  ohe_tr['price_category'].replace({'cheap': 0, 'affordable': 1, 'expensive': 2}, inplace=True)
  ohe_tr['price_category'].replace({'cheap': 0, 'affordable': 1, 'expensive': 2}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perf

In [162]:
ohe_ts['price_category'].replace({'cheap': 0, 'affordable': 1, 'expensive': 2}, inplace=True)
ohe_ts['volume_category'].replace({'small': 0, 'medium': 1, 'large': 2}, inplace=True)
ohe_ts['weight_category'].replace({'light': 0, 'medium': 1, 'heavy': 2}, inplace=True)
ohe_ts['freight_category'].replace({'cheap': 0, 'medium': 1, 'expensive': 2}, inplace=True)
ohe_ts['zip_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_ts['apmpu_category'].replace({'fast': 0, 'medium': 1, 'slow': 2}, inplace=True)
ohe_ts['city_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_ts['state_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_ts['seller_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_ts['product_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)
ohe_ts['prod_cat_late_freq'].replace({'never': 0, 'jarang': 1, 'lumayan': 2, 'sering': 3}, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ohe_ts['price_category'].replace({'cheap': 0, 'affordable': 1, 'expensive': 2}, inplace=True)
  ohe_ts['price_category'].replace({'cheap': 0, 'affordable': 1, 'expensive': 2}, inplace=True)
  ohe_ts['price_category'].replace({'cheap': 0, 'affordable': 1, 'expensive': 2}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perf

In [163]:
fix_data = ohe_tr.copy()
fix_data2 = ohe_ts.copy()

In [164]:
fix_data.to_csv('./intel-train-finn3.csv', index=False)
fix_data2.to_csv('./intel-test-finn3.csv', index=False)

In [165]:
roro = fix_data.drop(['order_id'], axis=1)


In [166]:
# Create correlation matrix
corr_matrix = roro.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.5
to_use = [column for column in upper.columns if any(upper[column] > 0.5)]


In [167]:
X = fix_data.drop(['is_late', 'order_id'], axis=1)
y = fix_data['is_late']

In [168]:
X[X.select_dtypes(include=["category"]).columns] = X[X.select_dtypes(include=["category"]).columns].astype(int)


In [169]:
X_br = X[to_use]
y_br = fix_data['is_late']

In [170]:
X_ts = fix_data2.drop(['order_id'], axis=1)

In [171]:
X_ts[X_ts.select_dtypes(include=["category"]).columns] = X_ts[X_ts.select_dtypes(include=["category"]).columns].astype(int)


In [172]:
# from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_br, y_br, test_size=0.2, random_state=42)

print('Shape of X_train', X_train.shape)
print('Shape of X_test', X_test.shape)
print('Shape of y_train', y_train.shape)
print('Shape of y_test', y_test.shape)

Shape of X_train (69687, 30)
Shape of X_test (17422, 30)
Shape of y_train (69687,)
Shape of y_test (17422,)


In [173]:
X_ts = X_ts[X_test.columns]


In [174]:
eval_df = pd.DataFrame(columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC'])
eval_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,ROC-AUC


MODELS

In [175]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_predKNN = knn.predict(X_test)

model = 'KNN'
# classification report
report = classification_report(y_test, y_predKNN)
print('Classification Report:\n', report)

# calculate and print model accuracy
knn_acc = accuracy_score(y_test, y_predKNN)
print("KNN Model Accuracy:", accuracy_score(y_test, y_predKNN))

# calculate and print model precision
knn_prc = precision_score(y_test, y_predKNN)
print("KNN Model Precision:", precision_score(y_test, y_predKNN))

# calculate and print model recall
knn_rcl = recall_score(y_test, y_predKNN)
print("KNN Model Recall:", recall_score(y_test, y_predKNN))

# calculate and print model f1-score
knn_f1s = f1_score(y_test, y_predKNN)
print("KNN Model F1-Score:", f1_score(y_test, y_predKNN))

# calculate and print model roc-auc
knn_rau = roc_auc_score(y_test, y_predKNN)
print("KNN Model ROC-AUC:", roc_auc_score(y_test, y_predKNN))

new_eval = {'Model': model, 'Accuracy': knn_acc, 'Precision': knn_prc, 'Recall': knn_rcl, 'F1-Score': knn_f1s, 'ROC-AUC': knn_rau}
eval_df.loc[len(eval_df)] = new_eval

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.98      0.96     16105
           1       0.38      0.18      0.24      1317

    accuracy                           0.92     17422
   macro avg       0.66      0.58      0.60     17422
weighted avg       0.89      0.92      0.90     17422

KNN Model Accuracy: 0.9158535185397773
KNN Model Precision: 0.37886178861788616
KNN Model Recall: 0.17691723614274868
KNN Model F1-Score: 0.2412008281573499
KNN Model ROC-AUC: 0.5765989471617189


In [176]:
y_pred_knn = knn.predict(X_ts)
y_id = fix_data2[['order_id']]
label = pd.DataFrame(y_pred_knn, columns=['is_late'])
submissionKNN = pd.concat([y_id, label], axis=1)
submissionKNN.to_csv('./subsfix-KNN.csv', index=False)

In [177]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_predDT = dt.predict(X_test)

model = 'Decision Tree'
# classification report
report = classification_report(y_test, y_predDT)
print('Classification Report:\n', report)

# calculate and print model accuracy
dt_acc = accuracy_score(y_test, y_predDT)
print("DT Model Accuracy:", accuracy_score(y_test, y_predDT))

# calculate and print model precision
dt_prc = precision_score(y_test, y_predDT)
print("DT Model Precision:", precision_score(y_test, y_predDT))

# calculate and print model recall
dt_rcl = recall_score(y_test, y_predDT)
print("DT Model Recall:", recall_score(y_test, y_predDT))

# calculate and print model f1-score
dt_f1s = f1_score(y_test, y_predDT)
print("DT Model F1-Score:", f1_score(y_test, y_predDT))

# calculate and print model roc-auc
dt_rau = roc_auc_score(y_test, y_predDT)
print("DT Model ROC-AUC:", roc_auc_score(y_test, y_predDT))

new_eval = {'Model': model, 'Accuracy': dt_acc, 'Precision': dt_prc, 'Recall': dt_rcl, 'F1-Score': dt_f1s, 'ROC-AUC': dt_rau}
eval_df.loc[len(eval_df)] = new_eval

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.97      0.97     16105
           1       0.66      0.67      0.66      1317

    accuracy                           0.95     17422
   macro avg       0.82      0.82      0.82     17422
weighted avg       0.95      0.95      0.95     17422

DT Model Accuracy: 0.9488003673516244
DT Model Precision: 0.6596543951915853
DT Model Recall: 0.6666666666666666
DT Model F1-Score: 0.6631419939577039
DT Model ROC-AUC: 0.8192693780399462


In [178]:
y_pred_dt = dt.predict(X_ts)
y_id = fix_data2[['order_id']]
label = pd.DataFrame(y_pred_dt, columns=['is_late'])
submissionDT = pd.concat([y_id, label], axis=1)
submissionDT.to_csv('./subsfix-DT.csv', index=False)

In [179]:
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_predXGB = xgb.predict(X_test)

model = 'XGBoost'
# classification report
report = classification_report(y_test, y_predXGB)
print('Classification Report:\n', report)

# calculate and print model accuracy
xgb_acc = accuracy_score(y_test, y_predXGB)
print("XGB Model Accuracy:", accuracy_score(y_test, y_predXGB))
# calculate and print model precision
xgb_prc = precision_score(y_test, y_predXGB)
print("XGB Model Precision:", precision_score(y_test, y_predXGB))

# calculate and print model recall
xgb_rcl = recall_score(y_test, y_predXGB)
print("XGB Model Recall:", recall_score(y_test, y_predXGB))

# calculate and print model f1-score
xgb_f1s = f1_score(y_test, y_predXGB)
print("XGB Model F1-Score:", f1_score(y_test, y_predXGB))

# calculate and print model roc-auc
xgb_rau = roc_auc_score(y_test, y_predXGB)
print("XGB Model ROC-AUC:", roc_auc_score(y_test, y_predXGB))

new_eval = {'Model': model, 'Accuracy': xgb_acc, 'Precision': xgb_prc, 'Recall': xgb_rcl, 'F1-Score': xgb_f1s, 'ROC-AUC': xgb_rau}
eval_df.loc[len(eval_df)] = new_eval

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98     16105
           1       0.78      0.63      0.70      1317

    accuracy                           0.96     17422
   macro avg       0.88      0.81      0.84     17422
weighted avg       0.96      0.96      0.96     17422

XGB Model Accuracy: 0.9590747330960854
XGB Model Precision: 0.7838345864661654
XGB Model Recall: 0.6332574031890661
XGB Model F1-Score: 0.7005459890802184
XGB Model ROC-AUC: 0.8094880620415992


In [180]:
y_pred_xgb = xgb.predict(X_ts)
y_id = fix_data2[['order_id']]
label = pd.DataFrame(y_pred_xgb, columns=['is_late'])
submissionXGB = pd.concat([y_id, label], axis=1)
submissionXGB.to_csv('./subsfix-XGB.csv', index=False)

In [181]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import pandas as pd
import pickle

# Initialize and fit the Random Forest model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Make predictions
y_predRF = rf.predict(X_test)

model = 'Random Forest'

# Classification report
report = classification_report(y_test, y_predRF)
print('Classification Report:\n', report)

# Calculate and print model accuracy
rf_acc = accuracy_score(y_test, y_predRF)
print("RF Model Accuracy:", rf_acc)

# Calculate and print model precision
rf_prc = precision_score(y_test, y_predRF)
print("RF Model Precision:", rf_prc)

# Calculate and print model recall
rf_rcl = recall_score(y_test, y_predRF)
print("RF Model Recall:", rf_rcl)

# Calculate and print model f1-score
rf_f1s = f1_score(y_test, y_predRF)
print("RF Model F1-Score:", rf_f1s)

# Calculate and print model roc-auc
rf_rau = roc_auc_score(y_test, y_predRF)
print("RF Model ROC-AUC:", rf_rau)

# Update evaluation DataFrame
new_eval = {
    'Model': model,
    'Accuracy': rf_acc,
    'Precision': rf_prc,
    'Recall': rf_rcl,
    'F1-Score': rf_f1s,
    'ROC-AUC': rf_rau
}
eval_df.loc[len(eval_df)] = new_eval

#laod
import pickle
from sklearn.preprocessing import StandardScaler
pickle.dump(rf,open('regmodel.pkl','wb'))



Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98     16105
           1       0.83      0.64      0.73      1317

    accuracy                           0.96     17422
   macro avg       0.90      0.82      0.85     17422
weighted avg       0.96      0.96      0.96     17422

RF Model Accuracy: 0.9630926414877741
RF Model Precision: 0.8291015625
RF Model Recall: 0.6446469248291572
RF Model F1-Score: 0.7253310551046561
RF Model ROC-AUC: 0.8168903671025636


In [182]:
eval_df.sort_values(by=['Accuracy'], ascending=False)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,ROC-AUC
3,Random Forest,0.963093,0.829102,0.644647,0.725331,0.81689
2,XGBoost,0.959075,0.783835,0.633257,0.700546,0.809488
1,Decision Tree,0.9488,0.659654,0.666667,0.663142,0.819269
0,KNN,0.915854,0.378862,0.176917,0.241201,0.576599
