# Hackerearth Exhibit A(rt)

Hackathon for predicting the cost of shipping an art piece.

In [1]:
import numpy as np
import pandas as pd
import pandas_profiling

#Training data file
train_file = "./dataset/train.csv"
test_file = "./dataset/test.csv"

## Data Specification

| Column name | Description |
| --- | --- |
| Customer Id | Represents the unique identification number of the customers |
| Artist Name |Represents the name of the artist |
| Artist Reputation	| Represents the reputation of an artist in the market (the greater the reputation value, the higher the reputation of the artist in the market)
| Height | Represents the height of the sculpture |
| Width	| Represents the width of the sculpture |
| Weight | Represents the weight of the sculpture |
| Material | Represents the material that the sculpture is made of |
| Price Of Sculpture | Represents the price of the sculpture |
| Base Shipping Price | Represents the base price for shipping a sculpture |
| International | Represents whether the shipping is international |
| Express Shipment | Represents whether the shipping was in the express (fast) mode |
| Installation Included | Represents whether the order had installation included in the purchase of the sculpture |
| Transport | Represents the mode of transport of the order |
| Fragile | Represents whether the order is fragile |
| Customer Information | Represents details about a customer |
| Remote Location | Represents whether the customer resides in a remote location |
| Scheduled Date | Represents the date when the order was placed |
| Delivery Date | Represents the date of delivery of the order |
| Customer Location | Represents the location of the customer |
| Cost | Represents the cost of the order |

## Load the data in a pandas.Dataframe

In [2]:
train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)
#train_df.profile_report()

In [3]:
train_df.head()

Unnamed: 0,Customer Id,Artist Name,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Scheduled Date,Delivery Date,Customer Location,Cost
0,fffe3900350033003300,Billy Jenkins,0.26,17.0,6.0,4128.0,Brass,13.91,16.27,Yes,Yes,No,Airways,No,Working Class,No,06/07/15,06/03/15,"New Michelle, OH 50777",-283.29
1,fffe3800330031003900,Jean Bryant,0.28,3.0,3.0,61.0,Brass,6.83,15.0,No,No,No,Roadways,No,Working Class,No,03/06/17,03/05/17,"New Michaelport, WY 12072",-159.96
2,fffe3600370035003100,Laura Miller,0.07,8.0,5.0,237.0,Clay,4.96,21.18,No,No,No,Roadways,Yes,Working Class,Yes,03/09/15,03/08/15,"Bowmanshire, WA 19241",-154.29
3,fffe350031003300,Robert Chaires,0.12,9.0,,,Aluminium,5.81,16.31,No,No,No,,No,Wealthy,Yes,05/24/15,05/20/15,"East Robyn, KY 86375",-161.16
4,fffe3900320038003400,Rosalyn Krol,0.15,17.0,6.0,324.0,Aluminium,3.18,11.94,Yes,Yes,Yes,Airways,No,Working Class,No,12/18/16,12/14/16,"Aprilside, PA 52793",-159.23


In [4]:
#Separate the target in dataframe
y_train = train_df["Cost"].abs()
x_train = train_df.drop(["Cost"], axis=1)
x_test = test_df
missing_values_count = x_train.isnull().sum()
missing_values_count

Customer Id                 0
Artist Name                 0
Artist Reputation         750
Height                    375
Width                     584
Weight                    587
Material                  764
Price Of Sculpture          0
Base Shipping Price         0
International               0
Express Shipment            0
Installation Included       0
Transport                1392
Fragile                     0
Customer Information        0
Remote Location           771
Scheduled Date              0
Delivery Date               0
Customer Location           0
dtype: int64

## Inspect the data
### Observations:
- CustomerIds are 100% unique, might be representing transaction ID(candidate for removal)
- Customer Locations are 100% unique(candidate for removal since distance cannot be dervided from currently provided data)
- Cost is highly skewed(Target to be predicted)
- Price of sculpture is highly skewed
- Height, Width, Weight and Material might have correlation(Try to use PCA and another regressor to impute missing values)
- Transportation has a lot of missing information
### Columns with missing data
- Artist Reputation
- Height
- Width
- Weight
- Material
- Transport
- Remote Location

## Approach 1: Straight forward, just use features that affected shipping cost based on experience
Remove probably not useful columns
CustomerIds, Customer Location, Material, Artist Name, Artist Reputation, Scheduled Date, Delivery Date
### Note
Consider adding again, Artist reputation in later attempt

In [5]:
drop_columns = ["Customer Id", "Customer Location", "Material", "Artist Name", "Artist Reputation", "Scheduled Date", "Delivery Date"]
X_train_dropped = x_train.drop(drop_columns, axis=1)
X_test_dropped = x_test.drop(drop_columns, axis=1)
label = X_train_dropped.dtypes == "object"
label_enc_cols = list(label[label].index)

### Fill missing values of Transport

In [6]:
replacement = X_train_dropped["Transport"].mode()[0]
X_train_dropped["Transport"] = X_train_dropped["Transport"].fillna(replacement)
replacement = X_train_dropped["Remote Location"].mode()[0]
X_train_dropped["Remote Location"] = X_train_dropped["Remote Location"].fillna(replacement)
replacement = X_test_dropped["Transport"].mode()[0]
X_test_dropped["Transport"] = X_test_dropped["Transport"].fillna(replacement)
replacement = X_test_dropped["Remote Location"].mode()[0]
X_test_dropped["Remote Location"] = X_test_dropped["Remote Location"].fillna(replacement)

### Label Encode

In [7]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in label_enc_cols:
    X_train_dropped[col] = label_encoder.fit_transform(X_train_dropped[col])
    X_test_dropped[col] = label_encoder.fit_transform(X_test_dropped[col])

In [8]:
X_train_dropped.head()

Unnamed: 0,Height,Width,Weight,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location
0,17.0,6.0,4128.0,13.91,16.27,1,1,0,0,0,1,0
1,3.0,3.0,61.0,6.83,15.0,0,0,0,1,0,1,0
2,8.0,5.0,237.0,4.96,21.18,0,0,0,1,1,1,1
3,9.0,,,5.81,16.31,0,0,0,1,0,0,1
4,17.0,6.0,324.0,3.18,11.94,1,1,1,0,0,1,0


## Impute missing values

In [9]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="most_frequent")
X_imputed = pd.DataFrame(imputer.fit_transform(X_train_dropped), columns=X_train_dropped.columns)

## Scale the data

In [10]:
from sklearn.preprocessing import MinMaxScaler
feat_to_scale = ["Height", "Width", "Weight","Price Of Sculpture", "Base Shipping Price"]
X_scaled = X_imputed
scaled_features = MinMaxScaler().fit_transform(X_scaled[feat_to_scale])
#Remove unscaled columns from original DF
scaled_feature_df = pd.DataFrame(scaled_features, columns=feat_to_scale)
scaled_feature_df.head()
for col in feat_to_scale:
    X_scaled[col] = scaled_feature_df[col]
    
X_scaled.head()
# Scaled output, reverse, remeber to reverse the operation on the result
y_train = np.log1p(y_train)

In [11]:
from sklearn.model_selection import train_test_split
# Split the currently available training set for evaluation
train_feat, test_feat, cost_train, cost_test = train_test_split(
    X_scaled.to_numpy(),
    y_train.to_numpy(),
    train_size=0.8,
    random_state=0
    )


## Try AdaBoost Regressor

In [12]:
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor(random_state=0, n_estimators=150, learning_rate=0.0001, loss="exponential")

In [13]:
ada.fit(train_feat, cost_train)
print(ada.score(test_feat, cost_test))

0.8606118766716299


## RandomForest Regressor

In [14]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=150,random_state=42)

In [15]:
rfr.fit(train_feat, cost_train)
print(rfr.score(test_feat, cost_test))

0.8892530789905283


## Linear Regressor

In [16]:
from sklearn.linear_model import LinearRegression
lnr = LinearRegression(normalize=True)

In [17]:
lnr.fit(train_feat, cost_train)
print(lnr.score(test_feat, cost_test))

0.5699499216959859


## Ridge Regression

In [18]:
from sklearn.linear_model import Ridge
rdg = Ridge(alpha=0.1, normalize=True)
rdg.fit(train_feat, cost_train)
print(rdg.score(test_feat, cost_test))

0.5665261893013835


## Kernel Ridge

In [19]:
from sklearn.kernel_ridge import KernelRidge
krn = KernelRidge(alpha=0.1, degree=12)
krn.fit(train_feat, cost_train)
print(krn.score(test_feat, cost_test))

-0.037706487114584464


## GradientBoosting

In [20]:
from sklearn.ensemble import GradientBoostingRegressor
grd = GradientBoostingRegressor(random_state=0)
grd.fit(train_feat, cost_train)
print(grd.score(test_feat, cost_test))

0.8896623379793243


## Approach 1 result
- Gradient Boosting performed better than other models using the present data

## Approach 2: Generate additional features from present data

In [21]:
# Original Data
x_modified = train_df
final_test = test_df
y_target = x_modified["Cost"]
customer_id = final_test["Customer Id"]
# Drop features we will not use
drop_columns = ["Customer Id", "Artist Name"]
x_modified.drop(drop_columns, axis=1, inplace=True)
final_features = final_test.drop(drop_columns, axis=1)

x_modified["Scheduled Date"] = pd.to_datetime(x_modified["Scheduled Date"], format="%m/%d/%y")
x_modified["Delivery Date"] = pd.to_datetime(x_modified["Delivery Date"], format="%m/%d/%y")
final_features["Scheduled Date"] = pd.to_datetime(final_features["Scheduled Date"], format="%m/%d/%y")
final_features["Delivery Date"] = pd.to_datetime(final_features["Delivery Date"], format="%m/%d/%y")
x_modified.head()

Unnamed: 0,Artist Reputation,Height,Width,Weight,Material,Price Of Sculpture,Base Shipping Price,International,Express Shipment,Installation Included,Transport,Fragile,Customer Information,Remote Location,Scheduled Date,Delivery Date,Customer Location,Cost
0,0.26,17.0,6.0,4128.0,Brass,13.91,16.27,Yes,Yes,No,Airways,No,Working Class,No,2015-06-07,2015-06-03,"New Michelle, OH 50777",-283.29
1,0.28,3.0,3.0,61.0,Brass,6.83,15.0,No,No,No,Roadways,No,Working Class,No,2017-03-06,2017-03-05,"New Michaelport, WY 12072",-159.96
2,0.07,8.0,5.0,237.0,Clay,4.96,21.18,No,No,No,Roadways,Yes,Working Class,Yes,2015-03-09,2015-03-08,"Bowmanshire, WA 19241",-154.29
3,0.12,9.0,,,Aluminium,5.81,16.31,No,No,No,,No,Wealthy,Yes,2015-05-24,2015-05-20,"East Robyn, KY 86375",-161.16
4,0.15,17.0,6.0,324.0,Aluminium,3.18,11.94,Yes,Yes,Yes,Airways,No,Working Class,No,2016-12-18,2016-12-14,"Aprilside, PA 52793",-159.23


### Create new feature from difference of scheduled date and delivery date

In [22]:
x_modified["ScheduleDiff"] = (x_modified["Delivery Date"]-x_modified["Scheduled Date"]).map(lambda x: str(x).split()[0])
x_modified["ScheduleDiff"] = pd.to_numeric(x_modified["ScheduleDiff"])
x_modified.drop(["Delivery Date", "Scheduled Date"], axis=1, inplace=True)

final_features["ScheduleDiff"] = (final_features["Delivery Date"]-final_features["Scheduled Date"]).map(lambda x: str(x).split()[0])
final_features["ScheduleDiff"] = pd.to_numeric(final_features["ScheduleDiff"])
final_features.drop(["Delivery Date", "Scheduled Date"], axis=1, inplace=True)

### Extract state from customer location

In [23]:
x_modified["State"] = x_modified["Customer Location"].map(lambda x:str(x).split()[-2])
x_modified.drop(["Customer Location"], axis=1, inplace=True)

final_features["State"] = final_features["Customer Location"].map(lambda x:str(x).split()[-2])
final_features.drop(["Customer Location"], axis=1, inplace=True)

In [24]:
#x_modified.profile_report()

### One Hot Encode Transport

In [25]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
#Fill NaN transport
replacement = x_modified["Transport"].mode()[0]
x_modified["Transport"] = x_modified["Transport"].fillna(replacement)
x_transport = pd.DataFrame(ohe.fit_transform(x_modified[["Transport"]]), columns=ohe.get_feature_names(["Transport"]))
x_modified.drop(["Transport"], axis=1, inplace=True)
x_modified = pd.concat([x_modified, x_transport], axis=1)

final_features["Transport"] = final_features["Transport"].fillna(final_features["Transport"].mode()[0])
x_transport = pd.DataFrame(ohe.fit_transform(final_features[["Transport"]]), columns=ohe.get_feature_names(["Transport"]))
final_features.drop(["Transport"], axis=1, inplace=True)
final_features = pd.concat([final_features, x_transport], axis=1)

In [26]:
#x_modified.profile_report()

### Impute missing values

In [27]:
x_modified.isnull().sum()

Artist Reputation        750
Height                   375
Width                    584
Weight                   587
Material                 764
Price Of Sculpture         0
Base Shipping Price        0
International              0
Express Shipment           0
Installation Included      0
Fragile                    0
Customer Information       0
Remote Location          771
Cost                       0
ScheduleDiff               0
State                      0
Transport_Airways          0
Transport_Roadways         0
Transport_Waterways        0
dtype: int64

In [28]:
def fillNan(df, column, value):
    df[column].fillna(value, inplace=True)

### Impute Artist reputaion

In [29]:
fillNan(x_modified, "Artist Reputation", x_modified["Artist Reputation"].mean())
x_modified["Artist Reputation"].isna().any()

fillNan(final_features, "Artist Reputation", final_features["Artist Reputation"].mean())
final_features["Artist Reputation"].isna().any()

False

### Impute Height, Width, and Weight

In [30]:
fillNan(x_modified, "Height", x_modified["Height"].mean())
x_modified["Height"].isna().any()
fillNan(x_modified, "Width", x_modified["Width"].mean())
x_modified["Width"].isna().any()
fillNan(x_modified, "Weight", x_modified["Weight"].mean())
x_modified["Weight"].isna().any()

fillNan(final_features, "Height", final_features["Height"].mean())
final_features["Height"].isna().any()
fillNan(final_features, "Width", final_features["Width"].mean())
final_features["Width"].isna().any()
fillNan(final_features, "Weight", final_features["Weight"].mean())
final_features["Weight"].isna().any()

False

### Impute Material

In [31]:
fillNan(x_modified, "Material", x_modified["Material"].mode()[0])
x_modified["Material"].isna().any()

fillNan(final_features, "Material", final_features["Material"].mode()[0])
final_features["Material"].isna().any()

False

### Impute Remote Location

In [32]:
fillNan(x_modified, "Remote Location", x_modified["Remote Location"].mode()[0])
x_modified["Remote Location"].isna().any()

fillNan(final_features, "Remote Location", final_features["Remote Location"].mode()[0])
final_features["Remote Location"].isna().any()

False

### Label Encode non-numerical columns

In [33]:
label = x_modified.dtypes == "object"
label_enc_cols = list(label[label].index)
label_encoder = LabelEncoder()
for col in label_enc_cols:
    x_modified[col] = label_encoder.fit_transform(x_modified[col])
    
label = final_features.dtypes == "object"
label_enc_cols = list(label[label].index)
label_encoder = LabelEncoder()
for col in label_enc_cols:
    final_features[col] = label_encoder.fit_transform(final_features[col])

In [34]:
#x_modified.profile_report()

In [35]:
#from sklearn.preprocessing import StandardScaler
X_normalized = x_modified
from sklearn.preprocessing import MinMaxScaler
feat_to_scale = ["Height", "Width", "Weight","Price Of Sculpture", "Base Shipping Price", "State"]
scaled_features = MinMaxScaler().fit_transform(X_normalized[feat_to_scale])
#Remove unscaled columns from original DF
scaled_feature_df = pd.DataFrame(scaled_features, columns=feat_to_scale)
for col in feat_to_scale:
    X_normalized[col] = scaled_feature_df[col]
X_normalized["Cost"] = np.log1p(abs(X_normalized["Cost"]))
#X_normalized.profile_report()

final_features_norm = final_features
from sklearn.preprocessing import MinMaxScaler
feat_to_scale = ["Height", "Width", "Weight","Price Of Sculpture", "Base Shipping Price", "State"]
scaled_features = MinMaxScaler().fit_transform(final_features_norm[feat_to_scale])
#Remove unscaled columns from original DF
scaled_feature_df = pd.DataFrame(scaled_features, columns=feat_to_scale)
for col in feat_to_scale:
    final_features_norm[col] = scaled_feature_df[col]

### Separate target from features

In [36]:
y_cost = X_normalized["Cost"]
X_normalized.drop(["Cost"], axis=1, inplace=True)

### Split training and testing data

In [37]:
from sklearn.model_selection import train_test_split
# Split the currently available training set for evaluation
train_feat, test_feat, cost_train, cost_test = train_test_split(
    X_normalized.to_numpy(),
    y_cost.to_numpy(),
    train_size=0.8,
    random_state=0
    )

## Try out models

### Gradient boosting

In [38]:
from sklearn.ensemble import GradientBoostingRegressor
grd = GradientBoostingRegressor(random_state=42, learning_rate=0.2, subsample=0.8)
grd.fit(train_feat, cost_train)
print(grd.score(test_feat, cost_test))

0.9702842163830093


### AdaBoost

In [39]:
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor(random_state=0, n_estimators=150, learning_rate=0.001, loss="square")
ada.fit(train_feat, cost_train)
print(ada.score(test_feat, cost_test))

0.8831304789993801


### RandomForest Regressor

In [40]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=150,random_state=42)
rfr.fit(train_feat, cost_train)
print(rfr.score(test_feat, cost_test))

0.9631836355326009


### LinearRegressor

In [41]:
from sklearn.linear_model import LinearRegression
lnr = LinearRegression(normalize=True)
lnr.fit(train_feat, cost_train)
print(lnr.score(test_feat, cost_test))

0.6926237060423044


### Approach 2 Findings:
Gradient boosting performed well

## Cross Validation and parameter tuning

In [42]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

kf = KFold(shuffle=True, random_state=42)#5 Splits

x_array = X_normalized.to_numpy()
y_array = y_cost.to_numpy()
#kf.get_n_splits(x_array)

for train_index, test_index in kf.split(x_array):
    X_train, X_test = x_array[train_index], x_array[test_index]
    y_train, y_test = y_array[train_index], y_array[test_index]

In [43]:
#grd.fit(X_train, y_train)
print(grd.score(X_test, y_test))

0.9782109505962051


In [44]:
#Best estimator: GradientBoostingRegressor(learning_rate=0.2, n_estimators=250, random_state=30)
#parameters = {"random_state":[0, 10, 20, 30, 40], "learning_rate":[0.1, 0.2, 0.3, 0.4], "subsample":[1.0, 0.9, 0.8, 0.7], "n_estimators":[100, 150, 200, 250]}
#grd = GradientBoostingRegressor()
#reg = GridSearchCV(grd, parameters, n_jobs=4)
#reg.fit(X_train, y_train)

#print(f"Score: {reg.score(X_test, y_test)}")
#print(f"Best estimator: {reg.best_estimator_}")

In [45]:
def inverseLog(cost):
    return np.expm1(cost)

In [46]:
from sklearn.metrics import mean_squared_error

model = GradientBoostingRegressor(learning_rate=0.2, n_estimators=250, random_state=30)
model.fit(X_train, y_train)
preds = model.predict(X_test)

# Evaluate the model
score = mean_squared_error(y_test, preds)
print('MSE:', score)

MAE: 0.07152636773387103


## Prepare submission

In [54]:
result = inverseLog(model.predict(final_features_norm)).round(2)
submission = pd.DataFrame({"Customer Id":customer_id, "Cost":result})
submission.head()

Unnamed: 0,Customer Id,Cost
0,fffe3400310033003300,409.18
1,fffe3600350035003400,682.41
2,fffe3700360030003500,232.32
3,fffe350038003600,250.19
4,fffe3500390032003500,378.98


In [55]:
submission.to_csv('./submission_GB.csv', index=False)