# About
This is an attempt to let models have a day off and make some optimization work for them. 

I'll look at the Shipping column, that provides binary information right now, and see if I can squeeze some more information from it.

Shipping column is described as: 
> shipping - 1 if shipping fee is paid by seller and 0 by buyer

We used to think that if price is to be paid by the buyer, it should be discounted for the amount of shipping rate. Let's look at it in detail.

In the code below I try to add a fixed amount to the price if shipping is to be paid by a buyer, implying that buyer wants to think of the total price of purchase, including Shipping/Handling fees.

As a base model I use just catboost on all categorical columns - no frills, little data preprocessing.

# Preparation

In [None]:
import numpy as np 
import pandas as pd 

In [None]:
train = pd.read_csv('../input/train.tsv',delimiter='\t', index_col='train_id')
test = pd.read_csv('../input/test.tsv',delimiter='\t', index_col='test_id')

In [None]:
#fill in NaNs
train['brand_name'].fillna('NONAME', inplace=True)
train['category_name'].fillna('NOCAT', inplace=True)

In [None]:
#split train into train and validation set, 
#we will fit the model on log price from the begining
from sklearn.model_selection import train_test_split

X_train_part, X_valid, y_train_part, y_valid = \
    train_test_split(train, np.log1p(train['price'].values), random_state=17)

# Base Model

In [None]:
from catboost import CatBoostRegressor
#Set some pre-tuned parameters. #Want better score - increase interations.
I=200; lr = 0.5
cb_params={'has_time':False, 'eval_metric':'RMSE', 'logging_level':'Silent', 'train_dir':'/tmp'}
cb_fit_columns=['item_condition_id', 'category_name', 'brand_name','shipping']
cb_fit_params={'cat_features':[0,1,2,3]} #all train features are categorical
cb=CatBoostRegressor(**cb_params,iterations=I, learning_rate=lr)
cb.fit(X_train_part[cb_fit_columns], y_train_part, **cb_fit_params)

In [None]:
from sklearn.metrics import mean_squared_error
import math
pred = cb.predict(X_valid[cb_fit_columns])
print(math.sqrt(mean_squared_error(y_valid, pred )))

Ok, so far so good. Let's now remove the shipping column and fit the model again.

# Model with no Shipping column

In [None]:
cb_fit_columns=['item_condition_id', 'category_name', 'brand_name']
cb_fit_params={'cat_features':[0,1,2]} #all train features are categorical
cb=CatBoostRegressor(**cb_params,iterations=I, learning_rate=lr)
cb.fit(X_train_part[cb_fit_columns], y_train_part, **cb_fit_params)

In [None]:
pred_no_ship = cb.predict(X_valid[cb_fit_columns])
print(math.sqrt(mean_squared_error(y_valid, pred_no_ship )))

We don't get any better score, this means shipping is really a significant feature.
Let's see now if we can tweak the price a bit to compensate for the missing shipping info?
We will subtract some amount from the predicted price and re-evaluate the metric.

# Let's try to account for shipping rate

In [None]:
X_valid_pred = X_valid.copy()
X_valid_pred['pred'] = np.expm1(pred_no_ship) #restore to "human" scale

#to keep track of price changes
ship_surcharges = []
score_vs_ship_surcharge = []

for ship_surcharge in np.arange(0.0, 4.0, 0.1):
    X_valid_pred["pred_shipped"] = X_valid_pred['pred']
    #We only amend thices with shipping by the bayer
    X_valid_pred.loc[X_valid_pred['shipping']==0,"pred_shipped"] += ship_surcharge
    
    pred_mod_score = math.sqrt(mean_squared_error(y_valid, #np.log1p(y_valid),
                                                  np.log1p(X_valid_pred['pred_shipped'].clip(0))))
    #
    ship_surcharges.append(ship_surcharge)
    score_vs_ship_surcharge.append(pred_mod_score)
    
best_score = min(score_vs_ship_surcharge)
best_ship = ship_surcharges[np.argmin(score_vs_ship_surcharge)]
print(best_ship, best_score)

In [None]:
from matplotlib import pyplot as plt
plt.plot(ship_surcharges, score_vs_ship_surcharge)

We see that prices are discounted by approximately 1.9 dollars in average, if the buyer is to pay for shipping.
Can we now play the same trick with the baseline?
# Shipping Rate in the baseline?

In [None]:
X_valid_pred = X_valid.copy()
X_valid_pred['pred'] = np.expm1(pred) #restore to "human" scale

#to keep track of price changes
ship_surcharges = []
score_vs_ship_surcharge = []

for ship_surcharge in np.arange(-3.0, 4.0, 0.1):
    X_valid_pred["pred_shipped"] = X_valid_pred['pred']
    #We only amend thices with shipping by the bayer
    X_valid_pred.loc[X_valid_pred['shipping']==0,"pred_shipped"] += ship_surcharge
    
    pred_mod_score = math.sqrt(mean_squared_error(y_valid, #np.log1p(y_valid),
                                                  np.log1p(X_valid_pred['pred_shipped'].clip(0))))
    #
    ship_surcharges.append(ship_surcharge)
    score_vs_ship_surcharge.append(pred_mod_score)
    
best_score = min(score_vs_ship_surcharge)
best_ship = ship_surcharges[np.argmin(score_vs_ship_surcharge)]
print(best_ship, best_score)

In [None]:
plt.plot(ship_surcharges, score_vs_ship_surcharge)

We see that the baseline model that is fit with the shipping column is pretty efficient!
Ok, let's try the last thing. We prudently assume that shipping prices may vary from catagory to category - TVs are more expensive to ship than lipsticks. We therefore want to optimize shipping surcharge for each category. This time, we use scipy optimize functions instead of a simple brute force "attack".

# Per category optimization

In [None]:
#optimize this function of score vs. shipping surcharge (s)
def f(s, X, y):
    X["pred_mod"] = X['pred'].copy()
    X.loc[X['shipping']==0,"pred_mod"] += s
    score = math.sqrt(mean_squared_error(y, 
                                         np.log1p(X['pred_mod'].clip(0))))
    return score

Let's validate this approach first:

In [None]:
#I take the first optimizer from the list, but others may yield better results
from scipy import optimize
min_s = optimize.minimize(f, 0, args=(X_valid_pred, y_valid), method='Nelder-Mead')

In [None]:
#We got the same values - good!
min_s.x[0], min_s.fun

In [None]:
#let's chose 10 most frequent categories,
#otherwise our optimization process takes too long.
all_cats = X_valid.groupby('category_name')['name'].count().to_frame()\
    .sort_values(by='name', ascending=False).index

In [None]:
X_valid_pred = X_valid.copy()
X_valid_pred['pred'] = np.expm1(pred_no_ship) #restore to "human" scale
X_valid_pred['pred_opt'] = np.expm1(pred_no_ship) 

MIN0=1.9 #initial value for the optimizer. Recall, this is our best average across all categories
NCAT=10 # number of categories to optimize. this is enough to get an idea.
METHOD='Nelder-Mead'

scores_cat = []

for cat in all_cats[:NCAT]:
    mask = X_valid_pred['category_name']==cat
    res = optimize.minimize(f, MIN0, args=(X_valid_pred[mask], y_valid[mask]), method=METHOD)
    print("{0:>2.2f} {1:1.3f} {2:s}".format (res.x[0],res.fun, cat))
    #update predicted values                        
    X_valid_pred.loc[mask,'pred_opt'] = X_valid_pred[mask]['pred'] + res.x[0]
    #calculate the total score with price adjustment with current category included.
    scores_cat.append( math.sqrt(mean_squared_error(y_valid, 
                                    np.log1p(X_valid_pred['pred_opt'].clip(0)))) )

Take a glance at what all these price are:

In [None]:
X_valid_pred[X_valid_pred['category_name'].isin(all_cats[:NCAT])][['category_name','price','pred','pred_opt']].head(10)

In [None]:
#score for this optimization
score = math.sqrt(mean_squared_error(y_valid, 
                                    np.log1p(X_valid_pred['pred_opt'].clip(0))))
score

Ha? We don't see any improvement. In fact, the score diverges up from the best once we start "optimizing" price for largest categories.

In [None]:
plt.plot(scores_cat)
    

Compare to previously obtained "baseline" scores:

In [None]:
#catboost on full data
math.sqrt(mean_squared_error(y_valid, pred))

In [None]:
#catboost with no shipping column in data
math.sqrt(mean_squared_error(y_valid, pred_no_ship))

# Random Shipping

As Rene Wang suggested in comments, I run an experiment with shipping permutted, i.e. randomly assigned.
Interestingly, this gives quasi-same result as when the shipping column is removed.

In [None]:
# permute the shipping column
train['shipping'] = np.random.permutation(train['shipping'])

X_train_part, X_valid, y_train_part, y_valid = \
    train_test_split(train, np.log1p(train['price'].values), random_state=17)

In [None]:
#Set some pre-tuned parameters. #Want better score - increase interations.
I=200; lr = 0.5
cb_params={'has_time':False, 'eval_metric':'RMSE', 'logging_level':'Silent', 'train_dir':'/tmp'}
cb_fit_columns=['item_condition_id', 'category_name', 'brand_name','shipping']
cb_fit_params={'cat_features':[0,1,2,3]} #all train features are categorical
cb=CatBoostRegressor(**cb_params,iterations=I, learning_rate=lr)
cb.fit(X_train_part[cb_fit_columns], y_train_part, **cb_fit_params)

In [None]:
from sklearn.metrics import mean_squared_error
import math
pred_ship_permute = cb.predict(X_valid[cb_fit_columns])
print(math.sqrt(mean_squared_error(y_valid, pred_ship_permute)))


# Conclusion

the model works better than manual optimization. Who would have thought? :)

But let me submit the "no shipping" version anyway.

In [None]:
#lets retrain on the full dataset , predict the test, and submit
cb.fit(train[cb_fit_columns], np.log1p(train['price'].values), **cb_fit_params)



In [None]:
#fill in NaNs
test['brand_name'].fillna('NONAME', inplace=True)
test['category_name'].fillna('NOCAT', inplace=True)

pred_test=cb.predict(test[cb_fit_columns])

In [None]:
test['price'] = np.expm1(pred_test)
test.loc[test['shipping']==0,"price"] += MIN0

In [None]:
#submission
pd.DataFrame({'price':test['price']}, index=test.index)\
  .to_csv("submission.csv", index_label="test_id")

