When ensembling using blending it is important to blend models of similar performance but the less possible correlated between them.

Instead, when stacking, it is just important to stack models that are different from each other. It will be the stacking algorithm to decide if even an under-perforimng model can help in any way your ensemble.

In this kernel I propose to you the tweedie regression, a regression based on tweedie distribution which is suitable for modelling claims in insurance (our artificial data has been taken from an insurance dataset).

The model uses some "tricks" to perform better:

* binarization of continuous variables based on k-means cluster analysis
* some selected interactions

actually, if you are using this model in a stacking enseble, you don't need to add any of these engeneered features.

Make your stacking richer and more varied!

Happy Kaggling!


In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import TweedieRegressor
from sklearn.metrics import mean_squared_error

In [2]:
# Loading data 
X_train = pd.read_csv("../input/30-days-of-ml/train.csv")
X_test = pd.read_csv("../input/30-days-of-ml/test.csv")

In [3]:
# Preparing data as a tabular matrix
y_train = X_train.target
X_train = X_train.set_index('id').drop('target', axis='columns')
X_test = X_test.set_index('id')

In [4]:
# Pointing out categorical features
categoricals = [item for item in X_train.columns if 'cat' in item]

In [5]:
# Dealing with categorical data using get_dummies
dummies = pd.get_dummies(X_train.append(X_test)[categoricals])
X_train[dummies.columns] = dummies.iloc[:len(X_train), :]
X_test[dummies.columns] = dummies.iloc[len(X_train): , :]
del(dummies)

In [6]:
# Dealing with categorical data using OrdinalEncoder (only when there are 3 or more levels)
ordinal_encoder = OrdinalEncoder()
X_train[categoricals[3:]] = ordinal_encoder.fit_transform(X_train[categoricals[3:]]).astype(int)
X_test[categoricals[3:]] = ordinal_encoder.transform(X_test[categoricals[3:]]).astype(int)
X_train = X_train.drop(categoricals[:3], axis="columns")
X_test = X_test.drop(categoricals[:3], axis="columns")

In [7]:
# Feature selection (https://www.kaggle.com/lucamassaron/tutorial-feature-selection-with-boruta-shap)
important_features = ['cat1_A', 'cat1_B', 'cat5', 'cat8', 'cat8_C', 'cat8_E', 'cont0', 
                      'cont1', 'cont10', 'cont11', 'cont12', 'cont13', 'cont2', 'cont3', 
                      'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9']

categoricals = ['cat5', 'cat8']

X_train = X_train[important_features]
X_test = X_test[important_features]

In [8]:
# Stratifying the data
km = KMeans(n_clusters=32, random_state=0)
pca = PCA(n_components=16, random_state=0)

pca.fit(X_train)
km.fit(pca.transform(X_train))

print(np.unique(km.labels_, return_counts=True))

y_stratified = km.labels_

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
      dtype=int32), array([ 8885, 23161, 18215, 13810, 13461, 18955,  8334,  8874,  3903,
       11152,  8617,  3471,  8492, 10460,  2726,  3428, 12002,  8632,
        7151,  7087, 14235, 10591,  2337, 12269,  3762, 10066,  1579,
       14104, 14622,  2005,  7083,  6531]))


In [9]:
# Clustering continuos values (acting as non linearity correctors)
n_clusters = 16
km = KMeans(n_clusters=n_clusters, random_state=0)
for i in range(14):
    feats = [f'cont{i}_{j}'for j in range(n_clusters)]
    km.fit(X_train[[f'cont{i}']])
    X_train[feats] = pd.get_dummies(km.predict(X_train[[f'cont{i}']])).values
    X_test[feats] = pd.get_dummies(km.predict(X_test[[f'cont{i}']])).values

In [10]:
interactions =(
[('cont2_11', 'cont7_9'), ('cont2_11', 'cont7_13'), ('cont2_6', 'cont4_5'), ('cat8', 'cont8_1'),
 ('cont5_6', 'cont8_0'), ('cont5_6', 'cont7_0'), ('cont7_0', 'cont8_0'), ('cat8_C', 'cont4_5'), 
 ('cont4_5', 'cont6_5'), ('cont4_5', 'cont7'), ('cont0_13', 'cont8_15'), ('cont5', 'cont5_6'),
 ('cont7_13', 'cont7_9'), ('cont4_3', 'cont7_0'), ('cont5_14', 'cont5_6'), ('cont11', 'cont5_14'),
 ('cont2_11', 'cont2_12'), ('cont2_11', 'cont5_3'), ('cont1_1', 'cont7_6'), ('cont7_9', 'cont8_6'),
 ('cont2_12', 'cont7_13'), ('cont4_3', 'cont5_6'), ('cont2_6', 'cont6_5'), ('cont2_6', 'cont5_5'),
 ('cat8_C', 'cont2_6'), ('cat8', 'cont7_9'), ('cont4_9', 'cont8_15'), ('cont4_14', 'cont7_9'),
 ('cont1_15', 'cont5_11'), ('cont4_3', 'cont7_4'), ('cont7_9', 'cont8_1'), ('cont4_3', 'cont8_0'),
 ('cont4_3', 'cont5'), ('cont2_12', 'cont7_9'), ('cont5_6', 'cont6_4'), ('cont1_1', 'cont1_13'), 
 ('cont4_5', 'cont7_9'), ('cont4_12', 'cont8_15'), ('cont5_3', 'cont7_9'), ('cont4_5', 'cont5_5'),
 ('cont0_3', 'cont4_5'), ('cont5_6', 'cont6_0'), ('cat8', 'cont8_2'), ('cont5_5', 'cont6_5'),
 ('cont4_3', 'cont6_0'), ('cat8', 'cont2_4'), ('cont1_1', 'cont4_5'), ('cont5', 'cont8_0'),
 ('cont0_1', 'cont4_0'), ('cont13', 'cont8_0')]
)

for a, b in interactions:
    X_train[f"{a}*{b}"] = np.prod(X_train[[a, b]].values, axis=1)
    X_test[f"{a}*{b}"] = np.prod(X_test[[a, b]].values, axis=1)

In [11]:
# Cross-validation prediction
folds = 10
skf = StratifiedKFold(n_splits=folds,
                      shuffle=True, 
                      random_state=0)

fold_idxs = list(skf.split(X_train, y_stratified))

predictions = np.zeros(len(X_test))
score = list()

for k, (train_idx, val_idx) in enumerate(fold_idxs):
    
    ss = StandardScaler()
    
    y = y_train[train_idx]
    X = ss.fit_transform(X_train.iloc[train_idx, :]).astype(np.float32)
    Xv = ss.transform(X_train.iloc[val_idx, :]).astype(np.float32)
    Xt = ss.transform(X_test).astype(np.float32)
    
    glm = TweedieRegressor(power=1.75, alpha=0.0001, max_iter=10000)
    glm.fit(X, y)
    
    val_preds = glm.predict(Xv)
    val_rmse = mean_squared_error(y_true=y_train[val_idx], y_pred=val_preds, squared=False)
    print(f"Fold {k} RMSE: {val_rmse:0.5f}")
    predictions += glm.predict(Xt).ravel()
    score.append(val_rmse)
    
predictions /= folds
print(f"CV RMSE {np.mean(score):0.5f} ({np.std(score):0.5f})")

Fold 0 RMSE: 0.72873
Fold 1 RMSE: 0.72822
Fold 2 RMSE: 0.72524
Fold 3 RMSE: 0.72563
Fold 4 RMSE: 0.72664
Fold 5 RMSE: 0.72444
Fold 6 RMSE: 0.72410
Fold 7 RMSE: 0.72564
Fold 8 RMSE: 0.73232
Fold 9 RMSE: 0.72392
CV RMSE 0.72649 (0.00248)


In [12]:
# Preparing the submission
submission = pd.DataFrame({'id':X_test.index, 
                           'target': predictions})

submission.to_csv("submission.csv", index = False)

In [13]:
submission

Unnamed: 0,id,target
0,0,8.060932
1,5,8.356644
2,15,8.380574
3,16,8.513334
4,17,8.325107
...,...,...
199995,499987,7.973437
199996,499990,8.441941
199997,499991,8.392145
199998,499994,8.199370
