# Ensembling

In other notebooks we've trained various models on our train data and saved the predictions for those models on both our train and test sets.

In this notebook we train two ensembling models, a meta-network (for non linear ensembling) and a lasso regression model (for linear ensembling), that get as input the predictions of all other models and give the final prediction as an output. Obviously we train those models on the y labels (probability scores) of the train set.

In [1]:
%load_ext autoreload

In [2]:
%run stephan_modules.ipynb


Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.

Using TensorFlow backend.


In [3]:
DATA_PATH = './data'
PREDICTIONS_DATA_PATH = './predictions'

## Feature enrichment

In [116]:
print('loading data...')
train, test = load_data(DATA_PATH)

train_items = pd.DataFrame(train['item_id'])
test_items = pd.DataFrame(test['item_id'])
train_expected = train['deal_probability']

loading data...


In [117]:
from os import path

file_names = ['NN-COUNTVEC-MERGED-BIGRAMS.csv.gz',
              'NN-COUNTVEC-MERGED-UNI-nodropout.csv.gz',
              'NN-COUNTVEC-SEPARATED-BIGRAMS-nodropout.csv.gz',
              'NN-COUNTVEC-SEPARATED-UNIGRAMS.csv.gz',
              'NN-TFIDF-MERGED-BIGRAMS.csv.gz',
              'NN-TFIDF-SEPARATED-BIGRAMS-nodropout.csv.gz',
              'NN-TFIDF-SEPARATED-UNIGRAMS.csv.gz',
              'NN-TFIDF-UNI-MERGED-nodropout.csv.gz',              
              'all_data_lgbm_regression_results_0.2281.csv.gz',
              'all_data_lgbm_poisson_results_0.2281.csv.gz',
              'NN-JUST-LSTM-MERGED.csv.gz',
              'NN-LSTM-COMBINED-MERGED.csv.gz']


for filename in file_names:
    print("Loading predcictions for:{}".format(filename))
    predictions = load_df(PREDICTIONS_DATA_PATH, filename)
    model_name = 'model_' + filename
    predictions = predictions.rename({'deal_probability': model_name}, axis='columns')            
    train_items = train_items.merge(predictions, on='item_id', how='left')
    test_items = test_items.merge(predictions, on='item_id', how='left')

Loading predcictions for:NN-COUNTVEC-MERGED-BIGRAMS.csv.gz
Loading predcictions for:NN-COUNTVEC-MERGED-UNI-nodropout.csv.gz
Loading predcictions for:NN-COUNTVEC-SEPARATED-BIGRAMS-nodropout.csv.gz
Loading predcictions for:NN-COUNTVEC-SEPARATED-UNIGRAMS.csv.gz
Loading predcictions for:NN-TFIDF-MERGED-BIGRAMS.csv.gz
Loading predcictions for:NN-TFIDF-SEPARATED-BIGRAMS-nodropout.csv.gz
Loading predcictions for:NN-TFIDF-SEPARATED-UNIGRAMS.csv.gz
Loading predcictions for:NN-TFIDF-UNI-MERGED-nodropout.csv.gz
Loading predcictions for:all_data_lgbm_regression_results_0.2281.csv.gz
Loading predcictions for:all_data_lgbm_poisson_results_0.2281.csv.gz
Loading predcictions for:NN-JUST-LSTM-MERGED.csv.gz
Loading predcictions for:NN-LSTM-COMBINED-MERGED.csv.gz


In [118]:
list(train_items.columns)

['item_id',
 'model_NN-COUNTVEC-MERGED-BIGRAMS.csv.gz',
 'model_NN-COUNTVEC-MERGED-UNI-nodropout.csv.gz',
 'model_NN-COUNTVEC-SEPARATED-BIGRAMS-nodropout.csv.gz',
 'model_NN-COUNTVEC-SEPARATED-UNIGRAMS.csv.gz',
 'model_NN-TFIDF-MERGED-BIGRAMS.csv.gz',
 'model_NN-TFIDF-SEPARATED-BIGRAMS-nodropout.csv.gz',
 'model_NN-TFIDF-SEPARATED-UNIGRAMS.csv.gz',
 'model_NN-TFIDF-UNI-MERGED-nodropout.csv.gz',
 'model_all_data_lgbm_regression_results_0.2281.csv.gz',
 'model_all_data_lgbm_poisson_results_0.2281.csv.gz',
 'model_NN-JUST-LSTM-MERGED.csv.gz',
 'model_NN-LSTM-COMBINED-MERGED.csv.gz']

In [119]:
train_items.drop(['item_id'], axis=1, inplace=True)
test_items.drop(['item_id'], axis=1, inplace=True)

# Ensembling through a Neural Net

### Define network

In [120]:
hidden_dim1 = 128
out_dim = 1

In [124]:
l_input = Input(shape=(train_items.shape[1],), name="input_predictions")
do1 = Dropout(0.2)(l_input)
hidden_dim = Dense(hidden_dim1, activation='linear', kernel_regularizer=regularizers.l1(1e-3), name='hidden_dim1')(do1)
do2 = Dropout(0.2)(hidden_dim)
output = Dense(out_dim, activation='sigmoid', kernel_regularizer=regularizers.l1(1e-3), name='output')(do2)

adam_opt = Adam(lr=0.001)
def rmse_err(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_true - y_pred)))

model = Model(l_input, output)
model.compile(optimizer=adam_opt,
              loss=[rmse_err])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_predictions (InputLaye (None, 12)                0         
_________________________________________________________________
dropout_41 (Dropout)         (None, 12)                0         
_________________________________________________________________
hidden_dim1 (Dense)          (None, 128)               1664      
_________________________________________________________________
dropout_42 (Dropout)         (None, 128)               0         
_________________________________________________________________
output (Dense)               (None, 1)                 129       
Total params: 1,793
Trainable params: 1,793
Non-trainable params: 0
_________________________________________________________________
None


In [125]:
nbatch_size = 1024
model.fit(train_items, train_expected,
          #validation_split = 0.3, 
          epochs=1, 
          batch_size=nbatch_size)

Epoch 1/1


<keras.callbacks.History at 0x15308fec4c50>

### Create NN submission file

In [126]:
y_pred = model.predict(test_items)

In [127]:
y_pred = np.clip(y_pred, 0, 1)
res_df = pd.DataFrame(test['item_id'])
res_df['deal_probability'] = y_pred
res_df.head()

Unnamed: 0,item_id,deal_probability
0,6544e41a8817,0.194396
1,65b9484d670f,0.108244
2,8bab230b2ecd,0.177688
3,8e348601fefc,0.120174
4,8bd2fe400b89,0.130631


In [128]:
res_df.to_csv('./submissions/NN_ensemble_res_%s.csv.gz' % datetime.datetime.now(), index=None, compression='gzip')

### Define lasso regression model

In [129]:
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.0001, precompute=True, max_iter=1000, positive=True, random_state=9999, selection='random')
lasso_model.fit(train_items, train_expected)
y_pred = lasso_model.predict(test_items)

In [130]:
y_pred = np.clip(y_pred, 0, 1)
res_df = pd.DataFrame(test['item_id'])
res_df['deal_probability'] = y_pred
res_df.head()

Unnamed: 0,item_id,deal_probability
0,6544e41a8817,0.279048
1,65b9484d670f,0.104908
2,8bab230b2ecd,0.210024
3,8e348601fefc,0.14464
4,8bd2fe400b89,0.102979


In [131]:
res_df.to_csv('./submissions/Lasso_regression_ensemble_res_%s.csv.gz' % datetime.datetime.now(), index=None, compression='gzip')