## Santander Product Recommendation - Part 4
#### Part 4: Model Training and Validation
This is the work demo for Satandander Product Recommendation Project, which is a also Kaggle Contest. We ranked as 12nd in Public LB and 16th in Private LB. In this project the target was to recommend new products to customers based on their historical behavioral patterns, product purchase records as well as demographic information. The demo will give a step-by-step workflow of my work. Basically this notebook includes:
- Part 1 - Data cleaning
- Part 2 - Feature Bank Generation
- Part 3 - EDA and feature exploration
- Part 4 - Model Training and Validation

**Note:** *We only use training data provided for the demonstration and validation as the true label was not provided in test data*


In [115]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [116]:
## disable warnings
config_db = "../input/santander_full.sqlite"

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import datetime
import gc
import seaborn as sns
import matplotlib.pyplot as plt
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)

import sqlite3 as sq

## connect to database
sq_conn = sq.connect(config_db)

Populating the interactive namespace from numpy and matplotlib


In [None]:
## load data
data_train_label_stacked = pd.read_sql_query("SELECT * FROM data_train_label_stacked;", sq_conn)
data_label_profile = pd.read_sql_query("SELECT * FROM data_label_profile;", sq_conn)

## 1 Data Process 

In [163]:
## Modules for data processing
from sklearn import preprocessing
def bin_numeric(df, cols_numeric, bins):
    """
    Convert numeric features into categorical using bins (a dictionary)    
    """
    for col in cols_numeric:
        df.loc[:, col] = pd.cut(df.loc[:, col], bins[col], right = True)
    
    return df
    
def generate_map_dict(df_train, df_test,  cols):
    """
    generate mapping dictionary for cols of data    
    """
    map_dict = {k:{} for k in cols}
    df = pd.concat([df_train.loc[:, cols], df_test.loc[:, cols]], axis = 0)
    for col in cols:
        val_unq = df.loc[:, col].unique().astype(str).tolist()
        map_dict[col].update({k:v for k, v in zip(val_unq, range(len(val_unq)))})
    return map_dict

def preprocess(df, cols, map_dict):
    """
    map categorical values into unique integer index specified in columns
    """
    for col in cols:
        df.loc[:, col] = df.loc[:, col].apply(lambda x: map_dict[col][str(x)])
        
    return df

def create_features_onehot_encode(df, cols, map_dict):
    """
    return numpy array of onehot encoded features
    """
    
    data = df.loc[:, cols].values.astype(int)
    #data[:, -1] = data[:, -1] - 1 # for month
        
    n_values=[len(map_dict[x]) if x != "month" else 12 for x in cols]    
    enc = preprocessing.OneHotEncoder(n_values = n_values,
                                    sparse=False, dtype=np.uint8)
    enc.fit(data)
    encoded_data = enc.transform(data)
    
    return encoded_data

In [155]:
## define feature columns
feat_profile_cate = ['idx_active', 'idx_primary', \
                     'type_cust', 'idx_foreigner', 'segmentation', \
                     'idx_new_cust', 'type_cust_relation']
feat_profile_num = ['age', 'income']
lags_prod = range(1, 12)
feat_prod = [x + "_lag_" + str(y) for x in cols_product for y in lags_prod] ## Lag Features ! 

In [156]:
## 1. Bin numerical variable into categorical
bins = {}
bins.update({"age": list(range(0, 101, 10)) + [200]})
bins.update({"income": [0] + list(range(20000, 200001, 10000)) + list(range(300000, 1000001, 100000)) + [2000000, 100000000]})
data_train_label_stacked = bin_numeric(data_train_label_stacked, feat_profile_num, bins)
data_val_label = bin_numeric(data_val_label, feat_profile_num, bins)

In [162]:
## add back in month variable
for df in [data_train_label_stacked, data_val_label]:
    df.loc[:, "month"] = (df.loc[:, "date_record"])%12

In [164]:
## 2. Get mapping dictionary from the entire dataset (including train and val)
map_dict = generate_map_dict(data_train_label_stacked, data_val_label, feat_profile_cate + feat_profile_num)

In [166]:
## 3. Map feature columns into index
for df in [data_train_label_stacked, data_val_label]:
    df = preprocess(df, feat_profile_cate + feat_profile_num, map_dict)

In [195]:
## -- Save processed data into sql
data_train_label_stacked.to_sql(name='data_train_label_stacked', \
                                con=sq_conn, if_exists='replace', index=False, index_label=None)
data_val_label.to_sql(name='data_val_label', \
                      con=sq_conn, if_exists='replace', index=False, index_label=None)

In [187]:
## 4. Create train and val data after onehot, with product features
X_tr = create_features_onehot_encode(data_train_label_stacked, feat_profile_cate + feat_profile_num + ["month"], map_dict)
X_tr = np.concatenate((X_tr, data_train_label_stacked.loc[:, feat_prod].values.astype(int)), axis = 1)

X_val = create_features_onehot_encode(data_val_label, feat_profile_cate + feat_profile_num + ["month"], map_dict)
X_val = np.concatenate((X_val, data_val_label.loc[:, feat_prod].values.astype(int)), axis = 1)


In [189]:
## 5. Generate labels for train and test
#### Note that the labels used for test is a list of purchased products, and need special routine to generate test score, 
#### as we will see later
from ast import literal_eval
Y_tr = data_train_label_stacked.label.values
Y_val = data_val_label.new_products.apply(literal_eval).values

In [190]:
print Y_tr.shape, Y_val.shape

(631611,) (29717,)


In [191]:
## 6. Define model and train on X_tr, Y_tr
import lightgbm as lgb
PARAMS = {
'n_estimators': 100,
'nthread': 8
}
clf = lgb.LGBMClassifier(**PARAMS)
unq_lb = sorted(np.unique(Y_tr).tolist())    
clf.fit(X_tr, Y_tr, eval_metric="multi_logloss") 

LGBMClassifier(boosting_type='gbdt', colsample_bytree=1, drop_rate=0.1,
        is_unbalance=False, learning_rate=0.1, max_bin=255, max_depth=-1,
        max_drop=50, min_child_samples=10, min_child_weight=5,
        min_split_gain=0, n_estimators=100, nthread=8, num_leaves=31,
        objective='multiclass', reg_alpha=0, reg_lambda=0,
        scale_pos_weight=1, seed=0, sigmoid=1.0, silent=True,
        skip_drop=0.5, subsample=1, subsample_for_bin=50000,
        subsample_freq=1, uniform_drop=False, xgboost_dart_mode=False)

In [192]:
## 7. Predict model on X_val and output MAP@7 score
from helper.average_precision import mapk
def create_prediction(model, X, previous_products, unq_lb):
    """
    Makes a prediction using the given model and parameters
    
    model: trained model
    X: test set
    previous_products: previous product records
    unq_lb: unique labels
    
    """    
    rank = model.predict_proba(X)
    # if some labels are missing, fill zeros in rank so that the shape matchs nsamp * 24
    if rank.shape[1] < 24:
        rank_copy = np.zeros((rank.shape[0], 24))
        rank_copy[:, unq_lb] = rank.copy()
        filtered_rank = np.equal(previous_products, 0) * rank_copy
    else:
        filtered_rank = np.equal(previous_products, 0) * rank
    predictions = np.argsort(filtered_rank, axis=1)
    predictions = predictions[:,::-1][:,0:7]

    return predictions 

def validation(Y_val, predictions, k = 7):
    """
    make prediction on eval set output validation scores 
    """
    score = mapk(Y_val, predictions, k = k)
    
    return score


In [193]:
previous_products = data_val_label.loc[:, [x + "_lag_1" for x in cols_product]].values.astype(int)
predictions = create_prediction(clf, X_val, previous_products, unq_lb)
score = validation(Y_val, predictions, k = 7)

In [194]:
score

0.72586666228672492

In [186]:
predictions

array([[18, 23, 12, ..., 22, 21,  4],
       [23, 13, 11, ...,  4,  7, 17],
       [23, 12, 11, ...,  4, 21, 22],
       ..., 
       [ 2,  6,  9, ..., 11, 21, 22],
       [ 2,  9, 23, ...,  6, 13, 22],
       [ 2,  6, 22, ..., 23,  4,  9]])

## Results

```
MAP@7 = 0.89665718035371766 with lag features
```

```
MAP@7 = 0.72586666228672492 without lag features
```