# Santander Customer Transaction Prediction - Light GBM

In the Kaggle competition, the objective is to identify which customer will make a transaction in the future.

**Link to the competition**: https://www.kaggle.com/c/santander-customer-transaction-prediction/  
**Type of Problem**: Classification  
**Metric for evalution**: AOC (Area Under Curve)

This Python 3 environment comes with many helpful analytics libraries installed
It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split
import lightgbm
from sklearn.metrics import roc_auc_score

import matplotlib.pylab as plt

In [2]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/santander-customer-transaction-prediction/sample_submission.csv
/kaggle/input/santander-customer-transaction-prediction/train.csv
/kaggle/input/santander-customer-transaction-prediction/test.csv


## Step1: Read Training Data from CSV
Use pandas `read_csv` function to read train.csv

In [3]:
input_dir = '/kaggle/input/santander-customer-transaction-prediction/'
df_train = pd.read_csv(input_dir + '/train.csv')
df_train

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.0930,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.6910,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.3890,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.3560,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.9250,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,train_199995,0,11.4880,-0.4956,8.2622,3.5142,10.3404,11.6081,5.6709,15.1516,...,6.1415,13.2305,3.9901,0.9388,18.0249,-1.7939,2.1661,8.5326,16.6660,-17.8661
199996,train_199996,0,4.9149,-2.4484,16.7052,6.6345,8.3096,-10.5628,5.8802,21.5940,...,4.9611,4.6549,0.6998,1.8341,22.2717,1.7337,-2.1651,6.7419,15.9054,0.3388
199997,train_199997,0,11.2232,-5.0518,10.5127,5.6456,9.3410,-5.4086,4.5555,21.5571,...,4.0651,5.4414,3.1032,4.8793,23.5311,-1.5736,1.2832,8.7155,13.8329,4.1995
199998,train_199998,0,9.7148,-8.6098,13.6104,5.7930,12.5173,0.5339,6.0479,17.0152,...,2.6840,8.6587,2.7337,11.1178,20.4158,-0.0786,6.7980,10.0342,15.5289,-13.9001


Separate the data into independent and dependent variables.  
Use sklearn's `train_test_split` function to separate the data into training and validation data

In [4]:
var_columns = [c for c in df_train.columns if c not in ['ID_code','target']]

X = df_train.loc[:,var_columns]
y = df_train.loc[:,'target']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((160000, 200), (40000, 200), (160000,), (40000,))

## Step2: Create a simple Light GBM Model and evaluate performance
LightGBM has function `Dataset` to read the data. This is required for using LightGBM.

In [5]:
train_data = lightgbm.Dataset(X_train, label=y_train)
valid_data = lightgbm.Dataset(X_valid, label=y_valid)

Specify the parameters for LightGBM

In [6]:
parameters = {'objective': 'binary',
              'metric': 'auc',
              'is_unbalance': 'true',
              'boosting': 'gbdt',
              'num_leaves': 63,
              'feature_fraction': 0.5,
              'bagging_fraction': 0.5,
              'bagging_freq': 20,
              'learning_rate': 0.01,
              'verbose': -1
             }

Train the LightGBM model for maximum 5000 rounds. Early stopping criteria is 50 iterations.

In [7]:
model_lgbm = lightgbm.train(parameters,
                            train_data,
                            valid_sets=valid_data,
                            num_boost_round=5000,
                            early_stopping_rounds=50)

[1]	valid_0's auc: 0.66387
Training until validation scores don't improve for 50 rounds
[2]	valid_0's auc: 0.691844
[3]	valid_0's auc: 0.714632
[4]	valid_0's auc: 0.720863
[5]	valid_0's auc: 0.730256
[6]	valid_0's auc: 0.744463
[7]	valid_0's auc: 0.750754
[8]	valid_0's auc: 0.761351
[9]	valid_0's auc: 0.766762
[10]	valid_0's auc: 0.770928
[11]	valid_0's auc: 0.775191
[12]	valid_0's auc: 0.777102
[13]	valid_0's auc: 0.780203
[14]	valid_0's auc: 0.783726
[15]	valid_0's auc: 0.783711
[16]	valid_0's auc: 0.784068
[17]	valid_0's auc: 0.78349
[18]	valid_0's auc: 0.783665
[19]	valid_0's auc: 0.7836
[20]	valid_0's auc: 0.783855
[21]	valid_0's auc: 0.785852
[22]	valid_0's auc: 0.786739
[23]	valid_0's auc: 0.787076
[24]	valid_0's auc: 0.78861
[25]	valid_0's auc: 0.7896
[26]	valid_0's auc: 0.791877
[27]	valid_0's auc: 0.793067
[28]	valid_0's auc: 0.793554
[29]	valid_0's auc: 0.79517
[30]	valid_0's auc: 0.795741
[31]	valid_0's auc: 0.797234
[32]	valid_0's auc: 0.79732
[33]	valid_0's auc: 0.798988


In [8]:
y_train_pred = model_lgbm.predict(X_train)
y_valid_pred = model_lgbm.predict(X_valid)

print("AUC Train: {:.4f}\nAUC Valid: {:.4f}".format(roc_auc_score(y_train, y_train_pred),
                                                    roc_auc_score(y_valid, y_valid_pred)))

AUC Train: 0.9883
AUC Valid: 0.8953


### Step3: Find predictions for test data
Read the test and sample submission data

In [9]:
df_test = pd.read_csv(input_dir + '/test.csv')
df_sample_submission = pd.read_csv(input_dir + '/sample_submission.csv')

In [10]:
X_test = df_test.loc[:,var_columns]
df_sample_submission['target'] = model_lgbm.predict(X_test)
df_sample_submission

Unnamed: 0,ID_code,target
0,test_0,0.348873
1,test_1,0.493735
2,test_2,0.520182
3,test_3,0.433698
4,test_4,0.207041
...,...,...
199995,test_199995,0.181327
199996,test_199996,0.027922
199997,test_199997,0.011246
199998,test_199998,0.286368


In [11]:
output_dir = '/kaggle/working/'
df_sample_submission.to_csv(output_dir + "04_lgbm_scores.csv", index=False)