# Santander Customer Satisfaction

In this competition, Santander asked us to identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.
https://www.kaggle.com/c/santander-customer-satisfaction

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="white", color_codes=True)

# Loading data
train = pd.read_csv("train.csv") # the train dataset is now a Pandas DataFrame
test = pd.read_csv("test.csv") # the train dataset is now a Pandas DataFrame

# Let's see what's in the trainings data
train.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


Happy customers are represented by 0, unhappy custormers by 1

In [2]:
df = pd.DataFrame(train.TARGET.value_counts())
df['Percentage'] = 100*df['TARGET']/train.shape[0]
df

Unnamed: 0,TARGET,Percentage
0,73012,96.043147
1,3008,3.956853


The dataset is unbalanced!

In [3]:
print('Train: {}\nTest: {}'.format(train.shape, test.shape))

Train: (76020, 371)
Test: (75818, 370)


Checking for nulls

In [4]:
print('Train: {}\nTest: {}'.format((train.isnull().sum()==1).sum(), (test.isnull().sum()==1).sum()))

Train: 0
Test: 0


Let's drop constant features because they won't help our model

In [5]:
def identify_constant_features(dataframe):
    count_uniques = dataframe.apply(lambda x: len(x.unique()))
    constants = count_uniques[count_uniques == 1].index.tolist()
    return constants

constant_features_train = identify_constant_features(train)


Now we are going to use this function to drop features in both datasets

In [6]:
train.drop(constant_features_train, inplace=True, axis=1)
test.drop(constant_features_train, inplace=True, axis=1)

We need to remove equals features too as they won't help us.

In [7]:
from itertools import combinations

def identify_equal_features(dataframe):
    pairs = list(combinations(dataframe.columns.tolist(),2))
    eq = []
    for p in pairs:
        is_eq = np.array_equal(dataframe[p[0]],dataframe[p[1]])
        if is_eq:
            eq.append(list(p))
    return eq

equal_features_train = identify_equal_features(train)

Now we are going to use this function to drop features in both datasets

In [8]:

train.drop(np.array(equal_features_train)[:,1], axis=1, inplace=True)
test.drop(np.array(equal_features_train)[:,1], axis=1, inplace=True)

Checking the shape

In [9]:
print(train.shape)
print(test.shape)

(76020, 308)
(75818, 307)


It's ok! The train dataset has the TARGET colummn, that's why it has 308 columns.
But now we are going to remove this column and put it in y and start building our model.

In [10]:
X = train.iloc[:,:-1]
y = train.TARGET

We still don't know which model will work better, so we are going to use cross validation to test their performances

In [11]:
from sklearn import cross_validation as cv
from sklearn import tree
from sklearn import metrics
from sklearn import ensemble
from sklearn import linear_model 
from sklearn import naive_bayes 

# Defining parameters

skf = cv.StratifiedKFold(y, n_folds=3, shuffle=True)
score_metric = 'roc_auc'
scores = {}



In [12]:
# Building a function to test ML models

def score_model(model):
    return cv.cross_val_score(model, X, y, cv=skf, scoring=score_metric)

Let's test AdaBoost, GradientBoosting e XGBoost

In [13]:
# ada_boost
scores['ada_boost'] = score_model(ensemble.AdaBoostClassifier())

In [14]:
# grad_boost
scores['grad_boost'] = score_model(ensemble.GradientBoostingClassifier())

In [18]:
import xgboost as xgb
# xgboost
scores['xgboost'] = score_model(xgb.XGBClassifier())

Checking the results

In [19]:
print(pd.DataFrame(scores).mean())

ada_boost     0.826987
grad_boost    0.834328
xgboost       0.836663
dtype: float64


It seems that XGBoost was the best!

Now we are going to use it, train the model and make predictions :)

In [20]:
model = xgb.XGBClassifier()
model.fit(X, y)
y_pred = model.predict_proba(test)

It's  time to prepare our file to Kaggle.

In [21]:
test_id = test.ID
submission = pd.DataFrame({"ID":test_id, "TARGET": y_pred[:,1]})

In [22]:
submission.to_csv("submission.csv", index=False)

We got 0.836339 right!