#  <font color="3d62f5"> Santander</font> Customer Satisfaction

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.

In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience

---

In [102]:
#@title #### Modules installation
!pip install opendatasets pyjanitor icecream orjson --quiet
!pip install --upgrade xgboost --quiet

In [103]:
#@title Download Santander Dataset
import opendatasets as od

api_key = {'...'}

# opendatasets download
kaggle_link = 'https://www.kaggle.com/c/santander-customer-satisfaction/data'
od.download(kaggle_link)

Skipping, found downloaded files in "./santander-customer-satisfaction" (use force=True to force download)


### Essential Modules

---

For this machine learning project we'll be using our standard modules for data
analysis along with interactive graphs from plotly.

In [104]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

from scipy.stats import entropy
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree, export_text
from xgboost import XGBClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import xgboost

import janitor
from icecream import ic
import warnings
import time
from typing import List
from tqdm import tqdm_notebook


pio.templates.default = 'plotly_white'
warnings.filterwarnings(action='ignore')
color_map = ["#fa7fb7","#ea5fd5","#b34df7","#994af3",
    "#8948f2","#7a46f1","#8d89df","#91a3f5","#93c1f6","#95e0f6"]

In [105]:
# create a data load function
train_fp = '/content/santander-customer-satisfaction/train.csv'
test_fp = '/content/santander-customer-satisfaction/test.csv'
def load_dataset(filepath: str) -> pd.DataFrame:
    """
    :param filepath: file directory of the data
    :returns: DataFrame Object
    """
    dataframe = pd.read_csv(filepath)
    dataframe = dataframe.remove_empty()
    missing_data = dataframe.isna().sum(axis=1).sum()
    print(f"""
    The dataframe has {dataframe.shape[0]} entries.
        with {dataframe.shape[1]} features. With
        {missing_data} total missing features
    """)
    return dataframe

### Load Datasets

In [106]:
# Training data
train = load_dataset(train_fp)


    The dataframe has 76020 entries.
        with 371 features. With
        0 total missing features
    


In [107]:
# Test data
test = load_dataset(test_fp)


    The dataframe has 75818 entries.
        with 370 features. With
        0 total missing features
    


For our dataset we have a lot feature vectors for our train and test
data. with a ton of attributes which is 370 in total. Our first goal will be 
checking entropy, constant, quasi-constant, and duplicated features. 

### Training Data Entropy

In [108]:
entropy_list = []
for cols, bar in zip(train, tqdm_notebook(range(len(train.columns)))):
    entropy_vals = entropy(train[cols])
    entropy_list.append(entropy_vals)
    time.sleep(0.1)

  0%|          | 0/371 [00:00<?, ?it/s]

In [109]:
def get_entropy(vals: List[float]) -> None:
    """
    :param vals: Entropy values in list
    """
    entropy_values = pd.Series(entropy_list)
    nan_entropy = entropy_values.isna().sum()
    median_entropy = entropy_values.median()
    inf_count_neg = 0
    inf_count_pos = 0

    for vals in entropy_list:
        if vals == np.float('-inf'):
            inf_count_neg += 1
        if vals == np.float('inf'):
            inf_count_pos += 1

    print(f"""
    null entropy counts: {nan_entropy}
    negative infinite entropy counts: {inf_count_neg}
    positve infinite entropy counts: {inf_count_pos}
    median entropy: {median_entropy}
    """)

get_entropy(entropy_list)


    null entropy counts: 34
    negative infinite entropy counts: 31
    positve infinite entropy counts: 0
    median entropy: 5.078338330919761
    


### Feature Split 

Seeing our correlation plot below we can see that there are features that have
positive high correlation while some features do not have any correlation at all. In our target variable majority of the outcome are satisfied customers.

In [110]:
# select x and y features
X = train.drop('TARGET', axis=1)
y = train['TARGET']

train_corrs = X.corr()

In [111]:
#@title Training Correlation
def plot_corr_heatmap() -> go.Figure:

    fig = px.imshow(train_corrs,
            color_continuous_scale=color_map)\
        .update_layout(
            title="<b>Training Data</b><br> Correlation",
            width=780) \
        .update_yaxes(visible=False) \
        .update_xaxes(visible=False)
    
    return fig.show()

plot_corr_heatmap()

In [112]:
#@title Target Variable Bar Plot
def plot_class_bar() -> go.Figure:
    """
    :returns: Bar plot of class variable.
    """
    title = '<b>Target Variable</b>'    
    fig = go.Figure()
    labels = ['Satisfied', 'Unsatisfied']

    # add multiple traces to figure
    i = 0
    for idx, name in zip(y.value_counts(), labels):
        fig.add_trace(go.Bar(
            x=[name],
            y=[idx],
            name=name,
            marker_color=color_map[i]))
        i += 5

    # tweak layout
    fig.update_layout(title=title, width=780)
    return fig.show()

plot_class_bar()

### Train Test Split

In [113]:
# split our data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=25)

In [114]:
def check_split() -> None:
    """
    Prints split sizes
    """
    return print(f"""
    Train Test Split Output
    -------------------------------
    X_train size: {X_train.shape[0]}
    X_test size: {X_test.shape[0]}
    y_train: {y_train.shape[0]}
    y_test: {y_test.shape[0]}
    """)

check_split()


    Train Test Split Output
    -------------------------------
    X_train size: 60816
    X_test size: 15204
    y_train: 60816
    y_test: 15204
    


# Feature <font color="3d62f5">Selection</font>

---

### Constant Features

In [115]:
constant_feat = VarianceThreshold(threshold=0)
constant_feat.fit(X_train)

xtrain_cons = pd.DataFrame(constant_feat.transform(X_train))
xtest_cons = pd.DataFrame(constant_feat.transform(X_test))

cols = constant_feat.get_support()   
col_names = [*X_train.loc[:, cols].columns]

xtrain_cons.columns = col_names
xtest_cons.columns = col_names

In [116]:
xtrain_cons.shape, xtest_cons.shape

((60816, 333), (15204, 333))

### Quasi-constant Features

In [117]:
quasi_const_features = VarianceThreshold(threshold=0.01)
quasi_const_features.fit(xtrain_cons)

xtrain_qcons = pd.DataFrame(quasi_const_features.transform(xtrain_cons))
xtest_qcons = pd.DataFrame(quasi_const_features.transform(xtest_cons))

cols = quasi_const_features.get_support()
col_names = [*xtrain_cons.loc[:, cols].columns]

In [118]:
xtrain_qcons.columns = col_names
xtest_qcons.columns = col_names

In [119]:
xtrain_qcons.shape, xtest_qcons.shape

((60816, 273), (15204, 273))

### Duplicated Features

In [120]:
xtrain_t = xtrain_qcons.T
xtest_t = xtest_qcons.T
xtrain_t_duple = xtrain_t.duplicated()
keep = [not idx for idx in xtrain_t_duple]
X_train = xtrain_t[keep].T
X_test = xtest_t[keep].T

In [121]:
X_train.shape, X_test.shape

((60816, 256), (15204, 256))

In [122]:
#@title Feature Selection Outcome
def plot_fs_outcome() -> go.Figure:
    """
    :returns: Plotly Bar Figure
    """
    title = '<b>Feature Selection</b> Outcome'

    # create labels and values
    labels = ['Before F. Selection', 'After F. Selection']
    values = [train.shape[1] - 1, X_train.shape[1]]

    # create figure
    fig = go.Figure()

    # add traces in for loop
    i = 0
    for val, trace_label in zip(values, labels):
        fig.add_trace(go.Bar(
            x=[trace_label],
            y=[val],
            name=trace_label,
            text=[val],
            textposition='inside',
            marker_color=color_map[i]))
        i += 5
    
    # tweak layout
    fig.update_layout(title=title, width=780)
    fig.update_yaxes(visible=False)
    
    return fig.show()

plot_fs_outcome()

# <font color="3d62f5">Model</font> Building

---

Since we're dealing with a classification problem, this kernel will be using decision tree and xgboost. 

## Decision Tree | Model # 1

In [123]:
tree = DecisionTreeClassifier(random_state=25) \
    .fit(X_train , y_train)

In [124]:
tree

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=25, splitter='best')

### Training Score

In [125]:
tree.score(X_train, y_train)

1.0

### Model Score

In [126]:
tree.score(X_test, y_test)

0.9273217574322546

The model seems to be overfitting in our training data.So we'll need
to tune hyperparameters.

### Decision Tree Hyperparameter Tuning

The function below will help us navigate through our model tests.
And finally, plot the model scores in a line chart.

In [127]:
def run_tree(X_train, X_test, y_train, y_test, **kwargs) -> None:
    """
    Get Scores for Hyperparameters
    """
    t = DecisionTreeClassifier(
        **kwargs, random_state=25).fit(X_train, y_train)
    train_score = 1 - t.score(X_train, y_train)
    test_score = 1 - t.score(X_test, y_test)
    train_score_list.append(train_score)
    test_score_list.append(test_score)

    print(f"""Train Error: {train_score}, Test Error: {test_score}""")


def show_errors(train_score_list: List[float], test_score_list: List[float], hyperparam: str, start: int):
    """
    :param train_score_list:
    :param test_score_list:
    :param hyperparam:
    :returns:
    """
    train_test_scores = pd.DataFrame()
    train_test_scores['train_error'] = pd.Series(train_score_list)
    train_test_scores['test_error'] = pd.Series(test_score_list)
    train_test_scores.index = train_test_scores.index + start

    def plot_max_depth_error() -> go.Figure:
        """
        :returns: Plotly Line Graph for Max Depth Error in 
            Training and Test Data.
        """
        title = f"{hyperparam} Hyperparameter"   
        i = 0
        fig = go.Figure()
        for cols in train_test_scores:
            fig.add_trace(go.Scatter(
                mode='lines',
                x=train_test_scores.index,
                y=train_test_scores[cols],
                name=cols,))
        fig.update_yaxes(title="Prediction Error (1 - Score)")
        fig.update_xaxes(title=f"{hyperparam}")
        fig.update_layout(title=title,
            width=780, hovermode="x")
        
        return fig.show()

    return plot_max_depth_error()

max_depth

For our first hyperparameter we'll be testing max_depth. In range 
of 1 to 50. Beow is a for loop that will print the training and test scores. and
appends the data to a list for plotting.

In [128]:
%%time
train_score_list = []
test_score_list = []
i = 1
for bar, i in enumerate(tqdm_notebook(range(1, 50))):
    run_tree(X_train, X_test, y_train, y_test,
        max_depth=i)
    time.sleep(0.1)
    i += 1

  0%|          | 0/49 [00:00<?, ?it/s]

Train Error: 0.03936464088397795, Test Error: 0.04038410944488291
Train Error: 0.03936464088397795, Test Error: 0.04038410944488291
Train Error: 0.03933175480136808, Test Error: 0.04038410944488291
Train Error: 0.039282425677453325, Test Error: 0.040449881610102656
Train Error: 0.03910155222309919, Test Error: 0.040581425940541926
Train Error: 0.038706919231781156, Test Error: 0.04084451460142069
Train Error: 0.038230071033938384, Test Error: 0.04176532491449614
Train Error: 0.037638121546961334, Test Error: 0.04156800841883712
Train Error: 0.0371283872665088, Test Error: 0.04275190739279133
Train Error: 0.03628979215995787, Test Error: 0.043804262036306274
Train Error: 0.03546764009471193, Test Error: 0.04446198368850307
Train Error: 0.03443172849250198, Test Error: 0.04525124967113914
Train Error: 0.03347803209681666, Test Error: 0.04610628781899495
Train Error: 0.032179031833727945, Test Error: 0.046829781636411494
Train Error: 0.0306004998684557, Test Error: 0.04814522494080509
Tra

In [129]:
show_errors(train_score_list, test_score_list, 'Max Depth', start=1)

The test error starts to fluctuates at `max_depth` 6. Similarly with
train error which slowly decreasing. 

Min Sample Split

---

For our second hyperparameter we'll be choosing `min_samples_split`.
The for loop below is the exact same for loop we used on `max_depth`.

In [130]:
%%time
train_score_list = []
test_score_list = []
i = 2
for bar, i in enumerate(tqdm_notebook(range(2, 60))):
    run_tree(X_train, X_test, y_train, y_test,
        min_samples_split=i)
    time.sleep(0.1)
    i += 1

  0%|          | 0/58 [00:00<?, ?it/s]

Train Error: 0.0, Test Error: 0.07267824256774535
Train Error: 0.0036503551696921432, Test Error: 0.06886345698500396
Train Error: 0.0075637990002630495, Test Error: 0.06952117863720075
Train Error: 0.0101289134438306, Test Error: 0.06846882399368592
Train Error: 0.012496711391739024, Test Error: 0.06715338068929233
Train Error: 0.015061825835306464, Test Error: 0.06544330439358059
Train Error: 0.016245724809260675, Test Error: 0.06524598789792158
Train Error: 0.017758484609313285, Test Error: 0.06399631675874773
Train Error: 0.019008155748487243, Test Error: 0.06261510128913439
Train Error: 0.019863193896343057, Test Error: 0.062088923967376974
Train Error: 0.020455143383320218, Test Error: 0.06300973428045253
Train Error: 0.021688371481189184, Test Error: 0.061628518810839306
Train Error: 0.022378979215995742, Test Error: 0.06004998684556695
Train Error: 0.02290515653775327, Test Error: 0.059984214680347314
Train Error: 0.023842409892133598, Test Error: 0.058734543541173356
Train Err

In [131]:
show_errors(train_score_list, test_score_list, 'Min Sample Split', start=2)

Here we can see that our errors for train and test set are s

### Combining both Hyperparameters

In [132]:
run_tree(X_train, X_test, y_train, y_test,
    max_depth=7, min_samples_split=58)

Train Error: 0.038476716653512266, Test Error: 0.04143646408839774


In [133]:
model = DecisionTreeClassifier()

## XGboost | Model #2

### Preprocessing | MinMax Scaler

In [134]:
scale = MinMaxScaler()
scale.fit(X_train)

xtrain_scale = scale.transform(X_train)
xtest_scale = scale.transform(X_test)

### XGBClassifier Model

In [135]:
gb = XGBClassifier(random_state=25)
gb.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=2, num_parallel_tree=1,
              objective='binary:logistic', random_state=25, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

In [136]:
pred = gb.predict(X_train)
accuracy_score(y_train, pred)

0.9666370691923178

### Feature Importance

In [137]:
importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': gb.feature_importances_
}).sort_values('importance', ascending=False)

In [138]:
#@title Feature Importance
title = '<b>Feature Importance</b>'
px.bar(importance[:10], y='feature', x='importance')\
    .update_traces(orientation='h', marker_color=np.flip(color_map))\
    .update_layout(width=780, title=title, margin=dict(pad=10))

### Prediction Score

In [139]:
pred = gb.predict(X_test)
accuracy_score(y_test, pred)

0.9593528018942383

## Hyperparameter Tuning

In [140]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV, KFold, StratifiedKFold

In [141]:
model = XGBClassifier()

In [142]:
params = {
    'max_depth': [4, 5, 6, 7],
    'learning_rate': [0.10, 0.12, 0.15, 0.16],
    'random_state': [25]}

In [143]:
clf = GridSearchCV(
    model, params, n_jobs=-1,
    scoring='roc_auc', cv=5,
    verbose=3)

In [144]:
clf.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed: 20.5min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed: 60.9min finished




GridSearchCV(cv=5, error_score=nan,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estim...
                                     random_state=None, reg_alpha=None,
                                     reg_lambda=None, scale_pos_weight=None,
                                     subsample=None, tree_method=None,
                                     use_label_encoder=True,
  

In [145]:
clf.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=2, num_parallel_tree=1,
              objective='binary:logistic', random_state=25, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

In [146]:
clf.best_params_

{'learning_rate': 0.1, 'max_depth': 4, 'random_state': 25}

In [147]:
clf.best_score_

0.8382948729812201

# Conclusion

The hyperparameter tuning did not go well as expected. Both Training and Test
data did not have a increase in their accuracy score.
Hence, we will choose our default base xgboost base parameters of model one.

In [148]:
import joblib

# Model 1
joblib.dump(gb, 'santander_xgboost.z')

['santander_xgboost.z']

## Predictions

In [149]:
model = joblib.load('santander_xgboost.z')

In [150]:
y_pred = model.predict(X_test)

In [151]:
pred_df = pd.Series(y_pred).value_counts().to_frame().rename_columns(
    {0: 'PREDICTIONS'})

pred_df.index = ['Satisfied Customers', 'Dissatisfied Customers']

In [152]:
accuracy_score(y_pred, y_test)

0.9593528018942383

In [153]:
pred_df

Unnamed: 0,PREDICTIONS
Satisfied Customers,15188
Dissatisfied Customers,16


## References

jovian lesson module:
https://jovian.ai/learn/machine-learning-with-python-zero-to-gbms/lesson/gradient-boosting-with-xgboost

guide on xgboost hyperparameters:
https://www.kaggle.com/prashant111/a-guide-on-xgboost-hyperparameters-tuning

sklearn gridsearchcv:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

feature selection:
https://michael-fuchs-python.netlify.app/2019/08/09/dealing-with-constant-and-duplicate-features/
