<a href="https://colab.research.google.com/github/vectice/vectice-examples/blob/master/Samples/Customer_satisfaction_challenge/customer_satisfaction_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Santander customer satisfaction challenge

## Problem

Customer satisfaction is a key measure of success for all businesses. Unhappy customers don't stay with the same provider and they rarely voice their dissatisfaction before leaving. In this context, Santander bank launched a challenge in Kaggle in order to build models that predict potential unhappy customers

---



## Objective

The objective of this competition is to be able to identify unhappy customers early and anticipate their leaving which would allow the company to take proactive steps to improve a customer's happiness before it's too late. In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.

## Data

The data is an anonymized dataset containing a large number of numeric variables. The "TARGET" column is the variable to predict. It equals 1 for unsatisfied customers and 0 for satisfied customers. The task is to predict the probability that each customer in the test set is an unsatisfied customer.
- train.csv: (371 columns): The training set including the target
- test.csv: (370 columns): The test set without the target

## Install Vectice and GCS packages


Vectice provides a generic metadata layer that is potentially suitable for most data science workflows. For this notebook we will use the sickit-learn library for modeling and track experiments directly through our Python SDK to illustrate how to fine-tune exactly what you would like to track: metrics, etc. The same mechanisms would apply to R, Java or even more generic REST APIs to track metadata from any programming language and library.

Here is a link to the [Vectice Python library documentation](https://doc.vectice.com/).

In [None]:
## Install GCS packages
!pip install --q fsspec
!pip install --q gcsfs

#Install Vectice Python library 
# In this notebook we will do code versioning using github, we also support gitlab
# and bitbucket: !pip install -q "vectice[github, gitlab, bitbucket]"
!pip install --q vectice[github]==2.2.3

In [None]:
!pip3 show vectice

## Import the required packages


In [None]:
import os
import numpy as np # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, auc
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
import lightgbm as lgb
from lightgbm import plot_importance
from imblearn.over_sampling import SMOTE
from collections import Counter

plt.style.use('seaborn')
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## Retreive the data from GCS

We are going to load data stored in Google Cloud Storage, that is provided by Vectice for this notebook

In [None]:
# Download the "JSON file" from the "Vectice tutorial Page" in the application so that 
# you can access the GCS bucket. The name of the JSON file should be "readerKey.json"

from google.colab import files
uploaded = files.upload()

In [None]:
# Double check the json file name below so that it matches the name of the file that you uploaded.
# Note that the key provided for this notebook does not have permissions for you to write to GCS. 
# You can only use it to read the data.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'readerKey.json'
## Get the dataset from GCS
train_df = pd.read_csv("gs://vectice-examples-samples/Customer_satisfaction_challenge/dataset.csv")
# Run head to make sure the data was loaded properly
print(train_df.head())

## Data exploration

In [None]:
print("Train Data Shape : ",train_df.shape)

In [None]:
train_df['TARGET'].value_counts()


In [None]:
train_df.info()

In [None]:
train_df.describe()

In [None]:
features = train_df.drop(['ID','TARGET'],axis=1)

## Exploratory data analysis (EDA)
* Target Percent
* Check Multicollinearity
* Check Outlier

In [None]:
pd.DataFrame(train_df['TARGET'].value_counts())

The training set is way imbalanced (73012 zeros vs 3008 ones), so some algorithms may learn mostly from the 0 which can affect our predictions. We address that by using oversampling


In [None]:
f, ax = plt.subplots(1,2,figsize=(10,4))
train_df['TARGET'].value_counts().plot.pie(
    explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True
)
sns.countplot('TARGET', data=train_df, ax=ax[1])
plt.show()

In [None]:
null_value = train_df.isnull().sum().sort_values(ascending=False)
null_percent = round(train_df.isnull().sum().sort_values(ascending=False)/len(train_df)*100,2)
pd.concat([null_value, null_percent], axis=1, keys=['Null values', 'Percent'])

Ther is no column with null values

**Correlation**

If we have a big correlation, we have a problem of multicolinearity. That means that there are some features that depend of other features, so we should reduce the dimentionality of our data (if A depends of B, we should either find a way to aggregate or combine the two features and turn it into one variable or drop one of the variables that are too highly correlated with another) and that can be adressed using Principal component analysis (PCA)

In [None]:
features[features.columns[:8]].corr()

In [None]:
sns.heatmap(features[features.columns[:8]].corr(),annot=True,cmap='YlGnBu')
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()

=> We Can Check Multicollinearity

Multicollinearity is a phenomenon in which one independent variable is highly correlated with one or more of the other independent variables

In [None]:
plt.figure(figsize=(16,6))
plt.title("Distribution of mean values per row in the train and test set")
sns.distplot(train_df[features.columns].mean(axis=1),color="black", kde=True,bins=120, label='train')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(16,6))
plt.title("Distribution of std values per rows in the train and test set")
sns.distplot(train_df[features.columns].std(axis=1),color="blue",kde=True,bins=120, label='train')
plt.legend(); plt.show()

In [None]:
t0 = train_df[train_df['TARGET'] == 0]
t1 = train_df[train_df['TARGET'] == 1]
plt.figure(figsize=(16,6))
plt.title("Distribution of skew values per row in the train set")
sns.distplot(t0[features.columns].skew(axis=1),color="red", kde=True,bins=120, label='target = 0')
sns.distplot(t1[features.columns].skew(axis=1),color="blue", kde=True,bins=120, label='target = 1')
plt.legend(); plt.show()

=> We Can Check Outliers

An outlier is a value or point that differs substantially from the rest of the data

In [None]:
train_df.describe()

In [None]:
plt.boxplot(train_df['var3'])

In [None]:
plt.boxplot(train_df['var38'])

The training set:
- Contains continuous and and catigorized data (we should treate carigorized data cuz 10000>1 if we interpret them as numeric values and not catigorical (example IDs)
- Contains variables with zero variance or non predictive value
- Contains fake values (-999999) that were introduced to replace missing data
- Is way imbalanced

## Preprocessing

* Processing Outlier Values

In [None]:
train_df['var3'].replace(-999999,2,inplace=True)
train_df.describe()

## Vectice Configuration

In [None]:
from vectice import Experiment
from vectice.api.json import ModelType
from vectice.api.json import JobType

# Specify the API endpoint for Vectice.
# You can specify your API endpoint here in the notebook, but we recommand you to add it to a .env file
os.environ['VECTICE_API_ENDPOINT']= "app.vectice.com"

# To use the Vectice Python library, you first need to authenticate your account using an API key.
# You can generate an API key from the Vectice UI, by going to the "My API Keys" section under your profile's picture
# You can specify your API Token here in the notebook, but we recommand you to add it to a .env file
os.environ['VECTICE_API_TOKEN'] = "Your API Token"

# Add you project id. The project id can be found in the project settings page in the Vectice UI
PROJECT_ID = ID

## Feature Engineering

In this part we will:
* Split Data to Train / Test 
* Train Data to Standard Scaler
* Target Data to Oversampling by SMOTE

In [None]:
train_df.drop('ID',axis=1,inplace=True)

In [None]:
x = train_df.drop('TARGET',axis=1)
y = train_df['TARGET']

### Resolving the problem of multicolinearity

Here we are going to use the "The Pearson correlation" method. It is the most common method to use for numerical variables; it assigns a value between − 1 and 1, where 0 is no correlation, 1 is total positive correlation, and − 1 is total negative correlation. This is interpreted as follows: a correlation value of 0.7 between two variables would indicate that a significant and positive relationship exists between the two. A positive correlation signifies that if variable A goes up, then B will also go up, whereas if the value of the correlation is negative, then if A increases, B decreases

In [None]:
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

We consider a threshold of 0.9 to avoid high correlation

In [None]:
corr_features = correlation(x, 0.9)
len(set(corr_features))

In [None]:
x = x.drop(corr_features,axis=1)

### Standardize data

In [None]:
scaler = StandardScaler().fit(x)
x_scaler = scaler.transform(x)
x_scaler_df = pd.DataFrame(x_scaler, columns=x.columns)

**Principal component analysis (PCA)** 

In [None]:
pca = PCA(n_components=0.95)
x_scaler_pca = pca.fit_transform(x_scaler)
x_scaler_pca_df = pd.DataFrame(x_scaler_pca)

In [None]:
x_scaler_pca_df.head()

In [None]:
pca.explained_variance_ratio_

In [None]:
plt.scatter(x_scaler_pca_df.loc[:, 0], x_scaler_pca_df.loc[:, 1], c=y,  cmap="copper_r")
plt.axis('off')
plt.colorbar()
plt.show()

=> We cant use PCA since we can't reduce the dimentionality (The variance is represented by multiple variables and we didn't find a small number of variables that enable to represent a considerable part of the variance)

## Split the data and use oversampling

In [None]:
# We create our first experiment for data preparation and specify the workspace and the project we will be working on
# Each experiment only contains one job. Each invokation of the job is called a run.
# autocode = True enables you to track your git changes for your code automatically every time you execute a run (see below).
experiment = Experiment(job="jobSplitData_Customer_Satisfaction", job_type = JobType.PREPARATION, project=PROJECT_ID, auto_code = True)

We can check if the datasets are already created in our workspace by calling **experiment.vectice.list_datasets()** which lists all the datasets existing in the project

In [None]:
experiment.vectice.list_datasets()

Create a dataset version based on the created/existing dataset that contains your data. For this notebook, we'll use some datasets that have already been created in Vectice to illustrate datasets auto-versioning.

The following code splits the dataset to train and test sets and uses the SMOTE methode for oversampling in order to balance our dataset.

In [None]:
# We use auto-versioning here.
# The Vectice library automatically detects if there have been changes to the dataset you are using.
# If it detects changes, it will generate a new version of your dataset automatically.
# For this notebook, we changed the data to illustrate datasets auto-versioning..
# So, the Vectice Python library will create a new dataset version when this code is executed for the first time.
input_ds_version = experiment.add_dataset_version(dataset="customer_satisfaction_dataset")

# Because we are using Colab in this tutorial example we are going to declare a reference to the code
## manually. This will be added as a reference to the run we are going to create next.
# If you are using your local environment with GIT installed or JupyterLab etc... the code
# tracking is automated.
uri = "https://github.com/vectice/vectice-examples"
entrypoint="Samples/Customer_satisfaction_challenge/customer_satisfaction_challenge.ipynb"
input_code = experiment.add_code_version_uri(git_uri=uri, entrypoint=entrypoint)

# The created dataset version and code version will be automatically attached as inputs of the run as they come before the experiment.start
experiment.start(run_properties={"Property1": "Value 1", "property2": "Value 2"})

#Split data
scaler_x_train, scaler_x_test, scaler_y_train, scaler_y_test = train_test_split(x_scaler, y, test_size=0.3)
#Use SMOTE to oversample the dataset
x_over, y_over = SMOTE().fit_resample(scaler_x_train,scaler_y_train)
print(sorted(Counter(y_over).items()))


# We commented out the code to persist the training and testing test in GCS,
# because we already generated it for you, but feel free to uncomment it and execute it.
# The key (service account (readerKey.json)) existing in the tutorial page may not have writing permissions to GCS.
# Let us know if you want to be able to write files as well and we can issue you a different key.

## Get training and testing data in dataframes in order to upload them to GCS
#train_set = pd.DataFrame(x_over, columns=x.columns).join(pd.DataFrame(y_over, columns=["TARGET"]))
#test_set = pd.DataFrame(scaler_x_test, columns=x.columns).join(pd.DataFrame(scaler_y_test, columns=["TARGET"]))
#train_set.to_csv (r'gs://vectice-examples-samples/Customer_satisfaction_challenge/training_data.csv', index = False, header = True)
#test_set.to_csv (r'gs://vectice-examples-samples/Customer_satisfaction_challenge/testing_data.csv', index = False, header = True)

# We add new dataset versions 
train_ds_version = experiment.add_dataset_version(dataset="customer_satisfaction_training_dataset")
test_ds_version = experiment.add_dataset_version(dataset="customer_satisfaction_testing_dataset")

# We complete the current experiment's run 
## The added dataset versions will be automatically attached as outputs of the run
### as they come after the start run and before the experiment.complete
experiment.complete()

Our data contains now the same number of zeros and ones now

# Modeling
* LogisticRegression
* LightGBM Classification

Here we create a function that calculates and shows the confusion matrix and the accuracy, precision, recall, f1_score, roc_auc metrics.

In [None]:
def get_clf_eval(y_test, pred = None, pred_proba = None):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    roc_auc = roc_auc_score(y_test, pred_proba)
    
    print('confusion')
    print(confusion)
    print('Accuacy : {}'.format(np.around(accuracy,4)))
    print('Precision: {}'.format(np.around(precision,4)))
    print('Recall : {}'.format(np.around(recall,4)))
    print('F1 : {}'.format(np.around(f1,4)))  
    print('ROC_AUC : {}'.format(np.around(roc_auc,4)))
    return confusion, accuracy, precision, recall, f1, roc_auc

In [None]:
# We create our second experiment for modeling and specify the workspace and the project we will be working on
# Each experiment only contains one job. Each invokation of the job is called a run.
# autocode = True enables you to track your git changes for your code automatically every time you execute a run (see below).
experiment = Experiment(job="Modeling", project=PROJECT_ID, job_type=JobType.TRAINING, auto_code=True)

We can get the list of the models existing in our project by calling **vectice.list_models()**

In [None]:
experiment.vectice.list_models()

* **LogisticRegression**

In [None]:
## Logistic Regression
# we declare the dataset versions and code to use as inputs of our run
experiment.start(inputs=[train_ds_version,test_ds_version, input_code],
                run_properties={"Property1": "Value 1", "property2": "Value 2"})

lg_reg = LogisticRegression()

lg_reg.fit(x_over, y_over)
pred = lg_reg.predict(scaler_x_test)
pred_proba = lg_reg.predict_proba(scaler_x_test)[:,1]

confusion, accuracy, precision, recall, f1, roc_auc = get_clf_eval(scaler_y_test, pred=pred, pred_proba=pred_proba)
    
metrics = {'Accuracy score': round(accuracy, 3), "Precision": round(precision, 3),
            "Recall": round(recall, 3), 'f1 score': round(f1, 3), 'AUC score': round(roc_auc, 3)}
# We create a new model version 
model_version1 = experiment.add_model_version("Customer_Satisfaction_Classifier", algorithm="Logistic Regression", metrics=metrics)
# We complete the current experiment's run 
## The created model version will be automatically attached as outputs of the run
experiment.complete()

* **LightGBM Classifier**

In [None]:
scaler_x_test, scaler_x_val, scaler_y_test, scaler_y_val = train_test_split(scaler_x_test, scaler_y_test, test_size=0.5)

In [None]:
##Setting up the model's parameters
## Feel free to play with the parameters
train_data = lgb.Dataset(x_over, label=y_over)
val_data = lgb.Dataset(scaler_x_val, label=scaler_y_val)
n_estimators = 5000
num_leaves = 20
min_data_in_leaf = 80
learning_rate = 0.001
boosting = 'gbdt'
objective = 'binary'
metric = 'auc'
params = {
    'n_estimators': n_estimators,
    'num_leaves': num_leaves,
    'min_data_in_leaf': min_data_in_leaf,
    'learning_rate': learning_rate,
    'boosting': boosting,
    'objective': objective,
    'metric': metric,
}

In [None]:
## LightGBM Classifier
# we declare the dataset versions and code to use as inputs of our run
experiment.start(inputs=[train_ds_version,test_ds_version, input_code], run_properties={"Property1": "Value 1", "property2": "Value 2"})

lgbm = lgb.train(params,
                  train_data,
                  valid_sets=val_data, 
                  valid_names=['train','valid'],
                  early_stopping_rounds=300)

# Predicting the output on the Test Dataset 
ypred_lgbm = lgbm.predict(scaler_x_test)
ypred_lgbm
y_pred_lgbm_class = [np.argmax(line) for line in ypred_lgbm]
accuracy_lgbm=accuracy_score(scaler_y_test,y_pred_lgbm_class)
print(accuracy_lgbm)
#Print Area Under Curve
plt.figure()
false_positive_rate, recall, thresholds = roc_curve(scaler_y_test, ypred_lgbm)
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic (ROC)')
plt.plot(false_positive_rate, recall, 'b', label = 'AUC = %0.3f' %roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out (1-Specificity)')
plt.savefig("ROC_curve.png")
plt.show()
print('AUC score:', roc_auc)

metrics = {"Accuracy score": round(accuracy_lgbm, 3), "AUC score": round(roc_auc, 3)} 
hyper_parameters =  {"n_estimators": n_estimators, "num_leaves": num_leaves,
              "min_data_in_leaf": min_data_in_leaf, "learning_rate": learning_rate, "boosting": boosting,
              "objective": objective, "metric": metric}
# We create a new model version 
model_version2 = experiment.add_model_version(model="Customer_Satisfaction_Classifier", algorithm="Light GBM", metrics=metrics,
                                             hyper_parameters=hyper_parameters, attachment=["ROC_curve.png"])
# We complete the current experiment's run 
## The created model version will be automatically attached as outputs of the run
experiment.complete()