# Data exploration



In [None]:
library(tidyverse)
library(caret)
library(xgboost)
library(Ckmeans.1d.dp)
library(DiagrammeR)
library(precrec)
library(SHAPforxgboost)
options(warn = -1)

## Load data

In [None]:
data <- read.csv("data/churn.csv")
head(data)

# Check dimension

**Tidyverse**

*R packages for data science*
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Visit the [Learn section](https://www.tidyverse.org/learn/) of the webpage to find great resources on Tidyverse.

In [None]:
# Transformations: 
# 1) Remove unnecesary vars
# 2) Convert to numeric

data <- data %>%
   select(-RowNumber, -CustomerId, -Surname) 
head(data)

In [None]:
dummy_obj <- dummyVars(~Geography + Gender, data)
dummy_df <- predict(dummy_obj, newdata = data)
data <- cbind(select(data, -Geography, -Gender), dummy_df)
head(data)

**ggplot2**

It is always important to explore data before working on models. Let's have some fun with ggplot2. 

In [None]:
options(repr.plot.width=12, repr.plot.height=4)

for (i in 1:ncol(data)){
    p <- ggplot(data)
    p <- p + geom_histogram(aes(x=data[,i], y=..density.., fill=factor(Exited)), alpha = 0.2)
    p <- p + geom_density(aes(x=data[,i], y =..density.., fill = factor(Exited), colour = factor(Exited)), 
                          alpha = 0.35)
    p <- p + scale_x_continuous(name = names(data)[i])
    p <- p + theme_minimal()
    print(p)
}

In [None]:
# See outliers
for (i in 1:ncol(data)){
    p <- ggplot(data)
    p <- p + geom_boxplot(aes(x=factor(Exited), y=data[,i], fill=factor(Exited)), alpha = 0.2)
    p <- p + scale_y_continuous(name = names(data)[i])
    p <- p + theme_minimal()
    print(p)
}

## Data partition
Let's build a train dataset and a test dataset. We can select observations randomly preserving the balance of the clases 0/1

**caret** 

`caret` package has a bunch of amazing functions for machine learning tasks. `createDataPartition` is one of them.

In [None]:
# Data partition
set.seed(42)

train_index <- createDataPartition(data$Exited, p = .7, list = FALSE, times = 1)

# First partition: 70% train - 30% test 
train <- data[train_index, ]  
test <- data[-train_index, ]

round(table(train$Exited)/nrow(train)*100, 2)
round(table(test$Exited)/nrow(test)*100, 2)

In [None]:
head(train)

## Model training

**xgboost** 

We are using `xgboost` package to build a model to predict if a customer is going to leave the company based on some features. This is a binary classification problem, but `XGBoost`can also be used on regression problems (see [package Documentation](https://xgboost.readthedocs.io/en/latest/R-package/index.html)) 

In [None]:
class(train)

In [None]:
# Predictive variables in training dataset
X_train <- train %>%
    select(-Exited) %>%
    data.matrix()
# Labels in training dataset
y_train <- train$Exited

X_test <- test %>%
    select(-Exited) %>%
    data.matrix()
y_test <- test$Exited

In [None]:
set.seed(42)
xgb <- xgboost(data = X_train, 
 label = y_train, 
 eta = 0.2,
 max_depth = 3, 
 nround = 10, 
 subsample = 0.5,
 colsample_bytree = 0.5,
 seed = 1,
 eval_metric = "auc",
 objective = "binary:logistic",
 nthread = 3,
 scale_pos_weight = 4
)

In [None]:
# This info is accesible
xgb$evaluation_log

In [None]:
pred <- predict(xgb, X_test)
head(pred)

In [None]:
head(pred > 0.5)

In [None]:
cbind(pred > 0.5, y_test) %>% 
  data.frame() %>% 
  table() %>% 
  confusionMatrix(positive = "1")    # from caret package again

In [None]:
options(repr.plot.width=8, repr.plot.height=6.5)

precrec_obj <- evalmod(scores = pred, labels = y_test)
autoplot(precrec_obj)

## Feature importance

In [None]:
feature_importance <- xgb.importance(feature_names = xgb$feature_names, model = xgb)
feature_importance

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)
xgb.ggplot.importance(importance_matrix = feature_importance, rel_to_first = TRUE)

In [None]:
xgb.plot.tree(feature_names = xgb$feature_names, model = xgb, trees = 0)

## Shapley values

Shapley values calculate the importance of a feature by comparing what a model predicts with and without the feature. However, since the order in which a model sees features can affect its predictions, this is done in every possible order, so that the features are fairly compared. 

In [None]:
# To prepare the long-format data:
shap_long <- shap.prep(xgb_model = xgb, X_train = X_train)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 10)
shap.plot.summary(shap_long)

Check [here](https://liuyanguu.github.io/post/2019/07/18/visualization-of-shap-for-xgboost/) other plots for Shapley values. Amazing work!