# Churn Rate

## Project Overview

This project made to obtain grades for Reproducible Research class at the University of Warsaw.  

Our group:
1. Orkhan Amrullayev
2. Srinesh Heshan
3. Laura Florencia

The project's aim to predict behaviour of teleco customer. The dataset itself comes from kaggle (URL at the reference). Here we would like to analyze from the customer's data then see the churn implies to retention approach based on one base paper and we do the improvement. 

We will predict how the customer will churn by analyze the details:  
* Account information
* Demographic information
* Services information  

By those parameters, we will also know how to increase the customer satisfaction and somehow improve the previous research. We are now using Tensorflow library as the differenciate and some algorithms to get the different result.

## Weka Data Mining Tool

Here they have used weka which is a data mining tool for small scale projects. 

Weka features include machine learning, data mining, preprocessing, classification, regression, clustering, association rules, attribute selection, experiments, workflow and visualization. Weka is written in Java, developed at the University of Waikato, New Zealand.

## Google Colab

Why do we using Colab?

Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education.


Notebooks can stay connected for up to 24 hours, compared to the 12 hours in the free version of Colab notebooks. Get priority access to high-memory VMs. These VMs generally have double the memory of standard Colab VMs and twice as many CPUs. Users might even be automatically given a high-memory VM when Colab detects that the need. Another feature is absent in the free version. To offer faster GPUs, longer runtimes and more memory in Colab for a relatively low price, Google needs to maintain the flexibility to adjust usage limits and the availability of hardware on the fly. 

Resources in Colab Pro are prioritised for subscribers who have recently used fewer resources, in order to prevent the monopolisation of limited resources by a small number of users. To get the most out of Colab Pro, consider closing your Colab tabs when you are done with your work, and avoid opting for GPUs or extra memory when it is not needed for your work. 
This will make it less likely for a user to run into usage limits within Colab Pro.

# Dataset Detail  

Dataset contains 7043 rows and 34 columns. The rows comes from customers data and the column represent the features in dataset. However, we will just drop some columns, and will only use 19 independet variables and 1 dependent variables.

Target variable/ dependent variable: column "churn value" (we also have "churn label" in the column, but we will use the "churn value" as it is easy to use - integer value).  

Independent variables: 19 columns which represent the characteristics of customers.  

# Load libraries


In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import resample

%matplotlib inline
pd.options.display.max_columns = 500

import warnings
warnings.filterwarnings('ignore')

# Load data

Load the dataset that we are using as the source of the research.

In [47]:
df = pd.read_csv('Churn_dataset.csv')

## Exploratory Data Analysis (EDA)

We are showing the small detail of the dataset.

In [53]:
df.head()

In [54]:
df.info()

Apparently we have no null values, however TotalCharges has incorrect format (object), fix this converting it to float (float64) and filling missing values those will be generated during conversion with 0.

In [55]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(value=0)

df['tenure'] = df['tenure'].astype('float64')

Drop customer ID as it's not relevant field for analysis.

In [56]:
df.drop('customerID', axis=1, inplace=True)

Splitting categorical and numerical columns

In [57]:
col_cat = df.select_dtypes(include='object').drop('Churn', axis=1).columns.tolist()
col_num = df.select_dtypes(exclude='object').columns.tolist()

# Data Analysis in Categorical Fields

For our categorical fields, check how many unique values has each column so we will decide if feature engineering (and merging values in case there is too many of them) is needed.
You will see we have 2-4 unique values that is ideal.

In [58]:
for c in col_cat:
    print('Column {} unique values: {}'.format(c, len(df[c].unique())))

Let's look at the distribution of Churn by all categorical variables. We can see that gender is not correlated with Churn at all, but Contract is highly correlated with churn and customers with contract of month-to-month are more likely to churn than the customers with 1-year and 2-years contracts. It may lead company to promote 1 & 2 years contract.

In [59]:
plt.figure(figsize=(20,20))
for i,c in enumerate(col_cat):
    plt.subplot(5,4,i+1)
    sns.countplot(df[c], hue=df['Churn'])
    plt.title(c)
    plt.xlabel('')

Here we have distribution of our numerical features
It seems tenure is correlated with Churn.

In [60]:
plt.figure(figsize=(20,5))
for i,c in enumerate(['tenure', 'MonthlyCharges', 'TotalCharges']):
    plt.subplot(1,3,i+1)
    sns.distplot(df[df['Churn'] == 'No'][c], kde=True, color='blue', hist=False, kde_kws=dict(linewidth=2), label='No')
    sns.distplot(df[df['Churn'] == 'Yes'][c], kde=True, color='Orange', hist=False, kde_kws=dict(linewidth=2), label='Yes')
    plt.title(c)

### Violin plot

A violin plot is a hybrid of a box plot and a kernel density plot, *which shows peaks in the data*. It is used to visualize the distribution of numerical data of our variable which we prepared before. Unlike a box plot that can only show summary statistics, violin plots depict summary statistics and the density of each variable.  

This violin plot shows the relationship of feed type to chick weight. The box plot elements show the median weight for horsebean-fed chicks is lower than for other feed types. The shape of the distribution (extremely skinny on each end and wide in the middle) indicates the weights of sunflower-fed chicks are highly concentrated around the median.  

In [61]:
plt.figure(figsize=(20,5))
for i,c in enumerate(col_num):
    plt.subplot(1,4,i+1)
    sns.violinplot(x=df['Churn'], y=df[c])
    plt.title(c)

# Data preprocessing

After EDA, let's prepare our date for the machine learning algorithms. One hot encoding of our categorical features. 

One hot encoding of our categorical features.

In [62]:
df.head()

In [63]:
dfT = pd.get_dummies(df, columns=col_cat)
dfT.head()

Now do simple label encoding of our target variable Churn.

In [64]:
dfT['Churn'] = dfT['Churn'].map(lambda x: 1 if x == 'Yes' else 0)

## Balanced or imbalanced?
Let's see if our dataset is balanced or imbalanced and if any action is needed. You will find out that data are highly imbalanced, we will use resample function to upsample minority group.

In [65]:
plt.figure(figsize=(5, 5))
sns.countplot(dfT['Churn'])
plt.title('Imbalanced dataset, it seems ratio is 2:5 for Yes:No')
plt.show()

Let's divide our data into 2 groups, majority (0) and minority (1) and create new dataset by upsampling minority group.

In [66]:
minority = dfT[dfT.Churn==1]
majority = dfT[dfT.Churn==0]

minority_upsample = resample(minority, replace=True, n_samples=majority.shape[0])
dfT = pd.concat([minority_upsample, majority], axis=0)
dfT = dfT.sample(frac=1).reset_index(drop=True)

Let's just have a quick check how it looked like before balance and after balance.

In [67]:
plt.figure(figsize=(10, 5))
plt.subplot(1,2,1)
sns.countplot(df['Churn'])
plt.title('Imbalanced dataset')

plt.subplot(1,2,2)
sns.countplot(dfT['Churn'])
plt.title('Balanced dataset')
plt.show()

## Time to Scale!

Machine Learning algorithms are sensitive on data that are not normalized to same scale. We will use robust scaler which can nicely handle outliers, but standard scaler might work well too. This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range).

In [68]:
rs = RobustScaler()
dfT['tenure'] = rs.fit_transform(dfT['tenure'].values.reshape(-1,1))
dfT['MonthlyCharges'] = rs.fit_transform(dfT['MonthlyCharges'].values.reshape(-1,1))
dfT['TotalCharges'] = rs.fit_transform(dfT['TotalCharges'].values.reshape(-1,1))

## Data Split

Split our data into train & test partitions. Train partition will be used to train ML model, test will be used to validate it's performance. *70% goes to train, 30% goes to test*. It could be also 80:20 or 60:40, but we choose 70:30 in our research.

In [73]:
X_train, X_test, y_train, y_test = train_test_split(dfT.drop('Churn', axis=1).values, dfT['Churn'].values, test_size=0.3)

# Modeling

We will use Logistic Classifier, Decision Tree, Random Forest, XGBoost, Neural Network


## Logistic Regression

Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on.  The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product. 

In [74]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


In [75]:
model_lg = LogisticRegression(max_iter=500,random_state=0, n_jobs=-1)

In [76]:
model_lg.fit(X_train, y_train)

In [77]:
# Making Predictions
pred_lg = model_lg.predict(X_test)

In [78]:
print(classification_report(y_test, pred_lg))

## Decision Tree Classifier

Decision Trees, the popular and time-tested method of applying logic to complex problems, where the variables are many and the options specific and dependent, have an important role to play within Machine Learning.

We will dedicate this paper to understanding why this reasonably humble technique has become such an important tool for data scientists. And we will start the debate by suggesting that Decision Trees are popular because they have two key properties, which are: 
* Simplicity: Decision Trees are simple, visually appealing and are easy to interpret.
* Accuracy: Advance Decision Tree models show exceptional performance in predicting patterns in complex data.  

In [79]:
from sklearn.tree import DecisionTreeClassifier

In [80]:
# Creating object of the model
model_dt = DecisionTreeClassifier(max_depth=4, random_state=42)

In [81]:
model_dt.fit(X_train, y_train)

In [82]:
pred_dt = model_dt.predict(X_test)

In [83]:
print(classification_report(y_test, pred_dt))

## Random Forest

Random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.  

The low correlation between models is the key. Uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions. The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction). While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction. So the prerequisites for random forest to perform well are:
* There needs to be some actual signal in our features so that models built using those features do better than random guessing.
* The predictions (and therefore the errors) made by the individual trees need to have low correlations with each other.

In [84]:
from sklearn.ensemble import RandomForestClassifier

In [85]:
model_rf = RandomForestClassifier(n_estimators=400,min_samples_leaf=0.13, random_state=42)

In [86]:
model_rf.fit(X_train, y_train)

In [87]:
# Making Prediction
pred_rf = model_rf.predict(X_test)

In [88]:
print(classification_report(y_test,pred_rf))

## XGBoost

Let's start with popular XGB Classifier and check it's performance. The two main reasons to use XGBoost are flexibility, execution speed and model performance. XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. Its strength doesn’t only come from the algorithm, but also from all the underlying system optimization.  

In [89]:
xg = XGBClassifier()
xg.fit(X_train, y_train)
y_test_hat_xg = xg.predict(X_test)

In [124]:
print(classification_report(y_test, y_test_hat_xg))

We could try some hyperparameter tunning with models above.

## Deep neural networks
Deep Neural Networks (DNN) is a neural network with some level of complexity, usually at least has two layers, qualifies as a deep neural network, or deep net for short. Deep nets process data in complex ways by employing sophisticated math modeling.

In [125]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

We use sequential model with multiple dense & dropout layers.

In [126]:
model = Sequential()

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(256, activation='relu'))
model.add(Dropout(0.25))

model.add(Dense(128, activation='relu'))
model.add(Dropout(0.45))

model.add(Dense(1, activation='sigmoid'))

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.3, verbose=1,patience=10, min_lr=0.0000000001)
early_stopping_cb = EarlyStopping(patience=10, restore_best_weights=True)

model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(x=X_train, y=y_train, batch_size=128, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping_cb, reduce_lr])


Model is trained, increasing dropout solves overfitting

In [127]:
y_test_hat_tf = model.predict(X_test)

Output of prediction are probabilities, let's convert probabilities into 0/1

In [128]:
y_test_hat_tf2 = [1 if x > 0.5 else 0 for x in y_test_hat_tf ]

And finally checkout classification report!

In [129]:
print(classification_report(y_test, y_test_hat_tf2))

Xgboost is slightly better than NN

# Summary  

We do some testing and applies 6 algorithms for the model and we get the best accuracy from XGBoost compared to the Logistic, Regression Tree, Random Forest and DNN.  

The result from the other testing also satisfied (more than 75%). It means that the result we got from the market analysis regarding churn prediction is satisfactory reliable and the result are:

* Target more on young and middle-aged customers because they can adopt easily to modern tecnology and pretty much have budget to spend on the service
* Offer extra discount for the returning customer or the one who decided to choose one or two years contract, it will make them likely to stay with the package and contract
* Overall discout will make a big difference because price is the major factors for the customers.

# Recomendations and Limitations  

Regarding the limitations, we should mention the following limitations from the model and also dataset.
The number of observations are enough, but if we could have more columns of features like the customers’ geographic and location, (maybe another important data), we can get more ideas and insight compared to what we have now.  

The dataset is a cross-sectional dataset, so this means that there are no time series factors inside it. The goal is to predict churn rate, thus we are good enough to have the option of contracts from monthly, one year to two years. This the best prediction we can provide to predict and make a decision for the market in the future. It has been proven by the result of algorithms that we use. 

# Resources

1. Kaggle: https://www.kaggle.com/mytymohan/teleco-customer-churn-analysis
2. https://www.kaggle.com/chinmaybgaikwad/customer-churn-in-telcos
3. Dataset: https://www.kaggle.com/blastchar/telco-customer-churn
4. Original dataset from IBM: https://www.ibm.com/docs/en/cognos-analytics/11.1.0?topic=samples-telco-customer-churn
5. https://mode.com/blog/violin-plot-examples/
6. https://towardsdatascience.com/understanding-random-forest-58381e0602d2
7. https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663
