# Churn Analysis

Customer Churn, also known as customer attrition, or customer turnover, is the loss of customers and it is an important and challenging problem for ecomerce and online businesses. 

For any business in a designated period of time, customers can fall into 3 main categories:
   
1. **Newly Acquired Customers**  
2. **Existing Customers**   
3. **Churned Customers** 
  

Churned Customers are those who have decided to end their relationship a company. Because they can translate in a direct loss of revenue, predicting possible customers who can churn beforehand can help company save this loss.

We can classify customer churn by grouping cases into different categories:  

1. **Contractual Churn**: This type of churn is  applicable to business that provide different services such as cable companies. It happens when customers decide not to continue with their expired contracts.

2. **Voluntary Churn**: It refers to the situation in which customer decides to cancel their existing service and can be applicable for companies that provides a service not based in a fixed-term contract, such as prepaid cellphones or streaming subscriptions.


3. **Non-contractual Churn**: This type of churn, which is applicable to businesses that depend on retail locations or online stores as an example, can be associated to consumers leaving a possible purchasee without completing the transaction.

4. **Involuntary Churn**: This happens when a customer can not longer stay with the credit card company or can not pay their credit card bill.

There are several and completely different reasons of custumer churn. Among them, we can find lack of usage of the product, poor service, or better pricing in other similar services.    

Nevertheless, one thing holds `true` for all industries: 
        
> **It cost more to adquire new customers than it is to retain existing ones**
        

Because it costs so much to adquire them, it is wise to word hard towards retaining them. A company avoid customer churn by knowing its customers, and one of the best way to achieve this is through the analysis of historical and new customer data.

One of the metrics to keep track of customer churn is **Retention Rate**, an indication of to what degree the products satisfie a strong market demand, known as product-market fit. If a product-market fit is not satisfactory, a company will likely experience customers churning. 

A powerful tool to analyse and improve Retention Rate is Churn Prediction; a technique that helps to find out which customer is more likely to churn in the given period of time. 

For the current churn analysis project, we'll try to achieve the following aims:  

1. 
2. 

_Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.” Investigating how the features affect Retention by using Logistic Regression Building a classification model_

---

### Import required libraries

As always let's import the required libraries. We'll start by importing `pandas` to handle the Data and `numpy` to perform some numerical calculations.

In [1]:
import numpy as np
import pandas as pd

Now, we'll import the libraries for making plots.

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In order to preprocess and perform feature engineering, we need the `LabelEncoder` from `scikit-learn` module.

In [3]:
from sklearn.preprocessing import LabelEncoder

And last, we import the libraries for splitting the set, finding the best model and evaluating the final models.

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [7]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

---

## About the Data

The [dataset](https://www.kaggle.com/adammaus/predicting-churn-for-bank-customers) used for this project was obtained from Kaggle. It contains data about customers who are withdrawing their account from a bank.

Let's first load the data and understand its content.

### Read the data

In [8]:
df = pd.read_csv('Churn_Modelling.csv')

### Review the data

In [14]:
print(f"The dataset contains {df.shape[0]} observations, from which we have {df.shape[1]} attributes")

The dataset contains 10000 observations, from which we have 14 attributes


In [19]:
print(f"Among their attributes, we observe {df.columns.tolist()}")

Among their attributes, we observe ['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited']


Let's check if we have any missing values.

In [21]:
print("Column          Missing values")
print("------------------------------")
df.isnull().sum()

Column          Missing values
------------------------------


RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

And which type of data we have in each column.

In [24]:
print("Column               types")
print("--------------------------")
df.dtypes

Column               types
--------------------------


RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [13]:
df.nunique()

RowNumber          10000
CustomerId         10000
Surname             2932
CreditScore          460
Geography              3
Gender                 2
Age                   70
Tenure                11
Balance             6382
NumOfProducts          4
HasCrCard              2
IsActiveMember         2
EstimatedSalary     9999
Exited                 2
dtype: int64

### Clean the data

In [None]:
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

In [None]:
new_names = {
    'CreditScore': 'credit_score',
    'Geography': 'country',
    'Gender': 'gender',
    'Age': 'age',
    'Tenure': 'tenure',
    'Balance': 'balance',
    'NumOfProducts': 'number_products',
    'HasCrCard': 'owns_credit_card',
    'IsActiveMember': 'is_active_member',
    'EstimatedSalary': 'estimated_salary',
    'Exited': 'exited'
}

In [None]:
df.rename(columns=new_names, inplace=True)

---

## Exploratory Data Analysis

In [None]:
amount_retained = df[df['exited'] == 0]['exited'].count() / df.shape[0] * 100
amount_lost = df[df['exited'] == 1]['exited'].count() / df.shape[0] * 100

In [None]:
fig, ax = plt.subplots()
sns.countplot(x='exited', palette="Set3", data=df)
plt.xticks([0, 1], ['Retained', 'Lost'])
plt.xlabel('Condition', size=15, labelpad=12, color='grey')
plt.ylabel('Amount of customers', size=15, labelpad=12, color='grey')
plt.title("Proportion of customers lost and retained", size=15, pad=20)
plt.ylim(0, 9000)
plt.text(-0.15, 7000, f"{round(amount_retained, 2)}%", fontsize=12)
plt.text(0.85, 1000, f"{round(amount_lost, 2)}%", fontsize=12)
sns.despine()
plt.show()

In [None]:
categorical_labels = [['gender', 'country'], ['owns_credit_card', 'is_active_member']]
colors = [['Set1', 'Set2'], ['Set3', 'PuRd']]

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
for i in range(2):
    for j in range(2):
        feature = categorical_labels[i][j]
        color = colors[i][j]
        ax1 = sns.countplot(x=feature, hue='exited', palette=color, data=df, ax=ax[i][j])
        ax1.set_xlabel(feature, labelpad=10)
        ax1.set_ylim(0, 6000)
        ax1.legend(title='Exited', labels= ['No', 'Yes'])
        if i == 1:
            ax1.set_xticklabels(['No', 'Yes'])
sns.despine()

In [None]:
df.columns

In [None]:
numerical_labels = [['age', 'credit_score'], 
                    ['tenure', 'balance'],
                   ['number_products', 'estimated_salary']]
num_colors = [['Set1', 'Set2'], 
              ['Set3', 'PuRd'],
              ['Spectral', 'Wistia']]

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(12, 12))
for i in range(3):
    for j in range(2):
        feature = numerical_labels[i][j]
        color = num_colors[i][j]
        ax1 = sns.boxplot(x='exited', y=feature, palette=color, data=df, ax=ax[i][j])
        ax1.set_xlabel('Exited', labelpad=10)
        ax1.set_xticklabels(['No', 'Yes'])
sns.despine()

In [None]:
sns.pairplot(df, vars=['age', 'credit_score', 'balance', 'estimated_salary'], 
             hue="exited", palette='husl')
sns.despine()

---

## Feature Engineering

### New Variable Creation

In [None]:
df['creditscore_age_ratio'] = df['credit_score'] / df['age']

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
sns.boxplot(y='creditscore_age_ratio', x='exited', palette='summer', data=df)
ax.set_xticklabels(['No', 'Yes'])
sns.despine()

In [None]:
df['balance_salary_ratio'] = df['balance'] / df['estimated_salary']

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
sns.boxplot(y='balance_salary_ratio', x='exited', palette='winter', data=df)
ax.set_xticklabels(['No', 'Yes'])
ax.set_ylim(-1, 6)
sns.despine()

### Encoding Categorical Variables

In [None]:
x = df.drop('exited', axis=1)
y = df['exited']

In [None]:
for label in ['gender', 'country']:
    le = LabelEncoder()
    le.fit(x[label])
    x.loc[:, label] = le.transform(x[label])

### Split the data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, 
                                                    shuffle=True, stratify=y)

---

## Model fitting

In [None]:
def print_best_model(model):
    print(f"The best parameters are: {model.best_params_}")
    print(f"The best model score is: {model.best_score_}")    
    print(f"The best estimator is: {model.best_estimator_}")

In [None]:
def get_auc_scores(y_actual, method,method2):
    auc_score = roc_auc_score(y_actual, method); 
    fpr_df, tpr_df, _ = roc_curve(y_actual, method2); 
    return (auc_score, fpr_df, tpr_df)

### 1. Parameter Searching

#### Logistic Regression

In [None]:
param_grid_log = {
    'C': [0.1, 1, 10, 50, 100, 200],
    'max_iter': [200, 300],
    'penalty': ['l2'],
    'tol':[0.00001, 0.0001],
}

In [None]:
log_first = LogisticRegression(solver='lbfgs')

In [None]:
log_grid = GridSearchCV(log_first, param_grid=param_grid_log, cv=10, verbose=1)

In [None]:
log_grid.fit(x, y)

In [None]:
best_log_estimator = log_grid.best_estimator_

In [None]:
print_best_model(log_grid)

#### Support Vector Machine

In [None]:
param_grid_svm = {
    'C': [0.5, 100, 150],
    'kernel': ['rbf'],
    'gamma': [0.1, 0.01, 0.001]
}

In [None]:
svm_first = SVC()

In [None]:
svm_grid = GridSearchCV(svm_first, param_grid=param_grid_svm, cv=3, verbose=3, n_jobs=-2)

In [None]:
svm_grid.fit(x, y)

In [None]:
best_svm_estimator = svm_grid.best_estimator_

In [None]:
print_best_model(svm_grid)

In [None]:
param_grid_svm_poly = {
    'C': [0.5, 1, 10],
    'kernel': ['poly'],
    'degree': [2, 3],
    'gamma': [0.1, 0.01, 0.001]
}

In [None]:
svm_poly_first = SVC()

In [None]:
svm_grid_poly = GridSearchCV(svm_poly_first, param_grid=param_grid_svm_poly, cv=3, verbose=3, n_jobs=-2)

In [None]:
svm_grid_poly.fit(x, y)

### 2. Fitting Best Models

#### Logistic Regression

In [None]:
best_log_estimator.fit(X_train, y_train)

#### Support Vector Machine

In [None]:
best_svm_estimator.fit(X_train, y_train)

### 3. Metrics of Best Models

#### Logistic Regression

In [None]:
log_predict_train = best_log_estimator.predict(x_train)

In [None]:
log_predict_test = best_log_estimator.predict(x_test)

#### Support Vector Machine

In [None]:
svm_predict_train = best_svm_estimator.predict(x_train)

In [None]:
svm_predict_test = best_svm_estimator.predict(x_test)

In [None]:
X = df_train.loc[:, df_train.columns != 'Exited']
X_pol2 = df_train_pol2
log_scores = get_auc_scores()
smv_scores = get_auc_scores()

In [None]:
plt.figure(figsize = (12,6), linewidth= 1)
plt.plot(fpr_log_primal, tpr_log_primal, label = 'log primal Score: ' + str(round(auc_log_primal, 5)))
plt.plot(fpr_SVM_RBF, tpr_SVM_RBF, label = 'SVM RBF Score: ' + str(round(auc_SVM_RBF, 5)))
plt.plot([0,1], [0,1], 'k--', label = 'Random: 0.5')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()