# Machine Learning on Prediction for Customer Churn

This project builds on my prior experience with machine learning methods, where I contributed to a group project predicting the success of Kickstarter campaigns. My responsibilities included conducting exploratory data analysis (EDA) and training an XGBoost model. For this solo project, I apply a similar process to address the business-critical problem of customer retention.

# Project Details

In this project, I tackle the challenge of predicting customer churn by building a Customer Churn Prediction model. Customer churn refers to the rate at which customers stop doing business with a company, often measured as the number of customers who leave or fail to renew their subscription. Understanding and predicting churn is crucial for businesses aiming to retain customers and improve long-term growth.

__INFORMATION ON DATA__   
The data is retreived from [Kaggle](https://www.kaggle.com/datasets/willianoliveiragibin/customer-churn) along with the description of the columns. And the dataset is defined as "The Customer Churn Classification dataset is a vital resource for businesses seeking to understand and predict customer churn, a critical metric that represents the rate at which customers stop doing business with a company over a given period. Understanding churn is essential for any customer-focused company, as retaining customers is generally more cost-effective than acquiring new ones. The dataset is designed to provide a detailed view of customer characteristics and behaviors that could potentially lead to churn, allowing companies to take preemptive action to improve customer retention."

__BUSINESS CASE__    
You work as a Data Scientist for a bank that provides subscription-based packages that experiences fluctuations in customer retention, with a mix of new, loyal, and departing customers. The bank has tasked you with developing a predictive model to estimate the likelihood of customer churn — identifying which customers are at risk of canceling their subscriptions.

# Getting started

Setting the working environment

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# import xgboost as xgb
# from sklearn.datasets import make_classification
# from sklearn.model_selection import train_test_split
# from sklearn.model_selection import RandomizedSearchCV
# from sklearn.preprocessing import StandardScaler
# from sklearn.linear_model import LogisticRegression, LinearRegression
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, r2_score, mean_squared_error, confusion_matrix, log_loss, classification_report, roc_curve
# from xgboost import XGBClassifier
# from xgboost import XGBRegressor
# from sklearn.preprocessing import OneHotEncoder

# Understanding the dataset

This section provides an overview of the dataset, including its structure, the number of rows and columns, and a preview of the data.

In [2]:
# read in csv file and display
df = pd.read_csv("data_churn/customer_churn.csv")
df.head(20)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,EstimatedSalary,Exited
0,747,15787619,Hsieh,844,France,Male,18,2,160980.03,145936.28,0
1,1620,15770309,McDonald,656,France,Male,18,10,151762.74,127014.32,0
2,1679,15569178,Kharlamov,570,France,Female,18,4,82767.42,71811.9,0
3,2022,15795519,Vasiliev,716,Germany,Female,18,3,128743.8,197322.13,0
4,2137,15621893,Bellucci,727,France,Male,18,4,133550.67,46941.41,0
5,2142,15758372,Wallace,674,France,Male,18,7,0.0,55753.12,1
6,3331,15657439,Chao,738,France,Male,18,4,0.0,47799.15,0
7,3513,15657779,Boylan,806,Spain,Male,18,3,0.0,86994.54,0
8,3518,15757821,Burgess,771,Spain,Male,18,1,0.0,41542.95,0
9,3687,15665327,Cattaneo,706,France,Male,18,2,176139.5,129654.22,0


There are two categorical features: `Geography` and `Gender`

In [3]:
# check the columns
df.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'EstimatedSalary', 'Exited'],
      dtype='object')

Here are the descriptions of the columns:

| Column name | Description |
| --- | ----------- |
| CustomerId | A unique identifier for each customer |
| Surname | Contains the surname of the customer |
| CreditScore | A key financial indicator, the credit score reflects a customer's financial health |
| Geography | The geographical location of customers |
| Gender | Identifies the gender of the customer |
| Age | Contains the age of the customer |
| Tenure | Reflects how long a customer has been with the company (a bank in our case) |
| Balance | The account balance of customers |
| EstimatedSalary | A customer's estimated salary provides an indication of their financial well-being |
| Exited | This is the target column, which indicates whether the customer churned (1 for churned and 0 for not churned) |

In [4]:
# look at the shape of dataset
df.shape

(10000, 11)

There are 10000 rows and 11 columns in our dataset

In [5]:
# check data-types and for possible missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 859.5+ KB


__Important Notes on Data__     
- There are no missing or dublicated values.
- Only the `Geography` and `Gender` columns need to be converted into dummy variables. This is necessary to include these features in our analysis and model training.

This will be done in the next steps.

# Explanatory Data Analysis

In [6]:
# check for duplicate rows in CustomerId column
df["CustomerId"].duplicated().value_counts()

CustomerId
False    10000
Name: count, dtype: int64

In [7]:
# drop columns we don't need
df.drop(["RowNumber", "CustomerId", "Surname"], axis=1, inplace=True)

In [8]:
# check result
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,EstimatedSalary,Exited
0,844,France,Male,18,2,160980.03,145936.28,0
1,656,France,Male,18,10,151762.74,127014.32,0
2,570,France,Female,18,4,82767.42,71811.9,0
3,716,Germany,Female,18,3,128743.8,197322.13,0
4,727,France,Male,18,4,133550.67,46941.41,0


__Future engineering__  
As previously discussed, we must transform the `Geography` and `Gender` categorical columns into a more readable and analysable format.   
We can utilise pandas' `get_dummies()` function to achieve this transformation.

In [9]:
# convert categorical to dummies
df = pd.get_dummies(df, drop_first=True)

In [10]:
# check result
df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,844,18,2,160980.03,145936.28,0,False,False,True
1,656,18,10,151762.74,127014.32,0,False,False,True
2,570,18,4,82767.42,71811.9,0,False,False,False
3,716,18,3,128743.8,197322.13,0,True,False,False
4,727,18,4,133550.67,46941.41,0,False,False,True


In cases where the `Geography_Germany` and `Geography_Spain` columns contain only `False` values, the customer's location is identified as `France` and the same approach is applied to the `Gender` column, where `False` represents `Female`.

In [11]:
# get an overview with the describe() function
df.describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,38.9218,5.0128,76485.889288,100090.239881,0.2037
std,96.653299,10.487806,2.892174,62397.405202,57510.492818,0.402769
min,350.0,18.0,0.0,0.0,11.58,0.0
25%,584.0,32.0,3.0,0.0,51002.11,0.0
50%,652.0,37.0,5.0,97198.54,100193.915,0.0
75%,718.0,44.0,7.0,127644.24,149388.2475,0.0
max,850.0,92.0,10.0,250898.09,199992.48,1.0
