# Customer Churn Prediction

Customers have been leaving "Beta Bank." Every month. Not in large numbers, but it's noticeable. Bank marketers have calculated that retaining current clients is cheaper than attracting new ones.

The task is to forecast whether a customer will leave the bank in the near future or not. You are provided with historical data on customer behavior and contract terminations with the bank.

Build a model with the highest possible F1-score. To successfully complete the project, you need to achieve a metric of 0.59 for the F1-score. Validate the F1-score on the test dataset independently.

Additionally, measure the AUC-ROC and compare its value with the F1-score.

## Open and study the file

In [24]:
import pandas as pd
from IPython.display import display
import time

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

In [25]:
df = pd.read_csv('./datasets/churn.csv')
display(df.head())
df.info()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In the dataset, there are 10,000 objects and 14 features (`Exited` being the target feature). Each object in the dataset represents information about customer behavior and contract terminations with the bank. The following is known:

- *RowNumber* — index of the row in the data
- *CustomerId* — unique customer identifier
- *Surname* — surname
- *CreditScore* — credit score
- *Geography* — country of residence
- *Gender* — gender
- *Age* — age
- *Tenure* — number of years the person has been a customer of the bank
- *Balance* — account balance
- *NumOfProducts* — number of bank products used by the customer
- *HasCrCard* — presence of a credit card
- *IsActiveMember* — customer's activity
- *EstimatedSalary* — estimated salary
- *Exited* — fact of customer churn

There are 9% missing values in the `Tenure` column. There is a violation of style in the names of all columns. To proceed further, the data issues need to be addressed.

## Data Preparation

### Header Style
Let's rename the columns:

In [26]:
df.columns = ['row_number', 'customer_id', 'surname', 'credit_score', 'geography', 'gender', 'age', 'tenure', 'balance', 'num_of_products', 'has_cr_card', 'is_active_member', 'estimated_salary', 'excited']

### Handling Missing Values

Most likely, the blanks in the `tenure` column indicate that this user has recently become a bank customer. Therefore, let's fill these missing values with zero:


In [27]:
df['tenure'] = df['tenure'].fillna(0)

### Useless Data

In our table, there are data that have no potential connection with the outcome of the work. These data are not only useful for the model but can also be harmful. Therefore, we will get rid of columns such as `row_number`, `customer_id`, `surname`.

In [28]:
df = df.drop(['row_number', 'customer_id', 'surname'], axis=1)

### One-Hot Encoding (OHE)

Our data contains categorical features `geography` and `gender`. To avoid errors during model training, we will transform categorical features into numerical ones using the technique of One-Hot Encoding (OHE), and to avoid falling into the dummy variable trap, we'll use the `pd.get_dummies()` function with the `drop_first` argument.

In [29]:
df = pd.get_dummies(df, drop_first=True)

Let's convert all column names to lowercase:

In [30]:
df.columns = df.columns.str.lower()

### Data Splitting

To perform the classification task, we first need to split the data into three sets: training, validation, and testing. We'll split the original data in a 3:1:1 ratio. Initially, we'll use the `train_test_split` method to separate the training set from the data. Then, using the same method, we'll split the remaining data into validation and testing sets.

In [31]:
df_train, df_test = train_test_split(df, test_size=0.4, random_state=12345)
df_test, df_valid = train_test_split(df_test, test_size=0.5, random_state=12345)

For each dataset, let's designate the target feature (`target`) and other features (`features`).

In [32]:
def get_features_and_target(data):
    return data.drop('excited', axis=1), data['excited']

features_train, target_train = get_features_and_target(df_train)

features_valid, target_valid = get_features_and_target(df_valid)

features_test, target_test = get_features_and_target(df_test)

### Feature Scaling

To avoid the trap where the algorithm might consider one feature more important than another, features are scaled—brought to the same scale. Let's standardize the features using `StandardScaler`.

In [33]:
numeric = ['credit_score', 'age', 'tenure', 'balance', 'num_of_products', 'estimated_salary']

scaler = StandardScaler()
scaler.fit(features_train[numeric])
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

features_valid.head()

Unnamed: 0,credit_score,age,tenure,balance,num_of_products,has_cr_card,is_active_member,estimated_salary,geography_germany,geography_spain,gender_male
7041,-2.226392,-0.088482,-0.825373,-1.233163,0.830152,1,0,0.647083,False,False,True
5709,-0.08712,0.006422,1.426375,-1.233163,-0.89156,1,0,-1.65841,False,False,False
7117,-0.917905,-0.752805,0.139662,0.722307,-0.89156,1,1,-1.369334,False,True,True
7775,-0.253277,0.101325,1.748053,-1.233163,0.830152,1,0,0.075086,False,True,True
8735,0.785204,-0.847708,1.748053,0.615625,-0.89156,0,1,-1.070919,False,False,True


### Class Balance

Let's check if our classes are balanced in the data.

In [34]:
df['excited'].value_counts(normalize=True)

excited
0    0.7963
1    0.2037
Name: proportion, dtype: float64

In our task, there is a significant class imbalance (4:1), which will negatively affect model training. To address this imbalance, we can use techniques such as class weighting, upsampling, and downsampling. However, we'll follow the sequence of tasks in the project, and for now, we'll train the model on imbalanced data.

## Prototype Solution Preparation

Our target feature is categorical, which means we are dealing with a classification task, specifically binary classification, since there are only two categories ("customer churned" — `exited = 1`, "customer stayed" — `exited = 0`).

For solving this task, the following models will be suitable:
- Decision Tree
- Random Forest
- Logistic Regression

We will sequentially train these three models and then evaluate them.

First, we will train the decision tree model. To achieve the highest level of prediction quality, we will try different tree depths ranging from 1 to 30 during the training process.

Since there is class imbalance in the data, we will use the F1-score (the harmonic mean of precision and recall) as the metric for evaluating all models, instead of accuracy.

In [35]:
random_state = 12345

def decision_tree(features_train, target_train, features_valid, target_valid, class_weight=None):
    best_model = None
    best_result = 0
    best_depth = 1

    for depth in range(1, 30, 1):
        model = DecisionTreeClassifier(random_state=random_state, max_depth=depth, class_weight=class_weight)
        model.fit(features_train, target_train)
        predicted_valid = model.predict(features_valid)
        result = f1_score(target_valid, predicted_valid)
        if result > best_result:
            best_model = model
            best_depth = depth
            best_result = result
    return best_result, best_depth, best_model

decision_tree_result, decision_tree_depth, _ = decision_tree(
    features_train,
    target_train,
    features_valid,
    target_valid
)

print(f'F1-score of the best decision tree model on the validation set: {decision_tree_result:.4}. ',
      f'Tree depth: {decision_tree_depth}', sep='\n')

F1-score of the best decision tree model on the validation set: 0.5378. 
Tree depth: 9


The model with the best F1-score (0.5378) on the validation set turned out to be the model with a decision tree depth of 9. This is not a very good result, but it's important to remember that we didn't account for class imbalance.

Let's try training a random forest model. To find the best model, we will tune another hyperparameter – the number of trees (n_estimators) from 10 to 100 with a step of 10.

In [36]:
def random_forest(features_train, target_train, features_valid, target_valid, class_weight=None):
    best_model = None
    best_result = 0
    best_est = 10
    best_depth = 1

    for est in range(10, 100, 10):
        for depth in range(1, 30, 1):
            model = RandomForestClassifier(
                random_state=random_state,
                n_estimators=est,
                max_depth=depth,
                class_weight=class_weight
            )
            model.fit(features_train, target_train)
            predicted_valid = model.predict(features_valid)
            result = f1_score(target_valid, predicted_valid)
            if result > best_result:
                best_model = model
                best_result = result
                best_est = est
                best_depth = depth
    return best_result, best_est, best_depth, best_model

forest_result, forest_est, forest_depth, _ = random_forest(
    features_train,
    target_train,
    features_valid,
    target_valid
)

print(f'F1-score of the best random forest model on the validation set: {forest_result:.4}',
      f'Number of trees: {forest_est}',
      f'Tree depth: {forest_depth}', sep='\n')

F1-score of the best random forest model on the validation set: 0.5531
Number of trees: 50
Tree depth: 18


The model with the best F1-score (0.5531) on the validation set turned out to be the model with 50 trees and a depth of 18. As we can see, the F1-score of the random forest model is slightly higher than that of the decision tree model, but this value is still insufficient for an acceptable result. Additionally, a drawback of the random forest model is its execution speed: the more trees, the slower the model works.

Let's see what F1-score the logistic regression model will achieve on the imbalanced classes.

In [37]:
def logistic_regression(features_train, target_train, features_valid, target_valid, class_weight=None):
    model = LogisticRegression(random_state=random_state, solver='liblinear', class_weight=class_weight)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    result = f1_score(target_valid,predicted_valid)
    return result, model

logistic_regression_result, _ = logistic_regression(
    features_train,
    target_train,
    features_valid,
    target_valid
)

print(f'F1-score of the logistic regression model on the validation set: {logistic_regression_result:.4}')

F1-score of the logistic regression model on the validation set: 0.2743


Among the three models, the logistic regression model has the lowest F1-score - 0.2743.

However, let's move on to class balancing. Perhaps on balanced data, this model will show a better result.

## Solving Imbalance

As we decided earlier, to address the imbalance, we can use techniques such as class weighting, upsampling, and downsampling. Let's start with the first method - class weighting.

### Class Weighting

By specifying *class_weight='balanced'* in the parameters of our algorithms, the algorithm will calculate how many times class "0" occurs more frequently than class "1". Let's denote this number as N, and the new classes will look like this:
- Class "0" weight = 1.0
- Class "1" weight = N

In [38]:
logistic_regression_result, _ = logistic_regression(
    features_train,
    target_train,
    features_valid,
    target_valid,
    'balanced'
)
print(f'F1-score of the logistic regression model on the validation set: {logistic_regression_result:.4}')

F1-score of the logistic regression model on the validation set: 0.4797


In [39]:
forest_result, forest_est, forest_depth, _ = random_forest(
    features_train,
    target_train,
    features_valid,
    target_valid,
    'balanced'
)

print(f'F1-score of the best random forest model on the validation set: {forest_result:.4}',
      f'Number of trees: {forest_est}',
      f'Tree depth: {forest_depth}', sep='\n')

F1-score of the best random forest model on the validation set: 0.6197
Number of trees: 70
Tree depth: 9


In [40]:
decision_tree_result, decision_tree_depth, _ = decision_tree(
    features_train,
    target_train,
    features_valid,
    target_valid,
    'balanced'
)

print(f'F1-score of the best decision tree model on the validation set: {decision_tree_result:.4}. ',
      f'Tree depth: {decision_tree_depth}', sep='\n')

F1-score of the best decision tree model on the validation set: 0.5809. 
Tree depth: 5


**Conclusion**
Class weighting helped us achieve a decent result. Among the three algorithms, RandomForest stands out, with a model using 70 trees and a tree depth of 9 giving us an F1-score of 0.6197.

### Upsampling

We will repeat the rare class several times (in our case, 4 times).

In [41]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)

    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

In [42]:
logistic_regression_result, _ = logistic_regression(
    features_upsampled,
    target_upsampled,
    features_valid,
    target_valid
)
print(f'F1-score of the logistic regression model on the validation set: {logistic_regression_result:.4}')

F1-score of the logistic regression model on the validation set: 0.4779


In [43]:
decision_tree_result, decision_tree_depth, _ = decision_tree(
    features_upsampled,
    target_upsampled,
    features_valid,
    target_valid
)

print(f'F1-score of the best decision tree model on the validation set: {decision_tree_result:.4}. ',
      f'Tree depth: {decision_tree_depth}', sep='\n')

F1-score of the best decision tree model on the validation set: 0.5809. 
Tree depth: 5


In [44]:
forest_result, forest_est, forest_depth, _ = random_forest(
    features_upsampled,
    target_upsampled,
    features_valid,
    target_valid
)

print(f'F1-score of the best random forest model on the validation set: {forest_result:.4}',
      f'Number of trees: {forest_est}',
      f'Tree depth: {forest_depth}', sep='\n')

F1-score of the best random forest model on the validation set: 0.6206
Number of trees: 30
Tree depth: 11


**Conclusion**
By duplicating instances of the minority class 4 times, we balanced the classes. This helped us achieve an F1-score of 0.6206 (Random Forest model with 30 trees and depth of 11).

### Downsampling

Instead of repeating the rare class (1), we'll remove a portion of class 0.

In [45]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=random_state)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=random_state)] + [target_ones])

    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=random_state)

    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.25)

In [46]:
logistic_regression_result, _ = logistic_regression(
    features_downsampled,
    target_downsampled,
    features_valid,
    target_valid
)
print(f'F1-score of the logistic regression model on the validation set: {logistic_regression_result:.4}')

F1-score of the logistic regression model on the validation set: 0.4863


In [47]:
decision_tree_result, decision_tree_depth, _ = decision_tree(
    features_downsampled,
    target_downsampled,
    features_valid,
    target_valid
)

print(f'F1-score of the best decision tree model on the validation set: {decision_tree_result:.4}. ',
      f'Tree depth: {decision_tree_depth}', sep='\n')

F1-score of the best decision tree model on the validation set: 0.6074. 
Tree depth: 5


In [48]:
forest_result, forest_est, forest_depth, _ = random_forest(
    features_downsampled,
    target_downsampled,
    features_valid,
    target_valid
)

print(f'F1-score of the best random forest model on the validation set: {forest_result:.4}',
      f'Number of trees: {forest_est}',
      f'Tree depth: {forest_depth}', sep='\n')

F1-score of the best random forest model on the validation set: 0.5906
Number of trees: 10
Tree depth: 5


**Conclusion**
When downsampling the dataset, two models achieved an F1-score above 0.59:
- Decision tree model with tree depth 5 - 0.6074
- Random forest model with 10 trees and depth 5 - 0.5906.

## Model Testing

Let's select the four models from the previous task that achieved an F1-score above 0.59 on the validation set and compare their performance on the test set.

Additionally, we'll measure the AUC-ROC value on the test set and compare it with the F1-score.

After completing this task, we'll draw conclusions and choose the most suitable model for our problem.

In [49]:
models = [
    {
        'name': 'Random Forest: class weighting',
        'model': RandomForestClassifier(
            random_state=random_state, n_estimators=70, max_depth=9, class_weight='balanced'),
        'features': features_train,
        'target': target_train,
        'f1_score_on_valid': 0.6197
    },
    {
        'name': 'Random Forest: upsampling',
        'model': RandomForestClassifier(random_state=random_state, n_estimators=30, max_depth=11),
        'features': features_upsampled,
        'target': target_upsampled,
        'f1_score_on_valid': 0.6206
    },
    {
        'name': 'Decision Tree: downsampling',
        'model': DecisionTreeClassifier(random_state=random_state, max_depth=5),
        'features': features_downsampled,
        'target': target_downsampled,
        'f1_score_on_valid': 0.6074
    },
    {
        'name': 'Random Forest: downsampling',
        'model': RandomForestClassifier(random_state=random_state, n_estimators=10, max_depth=5),
        'features': features_downsampled,
        'target': target_downsampled,
        'f1_score_on_valid': 0.5906
    },
]

for model_obj in models:
    model = model_obj['model']
    model.fit(model_obj['features'], model_obj['target'])

    #speed
    start = time.time()
    predicted_test = model.predict(features_test)
    end = time.time()
    speed = end - start

    #f1_score
    f1 =  f1_score(target_test, predicted_test)

    #auc_roc
    probabilities_test = model.predict_proba(features_test)
    probabilities_one_test = probabilities_test[:, 1]
    auc_roc = roc_auc_score(target_test, probabilities_one_test)

    model_obj['f1_score_on_test'] = round(f1, 4)
    model_obj['speed'] = round(speed, 5)
    model_obj['auc_roc_on_test'] = round(auc_roc, 4)

display(pd.DataFrame(models, columns=['name', 'f1_score_on_valid', 'f1_score_on_test', 'auc_roc_on_test', 'speed']))

Unnamed: 0,name,f1_score_on_valid,f1_score_on_test,auc_roc_on_test,speed
0,Random Forest: class weighting,0.6197,0.6224,0.8537,0.00962
1,Random Forest: upsampling,0.6206,0.6121,0.8434,0.00527
2,Decision Tree: downsampling,0.6074,0.5931,0.8229,0.0005
3,Random Forest: downsampling,0.5906,0.589,0.8379,0.00122


### Conclusion

With F1-score values ranging from 0.58 to 0.62, the AUC-ROC values look plausible, ranging from 0.82 to 0.85. The selected models are not perfect, but they provide accuracy better than a random model.

On the test set, the best F1-score was achieved by the **Random Forest** model with 70 trees and depth 9, using class weighting for balancing – 0.6224. However, this model showed the worst prediction speed. If speed is an important parameter for the client, then the best model in this case is the **Decision Tree** (depth 5, balanced using downsampling): this model has lower accuracy but is several times faster.