# **Customer Attrition Study**

## Objectives

* Answer business requirement 1:
    * The client would like to better understand the patterns in the employee base so that the client can learn the variables of an employee least likely to attrition. 

## Inputs

* outputs/datasets/collection/EmployeeAttrition.csv

## Outputs

* Generate code and seaborn plots that answer business requirement 1 and can be used for the Streamlit App


---

# Change working directory

* Need to change working directory from the current jupyter_notebooks folder to the parent folder in order to access the whole project

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/EmployeeAttrition.csv").drop(['EmployeeCount', 'EmployeeNumber'], axis=1))
df.head(3)

# Data Exploration #

We wish to become familiar with the dataset, check variable types and their distribution, check for any missing data, and to understand what these variables mean in the business context

In [None]:
%pip show pydantic

In [24]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe

PydanticImportError: `BaseSettings` has been moved to the `pydantic-settings` package. See https://docs.pydantic.dev/2.0.2/migration/#basesettings-has-moved-to-pydantic-settings for more details.

For further information visit https://errors.pydantic.dev/2.0.2/u/import-error

---

# Correlation study

Can use `OrdinalEncoder` to transform categorical variables into integer values, this is so they can be correlated to `Attrition_Flag`. Firstly, determining the categeorical variables:

In [None]:
cols = df.columns[df.dtypes=='object'].to_list()
df_oe = df.copy()

for col in cols:
    print(col)
    print(df[col].unique())

Some of these variables, such as `Education_level`, have a ranking. This ordering will be assigned using preset lists for `OrdinalEncoder`'s `categories` argument

In [None]:
from sklearn.preprocessing import OrdinalEncoder

cat_list = [['M', 'F'],
            ['Uneducated', 'High School', 'College',
            'Unknown', 'Graduate', 'Post-Graduate',
            'Doctorate'],
            ['Single', 'Unknown', 'Divorced', 'Married'],
            ['Less than $40K', '$40K - $60K', 'Unknown',
            '$60K - $80K', '$80K - $120K', '$120K +'],
            ['Blue', 'Silver', 'Gold', 'Platinum']]

encoder = OrdinalEncoder(categories=cat_list)
encoded_array = encoder.fit_transform(df[cols])

for i, col in enumerate(cols):
    df_oe[col] = encoded_array[:,i]

df_oe.head(3)

Use `.corr()` for both `spearman` and `pearson` to investigate the top 10 correlations for each method by returning a dataframe ordered in descending order of correlation coefficient, with the correlation between target and itself excluded.

In [None]:
corr_spearman = df_oe.corr(method='spearman')['Attrition_Flag'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

The same for `pearson`

In [None]:
corr_pearson = df_oe.corr(method='pearson')['Attrition_Flag'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

We notice that redundant variables that were included by the dataset uploader, as well as variables related to customer account usage are the most correlated variables. As such, we drop these to only include variables available for a prospect.

In [None]:
df_oe_dropped = df_oe.drop(['Unnamed: 0',
                            'Months_on_book', 
                            'Months_Inactive_12_mon',
                            'Contacts_Count_12_mon',
                            'Total_Revolving_Bal',
                            'Avg_Open_To_Buy',
                            'Total_Amt_Chng_Q4_Q1',
                            'Total_Trans_Amt',
                            'Total_Trans_Ct',
                            'Total_Ct_Chng_Q4_Q1',
                            'Avg_Utilization_Ratio',
                            'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
                            'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis=1)

df_oe_dropped.head()

We now repeat the correlation methods for the dataset with the usage and redundant variables dropped

In [None]:
corr_spearman = df_oe_dropped.corr(method='spearman')['Attrition_Flag'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

In [None]:
corr_pearson = df_oe_dropped.corr(method='pearson')['Attrition_Flag'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

It appears that most variables that would be available for a prospect have very weak correlation to attrition, with only `Total_Relationship_Count` having any appreciably high correlation at all. We will consider the top 5 correlated variables presented here and study the distribution of them within attritioned customers.

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

Therefore we will study the following variables. We will investigate if:

* An attritioned customer typically has a lower credit limit
* An attritioned customer typically has more dependents
* An attritioned customer tends to be female
* An attritioned customer tends to single
* An attritioned customer tends to have less existing relationships with the bank

In [None]:
vars_to_study = ['Credit_Limit', 'Dependent_count', 'Gender', 'Marital_Status', 'Total_Relationship_Count']
vars_to_study

---

# EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['Attrition_Flag'])
df_eda.head()

## Variables Distibution by Attrition

Plot the distributions (numerical and categorical) coloured by attrition

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'Attrition_Flag'
for col in vars_to_study:
    if df_eda[col].dtype == 'object':
        plot_categorical(df_eda, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")

---

## Parellel Plot

Create separate DataFrame to transform `Credit_Limit` from a numerical variable into a binned categorical variable for visualizing on a `parallel_categories()` plot

In [None]:
from feature_engine.discretisation import ArbitraryDiscretiser
import numpy as np
cred_lim_map = [-np.Inf, 7000, 14000, 21000, 28000, np.Inf]
disc = ArbitraryDiscretiser(binning_dict={'Credit_Limit': cred_lim_map})
df_parallel = disc.fit_transform(df_eda)
df_parallel.head()

In [None]:
disc.binner_dict_['Credit_Limit']

Create map to replace `Credit_Limit` with more informative levels

In [None]:
n_classes = len(cred_lim_map) - 1
classes_ranges = disc.binner_dict_['Credit_Limit'][1:-1]

labels_map = {}
for n in range(0, n_classes):
    if n == 0:
        labels_map[n] = f"<{int(classes_ranges[0]/1000)}k"
    elif n == n_classes-1:
        labels_map[n] = f"+{int(classes_ranges[-1]/1000)}k"
    else:
        labels_map[n] = f"{int(classes_ranges[n-1]/1000)}k to {int(classes_ranges[n]/1000)}k"

labels_map

Replace using `.replace()`

In [None]:
df_parallel['Credit_Limit'] = df_parallel['Credit_Limit'].replace(labels_map)
df_parallel.head()

Creates multi-dimensional categorical data plot

In [None]:
import plotly.express as px
fig = px.parallel_categories(df_parallel, color="Attrition_Flag")
fig.show()

---

# Conclusions

The correlations and plot interpretations converge to a certain extent, e.g. in the `Marital_Status` plot it can be observed that single customers attrition at a higher rate than married customers, on the `Total_Relationship_Count` plot customers with less relationships attrition at a higher rate than customers with more relationships. However, these correlations are shown to be very weak. The bank would be advised to collect different data that might better predict customer's tendency to attrtion. 

* An attritioned customer typically has a lower credit limit
* An attritioned customer typically has more dependents
* An attritioned customer tends to be female
* An attritioned customer tends to single
* An attritioned customer tends to have less existing relationships with the bank