In [117]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score

In [44]:
def printest(args, value):
    return print( "{} : \n {} \n".format(args, value) )

# Initial Data Preparation

Churn prediction is about identifying customers who are likely to cancel their contracts soon. If the company can do that, it can offer discounts on these services in an effort to keep the users. Here we use the dataset of churn prediction for a telecom company.

In [45]:
df = pd.read_csv('Data/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head(2)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No


In [46]:
df.head(1).T

Unnamed: 0,0
customerID,7590-VHVEG
gender,Female
SeniorCitizen,0
Partner,Yes
Dependents,No
tenure,1
PhoneService,No
MultipleLines,No phone service
InternetService,DSL
OnlineSecurity,No


We see that the dataset has a few columns:
- CustomerID: the ID of the customer
- Gender: male/female
- SeniorCitizen: whether the customer is a senior citizen (0/1)
- Partner: whether they live with a partner (yes/no)
- Dependents: whether they have dependents (yes/no)
- Tenure: number of months since the start of the contract
- PhoneService: whether they have phone service (yes/no)
- MultipleLines: whether they have multiple phone lines (yes/no/no phone service)
- InternetService: the type of internet service (no/fiber/optic)
- OnlineSecurity: if online security is enabled (yes/no/no internet)
- OnlineBackup: if online backup service is enabled (yes/no/no internet)
- DeviceProtection: if the device protection service is enabled (yes/no/no internet)
- TechSupport: if the customer has tech support (yes/no/no internet)
- StreamingTV: if the TV streaming service is enabled (yes/no/no internet)
- StreamingMovies: if the movie streaming service is enabled (yes/no/no internet)
- Contract: the type of contract (monthly/yearly/two years)
- PaperlessBilling: if the billing is paperless (yes/no)
- PaymentMethod: payment method (electronic check, mailed check, bank transfer,
credit card)
- MonthlyCharges: the amount charged monthly (numeric)
- TotalCharges: the total amount charged (numeric)
- Churn: if the client has canceled the contract (yes/no)

When import a CSV file, Pandas tries to guess the right type for each column. But sometimes, it doesn't get it right. So, it's a good idea to double-check the types using ``df.dtypes``.

In [47]:
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

We observe that the 'TotalCharges' column poses an issue. Rather than being classified as a numeric type, such as float or integer, pandas incorrectly infers it as an object type.    

In [48]:
# Convert 'TotalCharges' to numeric, replace non-numeric with NaN
df['TotalCharges'] = pd.to_numeric(df.TotalCharges, errors='coerce')

# Create a filter for NaN values
filter = df['TotalCharges'].isna()

# Display the rows where 'TotalCharges' was NaN
print('Before:')
display(df[filter][['customerID','TotalCharges']].head(2))

# Fill NaN values with zero
df['TotalCharges'] = df['TotalCharges'].fillna(0)

# Display the rows where 'TotalCharges' was NaN before the fillna operation
print('After:')
display(df[filter][['customerID','TotalCharges']].head(2))

Before:


Unnamed: 0,customerID,TotalCharges
488,4472-LVYGI,
753,3115-CZMZD,


After:


Unnamed: 0,customerID,TotalCharges
488,4472-LVYGI,0.0
753,3115-CZMZD,0.0


In [49]:
# Columns

# lowering columns name and replace spaces by _
df_columns_lower = df.columns.str.lower()
df.columns  = df_columns_lower.str.replace(' ', '_')

# Rows

# boolean mask for columns with strings
column_mask = df.dtypes == 'object' 
string_columns = list(df.dtypes[column_mask].index)

# lowering rows strings and replace spaces by _
for col in string_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')

df.head(2)

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-vhveg,female,0,yes,no,1,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,yes,electronic_check,29.85,29.85,no
1,5575-gnvde,male,0,no,no,34,yes,no,dsl,yes,...,yes,no,no,no,one_year,no,mailed_check,56.95,1889.5,no


We the that some columns has 'yes' or 'no' string, that we can convert to boolean. First consider the target variable churn

In [50]:
df.churn = (df.churn == 'yes').astype(int)

df[['customerid', 'churn']].head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
churn,0,0,1,0,1


In [51]:
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state = 1)
df_train, df_val= train_test_split(df_train_full, test_size=0.33, random_state = 11)

y_train = df_train['churn'].values
y_val = df_val['churn'].values

del df_train['churn']
del df_val['churn']

display(df_train.head(2))

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges
2935,9435-jmlsx,male,0,yes,no,71,yes,no,dsl,yes,yes,yes,yes,yes,yes,two_year,yes,bank_transfer_(automatic),86.1,6045.9
3639,0512-flfdw,female,1,yes,no,60,yes,yes,fiber_optic,no,no,yes,no,yes,yes,one_year,yes,credit_card_(automatic),100.5,6029.0


# Exploratory Data Analysis (EDA)

We have already found a problem with the TotalCharges column and replaced the missing values with zeros. Now let’s see if we need to perform any additional null handling:

In [52]:
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

Let’s check the proportion of churned users among all customers. This is the **Global Churn Rate** that refers to the overall churn rate for the entire customer base of the dataset.

For that, we need to divide the number of customers who churned by the total number of customers as follows:

In [77]:
# checking the distribution of values in the target variable
churn_stats = df_train_full['churn'].agg([pd.value_counts])

# Mean
total_values = churn_stats['value_counts'].sum()
churn_stats['global_mean'] = round(churn_stats['value_counts']/total_values, 3)

display(churn_stats)


Unnamed: 0,value_counts,global_mean
0,4113,0.73
1,1521,0.27


This gives us the proportion of churned users, or the probability that a customer will churn. As we see in, approximately 27% of the customers stopped
using our services, and the rest remained as customers. 

Also, the dataset is a imbalanced one. There were three times as many people who didn’t churn in our dataset as those who did churn.

Let's separate the dataset in categorical and numerical variables:

In [64]:
# All categorical columns except 'customerid'
categorical_mask = df_train.dtypes == 'object'
categorical = list(df_train.dtypes[categorical_mask].index)
categorical.remove('customerid') 

# Manually add 'seniorcitizen' because it's an int boolean (0 or 1)
categorical.append('seniorcitizen')
printest('categorical', categorical)

# All numerical columns except 'seniorcitizen' because it's an int boolean
numerical_mask = df_train.dtypes != 'object'
numerical = list(df_train.dtypes[numerical_mask].index)
numerical.remove('seniorcitizen')
printest('numerical', numerical)

categorical : 
 ['gender', 'partner', 'dependents', 'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport', 'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling', 'paymentmethod', 'seniorcitizen'] 

numerical : 
 ['tenure', 'monthlycharges', 'totalcharges'] 



In [65]:
#Count number of distinct elements in specified axis.
df_train_full[categorical].nunique()

gender              2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
seniorcitizen       2
dtype: int64

## Feature importance

**Global Churn Ratio**

refers to the overall churn rate for the entire customer base of the dataset.


In [119]:
global_mean = df_train_full.churn.mean()
display(global_mean)

0.26996805111821087

**Group Churn Ratio**

The churn rate within a specific customer segment, known as the group churn rate, allows for targeted analysis. By comparing this group churn rate with the overall churn rate, we can better understand how a particular group's behavior deviates from the average customer behavior. If there's a minimal difference between the group and global churn rates, it indicates that this specific group's churn behavior is not significantly different from the overall customer base. Therefore, this group's characteristics might not be a critical factor in predicting churn.

Now, let's start our analysis with the gender variable:

In [74]:
gender_mean = df_train_full.groupby('gender')['churn'].mean()
display(gender_mean)

gender
female    0.276824
male      0.263214
Name: churn, dtype: float64

The difference between the group rates for both males and females is quite small, which indicates that knowing the gender of the customer
doesn’t help us identify whether they will churn.

Now let’s take a look at another variable: partner:

In [75]:
partner_mean = df_train_full.groupby('partner')['churn'].mean()
display(partner_mean)

partner
no     0.329809
yes    0.205033
Name: churn, dtype: float64

The churn rate for people with a partner is significantly less than the rate for the ones without a partner — 20.5% versus 33%. It means that clients with no partner are more likely to churn than the ones with a partner

**Risk Ratio**

Apart from the difference between group and global churn rates, it's also useful to calculate their ratio. In statistics, this ratio is known as the 'risk ratio,' with 'risk' referring to the chance of the event of interest occurring - in our case, churning.

$$\text{Risk} =\frac{\text{Group Rate}}{\text{Global Rate}} $$

The risk can range from zero to infinity. This gives us a clear idea of how likely members of a particular group are to churn compared to the entire customer base.

- $\text{Risk Ratio} > 1$: The group exhibits a higher churn rate compared to the overall population. This implies that members of this group are more likely to churn.

- $\text{Risk Ratio} = 1$: The group's churn rate is equivalent to that of the overall population. This means the group's likelihood to churn is average.

- $\text{Risk Ratio} < 1$: The group has a lower churn rate compared to the overall population. Customers within this group are less likely to churn.
  



In [125]:
# alternative method to groupby
df_group = df_train_full.pivot_table(values = 'churn', index = 'gender', aggfunc = np.mean)
df_group = df_group.rename(columns={'churn': 'mean'})
df_group['diff'] = df_group['mean'] - global_mean
df_group['risk'] = df_group['mean']/ global_mean
display(df_group)

# alternative method to groupby
df_group = df_train_full.pivot_table(values = 'churn', index = 'partner', aggfunc = np.mean)
df_group = df_group.rename(columns={'churn': 'mean'})
df_group['diff'] = df_group['mean'] - global_mean
df_group['risk'] = df_group['mean']/ global_mean


display(df_group)

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Unnamed: 0_level_0,mean,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059841,1.221659
yes,0.205033,-0.064935,0.759472


The churn rates for females
and males are not significantly different from the global churn rates, so the risks for them to churn are
low: both have risks values around 1. On the other hand, the churn rate for people with no partner is
significantly higher than average, making them risky, with the risk value of 1.22. People with partners
tend to churn less, so for them, the risk is only 0.75.

for all categorical variables:

In [114]:
for col in categorical:
    df_group = df_train_full.groupby(by=col).churn.agg(['mean'])
    df_group['diff'] = df_group['mean'] - global_mean
    df_group['risk'] = df_group['mean'] / global_mean
    display(df_group)

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Unnamed: 0_level_0,mean,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059841,1.221659
yes,0.205033,-0.064935,0.759472


Unnamed: 0_level_0,mean,diff,risk
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.31376,0.043792,1.162212
yes,0.165666,-0.104302,0.613651


Unnamed: 0_level_0,mean,diff,risk
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.241316,-0.028652,0.89387
yes,0.273049,0.003081,1.011412


Unnamed: 0_level_0,mean,diff,risk
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.257407,-0.012561,0.953474
no_phone_service,0.241316,-0.028652,0.89387
yes,0.290742,0.020773,1.076948


Unnamed: 0_level_0,mean,diff,risk
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dsl,0.192347,-0.077621,0.712482
fiber_optic,0.425171,0.155203,1.574895
no,0.077805,-0.192163,0.288201


Unnamed: 0_level_0,mean,diff,risk
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.420921,0.150953,1.559152
no_internet_service,0.077805,-0.192163,0.288201
yes,0.153226,-0.116742,0.56757


Unnamed: 0_level_0,mean,diff,risk
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.404323,0.134355,1.497672
no_internet_service,0.077805,-0.192163,0.288201
yes,0.217232,-0.052736,0.80466


Unnamed: 0_level_0,mean,diff,risk
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.395875,0.125907,1.466379
no_internet_service,0.077805,-0.192163,0.288201
yes,0.230412,-0.039556,0.85348


Unnamed: 0_level_0,mean,diff,risk
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.418914,0.148946,1.551717
no_internet_service,0.077805,-0.192163,0.288201
yes,0.159926,-0.110042,0.59239


Unnamed: 0_level_0,mean,diff,risk
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.342832,0.072864,1.269897
no_internet_service,0.077805,-0.192163,0.288201
yes,0.302723,0.032755,1.121328


Unnamed: 0_level_0,mean,diff,risk
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.338906,0.068938,1.255358
no_internet_service,0.077805,-0.192163,0.288201
yes,0.307273,0.037305,1.138182


Unnamed: 0_level_0,mean,diff,risk
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
month-to-month,0.431701,0.161733,1.599082
one_year,0.120573,-0.149395,0.446621
two_year,0.028274,-0.241694,0.10473


Unnamed: 0_level_0,mean,diff,risk
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.172071,-0.097897,0.637375
yes,0.338151,0.068183,1.25256


Unnamed: 0_level_0,mean,diff,risk
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bank_transfer_(automatic),0.168171,-0.101797,0.622928
credit_card_(automatic),0.164339,-0.10563,0.608733
electronic_check,0.45589,0.185922,1.688682
mailed_check,0.19387,-0.076098,0.718121


Unnamed: 0_level_0,mean,diff,risk
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24227,-0.027698,0.897403
1,0.413377,0.143409,1.531208


- Gender: Little difference in churn rates between females and males, with similar means and risks close to 1 for both groups.

- Senior Citizens: Higher churn risk at 1.53, compared to 0.89 for nonseniors.

- Partner Status: Lower churn risk for customers with a partner at 0.75, compared to 1.22 for those without.

- Phone Service Usage: Near-equal churn risk to global rate for users. Lower risk for non-users, with a risk 
below 1.

- Tech Support: Higher churn risk for clients without tech support with risk 1.55.

- Contract Length: Highest churn risk for monthly contract clients, while two-year contract clients churn very rarely.

**Entropy and Mutual Information**

Mutual Information is a measure that quantifies the amount of information obtained about one random variable through observing another random variable. In essence, it evaluates how dependent the two variables are on each other.

In order to fully understand this concept, we need to explore the concept of entropy in information theory. Let's assume we have a dataset with outcomes denoted by $Y$ and an attribute symbolized by $X$, both of which are categorical variables.

Entropy for the independent variable $X$ is defined as:

$$H(X) = -\sum_i p(x_i)\log{p(x_i)}$$

Here, $p(x_i)$ represents the probability of occurrence of a specific value $x_i$ of attribute $X$. This probability is calculated as:

$$p(x_i) =\frac{n_i}{n}$$

In this equation, $n_i$ is the total number of instances where the specific $x_i$ occurs, and $n$ is the total number of instances. We can similarly compute the entropy for the target variable $Y$.

For example, in a churn analysis, $X$ could represent the random variable for gender, with $x_i$ representing a specific gender (male or female). This formula allows us to calculate the entropy of the gender distribution in the dataset.

The next step involves connecting the entropies of $X$ and $Y$. We accomplish this by constructing a conditional entropy of $X$ given a specific target value $y_i$, formulated as:

$$H(X|y_i) = - \sum_j p(x_j|y_i)\log p(x_j|y_i) $$ 

Here, $p(x_j|y_i)$ stands for the conditional probability, computed as:

$$p(x_j|y_i) = \frac{n_{ij}}{n_i}$$

In this formula, $n_{ij}$ denotes the count of instances where a specific attribute value $x_j$ (e.g., male) and a specific target value $y_i (e.g., churn) occur together. On the other hand, $n_i$ represents the total count of instances where the specific target value $y_i$ (either churn or not churn) occurs, regardless of the attribute value.

With $X$ representing gender and $Y$ as churn, the entropy measure $H(X|y_i)$ calculates the entropy of the gender category conditional on churn $(y_1)$ and no churn $(y_0)$.

We can also express the conditional probability in terms of the joint probability, $p(x_j,y_i)$, computed as:

$$p(x_j,y_i)  = \frac{n_{ij}}{n}$$  

This allows us to express the conditional probability as:

$$p(x_j|y_i) = \frac{p(x_j,y_i)}{p(y_i)}$$

This way we can rewrite the conditional entropy as:

$$H(X|y_i) = - \sum_j \frac{p(x_j,y_i)}{p(y_i)}\log \bigg(\frac{p(x_j,y_i)}{p(y_i)}\bigg) $$ 

To calculate the entropy of variable $X$ conditional on the entire set $Y$, we need to compute the weighted sum of the conditional entropies for each possible outcome in $Y$. This can be represented mathematically as:

$$
\begin{align*}
H(X|Y) &= - \sum_i\sum_j p(y_i)H(X|y_i) \\
       &= - \sum_i\sum_j p(y_i)\bigg[\frac{p(x_j,y_i)}{p(y_i)}\log \bigg(\frac{p(x_j,y_i)}{p(y_i)}\bigg)\bigg]\\
       &= - \sum_i\sum_j p(x_j,y_i)\log \bigg(\frac{p(x_j,y_i)}{p(y_i)}\bigg)
\end{align*}
$$

This computation yields the expected conditional entropy over all possible outcomes of $Y$.

Further, the equation can be rearranged using logarithm properties to obtain:

$$
\begin{align*}
H(X|Y) &= - \sum_i\sum_j p(x_j,y_i)\log{p(x_j,y_i)} + \sum_i  \bigg(\log{p(y_i)\sum_j p(x_j,y_i)}\bigg) \\
       &= - \sum_i\sum_j p(x_j,y_i)\log{p(x_j,y_i)} + \sum_i  p(y_i)\log{p(y_i) }\\
       &= H(X,Y) - H(Y)
\end{align*}
$$

In this context, joint entropy $H(X,Y)$ quantifies the combined uncertainty of $X$ and $Y$, encapsulating the question, "How much do I not know about both $X$ and $Y$ jointly?" Conversely, conditional entropy $H(X|Y)$ expresses the remaining uncertainty of $X$ when we have some knowledge of $Y$, answering the query, "Given some knowledge of $Y$, what is my remaining uncertainty about $X$?"

Thus, the formula

$$H(X|Y) = H(X,Y) - H(Y)$$ 

is interpreted as the removal of uncertainty about $Y$ from the total joint uncertainty, leaving behind the conditional entropy $H(X|Y)$. This represents the residual uncertainty about $X$ once knowledge about $Y$ is accounted for. This interpretation underscores the inherent connection between these entropy measures, providing a clear mathematical understanding of their relationships. 

Similarly, we can define Mutual Information (MI) between two variables $X$ and $Y$ as:

$$MI(X;Y) = H(X) − H(X∣Y)$$

The Mutual Information, $MI(X;Y)$, denotes the reduction in uncertainty about $X$ due to the knowledge of $Y$, and vice versa. In other words, it quantifies the amount of information gained about $X$ after learning about $Y$, and the amount of information gained about $Y$ after learning about $X$. This is because mutual information is symmetric, i.e., $MI(X;Y) = MI(Y;X)$.

Here's how to interpret mutual information:

- **If $MI(X;Y)$ is high:** knowing $Y$ greatly reduces our uncertainty about $X$, and knowing $X$ greatly reduces our uncertainty about $Y$. This is the case where $X$ and $Y$ have some dependency.
 
- **If $MI(X;Y)$ is zero:**  knowing $Y$ does nothing to decrease our uncertainty about $X$, and knowing $X$ does nothing to decrease our uncertainty about $Y$. This is the case when $X$ and $Y$ are independent.

Therefore, Mutual Information, $MI(X;Y)$, serves as a measure of the decrease in uncertainty about $X$ (alternatively, the information about $X$ that we gain) as a result of knowing $Y$, and the decrease in uncertainty about $Y$ (or the information about $Y$ that we gain) as a result of knowing $X$.

Mutual Information can also be expressed in terms of joint and marginal probabilities as follows:

$$\textbf{MI}(X;Y) = \sum_i \sum_j P(x_i,y_j) \log \bigg(\frac{P(x_i,y_j)}{P(x_i)P(y_j)}\bigg)$$


We can apply this concept for categorical variables as follows:

In [120]:
def calculate_mi(series):
    return mutual_info_score(series, df_train_full.churn)

df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')


display(df_mi.head())
display(df_mi.tail())

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923


Unnamed: 0,MI
partner,0.009968
seniorcitizen,0.00941
multiplelines,0.000857
phoneservice,0.000229
gender,0.000117


As we see, contract, onlinesecurity, and techsupport are among the most
important features. It’s not surprising that gender is among the least important features, so we shouldn’t expect it to be useful for the model.

**Correlation Coefficient**