# Explore a churn dataset II
We will now apply some other strategies to find out, if our dataset is promising detect (with machine learning) if a customer churns.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pickle
import os
 
%config InlineBackend.figure_format = 'retina'

In [None]:
input_file = '../../.assets/data/churn/churn_persona.pkl.zip'

try:
    df = pd.read_pickle(input_file)
    print(('SUCCESS: Everything seems fine, we are good to go.'))
except FileNotFoundError:
    print(Markdown(f'ERROR: File {input_file} not found. Did you forget to run the create_churn_persona notebook first?'))

In [None]:
#print the columns in the dataset
df.columns

## Back to Work

We want to find out if we see seperation power. But now we perform it higher dimensionality by combining features within one plot.

Let's define a function, that shows us our customer data depending on wether they are going to terminate their contracts or not. You can just play around with different combinations of variables. Do you see some separation?

In [None]:
churn = df['churn']
not_churn = ~churn

def scatter_plot(vars):
    ax = plt.subplot(111)
    df[not_churn].plot.scatter(vars[0], vars[1], ax=ax, color='C0', alpha=0.3, s=5)
    df[churn].plot.scatter(vars[0], vars[1], ax=ax, color='C1', alpha=0.3, s=5)

In [None]:
scatter_plot(['age','year'])

In [None]:
scatter_plot(['age','contacts'])

In [None]:
scatter_plot(['pay','d_pay'])

## Recognize Shapes
Ok, that seems a bit like a mess. We can try doing something more sophisticated. We build a histogram on which we see the churn ratios. Meaning, the number of customers terminating their contracts over the total number of customers in each bin.

In [None]:
def rate_hist(vars, bins=[20,20], scatter=False):
    dc = df[churn]
    dn = df    
    
    datan = plt.hist2d(df[vars[0]],df[vars[1]], bins=bins)
    datac = plt.hist2d(dc[vars[0]],dc[vars[1]], bins=datan[1:3])
    plt.close()
    
    plt.figure(figsize=[8,6])
    plt.cla() #clear current axis
    data = np.transpose(np.nan_to_num(datac[0]/datan[0]))[::-1]
    plt.imshow(data, extent=[min(datan[1]),max(datan[1]),min(datan[2]),max(datan[2])], aspect='auto')
    plt.xlabel(f'{vars[0]}')
    plt.ylabel(f'{vars[1]}')
    plt.colorbar()
    if scatter:
        plt.scatter(dc[vars[0]],dc[vars[1]], color='red', alpha=0.1)
    plt.tight_layout()

And again we can try to find some areas, with a high churn rate (greenish and yellow areas) and a large total number of clients willing to terminate (red dots). Try some different combinations of variables.

In [None]:
rate_hist(['age','contacts'])

In [None]:
rate_hist(['pay', 'd_pay'], bins=[41,41])

## Discrete features
For discrete data we have to be a bit more carefull, as automatic binning functions often fail. So either set the binning manually...

In [None]:
rate_hist(['bank_n', 'mail_n'],bins=[np.linspace(0.5,5.5,6),np.linspace(0.5,5.5,6)])

...or invent a new way of plotting binned data. This example is a bit tricky, but still, we get the information we need.

In [None]:
data = (df[churn].groupby(['bank_n','mail_n'])['age'].count()/
    df.groupby(['bank_n','mail_n'])['age'].count()).unstack()

fig = plt.figure(figsize=(8,6))
ax = plt.imshow(data)
plt.colorbar()

indx = [item[0] for item in df.groupby(['bank','bank_n'])['age'].unique().index]
indy = [item[0] for item in df.groupby(['mail','mail_n'])['age'].unique().index]

plt.yticks(data.columns-1, indx)
plt.xticks(data.index-1, indy, rotation='vertical')
plt.show()

In addition, we can compare the churn rates (left plot) and absolute numbers of terminating customers (right plot).

In [None]:
plt.figure(figsize=[16,6])

plt.subplot(121)

data = (df[churn].groupby(['bank_n','mail_n'])['age'].count()/
    df.groupby(['bank_n','mail_n'])['age'].count()).unstack()

ax = plt.imshow(data)
plt.colorbar()

indx = [item[0] for item in df.groupby(['bank','bank_n'])['age'].unique().index]
indy = [item[0] for item in df.groupby(['mail','mail_n'])['age'].unique().index]

plt.yticks(data.columns-1, indx)
plt.xticks(data.index-1, indy, rotation='vertical')

plt.subplot(122)
data = (df[churn].groupby(['bank_n','mail_n'])['age'].count()).unstack()
plt.imshow(data)
plt.yticks(data.columns-1, indx)
plt.xticks(data.index-1, indy, rotation='vertical')
plt.colorbar()

plt.tight_layout()

On the one hand the left plot tells us, if a customer has not given his email address and has his bank account at *Interbank*, he has a very high churn rate. On the other hand the right plot let us know, that there are just a few customers with this feature combination, who are terminating their contract. There are a way more churning customers with a bank account at *Stadtbank* and having a email address from *brief.de*. 

## What a Mess
So you see, looking around in large datasets is challenging in any case. Additionally, human behavior is not easily predictable. Would you feel confident deciding which customers to speak to?

Try out, and then see, what machine learning algorithms can do.

## Exercise

  * Create the single most impressive plot to explain a key feature of churn behaviour.
  * What would you derive for your marketing efforts?
  

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_