In [None]:
from IPython.display import HTML

# Explore a churn dataset

We have extracted customer data from our ERP system. Our aim is to develop a simple churn model. That means we want to identify customers who are likely terminating their contract soon so we can target these customers directly with a marketing campaign. We have to avoid targeting satisfied customers though. Therefore, it is important to reach a good separation between dissatisfied and loyal customers.

## Overview:
- [Read in data](#Prerequisites)
- [Exploring data](#Exploring-data)
- [Visualize the data](#Visualize-the-data)
- [Significance and uncertainties](#Significance-and-uncertainties)

#### Remark:
All features are generated by a simulation from random distributions with some underlying assumption of how people (could) behave.
In the end, all customers have two possible states. Either they churn or they do not churn. Under the hood we implemented three types of costumers. These types determine how customers behave if they churn and how the features for our simulation are set. For example, an angry costumer has a higher churn probability than a standard costumer. But still, these standard costumers will have churn rates greater than zero. In addition, we have "sleepy" costumers which behave like standard costumers but have a higher churn rate, if woken up (by a call or an e-mail). Sleepy customers should not be woken up in most churn scenarios. They make it more difficult to generate an efficient model to detect churn. See the [create_churn_persona](create_churn_persona.ipynb) notebook for more details.

#### Some Words About Toy Data
High quality datasets are hard to find in reality. As a matter of fact, in many cases the preparations for high quality data taking take a lot more time, than large parts of the actual data analysis. However, starting early with analysis projects ensure, that you know at least some of the traps before you start datataking.

Thus, building simplified models to generate datasets from first principles is a usual way to get around. With such models you can learn bringing up the machinery and start data taking at the same time. Our dataset is such a toy set. So in several aspects it might not reflect reality at 100%. But, it still holds some key features of real data.

## Prerequisites 

Let's get everything we need and run some checks…

In [None]:
import os
import errno

from IPython.display import display, Markdown, Latex

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


%config InlineBackend.figure_format = 'retina'


input_file = '../../.assets/data/churn/churn_persona.pkl.zip'

try:
    df = pd.read_pickle(input_file)
    display(Markdown('**SUCCESS:** Everything seems fine, we are good to go.'))
except FileNotFoundError:
    display(Markdown(f'**ERROR:** File {input_file} not found. Did you forget to run the create_churn_persona notebook first?'))

## Exploring data

We have the condensed ERP information already transformed into a [Pandas](http://pandas.pydata.org) [dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). The dataframe is called `df` here.

In [None]:
df.head(5)

The columns in this dataframe are:
- **age:** customer age in years (as a floating number for simplicity)
- **amount:** energy consumption of last accounting period (e.g. in kWh)
- **bank:** the name of the customer's bank
- **churn:** whether the customer ended their contract
- **contacts:** number of contacts the customer had with us (everything from phone calls, mails, bills, meter reading and so on)
- **d_amount:** _delta_ amount, i.e. the change in energy consumption compared to next-to-last accounting period (positive: the customer consumed more energy)
- **d_pay:** _delta_ pay, i.e. the change in invoice amount compared to next-to-last accounting period (positive: the customer pays more than last year)
- **mail:** the customer's mail provider
- **pay:** the invoice amount of last accounting period
- **size:** number of people living in customer's household
- **year:** year since when the customer is out customer
- **bank_r, bank_s, mail_r, mail_s, contacts_r, contacts_s:** deduced quantities, we'll cover these later

Start by exploring the dataset. This gives you some feeling of the data and also helps to do some consistency checks (you should always be sure your data is sensible).

# Exercises

### Extract basic statistical quantities

Get means, standard deviations and medians of your dataset. You may start with the feature `age`. Check if the data is consistence and and statistical quantities make sense.

In [None]:
## Let's start...







This all sounds sensible. 

### Explore non-numeric columns

The dataset also contains non-numeric values like the customer's bank, mail provider and total contacts. We can explore these values as well. 
* What are the unique entries (`unique()`)?
* What are the `value_counts()` of each feature?
* Can you find out what are the feature columns with the suffix *_r*, *_s* and *_n* stand for?

In [None]:
## Let's start...











### Visualize the data

After performing basic operations on the data, continue with visualizations. This helps to get a good feeling for the dataset. Start with a histogram of the numeric and non discrete features.

Here are some task to perform:
* set the size of the figure to a width of 10 and a height of 4
* set the `bins` with [np.linspace](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html)
* set title and axis labeling
* set the limits of the axes and/or use a logscaled axis
* normalize your data, add a grid to the plot, change the color...
* go on with another feature

In [None]:
# create a new figure to plot in
plt.figure()

# plot the age distribution (this returns the axis object)
# use plt.hist(df['age']) or ax = df.age.plot(kind='hist')
plt.hist(df['age'])








plt.show()

This histogram shows the age distribution of our customers, i.e. how many (y axis) of or customers are in which age segment (x axis). Again, this looks plausible.

Let's do the same for a non-numeric column which is a bit more complicated:

In [None]:
# create a new figure to plot in
plt.figure(figsize=(10,4))

# group the customers by bank name, count number of customers for each 
# bank and plot this distribution
ax = df.groupby('bank')['bank'].count().plot(kind='bar')
ax.set_xlabel("Customer's bank")

plt.show()

This histogram shows us how many customers are with each bank.

Adapt the plot for one other feature with discrete values like contacts or the e-mail provider.

In [None]:
# create a new figure to plot in
plt.figure(figsize=(10,4))








plt.show()

### Compare different subsets of the data in one plot (e.g. by churn)

Now compare a distribution for two different subsets of costumers. In our case it makes sense to compare customers who churn with customers who do not churn.

* create two boolean lists for loyal customers (`churn == False`) and terminated customers (`churn = True`)
* plot different features and compare deduced quantities like mean or median
* make sure that you have the same bins for both subsets
* you may have to set the transparency (see setting for *alpha*)

In [None]:
# create boolean lists
# churn =
# not_churn = ~churn

# create figure
plt.figure(figsize = (10,4))

# plotting
# df[churn]
# df[not_churn]







plt.show()

### Customer comparision with non-numeric data
We can also gain insight from the non-numeric data. For example, we might be interested what the ratio of terminated contracts is depending on the customer's bank. In other words, a customer from a small public local bank might be more loyal than a customer from a large international direct bank. 

To do that: 
* select again loyal/terminated customers
* group these subsets by bank 
* count the total number for each bank
* devide these numbers to get the ratio

In [None]:
plt.figure(figsize = (10,4))







plt.show()

## Significance and uncertainties

The plot below shows that customers with an Interbank account are more likely to terminate their account than customers with a Volkskasse account (It is the same plot as you performed before. Here, we used the deduced quantities instead). **The crucial question is: Is this increase in terminations significant?**

In [None]:
plt.figure(figsize=(10,4))
ax = df.groupby('bank').first()['bank_r'].plot(kind='bar')
ax.set_xlabel("Customer's Bank")
ax.set_ylabel("Ratio of terminated contracts")
plt.show()

**Consider the following example:** Imagine your dataset contains a few 100000 customers, but only 10 with a bank account from "Landwirtschaftskasse Gammelsberg". Out of these 10 customers, 2 have terminated their contract. Thus, the rate of terminated contracts for "Landwirtschaftskasse Gammelsberg" (20%) is far higher than the 3%–8% we've seen above. It is very likely that this effect is just a statistical fluctuation. The number of customers with an account at "Landwirtschaftskasse Gammelsberg" is too low to draw any conclusions.

The quantities we are dealing with here are basically measurements. One can never measure anything with perfect precision. That said, each measurement always comes with an **uncertainty**. The uncertainty is a measure how precisely (or rather unprecisely) the quanitity is known. 

This is closely related to **[significance](https://en.wikipedia.org/wiki/Statistical_significance)**. In this example, the significance tells us whether the increased contract termination ratio is actually due to customers with bank "Interbank" actually being more likely to terminate their contracts or if this increased ratio is just due to low statistics. In general, the larger the statistical sample, the higher the significance of conclusions we can draw from the sample.

### Example from the customer dataset

The following example is our plot from above with the added uncertainties of each ratio (as a black bar). The height of the bar quantifies how much we expect the ratio to fluctuate just from statistical effects. 

In [None]:
plt.figure(figsize=(10,4))
ax = df.groupby('bank').first()['bank_r'].plot(kind='bar', yerr=df.groupby('bank').first()['bank_s'])
ax.set_xlabel("Customer's Bank")
ax.set_ylabel("Ratio of terminated contracts")

_Notes:_ 
- Here, we use the columns `bank_r` and `bank_s` we saw earlier. These columns carry the pre-calculated termination ratios for the corresponding bank (`bank_r`) and the corresponding uncertainty of this ratio (`bank_s`).

We can see how significance depends on the sample size if we consider a smaller sample. Here, we randomly select a 1% subset of our original data and recalculate the `bank_r` and `bank_s` quantities. 

In [None]:
# This functions comes from the dataset generation notebook. 
# It adds the three columns of deduced quantities for one feature
# We use it to generate new features for a subset of our dataset

def add_ratios(df, column):

    n1 = df[df['churn'] == True].groupby(column)[column].count()
    n2 = df[df['churn'] == False].groupby(column)[column].count()
    r=n1/n2
    n=n1+n2
    s = np.sqrt((r*(1-r)/n))
   
    index = np.arange(len(df.groupby(column)[column].count().index))+1
    dtest = pd.DataFrame(np.transpose([r,s,index]))
    dtest.columns=[column+'_r',column+'_s',column+'_n']
    dtest.index=df.groupby(column)[column].count().index
    return df.join(dtest, on=column)

In [None]:
df_small_sample = df.sample(frac=0.01)[['age', 'amount', 'bank', 'churn', 'contacts', 'd_amount', 'd_pay',
       'mail', 'pay', 'size', 'year']]

# Add deduced quantities
df_small_sample = add_ratios(df_small_sample, 'bank')

plt.figure(figsize=(10,4))
ax = df_small_sample.groupby('bank').first()['bank_r'].plot(kind='bar', yerr=df_small_sample.groupby('bank').first()['bank_s'])
ax.set_xlabel("Customer's Bank")
ax.set_ylabel("Ratio of terminated contracts")
df_small_sample.size

We get ratios that clearly differ from the earlier values. The uncertainty bars are enlarged showing us that we are less certain about the ratios. The ratio for "Solidbank" compared to "Sparbank", "Stadtbank, and "Volkskasse" is no longer statistically significant. 

Try to adapt the same plots to check out 
* how the e-mail providers are distributed between terminated and loyal customers
* and how certain you are with these ratios

In [None]:
## Let's start










---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_