# Machine Learning Models
We will now apply several machine learning models to our data. First of all we need a bunch of python packages to do the model building and the validation.

* [Data Import and Preparation](#Fetch-the-data)
* Data Exploration (see notebooks [churn-1](churn-1-exploration.ipynb) and [churn-2](churn-2-exploration-II.ipynb))
* [Feature Selection](#Feature-Selection) and Engineering
* ...
* [Exercise](#Exercise): It will be your tasked to finish the pipeline.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Fetch the data
You can now choose between different sets of clients. Each one with different issues to solve. We will start with a very basic sample. You can then try out the others.

In [None]:
input_file = '../../.assets/data/churn/churn_persona.pkl.zip'
try:
    df = pd.read_pickle(input_file)
    print(('SUCCESS: Everything seems fine, we are good to go.'))
except FileNotFoundError:
    print(Markdown(f'ERROR: File {input_file} not found. Did you forget to run the create_churn_persona notebook first?'))

In [None]:
#print the columns in the dataset
df.columns

## Data Preparation

This is to ensure data quality. Due to some operations on the datasets there might be some NaN values (e.g. from divide-by-zero operations). We have to get rid of them, as they might confuse our machine learning algorithms.

In [None]:
df.loc[np.isnan(df.mail_r), 'mail_r'] = 0
df.loc[np.isnan(df.mail_s), 'mail_s'] = 0
df.loc[np.isnan(df.bank_r), 'bank_r'] = 0
df.loc[np.isnan(df.bank_s), 'bank_s'] = 0
df.loc[np.isnan(df.contacts_r), 'contacts_r'] = 0
df.loc[np.isnan(df.contacts_s), 'contacts_s'] = 0

## Feature Selection
We have had a very close look into our data. You can select the relevant features from our dataset here. In this case, you might choose to take them all into account. In reality, you might want to select the most important ones, as in real life data is nearly infinite and ressources are limited.

In [None]:
# Just comment/uncomment the lines you like to select. 
# Keep the "churn" variable. It is needed for the training.

training_features = [
    'age',
    'amount',
    'churn', # we will delete it later from our data, as we want to predict it
    'contacts',
    'd_amount',
    'd_pay',
    'pay',
    'size',
    'year',
    'bank_r',
    'bank_s',
    'bank_n',
    'mail_r',
    'mail_s',
    'mail_n',
    'contacts_r',
    'contacts_s',
    'contacts_n'
]

## Variables and results
We now split our dataset into the variables used for our predictive model and the result that should be predicted (our churn state). We call the variables X and the results y.

In the last line of this block, all datasets with a NaN value are deleted.

In [None]:
X = df[training_features].dropna()
y = X.churn
X.drop('churn', axis=1, inplace=True)

# Exercise
Set up the machine learning pipeline.

1. Prepare the dataset for validation by performing a resonable `train-test-split`
2. Define the ML model you want to use and set some standard hyperparameters.
3. Perform the training by fitting the model to your train data. Try out to find a way to add a `sample_weight` in this step.
4. Do a proper validation by using hypothesis test, roc curves, confusion matrix, scores and feature importance
5. Save your model to disc. 

In [None]:
# Let's start







---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_