Predict Customer Churn Using Production-Level Code

This repository contains a project that analyzes and predicts Customer Churn using the Credit Card Customers dataset from Kaggle.

Data analysis, modeling and inference are implemented in the project to end up with an interpretable model that is also able to predict customer churn. However, the focus of the project are neither the business case nor the analysis and modeling techniques; instead, the goal is to provide with a boilerplate that shows how to transform a research notebook into a development/production environment code using as few additional tools as possible. Concretely, the research notebook churn_notebook.ipynb is transformed into a python package contained in customer_churn.

Note that not all aspects necessary in a ML worklflow are covered:

The Exploratory Data Analysis (EDA) and Feature Engineering (FE) done during the data processing are very simple.
The generated artifacts are not tracked or versioned (e.g., model-pipelines, processed data, etc.)
No thorough deployment nor API are created, although Docker containerization is introduced.
etc.

In other words, you ca re-use the code of this repository as a template, but for small projects only. If you something bigger in your hands and are interested in some of the aforementioned topics, you can visit other of my boilerplate projects listed in the section Interesting Links. However, let me reiterate the leitmotiv of this repository: production-level data science code with as few tools as possible 😄

The starter code comes from a Udacity Machine Learning DevOps Engineer Nanodegree project, although the initial status was substantially changed.

Overview of contents:

Predict Customer Churn Using Production-Level Code

Project Description

In the project, credit card customers that are most likely to churn are identified.
A first version of the data analysis and modeling is provided in the notebook churn_notebook.ipynb. As mentioned, the main goal of the project is to transform that notebook content into a production ready package, applying clean code principles or tools:

Readable, simple, concise code
PEP8 conventions applied, checked with pylint and autopep8
Modular and efficient code, with Object-Oriented patterns
Documentation provided at different stages: code, README files, etc.
Error/exception handling
Execution and data testing with pytest
Logging implemented during production execution and testing
Installable python package
Basic containerization with Docker

The used Credit Card Customers dataset from Kaggle has the following properties:

Original dataset size: 22 columns, 10127 data points.
It consists of categorical and numerical features of clients which churn (cease to be clients) or not; that churn status is the binary classification target used for training, encoded as the feature Attrition_Flag.
Number of features after data processing is carried out: 19.
Modeling used on the data: logistic regression, support vector machines and random forests with grid search.

More details on the dataset can be found in data/README.md.

Files and Workflow Description

The repository contains the following files and folders:

.
├── Dockerfile                          # Docker image definition
├── Instructions.md                     # Original instructions from Udacity (irrelevant here)
├── Makefile                            # Simple Makefile for common chores: clean, etc.
├── README.md                           # This file
├── churn_notebook.ipynb                # Research notebook
├── config.yaml                         # Configuration file for production code
├── customer_churn                      # Production library folder, package
│   ├── __init__.py         
│   ├── churn_library.py                # Production library
│   └── transformations.py              # Utilities for library
├── data
│   ├── README.md                       # Explanation of dataset columns
│   └── bank_data.csv                   # Dataset
├── main.py                             # Executable of production code
├── pics
│   ├── pipeline_diagram.png            # Pipeline diagram
│   └── sequencediagram_org.jpeg        # Original/old sequence diagram from Udacity (irrelevant here)
├── requirements.txt                    # Dependencies
├── setup.py
└── tests                               # Pytest testing scripts
    ├── __init__.py
    ├── conftest.py                     # Pytest fixtures
    └── test_churn_library.py           # Tests of the module churn_library.py

All the research work of the project is contained in the notebook churn_notebook.ipynb; in particular simplified implementations of the typical data processing and modeling tasks are performed:

Dataset loading
Exploratory Data Analysis (EDA) and Data Cleaning
Feature Engineering (FE)
Training: Logistic Regression, Support Vector Machine and Random Forest models are fit and persisted as pickles.
Generation of classification report plots.

The code from churn_notebook.ipynb has been transformed to create the package customer_churn, which contains two files:

churn_library.py: this file contains most of the refactored and modified code from the notebook.
transformations.py: definition of auxiliary transformations used in the data processing.

Additionally, a tests folder is provided, which contains test_churn_library.py. This script performs unit tests on the different functions of churn_library.py using pytest.

The executable or main function is provided in main.py; this script imports the package customer_churn and runs three functions from churn_library.py:

run_setup(): the configuration file config.yaml is loaded and auxiliary folders are created, if not there yet:
- images: it will contain the images of the EDA and the model evaluation.
- models: it will contain the inference models/pipelines as serialized pickles.
- artifacts: it will contain the data processing parameters created during the training and required for the inference, serialized as pickles.
run_training(): it performs the EDA, the data processing and the data modeling, and it generates the inference artifacts (the model/pipeline).
run_inference(): it shows how the inference artifacts need to be used to perform a prediction; an exemplary dataset sample created during the training is used.

The following diagram shows the workflow:

A central property of the package is that run_training() and run_inference() share the function perform_data_processing(). When perform_data_processing() is executed in run_training(), it generates the processing parameters and stores them to disk. In contrast, when perform_data_processing() is executed in run_inference(), it loads those stored parameters to perform the data processing for the inference.

Note that the implemented run_inference() is exemplary and it needs to be adapted:

Currently, it is triggered manually and it scores a sample dataset from a CSV file offline; instead, we should wait for external requests that feed new data to be scored.
The data processing parameters and the model should be loaded once in the beginning (hence, the dashed box) and used every time new data is scored.

Those loose ends are to be tied when deploying the model, since their final form depends on the deployment technologies adopted. Even though containerization with docker is briefly shown in the section Docker Image, deployment is not really in the scope of this repository.

How to Use This

Running the Package

First, create an environment (e.g., with conda) and install the dependencies:

cd /path/to/repository/folder
# Create the environment
conda create -n my-env python=3.8
conda activate my-env
conda install pip
# Check that the pip we use points to our environment
which pip
# Install dependencies
pip install -r requirements.txt

Then, we execute the main file:

python main.py

... and all the magic happens, as explained in the previous section.

Note that the EDA and the model evaluation are optional and can be controlled with the argument --produce_images in main.py; by default, they are active, if we'd like to switch them off, we need to run:

python main.py --produce_images 0

Switching off EDA and evaluation is important for the Docker container: saving matplotlib figures in a container requires a more sophisticated setup than the one explained in the section Docker Image.

Optionally, you can install the package on your environment:

cd /path/to/repository/folder
conda activate my-env
# The first installation
pip install .
# If we change things in the package
pip install --upgrade .

If the package is installed, we can work anywhere if we have the dataset and the configuration file:

cd /path/to/new/folder
cp -r /path/to/repository/folder/data .
cp /path/to/repository/folder/main.py .
cp /path/to/repository/folder/config.yaml .
python main.py

Testing the Package

To test the functions from churn_library.py using pytest:

cd /path/to/repository/folder
python tests/test_churn_library.py

This command not only tests the package, but it also fully runs the training and inference pipelines, generating models and other artifacts.

Note that, since the logging module is used, it is not enough running barely pytest in the Terminal, we need to run the command above to test the package.

If you would like to check the PEP8 conformity of the files:

pylint customer_churn/churn_library.py # should yield 8.07/10
pylint tests/test_churn_library.py # should yield 8.49/10

Tip: If you'd like to automatically edit and improve the score of a file that already has a good score, you can try autopep8:

autopep8 --in-place --aggressive --aggressive customer_churn/churn_library.py

Good Ol' `make`

Because the C-Universe had already most of the required tools.

# If you like to delete the models, artifacts, folders, etc.
# created when the training pipelines are run
make clean

Docker Image

The Dockerfile defines a simple image based on Python 3.8 which can be built and executed as a container:

# Build image
docker image build -t customer_churn_app .
# Run with shell access and remove container when finished
docker container run -it --rm --name customer_churn_app customer_churn_app

As explained in Running the Package, the EDA and the model evaluation are switched off for the docker image, because saving matplotlib images requires a more sophisticated setup than the one provided.

Limitations and Alternatives of This Boilerplate

No tracking of datasets / models / experiments / artifacts is done. Doing that is fundamental if we want to scale, and it implies using additional tools, such as Weights and Biases. I have another boilerplate where that is done: music_genre_classification.
This boilerplate works for small datasets/projects, which are not that uncommon in small/medium enterprises; in my experience, its structure is easy to understand, implement and adapt. However, as the complexity increases, it is recommended to apply these changes to the architecture:
- All data processing steps should be written in an Object Oriented style and packed into a Scikit-Learn Pipeline (or similar), as done with transformations.py.
- Any data processing that must be applied to new data should be integrated in the inference pipeline generated in train_models(); that means that we should integrate most of the content in perform_data_processing() as a Pipeline in train_models(). Thus, perform_data_processing() would be reduced to basic tasks related to cleaning (e.g., duplicate removal) and checking.
No serving via an API is discussed.

Please, visit the section Interesting Links, where I will list further resources related to those points.

Possible Improvements

Interesting Links

Guide on EDA, Data Cleaning and Feature Engineering: eda_fe_summary.
A boilerplate for reproducible (tracked) ML pipelines with MLflow and Weights & Biases which uses a music genre classification dataset from Spotify: music_genre_classification.
If you are interested in the automated deployment of production-ready ML pipelines packaged in APIs, check my example repository census_model_deployment_fastapi.
My notes on the Udacity Machine Learning DevOps Engineering Nanodegree, which features more MLOps content: mlops_udacity.

Authorship

Mikel Sagardia, 2022.
No guarantees.

The original content of this project was taken from the Udacity Machine Learning DevOps Engineer Nanodegree, but it has been modified and extended to the present state.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
customer_churn		customer_churn
data		data
pics		pics
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Instructions.md		Instructions.md
Makefile		Makefile
README.md		README.md
churn_notebook.ipynb		churn_notebook.ipynb
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

mxagar/customer_churn_production

Folders and files

Latest commit

History

Repository files navigation

Predict Customer Churn Using Production-Level Code

Project Description

Files and Workflow Description

How to Use This

Running the Package

Testing the Package

Good Ol' make

Docker Image

Limitations and Alternatives of This Boilerplate

Possible Improvements

Interesting Links

Authorship

About

Topics

Resources

Stars

Watchers

Forks

Languages

Good Ol' `make`