Apply unsupervised learning techniques to identify segments of a customer base
Apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company. These segments can then be used to direct marketing campaigns toward audiences that will have the highest expected rate of returns.
We use two datasets: a demographics data for the general population and another demographics data for the customers of the mail-order company.
- Step 1: We cluster the general population into groups with similar demographic properties.
- Step 2: We see how the people in the customers dataset fit into the clusters created previously.
The hope is that certain clusters are over-represented in the customer data, as compared to the general population. They will be assument to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
We compute the proportion of data in each cluster for the general population and the customer data:
The over-represented clusters in the customers dataset are the clusters 3, 13 and 20.
The under-represented ones the clusters 4, 6, 10 and 12.
In reality, we do not have higher proportion of persons in a cluster for the customer data compared to the general population. So it suggests that the people we are looking for are outside of the target demographics.
-
Preprocessing
- Missing values have been re-encoded. The columns with large amount of missing values have been removed.
- Mixed-type features have been re-engineered
- The data has been splitted according to how much data is missing. The subsets have compared. One has been selected.
- Categorical features have been explored and transformed.
-
Feature Transformation
- We applied feature scaling and used Imputation to remove remaining missing values.
- Principal component analysis has been applied to the data. We performed a variability analysis.
-
Clustering
- We tested multiple cluster counts, then made a decision on the number of clusters to be used.
After all that on the general demographics data, we applied cleaning, feature transformation, dimensionality reduction, and clustering models to the customer demographics data.
This project requires Python 3.x and the following Python libraries installed:
You will also need to have software installed to run and execute an iPython Notebook
We recommend to install Anaconda, a pre-packaged Python distribution that contains all of the necessary libraries and software for this project.
The code is provided in the Identify_Customer_Segments.ipynb
notebook file.
In a terminal or command window, navigate to the top-level project directory identify_customer_segments/
(that contains this README) and run one of the following commands:
ipython notebook Identify_Customer_Segments.ipynb
or
jupyter notebook Identify_Customer_Segments.ipynb
This will open the iPython Notebook software and project file in your browser.
The proprietary data we use has been provided by Bertelsmann Arvato Analytics, and consequently is not available here.