Skip to content

I describe the methods used to segment customers of a Brazilian online retailer via K-means clustering of their recency, frequency, and monetary value of purchases.

Notifications You must be signed in to change notification settings

oelghira/Customer-Segmentation-from-RFM-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Customer Segmentation from RFM Clustering

I describe the methods used to segment customers of a Brazilian online retailer via K-means clustering of their recency, frequency, and monetary value of purchases.

Introduction

In today's world of online retail, it is critical to a business's survival to know who their customers are and how those customers behave. Today's consumers develop a relationship with the brands and retailers they choose to patronize. In order for that relationship to not end in a break up, retailers must provide a much more personal touch with targeted and personalized communication that speaks to the individual needs of each customer. The problem with a personalized approach is that online retailers can have such a wide range of consumers as opposed to a brick in mortar store in a fixed location. Creating something personalized for each individual is a nearly impossible task, but the good news is that we can gain an in depth understanding of the customer base by clustering customers with similar behavior! In this example, I will show how clustering to achieve a better targeted marketing approach can be done using customers' recency of last purchase, how frequent they have shopped, and how much they have spent with a Brazilian Amazon type of online retailer.

Data

https://www.kaggle.com/olistbr/brazilian-ecommerce

All data was provided via Kaggle using the link above. The datasets and descriptions of the relevant data used are as follows:

olist_orders_dataset
"order_id" unique identifier of the order.
"customer_id" key to the customer dataset. Each order has a unique customer_id.
"order_purchase_timestamp" Shows the purchase timestamp.

olist_customers_dataset
"customer_id" key to the orders dataset. Each order has a unique customer_id.
"customer_unique_id" unique identifier of a customer.
"customer_state" customer state.

olist_order_payments_dataset
"order_id" unique identifier of an order.
"payment_sequential" a customer may pay an order with more than one payment method. If he does so, a sequence will be created to accommodate all payments.
"payment_type" method of payment chosen by the customer.
"payment_installments" number of installments chosen by the customer.
"payment_value" transaction value.

olist_order_items_dataset
"order_id" order unique identifier
"order_item_id" sequential number identifying number of items included in the same order.
"product_id" product unique identifier.

olist_products_dataset
"product_id" unique product identifier.
"product_category_name" root category of product, in Portuguese.

Exploratory Data Analysis (EDA)

Recency

Recency Unscaled The last day of a purchase in the data was made on April 9th, 2016. This date as the 0th date and all other purchase dates were given a numeric recency based on many days behind the 0th date they were made. For example, a purchase made on April 1st, 2016 had a numeric recency of 8 (i.e 8 days in the past) and a purchase made on April 9th, 2015 had a numeric recency of 365 (i.e. 365 days in the past).

The data table below shows that most shoppers in this dataset have not revisited the site in over a year since their last purchase! Given this fact, almost all clusters or groupings will have a very long tail for recency.

Min Q2 Median Mean Q3 Max
0 372 501 482.2 607 773

Frequency Unscaled An overwhelming 96% of customers only shopped with this online retailer only once! Given this fact there will not be much variation when it comes to clustering customers based on recency.

The data below shows the summary statistics of the frequency of purchases by the customers in the dataset.

Min Q2 Median Mean Q3 Max
1 1 1 1.035 1 17

Payment Unscaled As with frequency, payment (pymt) is heavily skewed to the right with a long tail from under $200 reaching just under $14K.

The data table below shows the summary statistics and that 75% of all customers in the dataset spent less than $200 with the retailer.

Min Q2 Median Mean Q3 Max
$9.59 $63.13 $108 $166.6 $183.53 $13,644.08

Given the amount of variation in frequency and monetary value (pymt) of customers, clustering will have to take into consideration questions of practicality. It could very easily be the case that more clusters make sense from a K-means perspective, but those additional clusters could be the same as "splitting hairs" in terms of what is practical. Knowing what we know, when it comes to clustering one has to ask themselves, "Does it really make sense to treat these customers differently?"

K-means Clustering

Before diving into the K-means clustering algorithm. Each of the 3 variables (recency, frequency, payment) were scaled to put each variable into a similar context. Each variable was normalized using min-max scaling which results in each variable represented on a scaled from 0 to 1. In min-max scaling the individual observation was transformed by the following formula:
min-max

After normalizing each variable, we begin to implement the K-means algorithm for each of our variables. It is assumed here, that the reader understands the methods behind the K-means algorithm, and details of the mechanics behind it have been omitted.

Recency
Recency Git Based on the scree plot of the scaled recency variable, it appears that the optimal number of clusters is somewhere between 2 and 4. Noticing the spread of the unscaled recency variable is between 0 (adjusted to 0.5 for the min-max calculation) and 773, it did not make sense to use only 2 clusters. Using 2 clusters would result in clusters separated at the midpoint treating customers who shopped within 387 days in one category and further out customers in another. Using 4 clusters seems to be adding clusters without creating additional value. 3 seems to be the optimal number in that the algorithm treats the most recent customers in one cluster, those haven't shopped in over a year in another, and those that made their last purchase almost 2 years ago or longer in another.

Frequency Frequency Git The recurring theme here is that the optimal number of clusters from our scree plot lies between 2 and 4. Given that we know 96% of customers made only 1 purchase, it does make some sense to use only 2 clusters. This would result in 1 cluster of 1 time shoppers and another of multiple time shoppers. In a practical sense, this is too simple of a solution and assumes there is no other differences between customers who shopped multiple times. Using 4 clusters would treat the outlier customer who shopped 17 times as their own cluster. This would also produce no additional value to have a cluster devoted to only 1 shopper. Thus, the optimal number of clusters algorithmically and practically seems to be 3. 1 cluster for 1 time shoppers, another for those that made 2 or 3 purchases, and a final cluster for those that shopped 4 or more times.

Monetary Payment Git The monetary value of customers is where determining the optimal number of clusters gets interesting. From our EDA, we know the spread of payment amount is very wide ranging from less than $10 to almost $14K. We also know that most customers are not high spenders given the frequency plot with most of its mass closer to 0 than the tail at the right (see EDA section). Too few clusters can treat most customers as one segment and assumes there is not much variability in those lower amounts. On the basis of what can best be applied to a targeted marketing campaign, I lean towards 4 clusters as the optimal number. Using 4 clusters separates shoppers into distinct categories of low spenders, moderate spenders, high spenders, and the highest spenders that could be purchasing not only for themselves but businesses and entire families.

From Clusters to Targeted Marketing Segments

The next steps are to add up the cluster "scores" from our previous exercise to get one final score made up of components from recency, frequency, and monetary value of purchases. In order to keep things consistent, the highest value in each cluster was the highest number of clusters. Thus, the highest spending customers were in cluster 4 for monetary value and the most frequent shoppers were in cluster 3 for frequency. For recency the scale was reversed in that the most recent shoppers were in cluster 3 and the customers who shopped furthest in the past are less attainable and in cluster 1.

Adding our cluster scores results in the following breakdown of our customer base:

RFM Score Count Min Recency Median Recency Mean Recency Max Recency Min Frequency Median Frequency Mean Frequency Max Frequency Min Pymt Median Pymt Mean Pymt Max Pymt
3 28,251 553 636 637 773 1 1 1 1 9.59 87.4 94.5 200
4 34,642 367 492 500 739 1 1 1.01 3 10.1 103 131 585
5 25,381 0.5 318 336 724 1 1 1.03 3 11.6 117 178 1,571
6 6,068 29 315 329 732 1 1 1.2 3 38.2 305 446 7,275
7 1,499 30 290 294 688 1 1 1.51 6 73.5 669 785 13,664
8 225 30 265 271 702 1 2 1.92 9 320 1,294 1,541 6,929
9 26 179 293 291 494 2 4 4.31 17 604 1,104 1,684 7,572
10 1 287 287 287 287 4 4 4 4 1,761 1,761 1,761 1,761

We now have a breakdown of all our customers by their RFM scores. For our purposes, treating all 8 different scores as their own segment does not make sense and we will look to further group these scores into segments for targeted marketing and communication. The goal of the segmenting was to group the RFM Scores into similar groups that would result in as little change as possible to the summary statistics above. The result of our segmenting is in the table below.

Segment RFM Score Count Min Recency Median Recency Mean Recency Max Recency Min Frequency Median Frequency Mean Frequency Max Frequency Min Pymt Median Pymt Mean Pymt Max Pymt
At Risk 3,4 62,893 367 567 561 773 1 1 1 3 9.59 94.5 115 585
Needs Attention 5,6 31,449 0.5 317 335 732 1 1 1.06 3 11.6 149 230 7,275
Promising 7 1,499 30 290 294 688 1 1 1.51 6 73.5 669 785 13,664
Champions 8,9,10 252 30 268 273 702 1 2 2.17 17 320 1,262 1,557 7,275

RFM Scores Segments

Conclusion

Looking at our table with all segments there are many insights that help us understand the behavior of our customers.

When it comes to recency, we know that most of our customer base have not purchased recently, so it makes sense that each segment has a long tail. We notice that the "At Risk" segment is the only one with a minimum recency of a year or more. Given that most customers are not very recent shoppers or very frequent, it does not make much sense to treat the other 3 segments any different from eachother in terms of recency. Overall, it would be beneficial to reach out to all customers with some sort of communication to remind them of the products and services they could benefit from.

Approximately 44% of customers in the "Promising" segment and 60% in the "Champions" segment purchased multiple times. These segments most likely have the customers that have the potential to become loyal shoppers and even those who spread by word of mouth where they visit online to buy from. Those in the "Needs Attention" group overwhelmingly purchased once but did spend a decent amount when they did. This group is more likely to have found value in what they purchased than our "At Risk" group. Thus the "Needs Attention" group needs to be reminded that not only did what they purchase provide a benefit to them, but also that there are many other products available to them that can give them the same or more value.

Focusing on monetary value, it is obvious that the "At Risk" segment spent the lowest amount. The other 3 segments have a mixture of low and high spenders. In the "Needs Attention" segment, 7% of those customers spent more than any of those in the "At Risk" segment. 50% of the customers in the "Promising" segment spent more than all of those in the "At Risk" group. The highest spending customer is also in this segment and could benefit both the online retailer and the customer from more communication. 50% of the customers in the "Champions" segment spent over $1,000 and 25% of them spent more than $2,000.

Given all that we have found it is obvious which segments present the highest potential value to the retailer. From here it is possible to see what products are purchased by customers in each segment, where they reside, how they are paying, etc. These segments open many doors to the marketing and product teams for better engagement and survival of the Brazilian online retailer.

About

I describe the methods used to segment customers of a Brazilian online retailer via K-means clustering of their recency, frequency, and monetary value of purchases.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages