# RMF Analysis Example

RFM (Recency, Frequency, Monetary):
    
- **RECENCY (R)**: Time since last order
- **FREQUENCY (F)**: Total number of orders or average time between orders
- **MONETARY VALUE (M)**: Total or average order value

In [None]:
# import modules
import pandas as pd
import datetime as dt

In [None]:
# load the data
cust = pd.read_excel('./data/US_Data.xlsx')
cust.head()

In [None]:
# look at .info()
cust.info()

Let's look at some basic information about this data.

* How many unique customers?
* How many invoices for each customer?

In [None]:
# How many unique customers?
cust.CustomerID.nunique()

In [None]:
# How many invoices for each customer?
cust.groupby('CustomerID')['InvoiceNo'].count().sort_values(ascending=False)

## Now, we'll go through the RFM analysis step by step.

### Recency

In [None]:
# get the reference date - the last date we have in data set
ref_date = cust.InvoiceDate.dt.date.max()
ref_date

In [None]:
# We currently have timestamps for invoice date
# convert InvoiceDate to date only to simplify
cust.InvoiceDate = cust.InvoiceDate.dt.date

In [None]:
cust.head()

We want to know how many days it has been between the last purchase and our reference date, which we called `ref_date`, for each customer. Let's do it steps.

In [None]:
# What was the most recent date that each customer made a purchase?
cust.groupby('CustomerID')['InvoiceDate'].max()

In [None]:
# We can subtract that date found above from ref_date
# NOTE: What is the dtype of resulting Series object?
#       Will that be a problem?
ref_date - cust.groupby('CustomerID')['InvoiceDate'].max()

In [None]:
# Let's just pull out the days and see if it will be an int
(ref_date - cust.groupby('CustomerID')['InvoiceDate'].max()).dt.days

In [None]:
# You can also use a lambda function
# Storing Series in a variable named recency
recency = cust.groupby('CustomerID')['InvoiceDate'].apply(lambda x: (ref_date - x.max()).days)

recency

In [None]:
# Change the name of the Series to 'Recency'
recency.name = 'Recency'
recency.info()

### Frequency

We'll use the number of unique `InvoiceNo` per customer to calculate the frequency

In [None]:
# Store the Series in a variable named frequency
frequency = cust.groupby('CustomerID')['InvoiceNo'].nunique() 
frequency

In [None]:
# Change the name of the Series to 'Frequency'
frequency.name = 'Frequency'
frequency.info()

### Monetary

We'll use the total amount spent by customers over the period to get the Monetary values.

In [None]:
# Store the Series in a variable named monetary
monetary = cust.groupby('CustomerID')['Amount'].sum()
monetary

In [None]:
# Change the name of the Series to 'Monetary'
monetary.name = 'Monetary'
monetary.info()

Let's combine the three RFM values into a single `DataFrame` named `rfm`.

In [None]:
rfm = pd.concat([recency, frequency, monetary], axis=1)
rfm

### RFM Scores

Now, let's assign scale scores. The goal here is to assign a number between 1 and 5 to each of the raw, calculated RFM values we found and put into the `rfm` `DataFrame`. In order to accomplish this task, we can use quantiles. Think of this as dividing the data into the top 20%, 40%, 60%, 80%, and over 80%. This approach will result in 5 groups that have the same number of customers in each group (roughly). Alternatively, you could pick 5 ranges that are the same distance apart from each other; that is, each group size will have the same high minus low range. This second approach would result in groups that do **not** have the same number of customers in each group (almost always).

Let's use the quantile approach - the one that creates 5 groups, each with (approximately) the same number of customers in it. We can use the function `.qcut()` in `pandas` to easily accomplish this task.

In [None]:
# Let's try it and spit out how many are in each group 
# to get a better idea of what it is doing
pd.qcut(rfm.Recency, q=5).value_counts()

Remember that a **smaller** recency is better because it means that the customer has purchased an item more recently. Therefore, we will assign a 5 to the 20% of customers that have the most recent purchases, etc.

In [None]:
# create a copy of the dataframe and play with that copy
rfm_copy = rfm.copy()

In [None]:
# convert to R score
rfm_copy['Rscore'] = pd.qcut(rfm_copy.Recency, q=5, labels=[5,4,3,2,1])
rfm_copy.head()

In [None]:
rfm_copy.Rscore.value_counts()

We want to similarly assign both the frequency and monetary scores to each customer. Again, we'll use 5 quantiles. This time, however, larger numbers are better. We have to be careful with the frequency score. The first quantile in Frequency could be 1, which will conflict with the label 1. We will have to use a different method otherwise, we will get error message. Instead, we can create the quantiles by first pre-ranking the column.

The next code cell will **fail**. The one following corrects this error.

In [None]:
# convert to F score -- this will FAIL
rfm_copy['Fscore'] = pd.qcut(rfm_copy.Frequency, q=5, labels=[1,2,3,4,5])
rfm_copy.head()

Let's rank each customer based on their frequency. Because there will be customers with the same frequency, we need to indicate how to break those ties. We'll use `method='first'` which simply says use the first one of customer it encounters to break any ties. 

In [None]:
# Rank customers by frequency and sort from small to large
rfm_copy.Frequency.rank(method='first').sort_values()

In [None]:
# Look at the first-ranked customer (smallest frequency)
rfm_copy.loc[12821]

In [None]:
# Look at the second-ranked customer (smallest frequency)
rfm_copy.loc[12824]

In [None]:
# convert to F score by using ranking
rfm_copy['Fscore'] = pd.qcut(rfm_copy.Frequency.rank(method='first'), q=5, labels=[1,2,3,4,5])
rfm_copy.head()

We can use `qcut()` directly on our `Monetary` column to generate the `Mscore`.

In [None]:
# convert to M score
rfm_copy['Mscore'] = pd.qcut(rfm_copy.Monetary, q=5, labels=[1,2,3,4,5])
rfm_copy.head()

### Aggregating Scores

We will now create a single `RFMscore` by simply taking the average of the R, F, and M scores for each customer. We'll store it in our `DataFrame` as a new column.

In [None]:
# we use the average of R, F, and M scores for RFM score
rfm_copy['RFMscore'] = rfm_copy[['Rscore', 'Fscore', 'Mscore']].mean(axis=1)  

rfm_copy.head()

In [None]:
# Take a quick look at the number of each aggegrated RFM score
rfm_copy.RFMscore.value_counts()

### Segmenting Customers

Segmenting your customers depends highly on the context of what you are going to do with those segments. You may have a pre-defined number of segments with specific lables. For example, you could segement customers into the following groups:

- Frequent buyers
- Big spenders
- Plantium customers
- Gold customers, etc.

Let's (arbitrarily) create 6 customer segments according to the following criteria:

| Segment Label | RFM Score |
| ------------- | --------- |
| Basic | value &le; 1 |
| Bronze | 1 < value &le; 2 |
| Silver | 2 < valuse &le; 3 |
| Gold | 3 < value &le; 4 |
| Platinum | 4 < value &le; 4.5 |
| Diamond | value > 4.5 |

In [None]:
# Let's try using `pd.cut()` and see what it looks like
# Give it the bin edges as a list
# Tell it to use the rightmost edge as *inclusive*
pd.cut(rfm_copy.RFMscore, bins=[0,1,2,3,4,4.5,5], right=True)    

In [None]:
# Now let's add a new column to rfm_copy
rfm_copy['LoyaltyTier'] = pd.cut(rfm_copy.RFMscore, bins=[0,1,2,3,4,4.5,5], right=True,
                                labels=['Basic','Bronze','Silver','Gold','Platinum','Diamond'])

rfm_copy

In [None]:
# How many customers are in each tier?
rfm_copy.LoyaltyTier.value_counts()

**&copy; 2023 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**