# RFM Analysis: Example

To figure out how to implement RFM analysis, I'll first consider a fake
dataset. This data was taken from this [blog post](https://clevertap.com/blog/rfm-analysis/).
Here the goal is to reproduce the results presented in that post.

## Imports

In [1]:
from typing import Literal, cast, get_args

import numpy as np
import pandas as pd
from pandas.testing import assert_frame_equal

## Fake dataset

For comparison, here's the original dataset:

![Fake dataset](./rfm_example_1.png)

Recreating this dataset using pandas:

In [2]:
df = pd.DataFrame(
    data={
        "Recency": [4, 6, 46, 23, 15, 32, 7, 50, 34, 10, 3, 1, 27, 18, 5],
        "Frequency": [6, 11, 1, 3, 4, 2, 3, 1, 15, 5, 8, 10, 3, 2, 1],
        "Monetary": [540, 940, 35, 65, 179, 56, 140, 950, 2630, 191, 845, 1510, 54, 40, 25],
    },
    index=list(range(1, 16)),
)
df.index.name = "CustomerId"
df

Unnamed: 0_level_0,Recency,Frequency,Monetary
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4,6,540
2,6,11,940
3,46,1,35
4,23,3,65
5,15,4,179
6,32,2,56
7,7,3,140
8,50,1,950
9,34,15,2630
10,10,5,191


## R score

We begin by calculating the R score. For comparison, these are the results
we're trying to reproduce:

![R scores](./rfm_example_2.png)

Notice that the rank on the fourth row is wrong. Recreating the above table
using pandas:

In [3]:
df_r = df.loc[:, ["Recency"]]
df_r = cast(pd.DataFrame, df_r)

In [4]:
# Use `Recency` to compute the rank
df_r["Rank"] = df_r["Recency"].rank().astype(np.int_)
df_r = df_r.sort_values(by="Rank")
df_r

Unnamed: 0_level_0,Recency,Rank
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1
12,1,1
11,3,2
1,4,3
15,5,4
2,6,5
7,7,6
10,10,7
5,15,8
14,18,9
4,23,10


In [5]:
# Use `Rank` to compute the R score
NUM_BINS = 5
SCORE_LABELS = list(range(NUM_BINS, 0, -1))

df_r["RScore"] = pd.cut(df_r["Rank"], NUM_BINS, labels=SCORE_LABELS)
df_r["RScore"] = df_r["RScore"].cat.reorder_categories(SCORE_LABELS[::-1], ordered=True)
df_r

Unnamed: 0_level_0,Recency,Rank,RScore
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12,1,1,5
11,3,2,5
1,4,3,5
15,5,4,4
2,6,5,4
7,7,6,4
10,10,7,3
5,15,8,3
14,18,9,3
4,23,10,2


Notice that these results agree with the reference values.

## F score and M score

The next step is to calculate the F and M scores. To do so, I'll generalize
the code above. The following function can be used to compute all scores:

In [6]:
RFMAttribute = Literal["Recency", "Frequency", "Monetary"]


def compute_score(
    df: pd.DataFrame,
    attr: RFMAttribute,
    num_bins: int = 5,
) -> pd.DataFrame:
    df_score = df.loc[:, [attr]]
    df_score = cast(pd.DataFrame, df_score)

    df_score["Rank"] = df_score[attr].rank(method="min").astype(np.int_)
    df_score = df_score.sort_values(by="Rank", ascending=attr == "Recency")

    score_name = f"{attr[0]}Score"
    score_labels = list(range(num_bins, 0, -1)) if attr == "Recency" else list(range(1, num_bins + 1))
    df_score[score_name] = pd.cut(df_score["Rank"], num_bins, labels=score_labels)
    if attr == "Recency":
        df_score[score_name] = df_score[score_name].cat.reorder_categories(score_labels[::-1], ordered=True)

    df_score = df_score.drop(columns="Rank")
    return df_score

Comparing this function with the previous code, one should notice that the
most important change is related to the score labels. Basically, this has to
do with the desirable values for the different RFM attributes.

In the case of recency, we want small values (customer purchased recently).
These values are assigned the best R scores. We have the opposite for
frequency and monetary. For these attributes, we want high values (customer
buys frequently/customer spends a lot of money). These cases are assigned the
best F and M scores. The above function takes this difference into account.

As a test, we'll re-calculate the R scores:

In [7]:
df_r_func = compute_score(df, "Recency")
assert_frame_equal(df_r_func, df_r.drop(columns="Rank"))
df_r_func

Unnamed: 0_level_0,Recency,RScore
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1
12,1,5
11,3,5
1,4,5
15,5,4
2,6,4
7,7,4
10,10,3
5,15,3
14,18,3
4,23,2


In [8]:
# Checking that categories are ordered correctly
df_r_func["RScore"]

CustomerId
12    5
11    5
1     5
15    4
2     4
7     4
10    3
5     3
14    3
4     2
13    2
6     2
9     1
3     1
8     1
Name: RScore, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [9]:
del df_r_func

Finally, let's compute the F and M scores. For comparison, these are the
values we want to obtain:

![F and M scores](./rfm_example_3.png)

Calculating the F score:

In [10]:
df_f = compute_score(df, "Frequency")
df_f

Unnamed: 0_level_0,Frequency,FScore
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1
9,15,5
2,11,5
12,10,5
11,8,4
1,6,4
10,5,4
5,4,3
4,3,2
7,3,2
13,3,2


Note that the results above are different from the reference values.
Specifically, there's a difference when the frequency value isn't unique. It
makes more sense that in such cases customers get the same F score. Our
results obey this rule. However, those in the original post don't. Then our
approach isn't wrong. It's better.

In [11]:
# Checking that categories are ordered correctly
df_f["FScore"]

CustomerId
9     5
2     5
12    5
11    4
1     4
10    4
5     3
4     2
7     2
13    2
6     2
14    2
3     1
8     1
15    1
Name: FScore, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

Calculating the M score:

In [12]:
df_m = compute_score(df, "Monetary")
df_m

Unnamed: 0_level_0,Monetary,MScore
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1
9,2630,5
12,1510,5
8,950,5
2,940,4
11,845,4
1,540,4
10,191,3
5,179,3
7,140,3
4,65,2


Notice that these results agree with the reference values.

## RFM score

Next, we'll combine the results obtained above to calculate the RFM scores.
More precisely, we'll reproduce the following table:

![RFM scores](./rfm_example_4.png)

As already explained, our results for the F score are different. For this
reason, two rows in this table won't be reproduced exactly. But our values
will be very close.

In [13]:
# Concatenate scores from different DataFrames
df_rfm = pd.concat(
    [df_r["RScore"], df_f["FScore"], df_m["MScore"]],
    axis=1,
)
df_rfm = df_rfm.sort_index()
df_rfm = df_rfm.astype(np.int_)
df_rfm

Unnamed: 0_level_0,RScore,FScore,MScore
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,5,4,4
2,4,5,4
3,1,1,1
4,2,2,2
5,3,3,3
6,2,2,2
7,4,2,3
8,1,1,5
9,1,5,5
10,3,4,3


In [14]:
# Compute RFM cells
df_rfm["RFMCell"] = df_rfm.agg(
    lambda r: f"{r.iloc[0]},{r.iloc[1]},{r.iloc[2]}",
    axis="columns",
)
df_rfm

Unnamed: 0_level_0,RScore,FScore,MScore,RFMCell
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,5,4,4,544
2,4,5,4,454
3,1,1,1,111
4,2,2,2,222
5,3,3,3,333
6,2,2,2,222
7,4,2,3,423
8,1,1,5,115
9,1,5,5,155
10,3,4,3,343


In [15]:
# Compute RFM scores
df_rfm["RFMScore"] = df_rfm.iloc[:, :3].agg("mean", axis="columns")
df_rfm.loc[:, ["RFMCell", "RFMScore"]]

Unnamed: 0_level_0,RFMCell,RFMScore
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,544,4.333333
2,454,4.333333
3,111,1.0
4,222,2.0
5,333,3.0
6,222,2.0
7,423,3.0
8,115,2.333333
9,155,3.666667
10,343,3.333333


Notice that these results agree with the reference values, except for the
rows with `CustomerId` 7 and 13. As expected, the difference is in the F
score values.

Now I'm confident that my implementation is correct. So I'll collect the
essential parts of the code above, and write a function that adds the RFM
scores to the original DataFrame.

In [16]:
def add_score_column(df: pd.DataFrame, attr: RFMAttribute, num_bins: int = 5) -> pd.DataFrame:
    score_name = f"{attr[0]}Score"
    score_labels = list(range(num_bins, 0, -1)) if attr == "Recency" else list(range(1, num_bins + 1))

    rank = df[attr].rank(method="min").astype(np.int_)
    df[score_name] = pd.cut(rank, num_bins, labels=score_labels)
    if attr == "Recency":
        df[score_name] = df[score_name].cat.reorder_categories(score_labels[::-1], ordered=True)

    return df

In [17]:
def add_rfm_scores(df: pd.DataFrame, num_bins: int = 5) -> pd.DataFrame:
    for attr in get_args(RFMAttribute):
        df = add_score_column(df, attr, num_bins)

    score_cols = [f"{attr[0]}Score" for attr in get_args(RFMAttribute)]
    df["RFMCell"] = df[score_cols].agg(lambda r: f"{r.iloc[0]},{r.iloc[1]},{r.iloc[2]}", axis="columns")
    df["RFMScore"] = df[score_cols].astype(np.int_).agg("mean", axis="columns")

    return df

In [18]:
# Quick test
df = add_rfm_scores(df)
assert_frame_equal(df[["RFMCell", "RFMScore"]], df_rfm[["RFMCell", "RFMScore"]])
df

Unnamed: 0_level_0,Recency,Frequency,Monetary,RScore,FScore,MScore,RFMCell,RFMScore
CustomerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,4,6,540,5,4,4,544,4.333333
2,6,11,940,4,5,4,454,4.333333
3,46,1,35,1,1,1,111,1.0
4,23,3,65,2,2,2,222,2.0
5,15,4,179,3,3,3,333,3.0
6,32,2,56,2,2,2,222,2.0
7,7,3,140,4,2,3,423,3.0
8,50,1,950,1,1,5,115,2.333333
9,34,15,2630,1,5,5,155,3.666667
10,10,5,191,3,4,3,343,3.333333
