# RFM Analysis
<img src="https://simplepage.vn/blog/wp-content/uploads/Mo-hinh-RFM.png" />

RFM Analysis is a concept used by Data Science professionals, especially in the marketing domain for understanding and segmenting customers based on their buying behaviour.

Using RFM Analysis, a business can assess customers’:

    recency (the date they made their last purchase)
    frequency (how often they make purchases)
    and monetary value (the amount spent on purchases)

Recency, Frequency, and Monetary value of a customer are three key metrics that provide information about customer engagement, loyalty, and value to a business.

## Libraries


In [71]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"

In [72]:
# Load Data
data = pd.read_csv("rfm_data.csv")
data.head()

Unnamed: 0,CustomerID,PurchaseDate,TransactionAmount,ProductInformation,OrderID,Location
0,8814,2023-04-11,943.31,Product C,890075,Tokyo
1,2188,2023-04-11,463.7,Product A,176819,London
2,4608,2023-04-11,80.28,Product A,340062,New York
3,2559,2023-04-11,221.29,Product A,239145,London
4,9482,2023-04-11,739.56,Product A,194545,Paris


## Calculating RFM Values

In [73]:
#Calculation the Recency, Frequency, and Monetary values of the customers to move further:
from datetime import datetime

# Sample data for demonstration
data = pd.DataFrame({
    'CustomerID': [1, 1, 2, 2, 3],
    'OrderID': [101, 102, 201, 202, 301],
    'PurchaseDate': ['2023-01-01', '2023-05-15', '2023-02-10', '2023-06-25', '2023-03-30'],
    'TransactionAmount': [150, 200, 50, 100, 300]
})

# Convert 'PurchaseDate' to datetime
data['PurchaseDate'] = pd.to_datetime(data['PurchaseDate'])

# Calculate Recency
data['Recency'] = (datetime.now() - data['PurchaseDate']).dt.days

# Calculate Frequency
frequency_data = data.groupby('CustomerID')['OrderID'].count().reset_index()
frequency_data.rename(columns={'OrderID': 'Frequency'}, inplace=True)

# Merge Frequency Data
data = data.merge(frequency_data, on='CustomerID', how='left')

# Calculate Monetary Value
monetary_data = data.groupby('CustomerID')['TransactionAmount'].sum().reset_index()
monetary_data.rename(columns={'TransactionAmount': 'MonetaryValue'}, inplace=True)

# Merge Monetary Data
data = data.merge(monetary_data, on='CustomerID', how='left')

# Optional: Drop duplicate rows if needed
data = data.drop_duplicates()

# Display the final DataFrame
print(data)

   CustomerID  OrderID PurchaseDate  TransactionAmount  Recency  Frequency  \
0           1      101   2023-01-01                150      633          2   
1           1      102   2023-05-15                200      499          2   
2           2      201   2023-02-10                 50      593          2   
3           2      202   2023-06-25                100      458          2   
4           3      301   2023-03-30                300      545          1   

   MonetaryValue  
0            350  
1            350  
2            150  
3            150  
4            300  


To calculate recency, we subtracted the purchase date from the current date and extracted the number of days using the datetime.now().date() function. It gives us the number of days since the customer’s last purchase, representing their recency value.

After that, we calculated the frequency for each customer. We grouped the data by ‘CustomerID’ and counted the number of unique ‘OrderID’ values to determine the number of purchases made by each customer. It gives us the frequency value, representing the total number of purchases made by each customer.

Finally, we calculated the monetary value for each customer. We grouped the data by ‘CustomerID’ and summed the ‘TransactionAmount’ values to calculate the total amount spent by each customer. It gives us the monetary value, representing the total monetary contribution of each customer.

By performing these calculations, we now have the necessary RFM values (recency, frequency, monetary value) for each customer, which are important indicators for understanding customer behaviour and segmentation in RFM analysis.

## Calculating RFM Scores

In [74]:
# Define scoring criteria for each RFM value
recency_scores = [5, 4, 3, 2, 1]  # Higher score for lower recency (more recent)
frequency_scores = [1, 2, 3, 4, 5]  # Higher score for higher frequency
monetary_scores = [1, 2, 3, 4, 5]  # Higher score for higher monetary value

# Calculate RFM scores
data['RecencyScore'] = pd.cut(data['Recency'], bins=5, labels=recency_scores)
data['FrequencyScore'] = pd.cut(data['Frequency'], bins=5, labels=frequency_scores)
data['MonetaryScore'] = pd.cut(data['MonetaryValue'], bins=5, labels=monetary_scores)

We assigned scores from 5 to 1 to calculate the recency score, where a higher score indicates a more recent purchase. It means that customers who have purchased more recently will receive higher recency scores.

We assigned scores from 1 to 5 to calculate the frequency score, where a higher score indicates a higher purchase frequency. Customers who made more frequent purchases will receive higher frequency scores.

To calculate the monetary score, we assigned scores from 1 to 5, where a higher score indicates a higher amount spent by the customer.

To calculate RFM scores, we used the pd.cut() function to divide recency, frequency, and monetary values into bins. We define 5 bins for each value and assign the corresponding scores to each bin.

Once the scores are added to the data, you will notice that they are categorical variables. You can use the data.info() method to confirm this. So we need to convert their datatype into integers to use these scores further:

In [75]:
# Convert RFM scores to numeric type
data['RecencyScore'] = data['RecencyScore'].astype(int)
data['FrequencyScore'] = data['FrequencyScore'].astype(int)
data['MonetaryScore'] = data['MonetaryScore'].astype(int)

## RFM Value Segmentation

In [76]:
# Calculate RFM score by combining the individual scores
data['RFM_Score'] = data['RecencyScore'] + data['FrequencyScore'] + data['MonetaryScore']

# Create RFM segments based on the RFM score
segment_labels = ['Low-Value', 'Mid-Value', 'High-Value']
data['Value Segment'] = pd.qcut(data['RFM_Score'], q=3, labels=segment_labels)

To calculate the RFM score, we add the scores obtained for recency, frequency and monetary value. For example, if a customer has a recency score of 3, a frequency score of 4, and a monetary score of 5, their RFM score will be 12.

After calculating the RFM scores, we created RFM segments based on the scores. We divided RFM scores into three segments, namely “Low-Value”, “Mid-Value”, and “High-Value”. Segmentation is done using the pd.qcut() function, which evenly distributes scores between segments.

In [77]:
data.head()

Unnamed: 0,CustomerID,OrderID,PurchaseDate,TransactionAmount,Recency,Frequency,MonetaryValue,RecencyScore,FrequencyScore,MonetaryScore,RFM_Score,Value Segment
0,1,101,2023-01-01,150,633,2,350,1,5,5,11,Mid-Value
1,1,102,2023-05-15,200,499,2,350,4,5,5,14,High-Value
2,2,201,2023-02-10,50,593,2,150,2,5,1,8,Low-Value
3,2,202,2023-06-25,100,458,2,150,5,5,1,11,Mid-Value
4,3,301,2023-03-30,300,545,1,300,3,1,4,8,Low-Value


In [78]:
# RFM Segment Distribution
segment_counts = data['Value Segment'].value_counts().reset_index()
segment_counts.columns = ['Value Segment', 'Count']

pastel_colors = px.colors.qualitative.Pastel

# Create the bar chart
fig_segment_dist = px.bar(segment_counts, x='Value Segment', y='Count',
                          color='Value Segment', color_discrete_sequence=pastel_colors,
                          title='RFM Value Segment Distribution')

# Update the layout
fig_segment_dist.update_layout(xaxis_title='RFM Value Segment',
                              yaxis_title='Count',
                              showlegend=False)

# Show the figure
fig_segment_dist.show()

## RFM Customer Segments

The above segments that we calculated are RFM value segments. Now we’ll calculate RFM customer segments. The RFM value segment represents the categorization of customers based on their RFM scores into groups such as “low value”, “medium value”, and “high value”. These segments are determined by dividing RFM scores into distinct ranges or groups, allowing for a more granular analysis of overall customer RFM characteristics. The RFM value segment helps us understand the relative value of customers in terms of recency, frequency, and monetary aspects.

Now let’s create and analyze RFM Customer Segments that are broader classifications based on the RFM scores. These segments, such as “Champions”, “Potential Loyalists”, and “Can’t Lose” provide a more strategic perspective on customer behaviour and characteristics in terms of recency, frequency, and monetary aspects. Here’s how to create the RFM customer segments:

In [79]:
# Create a new column for RFM Customer Segments
data['RFM Customer Segments'] = ''

# Assign RFM segments based on the RFM score
data.loc[data['RFM_Score'] >= 9, 'RFM Customer Segments'] = 'Champions'
data.loc[(data['RFM_Score'] >= 6) & (data['RFM_Score'] < 9), 'RFM Customer Segments'] = 'Potential Loyalists'
data.loc[(data['RFM_Score'] >= 5) & (data['RFM_Score'] < 6), 'RFM Customer Segments'] = 'At Risk Customers'
data.loc[(data['RFM_Score'] >= 4) & (data['RFM_Score'] < 5), 'RFM Customer Segments'] = "Can't Lose"
data.loc[(data['RFM_Score'] >= 3) & (data['RFM_Score'] < 4), 'RFM Customer Segments'] = "Lost"

# Print the updated data with RFM segments
print(data[['CustomerID', 'RFM Customer Segments']])

   CustomerID RFM Customer Segments
0           1             Champions
1           1             Champions
2           2   Potential Loyalists
3           2             Champions
4           3   Potential Loyalists


In the above code, we are assigning RFM segments to customers based on their RFM scores and then creating a new column called “RFM Customer Segments” in the data.

## RFM Analysis

In [80]:
segment_product_counts = data.groupby(['Value Segment', 'RFM Customer Segments']).size().reset_index(name='Count')

segment_product_counts = segment_product_counts.sort_values('Count', ascending=False)

fig_treemap_segment_product = px.treemap(segment_product_counts,
                                         path=['Value Segment', 'RFM Customer Segments'],
                                         values='Count',
                                         color='Value Segment', color_discrete_sequence=px.colors.qualitative.Pastel,
                                         title='RFM Customer Segments by Value')
fig_treemap_segment_product.show()









In [81]:
# Analyze the distribution of RFM values within the Champions segment:
# Filter the data to include only the customers in the Champions segment
champions_segment = data[data['RFM Customer Segments'] == 'Champions']

fig = go.Figure()
fig.add_trace(go.Box(y=champions_segment['RecencyScore'], name='Recency'))
fig.add_trace(go.Box(y=champions_segment['FrequencyScore'], name='Frequency'))
fig.add_trace(go.Box(y=champions_segment['MonetaryScore'], name='Monetary'))

fig.update_layout(title='Distribution of RFM Values within Champions Segment',
                  yaxis_title='RFM Value',
                  showlegend=True)

fig.show()

In [82]:
#Analyze the correlation of the recency, frequency, and monetary scores within the champions segment:
correlation_matrix = champions_segment[['RecencyScore', 'FrequencyScore', 'MonetaryScore']].corr()

# Visualize the correlation matrix using a heatmap
fig_heatmap = go.Figure(data=go.Heatmap(
                   z=correlation_matrix.values,
                   x=correlation_matrix.columns,
                   y=correlation_matrix.columns,
                   colorscale='RdBu',
                   colorbar=dict(title='Correlation')))

fig_heatmap.update_layout(title='Correlation Matrix of RFM Values within Champions Segment')

fig_heatmap.show()

In [83]:
import plotly.colors

pastel_colors = plotly.colors.qualitative.Pastel

segment_counts = data['RFM Customer Segments'].value_counts()

# Create a bar chart to compare segment counts
fig = go.Figure(data=[go.Bar(x=segment_counts.index, y=segment_counts.values,
                            marker=dict(color=pastel_colors))])

# Set the color of the Champions segment as a different color
champions_color = 'rgb(158, 202, 225)'
fig.update_traces(marker_color=[champions_color if segment == 'Champions' else pastel_colors[i]
                                for i, segment in enumerate(segment_counts.index)],
                  marker_line_color='rgb(8, 48, 107)',
                  marker_line_width=1.5, opacity=0.6)

# Update the layout
fig.update_layout(title='Comparison of RFM Segments',
                  xaxis_title='RFM Segments',
                  yaxis_title='Number of Customers',
                  showlegend=False)

fig.show()

In [84]:
#the recency, frequency, and monetary scores of all the segments:
# Calculate the average Recency, Frequency, and Monetary scores for each segment
segment_scores = data.groupby('RFM Customer Segments')[['RecencyScore', 'FrequencyScore', 'MonetaryScore']].mean().reset_index()

# Convert the DataFrame to a list of lists
segment_scores_list = segment_scores.values.tolist()

# Create a grouped bar chart to compare segment scores
fig = go.Figure()

# Add bars for Recency score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segments'],
    y=segment_scores['RecencyScore'],
    name='Recency Score',
    marker_color='rgb(158,202,225)'
))

# Add bars for Frequency score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segments'],
    y=segment_scores['FrequencyScore'],
    name='Frequency Score',
    marker_color='rgb(94,158,217)'
))

# Add bars for Monetary score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segments'],
    y=segment_scores['MonetaryScore'],
    name='Monetary Score',
    marker_color='rgb(32,102,148)'
))

# Update the layout
fig.update_layout(
    title='Comparison of RFM Segments based on Recency, Frequency, and Monetary Scores',
    xaxis_title='RFM Segments',
    yaxis_title='Score',
    barmode='group',
    showlegend=True
)

# Show the figure
fig.show()

## Summary

RFM Analysis is used to understand and segment customers based on their buying behaviour. RFM stands for recency, frequency, and monetary value, which are three key metrics that provide information about customer engagement, loyalty, and value to a business.

## Deep Learning

In [96]:
# Display the columns to check their names
print("Columns in the DataFrame:")
print(data.columns.tolist())

# Display the first few rows
print("First few rows of the DataFrame:")
print(data.head())

# Check for missing values
print("Missing values in each column:")
print(data.isnull().sum())

# If necessary, rename columns to remove spaces or correct cases
# Example: data.rename(columns=lambda x: x.strip(), inplace=True)

# Drop rows with missing values
data.dropna(subset=['Recency', 'Frequency', 'MonetaryValue', 'Value Segment'], inplace=True)

# Prepare features and target variable
features = data[['Recency', 'Frequency', 'MonetaryValue']]
target = data['Value Segment']

# Check shapes
print("Shape of features (X):", features.shape)
print("Shape of target (y):", target.shape)

# Convert target variable to categorical codes and then one-hot encode
target = pd.Categorical(target).codes  # Numeric codes
target = to_categorical(target)         # One-hot encode

# Check target shape again
print("Shape of target after encoding:", target.shape)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Standardize the feature values
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Build the model
model = Sequential()
model.add(Dense(32, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(16, activation='relu'))
model.add(Dense(y_train.shape[1], activation='softmax'))  # Output layer for multi-class classification

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=16, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.2f}')

# Make predictions
predictions = model.predict(X_test)
predicted_classes = np.argmax(predictions, axis=1)

Columns in the DataFrame:
['CustomerID', 'OrderID', 'PurchaseDate', 'TransactionAmount', 'Recency', 'Frequency', 'MonetaryValue', 'RecencyScore', 'FrequencyScore', 'MonetaryScore', 'RFM_Score', 'Value Segment', 'RFM Customer Segments']
First few rows of the DataFrame:
   CustomerID  OrderID PurchaseDate  TransactionAmount  Recency  Frequency  \
0           1      101   2023-01-01                150      633          2   
1           1      102   2023-05-15                200      499          2   
2           2      201   2023-02-10                 50      593          2   
3           2      202   2023-06-25                100      458          2   
4           3      301   2023-03-30                300      545          1   

   MonetaryValue  RecencyScore  FrequencyScore  MonetaryScore  RFM_Score  \
0            350             1               5              5         11   
1            350             4               5              5         14   
2            150             2    


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step - accuracy: 0.3333 - loss: 1.2566 - val_accuracy: 0.0000e+00 - val_loss: 1.4382
Epoch 2/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 110ms/step - accuracy: 0.3333 - loss: 1.2348 - val_accuracy: 0.0000e+00 - val_loss: 1.4353
Epoch 3/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - accuracy: 0.3333 - loss: 1.2133 - val_accuracy: 0.0000e+00 - val_loss: 1.4331
Epoch 4/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 59ms/step - accuracy: 0.3333 - loss: 1.1922 - val_accuracy: 0.0000e+00 - val_loss: 1.4324
Epoch 5/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - accuracy: 0.3333 - loss: 1.1714 - val_accuracy: 0.0000e+00 - val_loss: 1.4316
Epoch 6/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - accuracy: 0.3333 - loss: 1.1510 - val_accuracy: 0.0000e+00 - val_loss: 1.4308
Epoch 7/50
[1m1/1[0m [32m━━━━━━

In [90]:
# Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

In [91]:
# Display the columns to check their names
print("Columns in the DataFrame:")
print(data.columns.tolist())

# Display the first few rows
print("First few rows of the DataFrame:")
print(data.head())

# Check for missing values
print("Missing values in each column:")
print(data.isnull().sum())

Columns in the DataFrame:
['CustomerID', 'OrderID', 'PurchaseDate', 'TransactionAmount', 'Recency', 'Frequency', 'MonetaryValue', 'RecencyScore', 'FrequencyScore', 'MonetaryScore', 'RFM_Score', 'Value Segment', 'RFM Customer Segments']
First few rows of the DataFrame:
   CustomerID  OrderID PurchaseDate  TransactionAmount  Recency  Frequency  \
0           1      101   2023-01-01                150      633          2   
1           1      102   2023-05-15                200      499          2   
2           2      201   2023-02-10                 50      593          2   
3           2      202   2023-06-25                100      458          2   
4           3      301   2023-03-30                300      545          1   

   MonetaryValue  RecencyScore  FrequencyScore  MonetaryScore  RFM_Score  \
0            350             1               5              5         11   
1            350             4               5              5         14   
2            150             2    

In [92]:
# Drop rows with missing values
data.dropna(subset=['Recency', 'Frequency', 'MonetaryValue', 'Value Segment'], inplace=True)

# Prepare features and target variable
features = data[['Recency', 'Frequency', 'MonetaryValue']]
target = data['Value Segment']

# Check shapes
print("Shape of features (X):", features.shape)
print("Shape of target (y):", target.shape)

# Convert target variable to categorical codes and then one-hot encode
target = pd.Categorical(target).codes  # Numeric codes
target = to_categorical(target)         # One-hot encode

# Check target shape again
print("Shape of target after encoding:", target.shape)


Shape of features (X): (5, 3)
Shape of target (y): (5,)
Shape of target after encoding: (5, 3)


In [93]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Standardize the feature values
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [94]:
# Build the model
model = Sequential()
model.add(Dense(32, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(16, activation='relu'))
model.add(Dense(y_train.shape[1], activation='softmax'))  # Output layer for multi-class classification

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



model.fit(X_train, y_train, epochs=50, batch_size=16, validation_split=0.2)

In [97]:
model.fit(X_train, y_train, epochs=50, batch_size=16, validation_split=0.2)

Epoch 1/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 103ms/step - accuracy: 0.6667 - loss: 0.6313 - val_accuracy: 0.0000e+00 - val_loss: 1.5952
Epoch 2/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 86ms/step - accuracy: 0.6667 - loss: 0.6243 - val_accuracy: 0.0000e+00 - val_loss: 1.6038
Epoch 3/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step - accuracy: 0.6667 - loss: 0.6176 - val_accuracy: 0.0000e+00 - val_loss: 1.6123
Epoch 4/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step - accuracy: 0.6667 - loss: 0.6110 - val_accuracy: 0.0000e+00 - val_loss: 1.6211
Epoch 5/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step - accuracy: 0.6667 - loss: 0.6044 - val_accuracy: 0.0000e+00 - val_loss: 1.6301
Epoch 6/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 61ms/step - accuracy: 1.0000 - loss: 0.5979 - val_accuracy: 0.0000e+00 - val_loss: 1.6392
Epoch 7/50
[1m1/1[0

<keras.src.callbacks.history.History at 0x7b3c6c11de70>

In [98]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.2f}')

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - accuracy: 0.0000e+00 - loss: 1.8160
Test Accuracy: 0.00


In [100]:
predictions = model.predict(X_test)
predicted_classes = np.argmax(predictions, axis=1)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
