# Predict customer lifetime values of e-commerce customers


## Introduction

Predicting Customer Lifetime Value (CLV) is an essential exercise for e-commerce businesses aiming to evaluate the total revenue a company can expect from a single customer account. My project aims to construct a predictive model for CLV that encapsulates and forecasts the net profit attributed to the entire prospective relationship with customers. Such a model is vital for optimizing marketing strategies, focusing on customer retention, and effectively allocating resources towards the most promising customer segments.

## Data Preparation

We use a dataset from the UCI Machine Learning Repository with detailed transaction records from a UK-based online retailer from December 2009 to December 2011.

In [None]:
import pandas as pd
import numpy as np
dt = pd.read_excel('sample_online_ec.xlsx', engine='openpyxl')
dt.info()

In [None]:
# remove rows without description, replace missing customer ID with placeholders, and save them to the new csv file.
dt_clean = dt.dropna(subset=['Description'])
dt_clean = dt.dropna(subset=['Customer ID'])
dt_clean.to_csv('dt_clean.csv', index=False)
dt_clean.head()

## Data Understanding - Univariate Analysis of Quantitative Features

### 'Quantity' data

From the histogram, we can observe the most frequent quantity of items purchased in transactions falls between 0 and 10, with a notable peak at around 4 items per purchase. This central tendency is confirmed by the summary statistics and the median ('50%') value in the provided data snippet, which indicates a median purchase quantity of 4 items.

Interestingly, there are negative quantities which could indicate returned items, other adjustments to orders, or data entry error.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# filter the outliers with the interquartile range (IQR) and set up lower/upper bound
Q1 = dt_clean['Quantity'].quantile(0.25)
Q3 = dt_clean['Quantity'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

filtered_quantities = dt_clean[(dt_clean['Quantity'] >= lower_bound) & (dt_clean['Quantity'] <= upper_bound)]['Quantity']

# plot
plt.figure(figsize=(10, 6))
sns.histplot(filtered_quantities, bins=50, kde=False)
plt.title('Distribution of Quantity within IQR')
plt.xlabel('Quantity')
plt.ylabel('Frequency')
plt.show()

# statistics for the filtered Quantity data
filtered_quantity_description = filtered_quantities.describe()
print("Quantity Data: ", filtered_quantity_description)

### Visual Analytics

#### Violin Plot for Quantity by Country

The violin plot above illustrates the distribution of quantities ordered by customers by country.
This is particularly useful for identifying trends in purchasing behavior across different markets, visualizing the distribution density along with the range of the data.

< Interpretation >
*   The bulk of orders are centered around a quantity of zero, which may indicate a high volume of small-quantity transactions across all countries.
*   The length of the violins indicates the range of order quantities, while the width shows the frequency. Wider sections represent a higher frequency of orders at that quantity level, indicating common order sizes.
*   Several countries with extreme negative quantities - returns or canceled orders.

In [None]:
plt.figure(figsize=(12, 8))
sns.violinplot(x='Country', y='Quantity', data=dt_clean[dt_clean['Quantity'] < 50])  # Limiting to reasonable quantities
plt.title('Quantity Distribution by Country')
plt.xticks(rotation=90)
plt.show()

#### Faceted Grid Plot for Quantity and Price

The collection of plots allows for a comparative analysis of sales by country, highlighting areas of success, challenges, and growth opportunities.

Key Findings:

* United Kingdom: Primary market, high volume of transactions, mostly small quantities.
* France and Germany: Fewer transactions, wider range of quantities, smaller but more variable markets.
* Nigeria and Malta: Very few transactions, higher-quantity sales potentially indicate bulk purchases.

* Negative Quantities: Returns or cancelled orders, significant in some countries like Iceland and West Indies, may skew average quantity figures.
* Distribution of Sales: Informs marketing efforts, expansion/reduction decisions, logistics and supply chain strategies.

In [None]:
#Faceted Grid Plot for Quantity and Price
grid = sns.FacetGrid(dt_clean, col='Country', col_wrap=4, height=4)
grid.map(sns.scatterplot, 'Quantity', 'Price')
grid.add_legend()
plt.show()


## Prediction & Evaluation


### 1. Linear Regression

In [None]:
# add a column to store the sum of purchase amounts by customer and calc clv
dt_clean.insert(1, column = 'TotalSpend', value = dt_clean['Price'] * dt_clean['Quantity'])
clv = dt_clean.groupby('Customer ID')['TotalSpend'].sum()


# aggregate features by customer
X_aggregated = dt_clean.groupby('Customer ID').agg({'Quantity': 'sum', 'Price': 'mean'})
y_aggregated = dt_clean.groupby('Customer ID')['TotalSpend'].sum()


# make the train and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_aggregated, y_aggregated, test_size=0.2, random_state=42)


# train the model
from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)


# predict on the test dataset & plot
y_pred = linear_reg.predict(X_test)

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.title('Actual vs Predicted CLV - Linear Regression')
plt.xlabel('Actual CLV')
plt.ylabel('Predicted CLV')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)  # Diagonal line
plt.show()

In [None]:
# evaluate the model
from sklearn.metrics import root_mean_squared_error, r2_score
y_pred = linear_reg.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)
print(f'RMSE: {rmse}')
print(f'R-squared: {r_squared}')

#### Conclusion
RMSE: 3538.44 (Average prediction error of $3,538)

R-squared: 0.8656 (Strong model fit, explaining 86.56% of variance)

Performance Analysis:
* Accurate for lower CLV: Model excels at predicting CLV for customers with lower lifetime value.
* Less precise for higher CLV: Model exhibits some variance for customers with exceptionally high CLV, indicating potential for refinement.
* Strong model fit: High R-squared suggests that the selected features are influential in predicting CLV.
* Overfitting risk: Be mindful of potential overfitting due to the strong R-squared value.

### 2. Decision Tree


* Handle non-linear relationships


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, GridSearchCV


X_train, X_test, y_train, y_test = train_test_split(X_aggregated, y_aggregated, test_size=0.2, random_state=42)

decision_tree = DecisionTreeRegressor(random_state=42)

# define a grid of parameters to search over
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# use GridSearchCV to find the best parameters & model
grid_search = GridSearchCV(decision_tree, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
best_tree = grid_search.best_estimator_

# prediction
y_pred = best_tree.predict(X_test)

plt.scatter(y_test, y_pred, alpha=0.3)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], '--', color='red')  # Diagonal line
plt.title('Decision Tree Regressor: Actual vs Predicted CLV')
plt.xlabel('Actual CLV')
plt.ylabel('Predicted CLV')
plt.show()

# evaluation
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Decision Tree RMSE: {rmse}')
print(f'Decision Tree R-squared: {r2}')

#### Conclusion
##### Scatter Plot Analysis:
* Strong performance for lower CLV: The model accurately predicts CLV for most customers, especially those with lower CLV values.
* Increased variance for higher CLV: The model's predictions are less accurate for customers with higher CLV, indicating potential limitations in capturing complex relationships at higher CLV levels.

##### Model Performance Metrics:
* Lower RMSE: The Decision Tree Regressor has a lower RMSE compared to Linear Regression, indicating better overall prediction accuracy.
* Higher R-squared: The Decision Tree Regressor explains a larger portion of the variance in CLV than Linear Regression, suggesting a better fit to the data.

##### Best Parameters:
* No depth constraint: The model was allowed to grow to its maximum depth, potentially increasing its complexity and risk of overfitting.
* Minimal leaf size: The model allowed for very small leaf nodes, providing flexibility but potentially increasing the risk of overfitting.

### 3. Random Forest

* An ensemble of decision trees
* Less prone to overfitting
* Can capture complex interactions between features

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X_aggregated, y_aggregated, test_size=0.2, random_state=42)
random_forest = RandomForestRegressor(n_estimators=100, random_state=42)

# model training
random_forest.fit(X_train, y_train)


# prediction
y_pred_rf = random_forest.predict(X_test)

plt.scatter(y_test, y_pred_rf)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
plt.title('Random Forest Regressor: Actual vs Predicted CLV')
plt.xlabel('Actual CLV')
plt.ylabel('Predicted CLV')
plt.show()


# evaluation
rmse_rf = root_mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print(f'Random Forest RMSE: {rmse_rf}')
print(f'Random Forest R-squared: {r2_rf}')


#### Conclusion
##### Scatter Plot Analysis:
* Model accurately predicts low CLV values.
* Predictions become less accurate for higher CLV values.
* Outliers exist where CLV is significantly overestimated.

##### Model Performance Metrics:
* RMSE: Higher than Linear Regression and Decision Tree, indicating higher prediction error.
* R-squared: Slightly lower than previous models, indicating less explained variance in CLV.


### 4. Gradient Boosting
Rationale: Gradient boosting is a powerful ensemble method that combines weak predictive models to create a strong predictive model, often leading to high performance.


In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X_train, X_test, y_train, y_test = train_test_split(X_aggregated, y_aggregated, test_size=0.2, random_state=42)
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbr.fit(X_train, y_train)
y_pred_gbr = gbr.predict(X_test)

plt.scatter(y_test, y_pred_gbr)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
plt.title('Gradient Boosting Regressor: Actual vs Predicted CLV')
plt.xlabel('Actual CLV')
plt.ylabel('Predicted CLV')
plt.show()

rmse_gbr = root_mean_squared_error(y_test, y_pred_gbr)
r2_gbr = r2_score(y_test, y_pred_gbr)
print(f'Gradient Boosting RMSE: {rmse_gbr}')
print(f'Gradient Boosting R-squared: {r2_gbr}')

#### Conclusion
##### Model Performance:
* RMSE: 4014.27, slightly higher than Decision Tree and Random Forest, indicating potential for improvement.
* R-squared: 0.8269, strong but lower than previous models, suggesting some variance is not captured.

##### Visualization Insights:
* Model accuracy: Good for lower CLV values, but decreases for higher CLV values.
* Overestimation: Model overestimates CLV for some high-value customers.

##### Potential improvements:
* Hyperparameter tuning to reduce overfitting.
* Additional features or feature engineering for high-value customers.
* Handling outliers or customer segmentation

### 5. Neural Network (Multilayer Perceptron)
Rationale: Neural networks can model complex, non-linear patterns that other algorithms may miss. They are especially useful if the relationships in the data are not well captured by traditional algorithms.


In [None]:
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X_aggregated, y_aggregated, test_size=0.2, random_state=42)
mlp = MLPRegressor(hidden_layer_sizes=(100,), activation='relu', solver='adam', max_iter=500, random_state=42)
mlp.fit(X_train, y_train)

y_pred_mlp = mlp.predict(X_test)

plt.scatter(y_test, y_pred_mlp)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
plt.title('Neural Network (MLP) Regressor: Actual vs Predicted CLV')
plt.xlabel('Actual CLV')
plt.ylabel('Predicted CLV')
plt.show()

rmse_mlp = root_mean_squared_error(y_test, y_pred_mlp)
r2_mlp = r2_score(y_test, y_pred_mlp)
print(f'Neural Network RMSE: {rmse_mlp}')
print(f'Neural Network R-squared: {r2_mlp}')

#### Conclusion
* RMSE (3426.86): The model's predictions are on average $3,426.86 off from the actual CLV values, which is a competitive performance compared to other models.
* R-squared (0.8739): The model explains 87.39% of the variation in CLV, indicating a strong fit to the data.

##### Visualization Analysis:
* The model is highly accurate for customers with lower CLV values.
* The model may struggle with predicting higher-end CLV values.

## Summary

We predicted Customer Lifetime Value (CLV) for an e-commerce customers,applying multiple predictive models:
* Linear Regression
* Decision Tree
* Random Forest
* Gradient Boosting
* Neural Network (MLP regressor)


Based on the performance evaluation via RMSE and R-squared, **Neural Network** generated the most accurate prediction - turned out a powerful predictive model for CLV.