# Problem Description
***alles um tier GmBH*** is a pet supplies company. They are currently auditing their promotional activities and the CEO, one of the main stakeholders, feels that the promotions they offer is too generic and not targeted. They have requested us to devise a customer segmentation model that they can use to run targeted promotional activities.

The client is interested in seeing what kind of customers are buying at ***alles um tier GmbH***. They assume that, in addition to private individuals, there are also smaller companies that purchase from ***alles um tier GmBH***. The project scope is to build a segmentation model and analyze the resulting customer segments.

# Data

You are given a dataset at customer level for the past year with the following data points. Number of transactions in the past year (*num_transactions*), order amount the past year (*total_order_value*), days between transactions the past year (*days_between_trans*), re-order rate the past year (*repeat_share*), and % of dog products bought (*dog_share*).

### Data Set
The dataset consists of 100k rows and has the following columns:

* CustomerID (int): UUID for the customer
* num_transactions (int): number of transactions in a given year
* total_order_value (float): total order value in € for the time period
* days_between_trans (float): average days between transactions for a user
* repeat_share (float): product share repeated every order
* dog_share (float): percentage of products ordered that are dog food related
    
# Technical Environment
* Python
* numpy
* pandas
* scikit-learn
* matplotlib / scipy / searborn / altair / plotly

# Approach
The solution is assessed on the following skills:
* A thorough evaluation of the data set using statistical measures and visualization
* Elegant Python coding skills
* Machine learning modelling fundamentals
* Model & result evaluation

# Output
Please provide your solution in a jupyter notebook with clear markdown comments.
The final output should be in the form of a DataFrame with two columns, the CustomerId and the assigned cluster.

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------


# Data Loading and Preprocessing

In [1]:
# Import all needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.impute import KNNImputer
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score
import altair as alt
import plotly.express as px
from scipy import stats

## Loading the Data

In [None]:
# load the data and make sure to specify the correct delimiter
df = pd.read_csv("DataSet_JuniorCodingChallenge.csv", delimiter='|')
df

## Handling Missing Data

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Check how often two values are missing in one row
print(f"Number of rows with two missing values: {len(df[df.isnull().sum(axis=1) == 2])}")

dropping = len(df[df.isnull().sum(axis=1) == 1])/len(df)
print(f"Percentage of rows with one missing value: {dropping:.2%}")

* Only 0.46% of all customer data have missing values, with just one entry missing at a time.

* Given the low percentage of missing data, one approach could be to drop the incomplete rows. However, since the missing data occurs sparsely across the dataset, it’s preferable to fill these gaps rather than discard potentially valuable records.

* To preserve the dataset’s integrity, we will apply K-Nearest Neighbors (KNN) Imputation to fill in the missing numerical values. This method ensures that the imputed values are aligned with the general structure of the dataset.

In [None]:
# Create a copy and drop the CustomerID column
df_knn = df.drop(columns='CustomerID')

# Initialize the KNN imputer, choosing k=3 for nearest neighbors
imputer = KNNImputer(n_neighbors=3)

# Apply the imputer to the dataset (on numeric columns)
df_imputed = pd.DataFrame(imputer.fit_transform(df_knn), columns=df_knn.columns)

# After imputation, check if any missing values remain
assert df_imputed.isnull().sum().sum() == 0, "There are still missing values in the data."

# The imputed data is now free of missing values and has been verified as consistent with expectations.
print("Missing values handled using KNN imputation.")

* After applying the KNN Imputer, all missing values were successfully filled. The imputed data was manually reviewed, and all values appeared reasonable and consistent with the rest of the dataset.

In [None]:
# Set df to the imputed data and add the CustomerID column back to the first column
df = pd.concat([df['CustomerID'], df_imputed], axis=1)
df

## Data Integrity Check

At first we want to make sure that the num_transaction has the correct integer values and that the other numerical features are saved as floats.

We review all columns to ensure that each field had the correct data type (e.g., numerical, categorical).

In [None]:
# Check the current data types
df.dtypes

In [8]:
def find_floats(df, column):
    """
    Count the number of unique entries in the specified column of a DataFrame
    that are floats with non-zero decimal parts.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    column (str): The name of the column to analyze.

    Returns:
    int: The number of unique floats with non-zero decimal parts in the column.
    """
    count = 0
    for i in df[column].unique():
        if isinstance(i, float) and i % 1 != 0:
            count += 1
    return count


In [None]:
# State how many entries need to be rounded in the num_transactions column
print(f"There are {find_floats(df, 'num_transactions')} entries that are not .00 floats and need to be rounded.")

# Round all float values before converting them to integers
df['num_transactions'] = df['num_transactions'].apply(lambda x: round(x))

# Double Check if all values are rounded now with a print statement
print(f"There are {find_floats(df, 'num_transactions')} entries that are not .00 floats left.")

# Convert num_transactions to integers
df['num_transactions'] = df['num_transactions'].astype(int)

# Check the data types again
df.dtypes

* Any necessary adjustments were made to align the data types with their intended use. 

* It was carefully considered that the float values will be rounded first before they are converted into integer values.

In [None]:
# Count the number of negative values for num_transactions, total_order_value and days_between_trans for negative values
print(f"Number of negative values for num_transactions: {len(df[df['num_transactions'] < 0])}")
print(f"Number of negative values for total_order_value: {len(df[df['total_order_value'] < 0])}")
print(f"Number of negative values for days_between_trans: {len(df[df['days_between_trans'] < 0])}")

# Check repeat_share and dog_share for values between 0 and 1. So count the number of values outside of this range
print(f"Number of values outside of the range [0, 1] for repeat_share: {len(df[(df['repeat_share'] < 0) | (df['repeat_share'] > 1)])}")
print(f"Number of values outside of the range [0, 1] for dog_share: {len(df[(df['dog_share'] < 0) | (df['dog_share'] > 1)])}")

In [None]:
# Drop the negative values in the 3 columns
df = df[df['num_transactions'] >= 0]
df = df[df['total_order_value'] >= 0]
df = df[df['days_between_trans'] >= 0]

# Drop the values outside of the range 0 and 1 for the last two columns
df = df[(df['repeat_share'] >= 0) & (df['repeat_share'] <= 1)]
df = df[(df['dog_share'] >= 0) & (df['dog_share'] <= 1)]

df

* A comprehensive check was performed to ensure all numerical values were within logical and reasonable ranges. 

* Any data points falling outside acceptable ranges were identified and removed to maintain data accuracy.

In [12]:
def find_inconsistent_duplicates(df, object_col):
    """
    Identify indices of rows where duplicated values in the specified column
    have inconsistent data.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    object_col (str): The name of the column to check for duplicates.

    Returns:
    list of int: Indices of rows with inconsistent data among duplicates.
    """
    # Get the indices of duplicated entries in the object column
    duplicated_indices = df[df.duplicated(subset=[object_col], keep=False)].index

    # Dictionary to store the indices of inconsistent duplicates
    inconsistent_indices = []

    # Group by the object column and iterate over each group
    for key, group in df.loc[duplicated_indices].groupby(object_col):
        # Get the first row's data (excluding the object column)
        reference_row = group.iloc[0, 1:].values
        
        # Check if all rows in the group match the first row
        for idx, row in group.iterrows():
            if not (row.iloc[1:].values == reference_row).all():
                inconsistent_indices.append(idx)

    return inconsistent_indices

In [None]:
indices_with_inconsistencies = find_inconsistent_duplicates(df, 'CustomerID')

# Print the number of inconsistent duplicates
print(f"Number of inconsistent duplicates: {len(indices_with_inconsistencies)}.")
df

In [None]:
# State how many CustomerIDs are duplicated and that they will be dropped
print(f"There are {df['CustomerID'].duplicated().sum()} duplicated CustomerIDs. They will be dropped.")

# Drop the duplicated CustomerIDs
df = df.drop_duplicates(subset='CustomerID')

# Reset the index
df = df.reset_index(drop=True)
df

* A duplicate check was conducted across the dataset.

* All duplicate entries were found to be consistent with no conflicting values. These duplicates were safely dropped to avoid any skewing of the analysis.

Steps Taken:

* Data was loaded and reviewed for missing entries.

* Missing values were handled using K-Nearest Neighbors (KNN) Imputation to fill gaps efficiently.

* The data types of each column were verified and corrected where necessary.

* Numerical ranges were validated, and any values outside acceptable ranges were removed.

* Duplicate entries were detected, verified for consistency, and dropped.

Outcome:

* The dataset now contains 99,105 unique customer records, with clean and verified data, ready for further analysis.

# Exploratory Data Analysis **(EDA)**

## Statistical Summary

In [None]:
# Create summary statistics for the data
df.describe()

* Using the describe() function, we obtained a detailed overview of each variable, including key statistics such as mean, standard deviation, and quartiles. This provided an initial understanding of the distribution of the features and highlighted the presence of potential outliers, which were explored further in the subsequent visualizations.

## Data Visualization

In [None]:
# Create a histogram for each column
df.hist(figsize=(10, 10))
plt.show()

* The histograms of all numerical variables revealed varying distributions across features, with some exhibiting skewness. For example, num_transactions and total_order_value are skewed to the right, indicating the presence of a few customers with a high volume of transactions or large order values. This may support the CEO's hypothesis that there are also corporate customers.

In [None]:
df_copy = df.drop(columns='CustomerID')

# Creating multiple boxplots, one for each column
plt.figure(figsize=(10, 10))
for i, col in enumerate(df_copy.columns):
    plt.subplot(3, 3, i + 1)
    sns.boxplot(x=df_copy[col])
    plt.title(col)
plt.tight_layout()
plt.show()

* The boxplots further reinforced the hypothesis by highlighting significant outliers, particularly in total_order_value and days_between_trans. These extreme values likely represent corporate customers, who order infrequently but in large volumes. Understanding the spread and identifying outliers is crucial for later feature engineering and clustering.

In [18]:
def plot_correlation_matrix(df):
    """
    Create and visualize a correlation matrix for the DataFrame using a heatmap.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data to analyze.

    Returns:
    None: Displays a heatmap plot of the correlation matrix.
    """
    corr = df.corr()
    plt.figure(figsize=(10, 6))
    sns.heatmap(corr, annot=True, cmap='coolwarm')
    plt.title('Correlation Matrix')
    plt.show()


In [None]:
plot_correlation_matrix(df_copy)

The correlation matrix provided valuable insights into the relationships between features:

* days_between_trans and repeat_share (-0.67): This negative linear correlation suggests that customers with higher reordering rates tend to have shorter intervals between purchases. These could be customers who place routine, frequent orders, possibly on a weekly or monthly basis.

* days_between_trans and dog_share (0.41): The positive correlation indicates that customers who order less frequently tend to have a higher share of dog-related products in their purchases. However, this relationship requires further exploration as it could indicate differing customer types (e.g., occasional buyers focused on specific products like dog food).

## Feature Relationships and Engineering

In [None]:
# Creating pairplots for all numerical data
sns.pairplot(df)
plt.show()

The pairplot helped visualize the relationships between all features:

* num_transactions vs. total_order_value: This relationship highlights two distinct customer behaviors. Frequent, smaller orders could suggest private customers who need regular household supplies, while infrequent but large orders could represent smaller companies making bulk purchases. This supports the hypothesis of different customer segments: private individuals versus corporate clients.

* num_transactions & days_between_trans vs. repeat_share: The plots suggest three distinct customer segments. Two noticeable clusters of possibly private customers (those making regular, repeated purchases) are evident, as well as a group of outliers, likely representing corporate clients. The corporate clients typically have fewer transactions but higher order volumes.

### Dropping some of the Outliers in a 1st Step

In [None]:
# Drop the 10% outliers from the data in a new copy
df_no_outliers = df.copy()

# Define the columns to check for outliers
columns = ['num_transactions', 'total_order_value', 'days_between_trans', 'repeat_share', 'dog_share']

# Iterate over the columns and remove the outliers
for col in columns:
    # Calculate the z-scores for each value in the column
    z_scores = np.abs(stats.zscore(df_no_outliers[col]))

    # 99.9% confidence interval (2194 outliers)
    outlier_indices = np.where(z_scores > 3.291)[0]

    # 99.5% confidence interval (3523 outliers)
    # outlier_indices = np.where(z_scores > 2.807)[0]

    # 99% confidence interval (4532 outliers)
    # outlier_indices = np.where(z_scores > 2.58)[0]

    # 95% confidence interval (17560 outliers)
    # outlier_indices = np.where(z_scores > 1.96)[0] 

    # 90% confidence interval (29567 outliers)
    # outlier_indices = np.where(z_scores > 1.64)[0]

    # Drop the outliers
    df_no_outliers = df_no_outliers.drop(index=outlier_indices)

    # Reset the index
    df_no_outliers = df_no_outliers.reset_index(drop=True)

# How many outliers were removed
print(f"{len(df) - len(df_no_outliers)} outliers were removed.")

# Visualize the pairplots for the data without outliers
sns.pairplot(df_no_outliers)
plt.show()

After removing outliers using a 99.9% confidence interval, we gained clearer insights into customer segments:

* total_order_value vs. num_transactions: There is now a distinct relationship between higher total order values and increased transaction frequency. Customers who frequently place orders also tend to reorder the same products, suggesting a routine purchasing behavior.

* Customer Segments: Two key customer segments are visible:
    1. Low-frequency, low-volume buyers: These customers tend to have a high number of days between transactions and low repeat orders. They may represent trial users who were not fully convinced by the product offerings and could be targeted with win-back strategies (e.g., special offers or discounts).
    2. High-frequency, high-volume buyers: These customers order frequently and tend to reorder the same products. They likely have a stable shopping pattern, making them ideal candidates for loyalty programs or early access to new products.

* Repeat Share vs. Days Between Transactions: A clear segmentation emerges here. Customers with short intervals between transactions tend to have higher repeat share percentages, indicating they rely on certain staple products. On the other hand, less frequent buyers have lower repeat shares, suggesting they might experiment with different products. This insight could be useful for targeted promotions focusing on new or complementary products for loyal customers.

In [None]:
plot_correlation_matrix(df_no_outliers.drop(columns='CustomerID'))

After removing the outliers, the correlation matrix further validated several key relationships:

* num_transactions and total_order_value (0.98): This strong positive correlation highlights that frequent buyers naturally accumulate higher total order values over time. 

* days_between_trans and other features: The negative correlations between days_between_trans and both num_transactions (-0.84) and total_order_value (-0.79) emphasize that customers who order more frequently tend to have shorter gaps between transactions. This is expected and supports the segmentation of frequent buyers with routine purchasing behaviors.

* repeat_share vs. days_between_trans (-0.8): The negative correlation between the repeat order rate and days between transactions further indicates that customers with shorter transaction intervals tend to reorder the same products. These insights could inform the development of targeted promotions that focus on encouraging product trials or cross-selling to customers who rely on routine purchases.

* dog_share vs. num_transactions (-0.38): The inverse relationship here suggests that customers who place frequent orders tend to have a lower percentage of dog-related products. This is an important insight for product targeting, as infrequent buyers are more likely to be interested in dog-related products, whereas frequent buyers diversify their purchases beyond dog supplies.

### Creating a new Feature – avg_order_value

* Given the high correlation between total_order_value and num_transactions (0.98), we will derive a new feature, avg_order_value, to represent the average value of each order per customer. This transformation simplifies the analysis by consolidating these two highly correlated features into a more interpretable metric.

* Calculating the average order value provides a clearer insight into how much a customer spends per transaction, removing the noise of transaction frequency. This newly engineered feature will help in better understanding the purchasing behavior of different customer segments.

In [None]:
# Calculate new feature avg_order_value
df_no_outliers['avg_order_value'] = df_no_outliers['total_order_value'] / df_no_outliers['num_transactions']

# Set the new column as the first column and drop the old columns
df_no_outliers = df_no_outliers[['CustomerID', 'avg_order_value', 'days_between_trans', 'repeat_share', 'dog_share']]

# Do the same for the original data
df['avg_order_value'] = df['total_order_value'] / df['num_transactions']
# df = df[['CustomerID', 'avg_order_value', 'days_between_trans', 'repeat_share', 'dog_share']]

# Check the new data
df_no_outliers

### Visualizing the Dataset with new Feature

In [None]:
# Use .describe but only for the new feature
df_no_outliers['avg_order_value'].describe()

In [None]:
# Create a pairplot that only shows the relationship of the new feature with the other features
sns.pairplot(df_no_outliers, y_vars='avg_order_value', x_vars=['days_between_trans', 'repeat_share', 'dog_share'])
plt.show()

* avg_order_value vs. days_between_trans: Customers who order more frequently (i.e., have lower days_between_trans) tend to have higher average order values. This relationship could indicate that frequent buyers are consistently purchasing higher quantities or more expensive products in each order.

* avg_order_value vs. repeat_share: A clear linear relationship is visible here. The higher the average order value, the greater the percentage of repeat items in each order. This suggests that customers with larger average purchases are more likely to reorder the same products regularly, indicating loyalty to specific products.

* avg_order_value vs. dog_share: The relationship between these two features is less pronounced. However, there is a slight tendency indicating that customers with lower average order values might have a higher percentage of dog-related products. This could hint at occasional or first-time buyers focused on specific pet-related needs.

In [None]:
# Create the correlation only for the new feature with the other features
corr_avg_order = df_no_outliers.drop(columns='CustomerID').corr()['avg_order_value']
corr_avg_order = corr_avg_order.drop('avg_order_value')

# Visualize these findings
plt.figure(figsize=(10, 6))
sns.barplot(x=corr_avg_order.values, y=corr_avg_order.index, palette='coolwarm', hue=corr_avg_order.values)
plt.title('Correlation of Average Order Value with the other Features')
plt.show()

The correlations between the new feature avg_order_value and the remaining features provide further support for the patterns observed in the pairplot:

* days_between_trans (-0.78): The negative correlation suggests that customers with shorter intervals between transactions tend to have higher average order values. This is consistent with the idea that frequent buyers make larger or more valuable purchases per transaction.

* repeat_share (0.86): The strong positive correlation shows a clear and significant relationship between average order value and the percentage of repeat items in each order. Customers with higher average order values are more likely to reorder the same products regularly, which indicates loyalty and possibly a reliance on staple products. This also implies that high-value customers may be ideal targets for promotions involving new products, as they already exhibit strong purchasing habits.

* dog_share (-0.40): The negative correlation between avg_order_value and dog_share suggests that customers with higher average order values tend to have a lower percentage of dog-related products in their orders. This supports the observation that occasional buyers, particularly those with smaller average order values, may be focusing more on dog-related items

In [None]:
# Visualize the avg_order_value column in a boxplot
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x=df['avg_order_value'])
plt.title('Original Data')

plt.subplot(1, 2, 2)
sns.boxplot(x=df_no_outliers['avg_order_value'])
plt.title('Data without Outliers')
plt.tight_layout()
plt.show()

* Full Dataset: The left boxplot, which includes the entire dataset, clearly highlights the presence of corporate customers with significantly higher average order values. These outliers, representing large, less frequent purchases, are distinct from the general population of private customers. This visual representation reinforces the need to segment these corporate customers for more targeted analysis later on.

* Outlier-Free Dataset: The right boxplot, after removing outliers, provides a more refined view of the average order value distribution among private customers. It shows a narrower range, giving a clearer sense of the central tendency and spread of average order values. Most regular customers have average order values clustered within a range of 14 to 25 euros, confirming that corporate customers were skewing the previous analysis.

This refined understanding of avg_order_value sets the stage for the next analysis, where corporate customers will be more precisely identified and characterized.

### Identifying Corporate Customers

#### Using avg_order_value

Step 1: Detecting Outliers with Interquartile Range (IQR)

* To precisely identify corporate customers, we begin by applying the Interquartile Range (IQR) method on the newly created feature, avg_order_value. This classic approach allows us to systematically identify outliers, particularly those with high average order values, which are indicative of corporate clients.

In [None]:
# using the IQR method to identify the outliers in the avg_order_value column
Q1 = df['avg_order_value'].quantile(0.25)
Q3 = df['avg_order_value'].quantile(0.75)
IQR = Q3 - Q1

# Calculate the number of outliers
outliers = df[(df['avg_order_value'] < (Q1 - 1.5 * IQR)) | (df['avg_order_value'] > (Q3 + 1.5 * IQR))]
print(f"Number of outliers in the avg_order_value column: {len(outliers)}")

* After computing the IQR, we identified 99 customers as potential outliers. However, a more detailed, manual inspection of the data will help to find a more refined boundary for accurate segmentation.

In [None]:
# Use the original data and rank the dataset by the avg_order_value column in descending order
df_ranked = df.sort_values(by='avg_order_value', ascending=False).reset_index(drop=True)
df_ranked.head(len(outliers))

Step 2: Refining the Boundary for Corporate Customers

* Upon further analysis, we found a clear and significant boundary at index 95, where the average order value drops sharply from €2,370 to just €96.50 for the next customer (index 96). This strong difference suggests that the first 96 customers are very likely corporate clients, distinguished by their substantially higher average order values.

* Interestingly, the customer at index 96 (ID: tvs855) has a much lower avg_order_value of €96.50 but shares a total order value similar to the lower range of corporate clients. This observation prompts us to investigate total_order_value more thoroughly to detect other potential corporate customers based on total spending rather than per-order averages.

#### Using total_order_value

In [None]:
# Use the IQR method to identify the outliers in the avg_order_value column
Q1 = df['total_order_value'].quantile(0.25)
Q3 = df['total_order_value'].quantile(0.75)
IQR = Q3 - Q1

# Calculate the number of outliers
outliers = df[(df['total_order_value'] < (Q1 - 1.5 * IQR)) | (df['total_order_value'] > (Q3 + 1.5 * IQR))]
print(f"Number of outliers in the total_order_value column: {len(outliers)}")

Step 1: Analyzing total_order_value for Further Outliers

* In our next step, we applied the same IQR technique to the total_order_value column. This method revealed 1,587 customers with notably high total order values.

In [None]:
# Rank the df by the total order value
df_ranked_total = df.sort_values(by='total_order_value', ascending=False).reset_index(drop=True)
df_ranked_total.iloc[1580:1595]

* At the lower boundary, the total order value drops from €6,726 at index 1586 to just €658 for the next customer. This sharp decline indicates the presence of another distinct cluster of possibly corporate customers, maybe smaller businesses that place frequent but lower-value orders. These companies likely place orders on a near-daily basis, a behavior uncommon for private individuals, and thus represent another segment of corporate clients.

In [None]:
# Within df_ranked_total.iloc[97:1595] check for customers that have a days_between_trans value of above 3
outliers = df_ranked_total.iloc[97:1587][df_ranked_total.iloc[97:1587]['days_between_trans'] > 3].index
print(outliers)

* While analyzing days_between_trans, we found two extreme outliers that suggested data entry errors. Given that the dataset covers a one-year period, it is impossible for the number of days between transactions to exceed 365, yet two customers had significantly higher values, suggesting a potential typo or miscalculation.

In [None]:
# Take the df_ranked_total from 96 to 1586 and plot the days_between_trans column in a boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x=df[df['num_transactions'] == 401]['days_between_trans'][df[df['num_transactions'] == 401 ]['days_between_trans'] < 1000])
plt.title('Boxplot of the days_between_trans column for the first 1587 companies of total_order_value')
plt.show()

# Calculate the mean of that column for that range
print(f"The mean of the days_between_trans column for the first 1587 companies of total_order_value is {df[df['num_transactions'] == 401]['days_between_trans'][df[df['num_transactions'] == 401]['days_between_trans'] < 1000].mean():.2f}")

* After visualizing the corrected dataset using a boxplot (without these two outliers), we observed that the remaining customers have days_between_trans values ranging from 1.4 to 2.5 days, with an average of 1.90 days. This suggests that the two erroneous entries were inflated by three digits, potentially due to a data-entry issue where values should have been recorded as 2,346 and 2,349, respectively.

* To address this issue, we will divide the problematic values by 1,000 to align them with the rest of the dataset. After correction, the data will reflect realistic transaction frequencies and no longer skews the analysis.

In [34]:
# Correct the wrong values
df_ranked_total.loc[1231, 'days_between_trans'] = df_ranked_total.loc[1231, 'days_between_trans'] / 1000
df_ranked_total.loc[957, 'days_between_trans'] = df_ranked_total.loc[957, 'days_between_trans'] / 1000

In [None]:
# Check if there is any outlier left
print(f"Number of companies with a days_between_trans value of above 3: {len(df_ranked_total.iloc[97:1587][df_ranked_total.iloc[97:1587]['days_between_trans'] > 3])}")
print(f"Number of companies with a days_between_trans value of below 0.1: {len(df_ranked_total.iloc[97:1587][df_ranked_total.iloc[97:1587]['days_between_trans'] < 0.1])}")

Based on the analysis of both avg_order_value and total_order_value, along with the corrected days_between_trans values, we established three distinct segments:

1. Corporate Customers (1st 96 customers): 
    * Identified based on a significant boundary in avg_order_value and supported by their high total order value.

2. Frequent Corporate Customers (97th to 1588th customers): 
    * Identified by their high total order values and low days_between_trans values, representing companies that order frequently but with smaller individual transactions.

3. Private Customers (Remaining customers): 
    * These customers exhibit lower values in both avg_order_value and total_order_value and likely represent individual buyers with more sporadic purchasing behavior.

With this segmentation, we can now tailor future analyses and strategies to better understand and target each group.

### Dividing the Dataset into Corporate and Private Customers

In [None]:
# Create the corporate custome datarframe by using only the first 96 rows
df_corporate1 = df_ranked.head(96)
df_corporate1

In [None]:
df_corporate2 = df_ranked_total.iloc[97:1587]
df_corporate2

In [None]:
# Create the private customer dataframe by using the remaining rows
df_private = df_ranked_total.iloc[1587:]
df_private = df_private.reset_index(drop=True)
df_private

In [None]:
# Check if the 3 datasets have a common CustomerID with each other
common_ids1 = df_corporate1['CustomerID'].isin(df_corporate2['CustomerID'])
common_ids2 = df_corporate1['CustomerID'].isin(df_private['CustomerID'])
common_ids3 = df_corporate2['CustomerID'].isin(df_private['CustomerID'])

# Print the results
print(f"Number of common CustomerIDs between corporate1 and corporate2: {common_ids1.sum()}")
print(f"Number of common CustomerIDs between corporate1 and private: {common_ids2.sum()}")
print(f"Number of common CustomerIDs between corporate2 and private: {common_ids3.sum()}")

The dataset is now divided into three segments:

* Corporate Customers 1: The first 96 customers based on their high average order values.

* Corporate Customers 2: Customers ranked 97 to 1586, distinguished by their frequent and high-volume transactions.

* Private Customers: The remaining customers, presumed to be individual buyers with lower total and average order values.

We then proceed by visualizing each dataset with boxplots and histograms, focusing on avg_order_value and total_order_value.

In [None]:
# Now in 2x3 grid, plot the boxplots for the avg_order_value and total_order_value columns for the 3 datasets
plt.figure(figsize=(18, 12))

# First row: Boxplots
plt.subplot(2, 3, 1)
sns.boxplot(x=df_corporate1['avg_order_value'])
plt.title('Corporate Customers 1 - Average Order Value')

plt.subplot(2, 3, 2)
sns.boxplot(x=df_corporate2['avg_order_value'])
plt.title('Corporate Customers 2 - Average Order Value')

plt.subplot(2, 3, 3)
sns.boxplot(x=df_private['avg_order_value'])
plt.title('Private Customers - Average Order Value')

# Second row: Boxplots
plt.subplot(2, 3, 4)
sns.boxplot(x=df_corporate1['total_order_value'])
plt.title('Corporate Customers 1 - Total Order Value')

plt.subplot(2, 3, 5)
sns.boxplot(x=df_corporate2['total_order_value'])
plt.title('Corporate Customers 2 - Total Order Value')

plt.subplot(2, 3, 6)
sns.boxplot(x=df_private['total_order_value'])
plt.title('Private Customers - Total Order Value')

# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
# Create a 2x3 grid of histograms for the three dataframes with avg_order_value and total_order_value
plt.figure(figsize=(18, 12))

# First row: Histograms
plt.subplot(2, 3, 1)
sns.histplot(df_corporate1['avg_order_value'], bins=20)
plt.title('Corporate Customers 1 - Histogram')

plt.subplot(2, 3, 2)
sns.histplot(df_corporate2['avg_order_value'], bins=20)
plt.title('Corporate Customers 2 - Histogram')

plt.subplot(2, 3, 3)
sns.histplot(df_private['avg_order_value'], bins=20)
plt.title('Private Customers - Histogram')

# Second row: Histograms
plt.subplot(2, 3, 4)
sns.histplot(df_corporate1['total_order_value'], bins=20)
plt.title('Corporate Customers 1 - Histogram')

plt.subplot(2, 3, 5)
sns.histplot(df_corporate2['total_order_value'], bins=20)
plt.title('Corporate Customers 2 - Histogram')

plt.subplot(2, 3, 6)
sns.histplot(df_private['total_order_value'], bins=20)
plt.title('Private Customers - Histogram')

# Adjust layout
plt.tight_layout()
plt.show()

The boxplots and histograms generated for each segment confirm the robustness of our segmentation. Each customer group demonstrates distinct value ranges, with only a few outliers in each dataset.

* Corporate Customers 1: The range of avg_order_value and total_order_value aligns with our expectations for large, less frequent orders from corporate clients.

* Corporate Customers 2: A high total order value but with lower average order values, characteristic of companies making frequent but smaller orders.

* Private Customers: Interestingly, the histogram for private customers suggests at least two possible clusters, indicating varied spending behavior within this group.

In [None]:
# Show the correlation matrix for all customer datasets
plt.figure(figsize=(18, 6))

# First plot: Corporate Customers 1
plt.subplot(1, 3, 1)
corr_corporate1 = df_corporate1.drop(columns=['CustomerID','num_transactions','total_order_value']).corr()
sns.heatmap(corr_corporate1, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - Corporate Customers 1')

# Second plot: Corporate Customers 2
plt.subplot(1, 3, 2)
corr_corporate2 = df_corporate2.drop(columns=['CustomerID','num_transactions','total_order_value']).corr()
sns.heatmap(corr_corporate2, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - Corporate Customers 2')

# Third plot: Private Customers
plt.subplot(1, 3, 3)
corr_private = df_private.drop(columns=['CustomerID','num_transactions','total_order_value']).corr()
sns.heatmap(corr_private, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - Private Customers')

# Adjust layout
plt.tight_layout()
plt.show()

After calculating the correlation matrix for each segment, we gained additional insights into the behavior of different customer groups.

1. Corporate Customers 1

    * Avg Order Value vs. Days Between Transactions (0.87): Customers who order less frequently tend to have higher average order values.

    * Avg Order Value vs. Repeat Share (-0.65): Customers with a higher average order value tend to have a lower share of repeated orders.
    
    * Avg Order Value vs. Dog Share (0.20): There is a slight positive correlation suggesting that higher-spending corporate customers might also order more dog-related products.

2. Corporate Customers 2

    * All correlations are very weak, ranging between -0.034 and 0.035. This indicates that the behavior of these high-frequency, lower-average-order customers is relatively uniform and not strongly influenced by other factors.

3. Private Customers

    * Avg Order Value vs. Days Between Transactions (-0.69): Higher-spending customers tend to place orders more frequently, suggesting a division into high-spending frequent buyers and low-spending infrequent buyers.

    * Avg Order Value vs. Repeat Share (0.86): Strong positive correlation, indicating that frequent buyers also have a higher proportion of repeated orders.
    
    * Avg Order Value vs. Dog Share (-0.40): Interestingly, higher-spending private customers purchase fewer dog-related products, unlike the corporate segments.

In [43]:
def plot_customers_3d(df):
    """
    Visualize corporate customers in a 3D scatter plot using average order value,
    days between transactions, and dog share, with points colored by repeat share.

    Parameters:
    df (pd.DataFrame): The DataFrame containing corporate customer data.
    
    Returns:
    None: Displays a 3D scatter plot.
    """
    # Create a 3D scatter plot
    fig = px.scatter_3d(df, 
                        x='avg_order_value', 
                        y='days_between_trans', 
                        z='dog_share', 
                        color='repeat_share',
                        title='Corporate Customers - 3D Scatter Plot')

    # Set the size of the figure
    fig.update_layout(width=1600, height=800)
    
    # Show the plot
    fig.show()

In [None]:
plot_customers_3d(df_corporate1)

We visualized corporate customer 1 in a 3D scatterplot, using the following axes: avg_order_value, days_between_trans, and dog_share, with color representing the repeat_share. The plot revealed:

* A wide range of dog_share values, from 3% to 50%, for customers with lower average order values and more frequent orders.

* Less frequent but higher-spending corporate customers showed more variability in their dog_share, with no distinct separation. One high-spending corporate customer had a dog share of 62%, while another had only 8%.

In [None]:
plot_customers_3d(df_corporate2)

For the second corporate segment:

* The plot reaffirmed their high repeat_share values and low days_between_trans, with only two outliers in terms of avg_order_value. These outliers could be looked at in closer inspection later.

In [None]:
plot_customers_3d(df_private)

* In this plot, we discovered some anomalous data points where days_between_trans exceeded 364 days. Given that the dataset covers a one-year period, these entries are likely errors and should be corrected.

In [None]:
# Filter out the private customers that have a days_between_trans value of higher than 364 (the maximum)
df_private_hd = df_private[df_private['days_between_trans'] > 364]

plot_customers_3d(df_private_hd)

* After visualizing the data points with more than 364 days between transactions, we found that, except for their anomalous days between transactions, the rest of their data appeared reasonable. As such, we will correct these errors by setting their days_between_trans to a more realistic value.

In [None]:
plot_customers_3d(df_private[df_private['days_between_trans'] <= 364])

In [None]:
def count_customers_by_days_between(df, thresholds):
    """
    Print the number of customers with 'days_between_trans' exceeding specified thresholds.

    Parameters:
    df (pd.DataFrame): The DataFrame containing customer data.
    thresholds (list of int): List of thresholds for 'days_between_trans'.

    Returns:
    None: Prints the counts for each threshold.
    """
    for threshold in thresholds:
        count = len(df[df['days_between_trans'] > threshold])
        print(f"Number of customers that have more than {threshold} days between transactions: {count}")

# Example usage
thresholds = [364, 240, 236, 235, 234, 233, 1000]
count_customers_by_days_between(df_private, thresholds)


* We further visualized the days_between_trans for customers below 364 days and identified additional outliers. By analyzing the distribution of these values, we determined that a boundary of 235 days is appropriate. All values exceeding 235 days will be set to 235, providing a more consistent dataset for future analysis.

In [None]:
# Set the number of days between transactions to 235 for the private customers that have more than 235 days
df_private.loc[df_private['days_between_trans'] > 235, 'days_between_trans'] = 235

plot_customers_3d(df_private)

* After correcting the data, we visualized the private customer data once again. This final visualization offers valuable insights into the distribution of private customers, clearly indicating different clusters of spending behavior. These clusters will be the focus of further analysis in the next section.

For now, we will save the segmented datasets—Corporate Customers 1, Corporate Customers 2, and Private Customers—into CSV files for future use and modeling.

In [51]:
# Save the dataframes to csv files
df_corporate1.to_csv('corporate_customers1.csv', index=False)
df_corporate2.to_csv('corporate_customers2.csv', index=False)
df_private.to_csv('private_customers.csv', index=False)
df.to_csv('cleaned_data.csv', index=False)

# Clustering and Segmentation

What Has Been Done So Far:

* Data Preparation and Feature Engineering: We created new features such as avg_order_value to better understand customer purchasing behavior.

* Identification of Corporate Customers: Using statistical methods like the Interquartile Range (IQR), we segmented the dataset into Corporate Customers 1, Corporate Customers 2, and Private Customers.

* Data Visualization and Correction: Through boxplots, histograms, and 3D scatter plots, we visualized the data, identified anomalies, and corrected data errors to ensure accuracy.

What We Aim to Do Next:

* Finalize Clusters for Corporate and Private Customers: Use clustering techniques to identify distinct groups within each customer segment.

* Define Customer Characteristics: Analyze each cluster to understand their purchasing patterns and key characteristics.

* Develop Targeted Marketing Strategies: Based on the clusters, devise customized marketing approaches to effectively engage each customer group.

In [52]:
# Load all dataframes again (can be used as a new starting point for the code - except the libraries at the beginning)
df_corporate1 = pd.read_csv('corporate_customers1.csv')
df_corporate2 = pd.read_csv('corporate_customers2.csv')
df_private = pd.read_csv('private_customers.csv')
df = pd.read_csv('cleaned_data.csv')

## Corporate Customers

### Corporate Customers 1

* Even though this customer segment is already quite small, it makes sense to break it down further due to its importance in terms of the level of expenditure over the time horizon under consideration, particularly because we have seen very broad behavior with no clear pattern in its distribution.

In [None]:
# define a 3d scatter plot function
def scatter_plot3d(df, columns):
    # Visualize the corporate customers
    fig = px.scatter_3d(df, x=columns[0], y=columns[1], z=columns[2], color=columns[3])

    # Set the size of the figure
    fig.update_layout(width=1600, height=800)

    fig.update_layout(title='Corporate Customers - 3D Scatter Plot')
    fig.show()

scatter_plot3d(df_corporate1, ['avg_order_value', 'days_between_trans', 'dog_share', 'repeat_share'])

In [None]:
scatter_plot3d(df_corporate1, ['total_order_value', 'days_between_trans', 'dog_share', 'repeat_share'])

* These visualizations revealed potential clusters within the dataset. To further explore these clusters, we apply the K-means clustering algorithm using the features mentioned above, experimenting with two and four clusters.

In [None]:
def kmeans_clustering(df, columns_to_use, columns_to_drop, n_clusters):
    """
    Perform KMeans clustering on selected columns of a DataFrame and visualize the clusters in a 3D scatter plot.
    
    Parameters:
    df (pd.DataFrame): The DataFrame containing the data to be clustered.
    columns_to_use (list of str): List of column names to be used for clustering and visualization (x, y, z axes).
    columns_to_drop (list of str): List of column names to be dropped before clustering (e.g., identifiers and irrelevant columns).
    n_clusters (int): The number of clusters to form with KMeans.

    Returns:
    None: Displays a 3D scatter plot of the clusters.
    """
    
    # Validate inputs
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input df must be a pandas DataFrame.")
    
    if not all(col in df.columns for col in columns_to_use):
        raise ValueError("Some columns_to_use are not present in the DataFrame.")
    
    if not all(col in df.columns for col in columns_to_drop):
        raise ValueError("Some columns_to_drop are not present in the DataFrame.")
    
    if not isinstance(n_clusters, int) or n_clusters <= 0:
        raise ValueError("n_clusters must be a positive integer.")

    # Create a copy of the DataFrame to avoid modifying the original data
    df_cluster = df.copy()

    # Drop specified columns
    df_cluster = df_cluster.drop(columns=columns_to_drop, errors='ignore')

    # Validate columns_to_use after dropping irrelevant columns
    if not all(col in df_cluster.columns for col in columns_to_use):
        raise ValueError("Not all columns_to_use are present in the DataFrame after dropping irrelevant columns.")

    # Standardize the data
    scaler = StandardScaler()
    df_cluster_scaled = scaler.fit_transform(df_cluster[columns_to_use])

    # Create and fit the KMeans model
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    df['cluster'] = kmeans.fit_predict(df_cluster_scaled)

    # Create a 3D scatter plot to visualize the clusters
    fig = px.scatter_3d(
        df,
        x=columns_to_use[0],
        y=columns_to_use[1],
        z=columns_to_use[2],
        color='cluster',
        title='3D Scatter Plot with KMeans Clusters'
    )
    fig.update_layout(width=1600, height=800)
    fig.show()

# Example usage
kmeans_clustering(
    df_corporate1, 
    ['avg_order_value', 'days_between_trans', 'dog_share'],  # Columns for visualization (x, y, z axes)
    ['CustomerID', 'total_order_value', 'num_transactions'],  # Columns to drop (irrelevant for clustering)
    2  # Number of clusters
)


In [None]:
# Call the function with the corporate customers dataframe, the columns and the number of clusters
kmeans_clustering(
    df_corporate1, ['total_order_value', 'days_between_trans', 'dog_share'], 
    ['CustomerID', 'avg_order_value', 'num_transactions'], 
    2
    )

In [None]:
# Use k-means clustering to cluster the corporate customers into four groups
kmeans_clustering(
    df_corporate1, ['avg_order_value', 'days_between_trans', 'dog_share'], 
    ['CustomerID', 'total_order_value', 'num_transactions'], 
    4
    )

In [None]:
# Now the same but with total_order_value instead of avg_order_value
kmeans_clustering(
    df_corporate1, ['total_order_value', 'days_between_trans', 'dog_share'], 
    ['CustomerID', 'avg_order_value', 'num_transactions'],
    4
    )

* While K-means provided initial insights, given the small size of this segment (96 customers), we opted for a more straightforward approach for segmentation. Instead of relying solely on clustering algorithms, we decided to segment customers based on key features that align closely with business objectives.

* We divide Corporate Customers Segment 1 using total_order_value and num_transactions, categorizing each as "High" or "Low" based on their respective mean values. This method considers both the volume and frequency of purchases, providing a practical framework for marketing strategies.

In [None]:
# Create a copy of the corporate customers dataframe
df_corporate_split = df_corporate1.copy()

# Calculate the mean for total_order_value and num_transactions
mean_total_order_value = df_corporate_split['total_order_value'].mean()
mean_num_transactions = df_corporate_split['num_transactions'].mean()

# Split based on whether the values are above or below the mean
df_corporate_split['total_order_value_split'] = df_corporate_split['total_order_value'].apply(
    lambda x: 'low' if x < mean_total_order_value else 'high'
)

df_corporate_split['num_transactions_split'] = df_corporate_split['num_transactions'].apply(
    lambda x: 'low' if x < mean_num_transactions else 'high'
)

# Combine the two splits into a single group variable for four distinct categories
df_corporate_split['group'] = (
    df_corporate_split['total_order_value_split'].astype(str) + '_' +
    df_corporate_split['num_transactions_split'].astype(str)
)

# Visualize the 2x2 grid in a 3D scatter plot with four different colors
fig = px.scatter_3d(
    df_corporate_split,
    x='total_order_value',
    y='days_between_trans',
    z='dog_share',
    color='group',  # Use the combined group variable for color
    symbol='group',  # Optionally, use different symbols for clarity
    labels={
        'total_order_value': 'Total Order Value',
        'days_between_trans': 'Days Between Transactions',
        'dog_share': 'Dog Share',
        'group': 'Group'
    }
)
fig.update_layout(
    width=1600,
    height=800,
    title='Corporate Customers - 3D Scatter Plot with 2x2 Grid (Split by Mean)',
    legend_title_text='Group (Total Order Value vs Num Transactions)'
)
fig.show()

# Set the found clusters as df_corporate['cluster']
df_corporate1['cluster'] = df_corporate_split['group']

### Model Valuation

In [None]:
# Scatter plot with color based on 'cluster'
scatter_plot = alt.Chart(df_corporate1).mark_circle(size=60).encode(
    x=alt.X('avg_order_value', title='Average Order Value'),
    y=alt.Y('days_between_trans', title='Days Between Transactions'),
    color=alt.Color('cluster:N', scale=alt.Scale(scheme='category10'), title='Customer Segment'),
    tooltip=['CustomerID', 'avg_order_value', 'days_between_trans', 'repeat_share']
).properties(
    width=1200,
    height=800,
    title="Customer Segments by Order Value and Days Between Transactions"
).interactive()

scatter_plot.show()

In [None]:
df_corporate1['cluster'].value_counts()

## 2x2 Matrix: Customer Segmentation Based on Total Order Value and Number of Transactions

|                          | **Low Number of Transactions** <br> (Customers below Average Transactions Number) | **High Number of Transactions** <br> (Customers above Average Transactions Number) |
|--------------------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| **Low Total Order Value** <br> (Customers below Average Order Value) | **Budget-Conscious Occasional Customers** <br> These customers make infrequent purchases and spend a low total amount. They may be cost-sensitive or not heavily engaged. | **Budget-Conscious Frequent Customers** <br> These customers make frequent purchases but still maintain a relatively low total order value, indicating smaller transaction sizes. |
| **High Total Order Value** <br> (Customers above Average Order Value) | **High-Spending Occasional Customers** <br> These customers do not transact often, but when they do, they tend to spend a significant amount. They might be selective but valuable. | **High-Spending Frequent Customers** <br> These are highly valuable customers who make frequent purchases with high total order value, indicating strong engagement and consistent spending. |


In [None]:
# Visualize the clusters in a boxplot for each feature
def boxplot_clusters(df):
    # Create a boxplot for each feature in the corporate customers dataframe, but split by the cluster
    plt.figure(figsize=(18, 12))

    # Iterate over the columns
    for i, col in enumerate(df.columns[1:-1]):
        plt.subplot(3, 3, i + 1)
        sns.boxplot(x='cluster', y=col, data=df)
        plt.title(col)

    # Adjust layout
    plt.tight_layout()
    plt.show()

# Call the function with the corporate customers dataframe
boxplot_clusters(df_corporate1)

### Marketing Strategy Recommendations

* Budget-Conscious Occasional Clients:
    * Objective: Increase engagement and order frequency.
    * Strategy: Offer incentives such as discounts on repeat purchases or loyalty programs to encourage more frequent ordering.

* Budget-Conscious Frequent Clients:
    * Objective: Upsell and increase average order value.
    * Strategy: Promote bundled products or premium offerings to encourage larger purchases.

* High-Spending Occasional Clients:
    * Objective: Increase purchase frequency.
    * Strategy: Provide personalized communication highlighting new products and exclusive deals to entice more regular engagement.

* High-Spending Frequent Clients:
    * Objective: Retain and reward loyalty.
    * Strategy: Offer VIP programs, early access to new products, and personalized services to strengthen the relationship.

Note: While dog_share did not show a clear pattern within these segments, it may still be beneficial to customize marketing content based on this feature by categorizing it into intervals (e.g., low, medium, high) and tailoring product recommendations accordingly.

In [63]:
# Convert the cluster strings into integers
df_corporate1['cluster'] = df_corporate1['cluster'].apply(lambda x: 0 if x == 'low_low' else 1 if x == 'low_high' else 2 if x == 'high_low' else 3)

## Corporate Customers 2

In [None]:
scatter_plot3d(df_corporate2, ['avg_order_value', 'days_between_trans', 'dog_share', 'repeat_share'])

In [None]:
scatter_plot3d(df_corporate2, ['total_order_value', 'days_between_trans', 'dog_share', 'repeat_share'])

We created 3D scatter plots for Corporate Customers 2 using the same features as before.

Observations:

* The dataset is highly homogeneous.

* Three customers stand out with significantly higher total_order_value (> €15,000).

* Wide range in dog_share values.

In [None]:
df_corporate2[df_corporate2['total_order_value']>15000]

In [None]:
# Save them in a dataframe of individual handled customers
df_individual = df_corporate2[df_corporate2['total_order_value']>15000]

# Drop them from the corporate customers 2
df_corporate2 = df_corporate2[df_corporate2['total_order_value']<=15000]
# Reindex the dataframe
df_corporate2 = df_corporate2.reset_index(drop=True)

# Visualize the corporate customers again with the 3d scatter plot
scatter_plot3d(df_corporate2[df_corporate2['total_order_value']<=15000], ['avg_order_value', 'days_between_trans', 'dog_share', 'repeat_share'])

* Isolating High-Value Customers: The three outliers were saved as individual clients for specialized attention (still they will be in the cluster as well)

* The remaining customers show close similarity in avg_order_value, total_order_value, and days_between_trans.

* The main differentiators are dog_share and repeat_share.

In [None]:
# Call the function with the corporate customers dataframe, the columns and the number of clusters
kmeans_clustering(df_corporate2, ['avg_order_value', 'days_between_trans', 'dog_share'], ['CustomerID', 'total_order_value', 'num_transactions'], 4)

* We attempted K-Means clustering with different numbers of clusters.

* No clear patterns emerged that would facilitate an effective marketing strategy.

* Instead, we focused on segmenting based on dog_share and repeat_share, splitting the data manually using the mean values of these features.

In [None]:
# Create a copy of the corporate customers dataframe
df_corporate_split = df_corporate2.copy()

# Calculate the mean for total_order_value and num_transactions
mean_total_order_value = df_corporate_split['dog_share'].mean()
mean_num_transactions = df_corporate_split['repeat_share'].mean()

# Split based on whether the values are above or below the mean
df_corporate_split['dog_share_split'] = df_corporate_split['dog_share'].apply(
    lambda x: 'low' if x < mean_total_order_value else 'high'
)

df_corporate_split['repeat_share_split'] = df_corporate_split['repeat_share'].apply(
    lambda x: 'low' if x < mean_num_transactions else 'high'
)

# Combine the two splits into a single group variable for four distinct categories
df_corporate_split['group'] = (
    df_corporate_split['dog_share_split'].astype(str) + '_' +
    df_corporate_split['repeat_share_split'].astype(str)
)

# Visualize the 2x2 grid in a 3D scatter plot with four different colors
fig = px.scatter_3d(
    df_corporate_split,
    x='total_order_value',
    y='num_transactions',
    z='dog_share',
    color='group',  # Use the combined group variable for color
    symbol='group',  # Optionally, use different symbols for clarity
    labels={
        'total_order_value': 'Total Order Value',
        'num_transactions': 'Num Transactions',
        'dog_share': 'Dog Share',
        'group': 'Group'
    }
)
fig.update_layout(
    width=1600,
    height=800,
    title='Corporate Customers - 3D Scatter Plot with 2x2 Grid (Split by Mean)',
    legend_title_text='Group (Total Order Value vs Num Transactions)'
)
fig.show()

# Set the found clusters as df_corporate['cluster']
df_corporate2['cluster'] = df_corporate_split['group']

Segments:

* Low Repeat Share & Low Dog Share (Low_Low)

* Low Repeat Share & High Dog Share (Low_High)

* High Repeat Share & Low Dog Share (High_Low)

* High Repeat Share & High Dog Share (High_High)

### Model Valuation

In [None]:
# Analyse the found clusters
df_corporate2['cluster'].value_counts()

In [None]:
# Visualize the clusters in a boxplot
boxplot_clusters(df_corporate2)

The boxplots confirmed that:

* avg_order_value, total_order_value, days_between_trans, and num_transactions are similar across clusters.

* The primary differences lie in repeat_share and dog_share.

## 2x2 Matrix: Customer Segmentation Based on Dog Share and Repeat Share

|                                                  | **Low Dog Share**<br>(Below Average)                                     | **High Dog Share**<br>(Above Average)                                     |
|--------------------------------------------------|-------------------------------------------------------------------------|---------------------------------------------------------------------------|
| **Low Repeat Share**<br>(Below Average)          | **Variety-Seeking Customers**<br>Purchase a diverse range of products with fewer repeats and less focus on dog-related items. | **Dog Product Explorers**<br>Low repeat purchases but higher interest in dog-related products. Opportunity to promote new dog products. |
| **High Repeat Share**<br>(Above Average)         | **Loyal Customers**<br>Frequently reorder the same products, not heavily focused on dog-related items. | **Dog Product Loyalists**<br>High repeat purchases and strong preference for dog-related products. Highly engaged in this category. |


### Marketing Strategy Recommendations

* Variety-Seeking Customers:
    * Approach: Introduce loyalty programs to encourage repeat purchases. Highlight new and diverse product offerings.
    * Content: Personalized recommendations based on past diverse purchases.

* Dog Product Explorers:
    * Approach: Promote new dog-related products and exclusive deals. Encourage trial of other product categories.
    * Content: Tailored promotions on dog products, cross-selling opportunities.

* Loyal Customers:
    * Approach: Reward loyalty with special offers. Introduce them to complementary products to enhance their usual purchases.
    * Content: Exclusive discounts on frequently purchased items, bundled offers.

* Dog Product Loyalists:
    * Approach: Offer premium dog-related products and services. Engage them with community events or content related to dog care.
    * Content: Early access to new dog products, invitations to dog-related events, informative content.

In [72]:
# Change the cluster names into integer values
df_corporate2['cluster'] = df_corporate2['cluster'].apply(lambda x: 4 if x == 'low_low' else 5 if x == 'low_high' else 6 if x == 'high_low' else 7)

## Private Customers

* To begin analyzing the private customers, we first plot two 3D scatter plots to gain a visual understanding of the dataset. These plots help us identify patterns and potential clusters within the data.

In [None]:
scatter_plot3d(df_private, ['avg_order_value', 'days_between_trans', 'dog_share', 'repeat_share'])

In [None]:
scatter_plot3d(df_private, ['total_order_value', 'days_between_trans', 'dog_share', 'repeat_share'])

* Compared to the corporate customer segments, the private customer data appears more dispersed, making it less straightforward to segment. While we can visually detect the possibility of two or three clusters, this approach lacks precision.

* To refine our understanding, we will employ the elbow method to determine the optimal number of clusters. After that, we will use the k-means algorithm to effectively cluster the customers.

The Elbow Method

* The elbow method is a commonly used technique in cluster analysis to identify the optimal number of clusters in a dataset. 

* It works by running the k-means algorithm for a range of cluster numbers (k) and plotting the total within-cluster sum of squares (inertia) against the number of clusters. 

* The goal is to find the point where the inertia starts to decrease more slowly, forming an "elbow" shape in the graph. 

* This point typically indicates the optimal number of clusters, where adding more clusters does not significantly improve the model.

In [None]:
def elbow_method(df, max_clusters=10):
    """
    Runs the elbow method to determine the optimal number of clusters.
    Args:
        df: Input DataFrame with relevant features.
        max_clusters: The maximum number of clusters to test.
    Returns:
        A plot showing the inertia values for each k.
    """
    # Standardize the data to ensure each feature contributes equally
    scaler = StandardScaler()
    df_scaled = scaler.fit_transform(df)

    inertia_values = []

    # Iterate over possible cluster sizes (k) to calculate inertia for each
    for k in range(1, max_clusters + 1):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(df_scaled)
        inertia_values.append(kmeans.inertia_)

    # Plotting the inertia values to visualize the "elbow"
    plt.figure(figsize=(8, 5))
    plt.plot(range(1, max_clusters + 1), inertia_values, marker='o', linestyle='-', color='b')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia')
    plt.title('Elbow Method: Optimal Number of Clusters')
    plt.show()

# Run the elbow method for the corporate customers
elbow_method(df_private.drop(columns=['CustomerID', 'total_order_value', 'num_transactions']))


* By using the elbow method, we can confidently proceed with k-means clustering, using 3 clusters as the most appropriate choice based on the data's distribution and the elbow graph results.

In [None]:
# kmeans first
kmeans_clustering(df_private, ['avg_order_value', 'days_between_trans', 'dog_share'], ['CustomerID', 'total_order_value', 'num_transactions'], 3)

### Model Valuation

In [None]:
df_private

In [None]:
# Visualize the clusters in a boxplot
boxplot_clusters(df_private)

With corporate customers, we have always been able to use our own established logic to divide customers into different segments. This was not so easy with private customers. We will therefore use two common scores to further validate our results.

In [None]:
# Create copy of df_private without the custoemrID column
df_private_cluster = df_private.drop(columns=['CustomerID', 'total_order_value', 'num_transactions'])

# Calculate the silhouette score for the KMeans model
silhouette_score(df_private_cluster, df_private_cluster['cluster'])

Silhouette Score (0.61): 

* The silhouette score ranges from -1 to 1, with higher values indicating better-defined clusters. 

* A score of 0.61 suggests that the clusters are reasonably well-defined and separated, with most data points being close to the center of their assigned clusters. 

* This indicates that the clustering algorithm has effectively grouped customers based on similar behavioral patterns such as frequency of transactions, order value, and product preferences

* While not perfect, this score implies that the segmentation is strong enough to identify meaningful patterns for targeted marketing strategies.

In [None]:
# Calculate the Davies Bouldin score for the KMeans model
davies_bouldin_score(df_private_cluster, df_private_cluster['cluster'])

Davies-Bouldin Score (0.54): 

* The Davies-Bouldin score measures the average similarity between clusters, with lower values indicating better clustering. 

* A score of 0.54 suggests that the clusters are relatively distinct, with minimal overlap. 

* This means that the differences between customer groups (such as high vs. low spenders or frequent vs. infrequent buyers) are clearly defined, allowing for more precise targeting in marketing efforts.

### Marketing Strategy Recommendations

In [None]:
# df_private filtered for cluster 0
df_private[df_private['cluster'] == 0].describe()

Segment 0: Frequent High-Spending Customers (57.844 customers)

* Transaction Frequency: Moderate, with an average of 14 transactions (max 24)

* Total Order Value: Highest among all segments, with an average of €340 (standard deviation of 80)

* Days Between Transactions: Lowest, averaging 36 days (mostly between 31 and 36 days)

* Repeat Share: Highest, at 39%

* Dog Share: Low, averaging 20%

* Average Order Value: Highest, at €24 per order

Marketing Strategy: This group is the most valuable segment, making frequent, high-value purchases. Their high repeat share suggests strong loyalty to certain products. A targeted strategy for this segment could focus on:

* Exclusive Loyalty Programs: Offering personalized loyalty rewards to encourage repeat purchasing.

* Upselling and Cross-selling: Suggest complementary products based on their high spending behavior.

* Early Access to New Products: These customers are engaged and likely to appreciate early access or exclusive products.

* Subscription Services: Given their frequent orders, offer them a convenient subscription model with recurring delivery options.

In [None]:
# df_private filtered for cluster 1
df_private[df_private['cluster'] == 1].describe()

Segment 1: Infrequent Dog-Centric Customers (10,326 customers)

* Transaction Frequency: Low, with an average of only 4 transactions

* Total Order Value: Lowest, at an average of €11

* Days Between Transactions: Highest, averaging 230 days

* Repeat Share: Lowest, at 10%

* Dog Share: By far the highest, at 59%

* Average Order Value: Lowest, at €11

Marketing Strategy: These customers primarily buy dog-related products and do so infrequently. Their low repeat share and high dog share suggest they come for specific needs and then disengage. The marketing strategy here should focus on:

* Targeted Promotions on Dog Products: Offer discounts and special offers on dog-related items to drive engagement and repeat purchases.

* Educational Content: Provide personalized, informative content such as newsletters or blog posts about dog care, nutrition, or seasonal product recommendations.

* Re-engagement Campaigns: Since they have a long gap between transactions, automated re-engagement emails with time-sensitive offers could help bring them back sooner.

* Bundle Offers: Encourage them to purchase more items per transaction by offering bundle deals focused on dog care essentials.

In [None]:
# df_private filtered for cluster 2
df_private[df_private['cluster'] == 2].describe()

Segment 2: Occasional Moderate-Spending Customers (29,348 customers)

* Transaction Frequency: Medium, averaging 10 transactions (but some outliers up to 4000)

* Total Order Value: Moderate, averaging €70 (standard deviation of 54)

* Days Between Transactions: Medium, at an average of 102 days

* Repeat Share: Medium, at 21%

* Dog Share: Low, averaging 21%

* Average Order Value: Moderate, at €15 per order

Marketing Strategy: This segment represents a more balanced customer group that orders occasionally but shows moderate spending and engagement. To increase their frequency or spending:

* Seasonal Campaigns: Offer seasonal promotions or limited-time deals to spur additional purchases throughout the year.

* Loyalty Incentives: Introduce tiered loyalty programs to gradually increase engagement and reward customers based on purchase frequency or order value.

* Personalized Recommendations: Leverage purchase data to provide personalized product recommendations that align with their past behavior.

* Targeted Emails: Sending customized emails with product suggestions based on past transactions and browsing habits may increase engagement and total order value.

In [84]:
# Set the cluster values 0, 1, 2 to the values 8, 9, 10
df_private['cluster'] = df_private['cluster'].apply(lambda x: 8 if x == 0 else 9 if x == 1 else 10)

Each customer segment presents unique behaviors and opportunities for targeted marketing. Frequent High-Spending Customers are loyal and valuable, requiring a focus on personalization and exclusivity. Infrequent Dog-Centric Customers should be re-engaged with targeted promotions and educational content on dog-related products, while Occasional Moderate-Spending Customers benefit from a balanced approach combining personalized offers, loyalty rewards, and seasonal campaigns.

By tailoring the marketing strategies to these distinct segments, the company can optimize customer engagement, drive repeat business, and maximize revenue potential across the board.

# Summary

## Corporate Customers 1

### 2x2 Matrix: Customer Segmentation Based on Total Order Value and Number of Transactions

|                          | **Low Number of Transactions** <br> (Customers below Average Transactions Number) | **High Number of Transactions** <br> (Customers above Average Transactions Number) |
|--------------------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| **Low Total Order Value** <br> (Customers below Average Order Value) | **Budget-Conscious Occasional Customers** <br> These customers make infrequent purchases and spend a low total amount. They may be cost-sensitive or not heavily engaged. | **Budget-Conscious Frequent Customers** <br> These customers make frequent purchases but still maintain a relatively low total order value, indicating smaller transaction sizes. |
| **High Total Order Value** <br> (Customers above Average Order Value) | **High-Spending Occasional Customers** <br> These customers do not transact often, but when they do, they tend to spend a significant amount. They might be selective but valuable. | **High-Spending Frequent Customers** <br> These are highly valuable customers who make frequent purchases with high total order value, indicating strong engagement and consistent spending. |


### Key Marketing Strategies:

* Budget-Conscious Occasional Customers: Incentivize repeat purchases through discounts and loyalty programs.

* Budget-Conscious Frequent Customers: Upsell premium or bundled products to increase basket size.

* High-Spending Occasional Customers: Personalized offers for exclusive products and services to increase purchase frequency.

* High-Spending Frequent Customers: VIP treatment, early access to products, and personalized communication to retain loyalty.

## Corporate Customers 2

### 2x2 Matrix: Customer Segmentation Based on Dog Share and Repeat Share

|                                                  | **Low Dog Share**<br>(Below Average)                                     | **High Dog Share**<br>(Above Average)                                     |
|--------------------------------------------------|-------------------------------------------------------------------------|---------------------------------------------------------------------------|
| **Low Repeat Share**<br>(Below Average)          | **Variety-Seeking Customers**<br>Purchase a diverse range of products with fewer repeats and less focus on dog-related items. | **Dog Product Explorers**<br>Low repeat purchases but higher interest in dog-related products. Opportunity to promote new dog products. |
| **High Repeat Share**<br>(Above Average)         | **Loyal Customers**<br>Frequently reorder the same products, not heavily focused on dog-related items. | **Dog Product Loyalists**<br>High repeat purchases and strong preference for dog-related products. Highly engaged in this category. |


### Key Marketing Strategies:

* Variety-Seeking Customers: Highlight new and diverse product offerings, introduce loyalty programs.

* Dog Product Explorers: Promote new dog-related products, encourage cross-category purchases.

* Loyal Customers: Reward loyalty with special offers, cross-sell complementary products.

* Dog Product Loyalists: Offer premium dog products and services, create engagement through exclusive dog-related content.

## Private Customers

### Key Marketing Strategies:

Frequent High-Spending Customers:

* Exclusive loyalty programs to drive further engagement.

* Upselling and cross-selling opportunities to maximize value.

* Early access to new products to reward loyalty.

* Subscription services to maintain consistent transactions.


Infrequent Dog-Centric Customers:

* Targeted promotions and discounts on dog-related products.

* Educational content on dog care to build trust and engagement.

* Re-engagement campaigns to reduce long transaction gaps.

* Bundle offers for dog-related products to increase transaction size.


Occasional Moderate-Spending Customers
*  Seasonal campaigns to drive purchases during key times.

* Loyalty programs to boost engagement and encourage repeat buying.

* Personalized recommendations to drive higher average order values.

* Targeted emails to increase relevance and customer interaction.

# Recommendations and Next Steps

Based on the insights gathered from the customer segmentation analysis, the following strategic recommendations and future initiatives are proposed to enhance customer engagement, optimize marketing efforts, and drive business growth.

1. Implement an Automated Customer Segmentation Model

    * Leverage the customer segments identified in this analysis to build an automated model that dynamically manages and targets different customer groups. This will enable the business to:

        * Personalize marketing strategies according to the unique needs and behaviors of each segment.

        * Streamline communications, promotions, and product recommendations for high-value and high-potential customer groups.

        * Continuously update customer classifications as new data comes in, ensuring that the company can respond proactively to changing customer behaviors.

2. Incorporate Additional Data for a Refined Analysis

    * RFM Analysis: Use detailed order data, including purchase dates, to implement a Recency, Frequency, and Monetary (RFM) model. This model will allow for:
        
        * A deeper understanding of customer purchasing habits over time.
        
        * Identifying customers who are likely to engage with offers based on how recently and frequently they have made purchases.

    * Customer Lifetime Value (CLV) Modeling: For businesses with historical data spanning at least 2-3 years, implementing a CLV model can provide critical insights into:
        
        * Which customer segments are likely to deliver the highest long-term value.
        
        * Which customers should be prioritized for loyalty programs, personalized offers, and retention strategies.
        
        * Tools such as the PyMC-Marketing library can be employed to develop a probabilistic CLV model that factors in future revenue potential, helping to optimize investment in customer retention and acquisition.

3. Develop a Personalized Product Recommendation System

    * Utilize customer transaction history to build a recommendation engine that suggests products based on individual purchasing behavior. This model could:
        
        * Predict customer preferences, improving upselling and cross-selling efforts by offering highly relevant product suggestions.
        
        * Increase average order value and customer satisfaction by offering tailored product bundles or promotions.
        
        * Improve customer retention by providing timely recommendations for repeat purchases based on prior buying patterns.

# Final Output

In [None]:
# Check if all cluster labels are unique
print(df_corporate1['cluster'].unique())
print(df_corporate2['cluster'].unique())
print(df_private['cluster'].unique())

In [None]:
# Now combine all three dataframes by only using the CustomerID and the cluster column
df_final = pd.concat([df_corporate1[['CustomerID', 'cluster']], df_corporate2[['CustomerID', 'cluster']], df_private[['CustomerID', 'cluster']]])
df_final = df_final.reset_index(drop=True)
df_final

In [87]:
# Save the clustered data to a CSV file
df_final.to_csv('Clustered_Data.csv', index=False)