## TLDR;

- **Unsupervised Learning** -> Clustering algorithms are used for unsupervised learning, ideal for exploratory data analysis.
- **Grouping Data** -> These algorithms group similar data into clusters based on specific criteria.
- **Variety of Applications** -> They're used in diverse fields like customer segmentation, anomaly detection, and more.
- **Different Techniques** -> Various types exist, like K-means and DBSCAN, each with unique strengths and suited for specific data types.
- **Choice of Parameters** -> The selection and tuning of parameters, like the number of clusters, significantly influence the results.

In the field of machine learning,clustering algorithms play a role in uncovering hidden patterns present in the data. they group together datapoints based on the simalirty of features without the need for labeled data, these groups are refered to as clusters.
There are multiple algorithms that can be used to perform cluster analysis.

- centroid based (K-means)
- connectivity-based aka hierarchical clustering (Agglomerative lcustering)
- distribution based (Gaussian-mixture modelling)
- density based (DBSCAN)

Clustering algorithms have found their place in a diverse range of real-world applications, from customer segmentation in marketing strategies to image segmentation in computer vision, and anomaly detection in cybersecurity for insightful data-driven decision making.

To demonstrate the workflow, I will use a K-means clustering algorithm to group together similar shoppers at a shopping mall.

## Business Problem Introduction

As the owners of a thriving supermarket mall, you seek a deeper understanding of your customer base. Data points including demographics, spending habits, and loyalty program membership details are at your disposal. Your goal is to identify distinct groups within your customers to tailor marketing initiatives, thereby maximizing the efficiency of your promotional efforts.

To achieve this, you opt for clustering algorithms to group similar customers based on characteristics and behaviors. By isolating specific segments of your customer base, you aim to tailor marketing strategies to respective subgroups, increasing the likelihood of a successful outcome and resulting in maximizing the bang for your marketing buck.

![a lovely shopping mall ](/posts/cluster/artifacts/sung-jin-cho-BbVGAjfAQ4o-unsplash.jpg)


Photo by <a href="https://unsplash.com/@mbuff?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Sung Jin Cho</a> on <a href="https://unsplash.com/photos/people-walking-inside-white-building-BbVGAjfAQ4o?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Unsplash</a>

## Data Preprocessing

In this initial stage, the goal is to prepare the data for analysis. This involves cleaning the data by removing or filling in missing values, which could be done through various strategies like dropping the missing rows, filling them with mean/median/mode, or using a prediction model. It's also crucial to handle outliers and potentially normalize features if they're on different scales. This stage might also involve dealing with categorical variables using encoding techniques. Effective preprocessing is crucial for reliable results in the subsequent stages.


the owner of the mall has provide the following [dataset](https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python)

| Field                  | Description                                                               |
| ---------------------- | ------------------------------------------------------------------------- |
| CustomerID             | Unique ID assigned to the customer                                        |
| Gender                 | Gender of the customer                                                    |
| Age                    | Age of the customer                                                       |
| Annual Income (k$)     | Annual income of the customer                                             |
| Spending Score (1-100) | Score assigned by the mall based on customer behavior and spending nature |


In [None]:
# imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import umap
from matplotlib_inline.backend_inline import set_matplotlib_formats
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import power_transform
from yellowbrick.cluster import kelbow_visualizer, silhouette_visualizer

# setting global plotting settings
# set_matplotlib_formats("svg")
sns.set_palette("tab10")
sns.set_style("darkgrid")
FIGSIZE = (12, 6)

In [None]:
# | code-fold: show


# Load the customer dataset to analyze shopping patterns
df_mall = pd.read_csv("artifacts/Mall_Customers.csv")

# rename columns to be lowercase, for easy typing
df_mall = df_mall.rename(
    columns={
        "CustomerID ": "id",
        "Gender ": "gender",
        "Age ": "age",
        "Annual Income (k$) ": "income",
        "Spending Score (1-100)": "spending",
    }
)
df_mall["gender"] = df_mall["gender"].str.lower()
df_mall["gender"] = df_mall["gender"].str.strip()


#
print(f"amount of NULL \n{df_mall.isna().sum()} \n")

# look at a random sample to validate the contents
df_mall.sample(6)

So, upon taking a look at the dataset, it seems like we've got a mix of numeric and non-numeric columns. Specifically, the `gender` column is the only non-numeric feature - it's all categorical data, with customers labeled as either "Male" or "Female". All the other columns - `id`, `age`, `income`, and `spending` - are numeric data types.

The `id` column looks like it's just a unique identifier for each customer, so we can exclude that one from our feature set for clustering. Makes sense, right?

Now, the other features need to be in the same ballpark (i.e., normalized) to work effectively with clustering algorithms. Otherwise, the algorithm will group together instances based on the features with the highest numbers, rather than looking at all the features. That'd be like trying to compare apples and oranges!

So, we need to process the `gender` column by encoding those categories as numbers. One common way to do this is to map "Male" and "Female" to 1 and 0, respectively. Once we've done that, gender will be represented numerically like the other features.

In a nutshell, we need to encode the gender categorical data, exclude the customer `id` column, and normalize the remaining columns before we can apply clustering algorithms. Luckily for us, there are no NULL values in the data, so we don't have to worry about dealing with those. Easy peasy!

In [None]:
# | code-fold: show

# convert gender to a numerical value via one-hot-encoding
# clustering models usally need numerical values
df_mall = df_mall.assign(gender=df_mall["gender"].map({"male": 1, "female": 0}))

# list with features for easy reference
features = ["age", "income", "spending", "gender"]
df_feature = df_mall[features]


# look at a random sample to validate the contents
df_feature.sample(5)

I used one-hot encoding to convert the string values of the gender column into numerical values, using the pandas map method in combination with a simple mapping of "male" to 1 and "female" to 0. This resulted in a new "gender" column with 0/1 encoding.
The `id` column is dropped from the feature dataframe.

## Exploratory Data Analysis

In this section, we visualize the data and hope to gain some insights into meaningful patterns that can inform our clustering analysis during the next phase.

In [None]:
print(df_feature.describe().T)

The dataset contains information 200 customers. The average (mean) age is 38.85 years. Ages range from 18 to 70, with 50% of customers aged 36 years or below.

The average annual income is $60,560, ranging from $15,000 to $137,000. 50% of customers earn $61,500 or less.

For the spending score (1-100), the average is 50.2. Half the customers have a spending score of 50 or below. The minimum is 1 and maximum 99, showing a wide range in spending habits.

Overall, we see variation among customers in age, income levels, and purchasing patterns. Clustering algorithms can help segment customers into groups based on these attributes to develop targeted marketing approaches. let us first look at the distributions of the features.


In [None]:
# | code-fold: show

fig, ax = plt.subplots(1, 2, figsize=FIGSIZE)

curr_ax = ax[0]
sns.countplot(data=df_feature, x="gender", stat="count", hue="gender", ax=curr_ax)
curr_ax.legend(["female", "male"])
# little hacky way of getting the numbers to show in the plot
curr_ax.bar_label(curr_ax.containers[0], fontsize=10)
curr_ax.bar_label(curr_ax.containers[1], fontsize=10)

curr_ax = ax[1]
sns.kdeplot(x="age", data=df_feature, common_norm=False, hue="gender", ax=curr_ax)
curr_ax.legend(["male", "female"])

plt.show()

the barplot on the right shows the distribution between the gender. Our dataset contains more females(112) than males(88).

now if we focus on the age distributions as shown in the kernel density estimation plots. The most striking observation here is that for both genders, there's a noticeable peak around 30 years of age. But, if you look closely, there seems to be a more significant number of men falling into the older age groups (55+)



In [None]:
fig, ax = plt.subplots(1, 2, figsize=FIGSIZE)

curr_ax = ax[0]
sns.regplot(
    data=df_feature[df_feature["gender"] == 0],
    x="age",
    y="income",
    color=sns.color_palette()[0],
    order=1,
    lowess=True,
    truncate=True,
    ax=curr_ax,
)


curr_ax = ax[0]
sns.regplot(
    data=df_feature[df_feature["gender"] == 1],
    x="age",
    y="income",
    color=sns.color_palette()[1],
    order=1,
    lowess=True,
    truncate=True,
    ax=curr_ax,
)
curr_ax.legend(["gender=0", "lowess regression", "gender=1", "lowess regression"])
curr_ax.set_title("age vs. income")


curr_ax = ax[1]
sns.regplot(
    data=df_feature[df_feature["gender"] == 0],
    x="age",
    y="spending",
    color=sns.color_palette()[0],
    order=1,
    lowess=True,
    truncate=True,
    ax=curr_ax,
)


curr_ax = ax[1]
sns.regplot(
    data=df_feature[df_feature["gender"] == 1],
    x="age",
    y="spending",
    color=sns.color_palette()[1],
    order=1,
    lowess=True,
    truncate=True,
    ax=curr_ax,
)
curr_ax.legend(["gender=0", "lowess regression", "gender=1", "lowess regression"])
curr_ax.set_title("age vs. spending")
plt.show()

Exploring the relation between age, income, and spending is an intriguing aspect of our dataset. using a scatterplot with a [lowess regression](https://en.wikipedia.org/wiki/Local_regression).

starting with the left scatterplot we can see a postive correaltion between age and income up until 35 years old, with the income tapering off got older individuals. This finding is in line with our intuition – as people age, they typically accumulate more income due to career advancement and increased earning potential. Interestingly, this trend appears to be consistent for both genders.

However, the spending pattern is a different story! Up until 30 years of age, we see a slight increase in spending. But around that age, there's a noticeable decrease in spending that lasts until approximately 50 years old. After that point, there's a slight uptick in spending again, but it doesn't quite reach the previous level. These findings suggest that people tend to spend less as they get older.

The difference between genders regarding age and spending appears to be relatively small compared to the overall age effect. This means that both males and females exhibit similar trends in spending throughout their lives, with some differences in the exact shapes of their spending curves.


In [None]:
fig, ax = plt.subplots(1, 2, figsize=FIGSIZE)

curr_ax = ax[0]
sns.scatterplot(
    data=df_feature,
    x="income",
    y="spending",
    hue="age",
    palette=sns.color_palette("viridis", as_cmap=True),
    ax=curr_ax,
)
curr_ax.set_title("age")


curr_ax = ax[1]
sns.scatterplot(
    data=df_feature,
    x="income",
    y="spending",
    hue="gender",
    ax=curr_ax,
)
curr_ax.set_title("gender")
curr_ax.legend(["female", "male"])

plt.suptitle("the relation between income and spending")
plt.show()

In the left scatterplot you can clearly see that younger people do seem to have higher levels of spending, but as we move towards older age groups, the spending levels decrease for most individuals. There is no clear relation between income and spending. The cluster of people with spending and income in the 40-60 and 40-65 ranges could indicate that these individuals are in a specific life stage, such as starting families or paying off mortgages. Alternatively, it might suggest that there's an external factor influencing both income and spending for this group.

Regarding the right scatterplot, it seems that there's no apparent relationship between income/spending and gender, as the colors representing males and females appear to be evenly distributed throughout the plot. This suggests that income and spending levels are relatively similar for both genders in our dataset.

In [None]:
# create a ratio between income and spending
df_ratio = df_feature.assign(si_ratio=df_feature["income"] / df_feature["spending"])

fig, ax = plt.subplots(1, 2, figsize=FIGSIZE)
curr_ax = ax[0]

sns.scatterplot(
    data=df_ratio,
    x="si_ratio",
    y="age",
    hue="gender",
    alpha=0.7,
    ax=curr_ax,
)
curr_ax.legend(["male", "female"])
curr_ax.set_title("income / spending - unclipped")

# clip to 95% quantile of the si ratio,
# this will zoom into the interesting part of the plot
df_ratio = df_ratio.assign(
    si_ratio_clip=df_ratio["si_ratio"].clip(
        upper=np.quantile(df_ratio["si_ratio"], 0.95)
    )
)

curr_ax = ax[1]
sns.scatterplot(
    data=df_ratio,
    x="si_ratio_clip",
    y="age",
    hue="gender",
    alpha=0.7,
    ax=curr_ax,
)
curr_ax.set_xlim(right=np.ceil(np.quantile(df_ratio["si_ratio"], 0.95)))
curr_ax.legend(["male", "female"])
curr_ax.set_title(
    f"income / spending - clipped @ {np.quantile(df_ratio['si_ratio'], 0.95):.2f}"
)

plt.show()

By calculating the ratio of income to spending (i.e., income divided by spending), we can gain a better understanding of an individual's spending habits in relation to their income. The raw numbers alone don't provide a complete picture, but examining the relationship between these ratios can give us valuable insights. 

in the left plot, there are some outliers with rations around 80!!! that means that their income is much higher than the spending score that was assigned to them, possible cause could be a data entry error. To better visualize the trends in our data while still showing the majority of the data points, we can compress the rest of the data by [clipping](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.clip.html#pandas.DataFrame.clip) the values at the 0.95 quantile. This approach will help us to focus on the more typical spending behaviors.

The right plot shows a clearer separation between age groups, with younger people generally having ratios below 1 and older people above 1 . Lower ratios indicate that individuals are spending relatively more than their income levels compared to their peers. Conversely, higher ratios suggest that they're spending less than their income levels.

These findings could be due to various factors such as different life stages (e.g., younger people might have more debt or be starting families), savings goals, or lifestyle choices.

In [None]:
# | code-fold: show

# apply yeo-johnson transform to si_ratio
# this will make the data more like a normal distributuon
df_ratio = df_ratio.assign(
    si_ratio_transform=power_transform(
        df_ratio["si_ratio"].to_numpy().reshape(-1, 1), method="yeo-johnson"
    )
)

fig, ax = plt.subplots(1, 2, figsize=FIGSIZE)

curr_ax = ax[0]
sns.histplot(data=df_ratio["si_ratio"], stat="percent", ax=curr_ax)
curr_ax.set_title("Original")

curr_ax = ax[1]
sns.histplot(
    data=df_ratio["si_ratio_transform"],
    stat="percent",
    ax=curr_ax,
)
curr_ax.set_title("applied Yeo-Johnson transform")
plt.show()

when using distance-based clustering algorithms like K-Means, it's essential to ensure that all features are on a comparable scale to avoid the algorithm being influenced disproportionately by one feature with larger values. The Yeo-Johnson power transformation is a popular method for normalizing data distribution, especially when dealing with skewed data like in our case. This transformation can help bring our income/spending ratio data closer to a more normal distribution, making it easier for clustering algorithms to identify meaningful patterns across all features. In the right plot, you can see how the [yeo-johnson](https://en.wikipedia.org/wiki/Power_transform#Yeo%E2%80%93Johnson_transformation) power transformation has helped to make the feature behave much more similarly to a normally distributed one, which will be beneficial when applying distance-based clustering algorithms. By ensuring that all features are on a comparable scale, we can effectively capture the relationships between income, spending, and potentially other demographic variables to discover meaningful patterns in our dataset.

## Feature Engineering

In the EDA phase, we've observed a connection between age and spending/income. However, what constitutes high spending for a 20-year-old isn't the same as for a 60-year-old. To account for this, I've grouped ages into bins and calculated the difference between each bin's mean value and individual observations. The result is a number that shows whether someone spends more or less than average for their age bracket. Similarly, I applied this methodology to gender as well.

In [None]:
# | code-fold: show

# bin the age variable into 7 bins
# fmt: off
binned = pd.cut(
    df_feature["age"],
    bins=[0,20,30,40,50,60,70,80],
    labels=[1,2,3,4,5,6,7],
    )
# fmt: on
df_feature = df_feature.assign(age_bin=binned)

# turn off the formatter, to increase readability
# fmt: off
# create a new column with the difference between income and the mean income of the gender group
df_feature = df_feature.assign(
    # create a new column with the difference between income and the mean income of the gender group
    income_vs_gender_mean=df_feature['income'] - df_feature.groupby("gender",)[["income",]].transform("mean").iloc[:, 0],
    spending_vs_gender_mean=df_feature["spending"] - df_feature.groupby("gender")[["spending",]].transform("mean").iloc[:, 0],

    # create a new column with the difference between income and the mean income of the age group
    income_vs_age_mean=df_feature["income"] - df_feature.groupby("age_bin",observed=False)[["income",]].transform("mean").iloc[:, 0],
    spending_vs_age_mean=df_feature["spending"] - df_feature.groupby("age_bin",observed=False)[["spending",]].transform("mean").iloc[:, 0],
)
# fmt: on

df_feature.sample(10)

To capture if a person is spending behaviour is deviating from what we might expect based on their incomer or age. to quantify this, i have opted to calculate the ratio between income (or age) and spending for each observation. this allows us to determine the young spenders or the old savers. at this stage no clipping or normalziation of the ratios is applied. that means that extreme values will be reflected in the data as they are.

In [None]:
# | code-fold: show

# calculate the ratio between
# spending and income -> how much of the income do you spend
# spending and age -> if you are older do you spend more or less
df_feature = df_feature.assign(
    si_ratio=df_feature["income"] / df_feature["spending"],
    sa_ratio=df_feature["age"] / df_feature["spending"],
)
df_feature.sample(10)

now that we have all our features it is time to apply a power transform to them, By applying the Yeo-Johnson normalization method to each of your input features, you're transforming them in such a way that they approach normality, making the subsequent clustering process more reliable and robust. This transformation also ensures that all features contribute equally to the clustering results, as no single feature with high values will skew or distort the final outcomes, all will contribute in a fair manner.

In [None]:
# casting values to integer in order for scaling later on
df_feature = df_feature.assign(
    age_bin=df_feature["age_bin"].astype(int),
)

df_feature[df_feature.select_dtypes(include="number").columns] = power_transform(
    X=df_feature[df_feature.select_dtypes(include="number").columns],
    method="yeo-johnson",
)
df_feature.describe().T

The values in the mean column are nearly zero, with a standard deviation of 1. Given these descriptive statistics and the data transformation we performed, we can be confident that the original data approximates a normal distribution.

## Clustering

Use a suitable clustering algorithm (like K-means or hierarchical clustering) to divide customers into distinct groups.


In [None]:
fig, ax = plt.subplots(figsize=FIGSIZE)

_ = kelbow_visualizer(
    KMeans(
        n_init=10,
    ),
    X=df_feature,
    timings=False,
    metric="distortion",
    ax=ax,
)  # distortion: mean sum of squared distances to centers

In [None]:
# fit kmeans for various number of clusters
kmeans_clusters = [
    KMeans(n_clusters=i, n_init="auto", max_iter=900) for i in range(2, 11)
]

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(18, 16), layout="constrained")

for i, ax in enumerate(axes.flatten()):
    silhouette_visualizer(
        kmeans_clusters[i],
        X=df_feature,
        ax=ax,
        is_fitted=False,
        show=False,
        colors=sns.color_palette("tab10"),
    )
plt.tight_layout()
plt.show()

In [None]:
decomp = PCA(n_components=2)
decomp_components = decomp.fit_transform(df_feature)


fig, ax = plt.subplots(figsize=FIGSIZE)
sns.scatterplot(
    x=decomp_components[:, 0],
    y=decomp_components[:, 1],
    hue=kmeans_clusters[4].predict(df_feature),
    palette=sns.color_palette("tab10", 6),
    ax=ax,
)
ax.set_xticklabels([""])
ax.set_yticklabels([""])
plt.suptitle("clusters visualised with PCA")
plt.show()

In [None]:
decomp_umap = umap.UMAP(
    n_components=2, min_dist=0.5, n_neighbors=12, n_jobs=1, random_state=91
)
decomp_components_umap = decomp_umap.fit_transform(df_feature)

fig, ax = plt.subplots(figsize=FIGSIZE)

sns.scatterplot(
    x=decomp_components_umap[:, 0],
    y=decomp_components_umap[:, 1],
    hue=kmeans_clusters[4].predict(df_feature),
    palette=sns.color_palette("tab10", 6),
    ax=ax,
)
ax.set_xticklabels([""])
ax.set_yticklabels([""])
plt.suptitle("clusters visualised with UMAP")

plt.show()

## Analysis and Evaluation

Analyze each customer group's traits, like average age or buying habits. Use metrics like Silhouette Score or Dunn Index to assess clustering quality, evaluating cluster cohesion and separation. A successful clustering result scores well on these metrics and provides actionable business insights.


In [None]:
df_cluster = df_feature.assign(cluster=kmeans_clusters[4].predict(df_feature))
df_mall_cluster = df_mall.assign(cluster=kmeans_clusters[4].predict(df_feature)).drop(
    columns=["id"]
)

In [None]:
#
df_all = pd.concat([df_mall.drop(columns=["id"]), df_cluster], axis=1)

In [None]:
grouper = df_mall_cluster.groupby("cluster", as_index=False)
df_cluster_agg = grouper.mean().round(2)
df_cluster_agg = df_cluster_agg.assign(
    count=grouper.count()["age"], age=(df_cluster_agg["age"].astype("int"))
)

In [None]:
df_cluster_agg

In [None]:
sns.boxplot(
    df_mall_cluster,
    y="cluster",
    x="age",
    orient="h",
    hue="cluster",
    palette=sns.color_palette("tab10", 6),
)

plt.title("Age of each customer segment")
plt.show()

In [None]:
sns.countplot(
    df_cluster,
    x="cluster",
    hue="gender",
    stat="proportion",
    palette=sns.color_palette("tab10", 2),
)
plt.legend(
    [
        "female",
        "male",
    ]
)
plt.title("gender ratio between the customer segments")
plt.show()

In [None]:
a = 0.2
mean_gender = df_mall["gender"].mean()
l_mean_gender, u_mean_gender = mean_gender * (1 - a), mean_gender * (1 + a)

# define a mostly gender column
# determine which cluster falls outside of the bounds
df_gender = df_cluster_agg[["cluster", "gender"]].assign(type_gender="neutral")
df_gender.loc[df_gender["gender"].gt(u_mean_gender), "type_gender"] = "mostly men"
df_gender.loc[df_gender["gender"].lt(l_mean_gender), "type_gender"] = "mostly female"
df_gender

In [None]:
fig, ax = plt.subplots(1, 1, figsize=FIGSIZE)

curr_ax = ax
_ = sns.scatterplot(
    data=df_cluster_agg,
    x="income",
    y="spending",
    hue="cluster",
    palette=sns.color_palette("tab10", 6),
    s=150,
    ax=curr_ax,
)

xlim0, xlim1 = ax.get_xlim()
ylim0, ylim1 = ax.get_ylim()

plt.vlines(
    df_mall_cluster["income"].median(), ylim0, ylim1, color="grey", linestyles="--"
)
plt.hlines(
    df_mall_cluster["spending"].median(),
    xlim0,
    xlim1,
    color="grey",
    linestyles="--",
)
plt.suptitle("Income vs. Spending")
plt.show()

## Insights and Business Applications

Explain how the results could be used to tailor marketing strategies towards each segment for improved customer engagement and retention.


## Data Preprocessing\n","\n","In this initial stage, the goal is to prepare the data for analysis. This involves cleaning the data by removing or filling in missing values, which could be done through various strategies like dropping the missing rows, filling them with mean/median/mode, or using a prediction model. It's also crucial to handle outliers and potentially normalize features if they're on different scales. This stage might also involve dealing with categorical variables using encoding techniques. Effective preprocessing is crucial for reliable results in the subsequent stages.\n","\n","\n","the owner of the mall has provide the following [dataset](https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python)\n","\n","| Field                  | Description                                                               |\n","| ---------------------- | ------------------------------------------------------------------------- |\n","| CustomerID             | Unique ID assigned to the customer                                        |\n","| Gender                 | Gender of the customer                                                    |\n","| Age                    | Age of the customer                                                       |\n","| Annual Income (k$)     | Annual income of the customer                                             |\n","| Spending Score (1-100) | Score assigned by the mall based on customer behavior and spending nature |\n

In [None]:
# | echo: false
# | output: false
### creating the front image

In [None]:
# | echo: false
# | output: false
import gif  # noqa

In [None]:
# add the cluster labels to the dataframe
df_mall_cluster_centroid = df_mall_cluster.merge(
    df_cluster_agg.drop(columns=["count"]), on="cluster", suffixes=("", "_cluster")
)

In [None]:
def create_blend(df_in: pd.DataFrame, col1: str, steps: int):
    """
    Applies a function that generates a linear sequence between the value of
    a specified column and the corresponding cluster value in each row of the dataframe.

    Parameters
    ----------
    df_in : pd.DataFrame
        Input dataframe with at least two columns: one specified by `col1` and another with `col1` suffix '_cluster'.
    col1 : str
        The name of the column in the dataframe from which to start the linear sequence.
    steps : int
        The number of steps in the linear sequence.

    Returns
    -------
    pd.DataFrame
        A dataframe with each cell containing a linear sequence between the corresponding cell in `col1` and `col1_cluster`.

    """
    return df_in.apply(
        lambda row: np.linspace(row[col1], row[f"{col1}_cluster"], steps), axis=1
    )


# add the blend columns to the three variables of interest
df_mall_cluster_centroid = df_mall_cluster_centroid.assign(
    spending_blend=create_blend(df_mall_cluster_centroid, "spending", 50),
    age_blend=create_blend(df_mall_cluster_centroid, "age", 50),
    income_blend=create_blend(df_mall_cluster_centroid, "income", 50),
)
df_mall_cluster_centroid

In [None]:
fig, ax = plt.subplots(1, 1, figsize=FIGSIZE)

sns.scatterplot(
    x=df_mall_cluster_centroid["spending_blend"].apply(lambda row: row[0]),
    y=df_mall_cluster_centroid["age_blend"].apply(lambda row: row[0]),
    hue=df_mall_cluster_centroid["cluster"],
    palette=sns.color_palette("tab10", 6),
    s=200,
    alpha=0.7,
    ax=ax,
)

xlim_begin, ylim_begin = ax.get_xlim(), ax.get_ylim()

###

fig, ax = plt.subplots(1, 1, figsize=FIGSIZE)


sns.scatterplot(
    x=df_mall_cluster_centroid["spending_blend"].apply(lambda row: row[-1]),
    y=df_mall_cluster_centroid["age_blend"].apply(lambda row: row[-1]),
    hue=df_mall_cluster_centroid["cluster"],
    palette=sns.color_palette("tab10", 6),
    s=200,
    alpha=0.7,
    ax=ax,
)

xlim_end, ylim_end = ax.get_xlim(), ax.get_ylim()

zoom_factor = 0.1
# get the global axes limits
overal_xlim = (
    min(xlim_begin[0], xlim_end[0]) * (1 - zoom_factor),
    max(xlim_begin[1], xlim_end[1]) * (1 + zoom_factor),
)
overal_ylim = (
    min(ylim_begin[0], ylim_end[0]) * (1 - zoom_factor),
    max(ylim_begin[1], ylim_end[1]) * (1 + zoom_factor),
)

In [None]:
# | echo: false
# | output: false


@gif.frame
def plot_step(i, overal_xlim, overal_ylim):
    _, ax = plt.subplots(1, 1, figsize=FIGSIZE)

    curr_ax = ax
    sns.scatterplot(
        x=df_mall_cluster_centroid["spending_blend"].apply(lambda row: row[i]),
        y=df_mall_cluster_centroid["age_blend"].apply(lambda row: row[i]),
        hue=df_mall_cluster_centroid["cluster"],
        palette=sns.color_palette("tab10", 6),
        s=200,
        alpha=0.7,
        ax=curr_ax,
    )
    # set the overal axes
    ax.set_xlim(overal_xlim), ax.set_ylim(overal_ylim)

    # remove the ticks and lables from the axes
    xticks = ax.get_xticks()
    ax.set_xticks(xticks, labels=[])
    ax.set_xlabel("")

    yticks = ax.get_yticks()
    ax.set_yticks(yticks, labels=[])
    ax.set_ylabel("")
    plt.legend("")
    plt.tight_layout()

In [None]:
freeze_frames = 10
num_of_steps = len(df_mall_cluster_centroid["spending_blend"][0])
# create the animation
gif_frames = [plot_step(i, overal_xlim, overal_ylim) for i in range(num_of_steps)]

# freeze on the bounce point
gif_frames.extend([gif_frames[-1] for _ in range(freeze_frames // 6)])

# add the the original series in reverse
gif_frames.extend(gif_frames[::-1])


gif.save(gif_frames, "artifacts/clumper.gif", duration=1)