## TLDR;

- **Unsupervised Learning** -> Clustering algorithms are used for unsupervised learning, ideal for exploratory data analysis.
- **Grouping Data** -> These algorithms group similar data into clusters based on specific criteria.
- **Variety of Applications** -> They're used in diverse fields like customer segmentation, anomaly detection, and more.
- **Different Techniques** -> Various types exist, like K-means and DBSCAN, each with unique strengths and suited for specific data types.
- **Choice of Parameters** -> The selection and tuning of parameters, like the number of clusters, significantly influence the results.

In the field of machine learning,clustering algorithms play a role in uncovering hidden patterns present in the data. they group together datapoints based on the simalirty of features without the need for labeled data, these groups are refered to as clusters.
There are multiple algorithms that can be used to perform cluster analysis.

- centroid based (K-means)
- connectivity-based aka hierarchical clustering (Agglomerative lcustering)
- distribution based (Gaussian-mixture modelling)
- density based (DBSCAN)

Clustering algorithms have found their place in a diverse range of real-world applications, from customer segmentation in marketing strategies to image segmentation in computer vision, and anomaly detection in cybersecurity for insightful data-driven decision making.

To demonstrate the workflow, I will use a K-means clustering algorithm to group together similar shoppers at a shopping mall.


## Business Problem Introduction

As the owners of a thriving supermarket mall, you seek a deeper understanding of your customer base. Data points including demographics, spending habits, and loyalty program membership details are at your disposal. Your goal is to identify distinct groups within your customers to tailor marketing initiatives, thereby maximizing the efficiency of your promotional efforts.

To achieve this, you opt for clustering algorithms to group similar customers based on characteristics and behaviors. By isolating specific segments of your customer base, you aim to tailor marketing strategies to respective subgroups, increasing the likelihood of a successful outcome and resulting in maximizing the bang for your marketing buck.


## Data Preprocessing

In this initial stage, the goal is to prepare the data for analysis. This involves cleaning the data by removing or filling in missing values, which could be done through various strategies like dropping the missing rows, filling them with mean/median/mode, or using a prediction model. It's also crucial to handle outliers and potentially normalize features if they're on different scales. This stage might also involve dealing with categorical variables using encoding techniques. Effective preprocessing is crucial for reliable results in the subsequent stages.

the owner of the mall has provide the following [dataset](https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python)

| Field                  | Description                                                               |
| ---------------------- | ------------------------------------------------------------------------- |
| CustomerID             | Unique ID assigned to the customer                                        |
| Gender                 | Gender of the customer                                                    |
| Age                    | Age of the customer                                                       |
| Annual Income (k$)     | Annual income of the customer                                             |
| Spending Score (1-100) | Score assigned by the mall based on customer behavior and spending nature |


In [None]:
# imports
import pandas as pd
import seaborn as sns

# setting global plotting settings
# set_matplotlib_formats("svg")
sns.set_palette("tab10")
sns.set_style("darkgrid")
FIGSIZE = (12, 6)

In [None]:
# | code-fold: show


# Load the customer dataset to analyze shopping patterns
df_mall = pd.read_csv("artifacts/Mall_Customers.csv")

# rename columns to be lowercase, for easy typing
df_mall = df_mall.rename(
    columns={
        "CustomerID ": "id",
        "Gender ": "gender",
        "Age ": "age",
        "Annual Income (k$) ": "income",
        "Spending Score (1-100)": "spending",
    }
)
df_mall["gender"] = df_mall["gender"].str.lower()
df_mall["gender"] = df_mall["gender"].str.strip()


#
print(f"amount of NULL \n{df_mall.isna().sum()} \n")

# look at a random sample to validate the contents
df_mall.sample(6)

So, upon taking a look at the dataset, it seems like we've got a mix of numeric and non-numeric columns. Specifically, the `gender` column is the only non-numeric feature - it's all categorical data, with customers labeled as either "Male" or "Female". All the other columns - `id`, `age`, `income`, and `spending` - are numeric data types.

The `id` column looks like it's just a unique identifier for each customer, so we can exclude that one from our feature set for clustering. Makes sense, right?

Now, the other features need to be in the same ballpark (i.e., normalized) to work effectively with clustering algorithms. Otherwise, the algorithm will group together instances based on the features with the highest numbers, rather than looking at all the features. That'd be like trying to compare apples and oranges!

So, we need to process the `gender` column by encoding those categories as numbers. One common way to do this is to map "Male" and "Female" to 1 and 0, respectively. Once we've done that, gender will be represented numerically like the other features.

In a nutshell, we need to encode the gender categorical data, exclude the customer `id` column, and normalize the remaining columns before we can apply clustering algorithms. Luckily for us, there are no NULL values in the data, so we don't have to worry about dealing with those. Easy peasy!


In [None]:
# | code-fold: show

# convert gender to a numerical value via one-hot-encoding
# clustering models usally need numerical values
df_mall = df_mall.assign(gender=df_mall["gender"].map({"male": 1, "female": 0}))

# list with features for easy reference
features = ["age", "income", "spending", "gender"]
df_feature = df_mall[features]


# look at a random sample to validate the contents
df_feature.sample(5)

I used one-hot encoding to convert the string values of the gender column into numerical values, using the pandas map method in combination with a simple mapping of "male" to 1 and "female" to 0. This resulted in a new "gender" column with 0/1 encoding.
The `id` column is dropped from the feature dataframe.
