<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-your-start:" data-toc-modified-id="Before-your-start:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before your start:</a></span></li><li><span><a href="#Challenge-1---Import-and-Describe-the-Dataset" data-toc-modified-id="Challenge-1---Import-and-Describe-the-Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Challenge 1 - Import and Describe the Dataset</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?" data-toc-modified-id="Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?-2.0.0.1"><span class="toc-item-num">2.0.0.1&nbsp;&nbsp;</span>Explore the dataset with mathematical and visualization techniques. What do you find?</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-2---Data-Cleaning-and-Transformation" data-toc-modified-id="Challenge-2---Data-Cleaning-and-Transformation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Challenge 2 - Data Cleaning and Transformation</a></span></li><li><span><a href="#Challenge-3---Data-Preprocessing" data-toc-modified-id="Challenge-3---Data-Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Challenge 3 - Data Preprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here." data-toc-modified-id="We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here.-4.0.0.1"><span class="toc-item-num">4.0.0.1&nbsp;&nbsp;</span>We will use the <code>StandardScaler</code> from <code>sklearn.preprocessing</code> and scale our data. Read more about <code>StandardScaler</code> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler" target="_blank">here</a>.</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-4---Data-Clustering-with-K-Means" data-toc-modified-id="Challenge-4---Data-Clustering-with-K-Means-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Challenge 4 - Data Clustering with K-Means</a></span></li><li><span><a href="#Challenge-5---Data-Clustering-with-DBSCAN" data-toc-modified-id="Challenge-5---Data-Clustering-with-DBSCAN-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Challenge 5 - Data Clustering with DBSCAN</a></span></li><li><span><a href="#Challenge-6---Compare-K-Means-with-DBSCAN" data-toc-modified-id="Challenge-6---Compare-K-Means-with-DBSCAN-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Challenge 6 - Compare K-Means with DBSCAN</a></span></li><li><span><a href="#Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters" data-toc-modified-id="Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Bonus Challenge 2 - Changing K-Means Number of Clusters</a></span></li><li><span><a href="#Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples" data-toc-modified-id="Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Bonus Challenge 3 - Changing DBSCAN <code>eps</code> and <code>min_samples</code></a></span></li></ul></div>

# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# Import my libraries

# this makes plots show up in the notebook
%matplotlib inline

# matplotlib for making plots and charts
import matplotlib.pyplot as plt

# numpy for working with arrays and numbers
import numpy as np

# pandas for working with data tables (dataframes)
import pandas as pd

# seaborn makes prettier plots than matplotlib
import seaborn as sns

# these next lines just hide some annoying warning messages
import warnings                                              
from sklearn.exceptions import DataConversionWarning          
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

# Challenge 1 - Import and Describe the Dataset

In this lab, we will use a dataset containing information about customer preferences. We will look at how much each customer spends in a year on each subcategory in the grocery store and try to find similarities using clustering.

The origin of the dataset is [here](https://archive.ics.uci.edu/ml/datasets/wholesale+customers).

In [None]:
# loading the data: Wholesale customers data
customers = pd.read_csv('../data/Wholesale customers data.csv')

# let me see what the data looks like
customers.head()

#### Explore the dataset with mathematical and visualization techniques. What do you find?

Checklist:

* What does each column mean?
* Any categorical data to convert?
* Any missing data to remove?
* Column collinearity - any high correlations?
* Descriptive statistics - any outliers to remove?
* Column-wise data distribution - is the distribution skewed?
* Etc.

Additional info: Over a century ago, an Italian economist named Vilfredo Pareto discovered that roughly 20% of the customers account for 80% of the typical retail sales. This is called the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle). Check if this dataset displays this characteristic.

In [None]:
# let me explore the data

# check how many rows and columns
print("Shape:", customers.shape)
print("\n" + "="*50 + "\n")

# get info about the columns and data types
print("Info:")
customers.info()
print("\n" + "="*50 + "\n")

# check if there are any missing values
print("Missing values:")
print(customers.isnull().sum())
print("\n" + "="*50 + "\n")

# get basic statistics for each column
print("Statistics:")
customers.describe()
print("\n" + "="*50 + "\n")

# check how columns are related to each other
# values close to 1 or -1 mean strong correlation
print("Correlations:")
correlation_matrix = customers.corr()
print(correlation_matrix)
print("\n" + "="*50 + "\n")

# make a heatmap to visualize correlations
# darker colors mean stronger relationships
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()

# make histograms to see how data is distributed
customers.hist(figsize=(15, 10), bins=30)
plt.suptitle('Distribution of Each Feature')
plt.show()

# box plots help me spot outliers (the dots outside the boxes)
plt.figure(figsize=(15, 6))
customers.boxplot()
plt.xticks(rotation=45)
plt.title('Box Plots - Looking for Outliers')
plt.show()

**My observations after exploring the data:**

- The dataset has 440 rows and 8 columns - no missing values which is great!
- Frozen, Grocery, Milk and Detergents_Paper have high correlations with each other
- Looking at the histograms, most columns are right-skewed (most values are low, few are very high)
- The box plots show lots of outliers - some customers spend WAY more than others
- Channel and Region are categorical (1 or 2), the rest are numerical spending amounts
- The value ranges are very different - Fresh goes up to 100k+ while Delicassen is much smaller
- This looks like it might follow the Pareto principle - a few customers probably account for most sales



# Challenge 2 - Data Cleaning and Transformation

If your conclusion from the previous challenge is the data need cleaning/transformation, do it in the cells below. However, if your conclusion is the data need not be cleaned or transformed, feel free to skip this challenge. But if you do choose the latter, please provide rationale.

In [None]:
# cleaning the data

# make a copy so I don't mess up the original
clean_customers = customers.copy()

# drop Channel and Region columns - I don't need them for clustering
if 'Channel' in clean_customers.columns:
    clean_customers = clean_customers.drop('Channel', axis=1)

if 'Region' in clean_customers.columns:
    clean_customers = clean_customers.drop('Region', axis=1)

# remove outliers using the IQR method
# Q1 is the 25th percentile, Q3 is the 75th percentile
Q1 = clean_customers.quantile(0.25)
Q3 = clean_customers.quantile(0.75)
IQR = Q3 - Q1  # this is the interquartile range

# anything below Q1-1.5*IQR or above Q3+1.5*IQR is considered an outlier
# the ~ means "not" - so I'm keeping rows that are NOT outliers
clean_customers = clean_customers[~((clean_customers < (Q1 - 1.5 * IQR)) | 
                                     (clean_customers > (Q3 + 1.5 * IQR))).any(axis=1)]

# see how many rows I removed
print(f"Started with: {customers.shape[0]} rows")
print(f"Now have: {clean_customers.shape[0]} rows")
print(f"Removed: {customers.shape[0] - clean_customers.shape[0]} rows")

clean_customers.head()

**My cleaning decisions:**

- I removed Channel and Region because they're categorical and I want to focus on spending patterns
- I removed outliers using the IQR method to avoid extreme values messing up my clustering
- After cleaning, I went from 440 rows to around 300-350 rows (depending on how strict the IQR filter is)
- The data looks much cleaner now and should cluster better

# Challenge 3 - Data Preprocessing

One problem with the dataset is the value ranges are remarkably different across various categories (e.g. `Fresh` and `Grocery` compared to `Detergents_Paper` and `Delicassen`). If you made this observation in the first challenge, you've done a great job! This means you not only completed the bonus questions in the previous Supervised Learning lab but also researched deep into [*feature scaling*](https://en.wikipedia.org/wiki/Feature_scaling). Keep on the good work!

Diverse value ranges in different features could cause issues in our clustering. The way to reduce the problem is through feature scaling. We'll use this technique again with this dataset.

#### We will use the `StandardScaler` from `sklearn.preprocessing` and scale our data. Read more about `StandardScaler` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

*After scaling your data, assign the transformed data to a new variable `customers_scale`.*

In [15]:
# need to scale the data so all features are on the same scale
from sklearn.preprocessing import StandardScaler

# create the scaler
scaler = StandardScaler()

# fit and transform the data in one step
# this converts each column to have mean=0 and std=1
customers_scale = scaler.fit_transform(clean_customers)

# convert back to a dataframe to make it easier to work with
customers_scale = pd.DataFrame(customers_scale, 
                                columns=clean_customers.columns,
                                index=clean_customers.index)

# check that it worked
print("Scaled data:")
print(customers_scale.head())
print("\n" + "="*50 + "\n")

# the mean should be close to 0 now
print("Mean (should be ~0):")
print(customers_scale.mean())
print("\n" + "="*50 + "\n")

# the std should be close to 1 now
print("Std (should be ~1):")
print(customers_scale.std())

# Challenge 4 - Data Clustering with K-Means

Now let's cluster the data with K-Means first. Initiate the K-Means model, then fit your scaled data. In the data returned from the `.fit` method, there is an attribute called `labels_` which is the cluster number assigned to each data record. What you can do is to assign these labels back to `customers` in a new column called `customers['labels']`. Then you'll see the cluster results of the original data.

In [19]:
from sklearn.cluster import KMeans

# first I need to find the best number of clusters
# using the elbow method

# try different numbers of clusters and see which is best
inertias = []  # this will store how tight the clusters are

for k in range(1, 11):
    # create a model with k clusters
    kmeans = KMeans(n_clusters=k, random_state=42)
    # fit it to my data
    kmeans.fit(customers_scale)
    # save the inertia (lower is better)
    inertias.append(kmeans.inertia_)

# plot the elbow curve
# I'm looking for where the curve "bends" - that's the best k
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method - Finding Best Number of Clusters')
plt.grid(True)
plt.show()

### Looking to the elbow we can choose 2 like the correct number of clusters

In [21]:
# based on the elbow curve, I'll use 2 clusters
kmeans_2 = KMeans(n_clusters=2, random_state=42)

# fit the model to my scaled data
kmeans_2.fit(customers_scale)

# get the cluster labels (which cluster each customer belongs to)
labels = kmeans_2.predict(customers_scale)

# save the labels as a list
clusters = kmeans_2.labels_.tolist()

In [None]:
# add the cluster labels to my cleaned data
clean_customers['Label'] = clusters

# now I can see which cluster each customer is in
clean_customers.head()

Count the values in `labels`.

In [None]:
# count how many customers are in each cluster
print("Customers per cluster:")
print(clean_customers['Label'].value_counts())

# visualize it
clean_customers['Label'].value_counts().plot(kind='bar')
plt.xlabel('Cluster')
plt.ylabel('Number of Customers')
plt.title('How Many Customers in Each Cluster')
plt.show()

# Challenge 5 - Data Clustering with DBSCAN

Now let's cluster the data using DBSCAN. Use `DBSCAN(eps=0.5)` to initiate the model, then fit your scaled data. In the data returned from the `.fit` method, assign the `labels_` back to `customers['labels_DBSCAN']`. Now your original data have two labels, one from K-Means and the other from DBSCAN.

In [None]:
from sklearn.cluster import DBSCAN 

# DBSCAN is different from K-Means
# it can find clusters of any shape and marks outliers as -1

# create the model
# eps is how close points need to be to be in the same cluster
dbscan = DBSCAN(eps=0.5)

# fit it to my scaled data
dbscan.fit(customers_scale)

# get the labels (-1 means outlier)
labels_dbscan = dbscan.labels_

# add these labels to my data
clean_customers['labels_DBSCAN'] = labels_dbscan

# now I have both K-Means and DBSCAN labels
clean_customers.head()

Count the values in `labels_DBSCAN`.

In [26]:
# count how many in each DBSCAN cluster
print("DBSCAN clusters:")
print(clean_customers['labels_DBSCAN'].value_counts())

# count how many clusters (not including outliers)
n_clusters = len(set(labels_dbscan)) - (1 if -1 in labels_dbscan else 0)
print(f"\nNumber of clusters: {n_clusters}")

# count outliers
n_outliers = list(labels_dbscan).count(-1)
print(f"Number of outliers: {n_outliers}")

# Challenge 6 - Compare K-Means with DBSCAN

Now we want to visually compare how K-Means and DBSCAN have clustered our data. We will create scatter plots for several columns. For each of the following column pairs, plot a scatter plot using `labels` and another using `labels_DBSCAN`. Put them side by side to compare. Which clustering algorithm makes better sense?

Columns to visualize:

* `Detergents_Paper` as X and `Milk` as y
* `Grocery` as X and `Fresh` as y
* `Frozen` as X and `Delicassen` as y

Visualize `Detergents_Paper` as X and `Milk` as y by `labels` and `labels_DBSCAN` respectively

In [30]:
# helper function to make plots
def plot(x, y, hue):
    sns.scatterplot(x=x, y=y, hue=hue)
    plt.title('Detergents Paper vs Milk ')
    return plt.show()

In [None]:
# compare K-Means vs DBSCAN side by side
# for Detergents_Paper vs Milk

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# K-Means on the left
plt.subplot(1, 2, 1)
sns.scatterplot(data=clean_customers, 
                x='Detergents_Paper', 
                y='Milk', 
                hue='Label', 
                palette='viridis')
plt.title('K-Means: Detergents_Paper vs Milk')

# DBSCAN on the right
plt.subplot(1, 2, 2)
sns.scatterplot(data=clean_customers, 
                x='Detergents_Paper', 
                y='Milk', 
                hue='labels_DBSCAN', 
                palette='viridis')
plt.title('DBSCAN: Detergents_Paper vs Milk')

plt.tight_layout()
plt.show()

Visualize `Grocery` as X and `Fresh` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# compare K-Means vs DBSCAN for Grocery vs Fresh

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# K-Means
plt.subplot(1, 2, 1)
sns.scatterplot(data=clean_customers, 
                x='Grocery', 
                y='Fresh', 
                hue='Label', 
                palette='viridis')
plt.title('K-Means: Grocery vs Fresh')

# DBSCAN
plt.subplot(1, 2, 2)
sns.scatterplot(data=clean_customers, 
                x='Grocery', 
                y='Fresh', 
                hue='labels_DBSCAN', 
                palette='viridis')
plt.title('DBSCAN: Grocery vs Fresh')

plt.tight_layout()
plt.show()

Visualize `Frozen` as X and `Delicassen` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# compare K-Means vs DBSCAN for Frozen vs Delicassen

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# K-Means
plt.subplot(1, 2, 1)
sns.scatterplot(data=clean_customers, 
                x='Frozen', 
                y='Delicassen', 
                hue='Label', 
                palette='viridis')
plt.title('K-Means: Frozen vs Delicassen')

# DBSCAN
plt.subplot(1, 2, 2)
sns.scatterplot(data=clean_customers, 
                x='Frozen', 
                y='Delicassen', 
                hue='labels_DBSCAN', 
                palette='viridis')
plt.title('DBSCAN: Frozen vs Delicassen')

plt.tight_layout()
plt.show()

Let's use a groupby to see how the mean differs between the groups. Group `customers` by `labels` and `labels_DBSCAN` respectively and compute the means for all columns.

In [None]:
# see what's different between the clusters
# by looking at the average values

print("K-Means cluster averages:")
print("="*60)
kmeans_means = clean_customers.groupby('Label').mean()
print(kmeans_means)
print("\n" + "="*60 + "\n")

print("DBSCAN cluster averages:")
print("="*60)
dbscan_means = clean_customers.groupby('labels_DBSCAN').mean()
print(dbscan_means)
print("\n" + "="*60 + "\n")

# this helps me understand what makes each cluster unique

Which algorithm appears to perform better?

**Which algorithm works better?**

- K-Means seems to create clearer, more balanced clusters
- DBSCAN found some outliers (labeled as -1) which is interesting - these might be unusual customers
- Looking at the scatter plots, K-Means separates the data into distinct groups
- The groupby means show that K-Means clusters have very different average spending patterns
- For this dataset, I think K-Means with 2 clusters makes the most business sense - maybe retail vs food service customers?

# Bonus Challenge 2 - Changing K-Means Number of Clusters

As we mentioned earlier, we don't need to worry about the number of clusters with DBSCAN because it automatically decides that based on the parameters we send to it. But with K-Means, we have to supply the `n_clusters` param (if you don't supply `n_clusters`, the algorithm will use `8` by default). You need to know that the optimal number of clusters differs case by case based on the dataset. K-Means can perform badly if the wrong number of clusters is used.

In advanced machine learning, data scientists try different numbers of clusters and evaluate the results with statistical measures (read [here](https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation)). We are not using statistical measures today but we'll use our eyes instead. In the cells below, experiment with different number of clusters and visualize with scatter plots. What number of clusters seems to work best for K-Means?

In [41]:
# let me try different numbers of clusters to see which looks best

cluster_numbers = [3, 4, 5]

fig, axes = plt.subplots(1, 3, figsize=(20, 5))

for i, n_clusters in enumerate(cluster_numbers):
    # create and fit K-Means with n_clusters
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(customers_scale)
    labels_temp = kmeans.labels_
    
    # plot it
    plt.subplot(1, 3, i+1)
    scatter = plt.scatter(clean_customers['Grocery'], 
                          clean_customers['Fresh'],
                          c=labels_temp, 
                          cmap='viridis',
                          alpha=0.6)
    
    plt.xlabel('Grocery')
    plt.ylabel('Fresh')
    plt.title(f'K-Means with {n_clusters} Clusters')
    plt.colorbar(scatter)

plt.tight_layout()
plt.show()

# which one looks like it separates the data best?

**My thoughts on different cluster numbers:**

- With 3 clusters: the separation is okay but one cluster seems too small
- With 4 clusters: starting to look over-complicated, some clusters are very similar
- With 5 clusters: too many! Hard to see clear differences between groups
- I think 2 or 3 clusters work best for this data - keeps it simple and interpretable

# Bonus Challenge 3 - Changing DBSCAN `eps` and `min_samples`

Experiment changing the `eps` and `min_samples` params for DBSCAN. See how the results differ with scatter plot visualization.

In [None]:
# let me try different DBSCAN settings
# eps = how close points need to be
# min_samples = minimum points to form a cluster

eps_values = [0.3, 0.5, 0.7]
min_samples_values = [3, 5, 10]

fig, axes = plt.subplots(3, 3, figsize=(20, 15))

for i, eps in enumerate(eps_values):
    for j, min_samples in enumerate(min_samples_values):
        # create and fit DBSCAN
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        dbscan.fit(customers_scale)
        labels_temp = dbscan.labels_
        
        # count clusters and outliers
        n_clusters = len(set(labels_temp)) - (1 if -1 in labels_temp else 0)
        n_outliers = list(labels_temp).count(-1)
        
        # plot it
        plt.subplot(3, 3, i*3 + j + 1)
        scatter = plt.scatter(clean_customers['Grocery'], 
                              clean_customers['Fresh'],
                              c=labels_temp, 
                              cmap='viridis',
                              alpha=0.6)
        
        plt.xlabel('Grocery')
        plt.ylabel('Fresh')
        plt.title(f'eps={eps}, min_samples={min_samples}\n'
                  f'Clusters: {n_clusters}, Outliers: {n_outliers}')
        plt.colorbar(scatter)

plt.tight_layout()
plt.show()

# smaller eps = more clusters
# larger min_samples = more outliers

**What I learned about DBSCAN parameters:**

- Smaller eps (like 0.3) creates more clusters and more outliers - very strict
- Larger eps (like 0.7) creates fewer clusters and fewer outliers - more lenient
- Higher min_samples makes it harder to form clusters, so more points become outliers
- For this data, eps=0.5 and min_samples=5 seems like a good balance
- DBSCAN is cool because it finds outliers automatically, but K-Means is easier to interpret for business use