<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-your-start:" data-toc-modified-id="Before-your-start:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before your start:</a></span></li><li><span><a href="#Challenge-1---Import-and-Describe-the-Dataset" data-toc-modified-id="Challenge-1---Import-and-Describe-the-Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Challenge 1 - Import and Describe the Dataset</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?" data-toc-modified-id="Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?-2.0.0.1"><span class="toc-item-num">2.0.0.1&nbsp;&nbsp;</span>Explore the dataset with mathematical and visualization techniques. What do you find?</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-2---Data-Cleaning-and-Transformation" data-toc-modified-id="Challenge-2---Data-Cleaning-and-Transformation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Challenge 2 - Data Cleaning and Transformation</a></span></li><li><span><a href="#Challenge-3---Data-Preprocessing" data-toc-modified-id="Challenge-3---Data-Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Challenge 3 - Data Preprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here." data-toc-modified-id="We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here.-4.0.0.1"><span class="toc-item-num">4.0.0.1&nbsp;&nbsp;</span>We will use the <code>StandardScaler</code> from <code>sklearn.preprocessing</code> and scale our data. Read more about <code>StandardScaler</code> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler" target="_blank">here</a>.</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-4---Data-Clustering-with-K-Means" data-toc-modified-id="Challenge-4---Data-Clustering-with-K-Means-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Challenge 4 - Data Clustering with K-Means</a></span></li><li><span><a href="#Challenge-5---Data-Clustering-with-DBSCAN" data-toc-modified-id="Challenge-5---Data-Clustering-with-DBSCAN-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Challenge 5 - Data Clustering with DBSCAN</a></span></li><li><span><a href="#Challenge-6---Compare-K-Means-with-DBSCAN" data-toc-modified-id="Challenge-6---Compare-K-Means-with-DBSCAN-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Challenge 6 - Compare K-Means with DBSCAN</a></span></li><li><span><a href="#Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters" data-toc-modified-id="Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Bonus Challenge 2 - Changing K-Means Number of Clusters</a></span></li><li><span><a href="#Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples" data-toc-modified-id="Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Bonus Challenge 3 - Changing DBSCAN <code>eps</code> and <code>min_samples</code></a></span></li></ul></div>

# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In this lab, you are working with a dataset containing information about customer spending in a wholesale retail setting. The goal is to apply clustering algorithms such as K-Means and DBSCAN to segment the customers based on their purchasing behavior. By doing so, you aim to identify groups of customers with similar characteristics, which can be valuable for targeted marketing, customer segmentation, and understanding the customer base.

The specific objectives of this lab may include:

Exploring customer spending patterns and behavior.
Uncovering distinct customer segments within the dataset.
Applying clustering algorithms to identify groups of customers with similar purchasing characteristics.
Comparing the performance of different clustering algorithms (e.g., K-Means and DBSCAN).
Assessing the effectiveness of the clusterings and their potential insights for the business.
Ultimately, the lab provides an opportunity to gain practical experience with clustering techniques, understand customer segmentation, and draw actionable insights from the data to drive business decisions.

In [3]:
# Import your libraries:

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings                                              
from sklearn.exceptions import DataConversionWarning          
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

# Challenge 1 - Import and Describe the Dataset

In this lab, we will use a dataset containing information about customer preferences. We will look at how much each customer spends in a year on each subcategory in the grocery store and try to find similarities using clustering.

The origin of the dataset is [here](https://archive.ics.uci.edu/ml/datasets/wholesale+customers).

In [None]:
# Load the dataset into a pandas DataFrame
file_path = 'C:/Users/Latif-Calderón/lab-unsupervised-learning-en/data/Wholesale_customers_data.csv'
wholesale_data = pd.read_csv(file_path) #pd. is pandas module and need to be imported to run

# Display the first few rows of the dataset to understand its structure
print(wholesale_data.head())

# Summary information about the dataset, including data types and missing values
print(wholesale_data.info())

#### Explore the dataset with mathematical and visualization techniques. What do you find?

Checklist:

* What does each column mean?
* Any categorical data to convert?
* Any missing data to remove?
* Column collinearity - any high correlations?
* Descriptive statistics - any outliers to remove?
* Column-wise data distribution - is the distribution skewed?
* Etc.

Additional info: Over a century ago, an Italian economist named Vilfredo Pareto discovered that roughly 20% of the customers account for 80% of the typical retail sales. This is called the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle). Check if this dataset displays this characteristic.

In [None]:
# Your code here:
#1 What does each column mean?
# Display the first few rows of the dataset to understand its structure
print(wholesale_data.head())

In [None]:
#2 Any categorical data to convert?
# Display the data types of the columns
print(wholesale_data.dtypes)

In [None]:
#3 Any missing data to remove?
# Check for missing values
missing_values = wholesale_data.isnull().sum()
print(missing_values)

In [None]:
#4 Column collinearity - any high correlations?
# Check for missing values
# Calculate the correlation matrix
correlation_matrix = wholesale_data.corr()

# Show the correlation matrix
print(correlation_matrix)

# Visualize the correlation matrix
import seaborn as sns
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="YlGnBu")
plt.show()

In [None]:
#5 Descriptive statistics - any outliers to remove?
# Descriptive statistics
print(wholesale_data.describe())

In [None]:
#6 Column-wise data distribution - is the distribution skewed?
# Data distribution for each column
import matplotlib.pyplot as plt
wholesale_data.hist(bins=20, figsize=(15,10))
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Assuming you have generated a histogram
plt.hist(wholesale_data, bins=20)
plt.title('Histogram of Your Data')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Save the histogram as an image file
plt.savefig('histogram.png')

## Pareto Principal

In [None]:
#Analyze the annual spending of customers across different product categories
import matplotlib.pyplot as plt

# Box plot of annual spending in different product categories
plt.figure(figsize=(12, 8))
wholesale_data.boxplot(column=['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen'])
plt.title('Annual Spending in Different Product Categories')
plt.ylabel('Annual Spending')
plt.xlabel('Product Categories')
plt.xticks(rotation=45)
plt.show()

In [None]:
#Calculate the cumulative percentage of customer spending.
# Calculate the cumulative percentage of customer spending
total_spending = wholesale_data.sum(axis=1)
cumulative_percentage = total_spending.sort_values().cumsum() / total_spending.sum() * 100

# Display the cumulative percentage
print(cumulative_percentage)

In [None]:
#Evaluate whether around 20% of the customers account for approximately 80% of the total spending
# Evaluate the concentration of spending
percentage_spending_80 = cumulative_percentage[cumulative_percentage <= 80].index.max() / cumulative_percentage.index.max() * 100
percentage_customers_20 = cumulative_percentage[cumulative_percentage <= 20].index.max() / cumulative_percentage.index.max() * 100

print(f"Around {percentage_customers_20:.2f}% of the customers account for approximately {percentage_spending_80:.2f}% of the total spending")

**Your observations here**

- ex.: Frozen, Grocery, Milk and Detergents Paper have a high...
- ...



# Challenge 2 - Data Cleaning and Transformation

If your conclusion from the previous challenge is the data need cleaning/transformation, do it in the cells below. However, if your conclusion is the data need not be cleaned or transformed, feel free to skip this challenge. But if you do choose the latter, please provide rationale.

In [81]:
# Your code here

**Your comment here**

-  Around 100.00% of the customers account for approximately 100.00% of the total spending
-  ...

# Challenge 3 - Data Preprocessing

One problem with the dataset is the value ranges are remarkably different across various categories (e.g. `Fresh` and `Grocery` compared to `Detergents_Paper` and `Delicassen`). If you made this observation in the first challenge, you've done a great job! This means you not only completed the bonus questions in the previous Supervised Learning lab but also researched deep into [*feature scaling*](https://en.wikipedia.org/wiki/Feature_scaling). Keep on the good work!

Diverse value ranges in different features could cause issues in our clustering. The way to reduce the problem is through feature scaling. We'll use this technique again with this dataset.

#### We will use the `StandardScaler` from `sklearn.preprocessing` and scale our data. Read more about `StandardScaler` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

*After scaling your data, assign the transformed data to a new variable `customers_scale`.*

In [None]:
# Your import here:

from sklearn.preprocessing import StandardScaler

# Your code here:
# Scale the data using StandardScaler
scaler = StandardScaler()  # Initialize the StandardScaler
customers_scale = scaler.fit_transform(wholesale_data)  # Fit and transform the data

# The scaled data is now stored in the customers_scale variable
print(customers_scale)

# Challenge 4 - Data Clustering with K-Means

Now let's cluster the data with K-Means first. Initiate the K-Means model, then fit your scaled data. In the data returned from the `.fit` method, there is an attribute called `labels_` which is the cluster number assigned to each data record. What you can do is to assign these labels back to `customers` in a new column called `customers['labels']`. Then you'll see the cluster results of the original data.

In [83]:
from sklearn.cluster import KMeans

# Initialize the KMeans model and specify the number of clusters
kmeans = KMeans(n_clusters=5)  # For example, let's use 5 clusters

# Fit the scaled data to the KMeans model
kmeans.fit(customers_scale)

# Assign the cluster labels to the original customers dataset
wholesale_data['labels'] = kmeans.labels_

### Looking to the elbow we can choose 2 like the correct number of clusters

In [84]:
kmeans_2 = KMeans(n_clusters=2).fit(customers_scale)

labels = kmeans_2.predict(customers_scale)

clusters = kmeans_2.labels_.tolist()

In [85]:
wholesale_data['Label'] = clusters

Count the values in `labels`.

In [None]:
# Your code here:
# Count the values in the 'labels' column
label_counts = wholesale_data['Label'].value_counts()
print(label_counts)

# Challenge 5 - Data Clustering with DBSCAN

Now let's cluster the data using DBSCAN. Use `DBSCAN(eps=0.5)` to initiate the model, then fit your scaled data. In the data returned from the `.fit` method, assign the `labels_` back to `customers['labels_DBSCAN']`. Now your original data have two labels, one from K-Means and the other from DBSCAN.

In [87]:
from sklearn.cluster import DBSCAN 

# Your code here
# Initialize the DBSCAN model
dbscan = DBSCAN(eps=0.5)  # Using eps=0.5 as an example

# Fit the scaled data to the DBSCAN model
dbscan.fit(customers_scale)

# Assign the cluster labels to the original customers dataset
wholesale_data['labels_DBSCAN'] = dbscan.labels_

Count the values in `labels_DBSCAN`.

In [None]:
# Your code here
# Count the values in the 'labels_DBSCAN' column
label_counts_DBSCAN = wholesale_data['labels_DBSCAN'].value_counts()
print(label_counts_DBSCAN)

# Challenge 6 - Compare K-Means with DBSCAN

Now we want to visually compare how K-Means and DBSCAN have clustered our data. We will create scatter plots for several columns. For each of the following column pairs, plot a scatter plot using `labels` and another using `labels_DBSCAN`. Put them side by side to compare. Which clustering algorithm makes better sense?

Columns to visualize:

* `Detergents_Paper` as X and `Milk` as y
* `Grocery` as X and `Fresh` as y
* `Frozen` as X and `Delicassen` as y

Visualize `Detergents_Paper` as X and `Milk` as y by `labels` and `labels_DBSCAN` respectively

In [89]:
def plot(x,y,hue):
    sns.scatterplot(x=x, 
                    y=y,
                    hue=hue)
    plt.title('Detergents Paper vs Milk ')
    return plt.show();

In [None]:
# Your code here:
##import seaborn as sns
##import matplotlib.pyplot as plt

def plot(x, y, hue, data, title):
    plt.figure(figsize=(10, 5))
    plt.subplot(1, 2, 1)
    sns.scatterplot(x=x, 
                    y=y,
                    hue=hue,
                    data=data)
    plt.title(f'{title} - K-Means')

    plt.subplot(1, 2, 2)
    sns.scatterplot(x=x, 
                    y=y,
                    hue='labels_DBSCAN',
                    data=data)
    plt.title(f'{title} - DBSCAN')
    plt.show()

# Use the 'plot' function to visualize the Detergents_Paper vs. Milk by K-Means and DBSCAN labels
plot('Detergents_Paper', 'Milk', 'labels', wholesale_data, 'Detergents Paper vs Milk')

Visualize `Grocery` as X and `Fresh` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Your code here:
# Use the 'plot' function to visualize the Grocery vs. Fresh by K-Means and DBSCAN labels
plot('Grocery', 'Fresh', 'labels', wholesale_data, 'Grocery vs Fresh')

Visualize `Frozen` as X and `Delicassen` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Your code here:
# Use the 'plot' function to visualize the Frozen vs. Delicassen by K-Means and DBSCAN labels
plot('Frozen', 'Delicassen', 'labels', wholesale_data, 'Frozen vs Delicassen')

Let's use a groupby to see how the mean differs between the groups. Group `customers` by `labels` and `labels_DBSCAN` respectively and compute the means for all columns.

In [None]:
# Your code here:
# Group customers by K-Means labels and compute the column means
kmeans_means = wholesale_data.groupby('labels').mean()
print("K-Means Cluster Means:")
print(kmeans_means)

# Group customers by DBSCAN labels and compute the column means
dbscan_means = wholesale_data.groupby('labels_DBSCAN').mean()
print("\nDBSCAN Cluster Means:")
print(dbscan_means)

Which algorithm appears to perform better?

-The best algorithm seems to be K-Means because it seems to indentify the most recurrent customers of the products which they buy. 

**Your observations here**

- When comparing K-Means and DBSCAN, there are several considerations to take into account:

K-Means is sensitive to the number of clusters specified and may not perform well with irregular-shaped clusters or varying densities.
DBSCAN is robust to noise and can find arbitrarily-shaped clusters, but it requires more care in setting its hyperparameters such as epsilon and min_samples.
Based on the scatter plots, cluster means, and the average spending behaviors of the different clusters, you can make a comparative assessment of which algorithm better captures the inherent structure of the Wholesale customers dataset. It's important to note that the "better" algorithm depends on the specific characteristics of the data and the problem being addressed.

As we have conducted a comparative analysis, considering the insights gained from the visualizations, the consistency and separation of the clusters, and the average spending behaviors, you can draw conclusions about which algorithm appears to better capture the structure of the dataset.

# Bonus Challenge 2 - Changing K-Means Number of Clusters

As we mentioned earlier, we don't need to worry about the number of clusters with DBSCAN because it automatically decides that based on the parameters we send to it. But with K-Means, we have to supply the `n_clusters` param (if you don't supply `n_clusters`, the algorithm will use `8` by default). You need to know that the optimal number of clusters differs case by case based on the dataset. K-Means can perform badly if the wrong number of clusters is used.

In advanced machine learning, data scientists try different numbers of clusters and evaluate the results with statistical measures (read [here](https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation)). We are not using statistical measures today but we'll use our eyes instead. In the cells below, experiment with different number of clusters and visualize with scatter plots. What number of clusters seems to work best for K-Means?

In [None]:
# Your code here:.
# Experiment with different numbers of clusters and visualize the results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

for n_clusters, ax in zip([2, 3, 4, 5], axes.flatten()):
    kmeans_n = KMeans(n_clusters=n_clusters)
    kmeans_n.fit(customers_scale)
    labels_n = kmeans_n.labels_
    
    sns.scatterplot(x='Detergents_Paper', y='Milk', hue=labels_n, data=wholesale_data, ax=ax)
    ax.set_title(f'Number of clusters: {n_clusters}')
    ax.legend(loc='upper right')

plt.show()

**Your comment here**

- when the cluster is less dense , 2 clusters for example, it's easier to identify the most recurrent customers that buy detergent paper. whe the cluster is divided in more groups, each groups can be targeted to there especific needs. 

# Bonus Challenge 3 - Changing DBSCAN `eps` and `min_samples`

Experiment changing the `eps` and `min_samples` params for DBSCAN. See how the results differ with scatter plot visualization.

In [None]:
# Your code here
eps_values = [0.2, 0.5, 1.0]  # Example values for eps
min_samples_values = [5, 10, 20]  # Example values for min_samples
#this changes in comparison of the tables./fig,axes...
fig, axes = plt.subplots(len(eps_values), len(min_samples_values), figsize=(15, 12))

for i, eps in enumerate(eps_values):
    for j, min_samples in enumerate(min_samples_values):
        dbscan_param = DBSCAN(eps=eps, min_samples=min_samples)
        dbscan_param.fit(customers_scale)
        labels_param = dbscan_param.labels_
        #this is replace with the plt.figure...
        sns.scatterplot(x='Detergents_Paper', y='Milk', hue=labels_param, data=wholesale_data, ax=axes[i, j])
        axes[i, j].set_title(f'eps={eps}, min_samples={min_samples}')
        axes[i, j].legend(loc='upper right')

plt.show()

In [None]:
# Example values for eps and min_samples
eps_values = [0.2, 0.5, 1.0]
min_samples_values = [5, 10, 20]

for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan_param = DBSCAN(eps=eps, min_samples=min_samples)
        dbscan_param.fit(customers_scale)
        labels_param = dbscan_param.labels_
        
        plt.figure(figsize=(8, 6))
        sns.scatterplot(x='Detergents_Paper', y='Milk', hue=labels_param, data=wholesale_data)
        plt.title(f'eps={eps}, min_samples={min_samples}')
        plt.legend(loc='upper right')
        plt.show()

**Your comment here**

- After observing the visual representations, you can analyze the effects of parameter variations on the cluster formation and determine the settings that yield the most meaningful clustering results.