<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-your-start:" data-toc-modified-id="Before-your-start:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before your start:</a></span></li><li><span><a href="#Challenge-1---Import-and-Describe-the-Dataset" data-toc-modified-id="Challenge-1---Import-and-Describe-the-Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Challenge 1 - Import and Describe the Dataset</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?" data-toc-modified-id="Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?-2.0.0.1"><span class="toc-item-num">2.0.0.1&nbsp;&nbsp;</span>Explore the dataset with mathematical and visualization techniques. What do you find?</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-2---Data-Cleaning-and-Transformation" data-toc-modified-id="Challenge-2---Data-Cleaning-and-Transformation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Challenge 2 - Data Cleaning and Transformation</a></span></li><li><span><a href="#Challenge-3---Data-Preprocessing" data-toc-modified-id="Challenge-3---Data-Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Challenge 3 - Data Preprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here." data-toc-modified-id="We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here.-4.0.0.1"><span class="toc-item-num">4.0.0.1&nbsp;&nbsp;</span>We will use the <code>StandardScaler</code> from <code>sklearn.preprocessing</code> and scale our data. Read more about <code>StandardScaler</code> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler" target="_blank">here</a>.</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-4---Data-Clustering-with-K-Means" data-toc-modified-id="Challenge-4---Data-Clustering-with-K-Means-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Challenge 4 - Data Clustering with K-Means</a></span></li><li><span><a href="#Challenge-5---Data-Clustering-with-DBSCAN" data-toc-modified-id="Challenge-5---Data-Clustering-with-DBSCAN-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Challenge 5 - Data Clustering with DBSCAN</a></span></li><li><span><a href="#Challenge-6---Compare-K-Means-with-DBSCAN" data-toc-modified-id="Challenge-6---Compare-K-Means-with-DBSCAN-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Challenge 6 - Compare K-Means with DBSCAN</a></span></li><li><span><a href="#Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters" data-toc-modified-id="Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Bonus Challenge 2 - Changing K-Means Number of Clusters</a></span></li><li><span><a href="#Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples" data-toc-modified-id="Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Bonus Challenge 3 - Changing DBSCAN <code>eps</code> and <code>min_samples</code></a></span></li></ul></div>

# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [106]:
# Import your libraries:

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings                                              
from sklearn.exceptions import DataConversionWarning          
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

# Challenge 1 - Import and Describe the Dataset

In this lab, we will use a dataset containing information about customer preferences. We will look at how much each customer spends in a year on each subcategory in the grocery store and try to find similarities using clustering.

The origin of the dataset is [here](https://archive.ics.uci.edu/ml/datasets/wholesale+customers).

In [143]:
# loading the data: Wholesale customers data
data = pd.read_csv('../data/Wholesale customers data.csv')

In [None]:
display(data.head())
display(data.describe())
display(data.info())

In [None]:
# sum columns to see the total of each
data.sum(axis=0).sort_values(ascending=False)

In [None]:
display(data['Channel'].value_counts())
display(data['Region'].value_counts())


In [None]:
# correlation matrix
corr = data.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

In [None]:
# plot features
plt.figure(figsize=(12, 8))
sns.pairplot(data)
plt.show()

In [None]:
# see columnwise distribution
plt.figure(figsize=(12, 8))
data.boxplot()
plt.show()

In [None]:
# check skewness and kurtosis
display(data.skew())
display(data.kurt())


In [None]:
# check if data follows the pareto principle
total_spent = data.drop(columns=['Channel', 'Region']).sum(axis=1)
total_spent.sort_values(ascending=False, inplace=True)


great_total = data.drop(columns=['Channel', 'Region']).sum().sum()

cumulative = 0
for i, value in enumerate(total_spent):
    cumulative += value
    if cumulative >= great_total * 0.8:
        break

print(f'The top {i+1} customers account for 80% of the total spent.')
print(f'{i+1} is {((i+1)/len(data))*100:.2f}% of the total customers.')


''' The pareto principle is not followed in this dataset.'''

In [None]:
cumulative_spend = total_spent.cumsum() / total_spent.sum()

# Generates the pareto plot
plt.figure(figsize=(10, 6))
plt.plot(cumulative_spend.values, label="Cumulative Spend", color="blue")
plt.axhline(y=0.8, color="red", linestyle="--", label="80% Threshold")
# plot vertical line at 20% of customers
plt.axvline(x=len(data)*0.2, color="green", linestyle="--", label="20% Customers")

# plot i+i vertical line
plt.axvline(x=i+1, color="orange", linestyle=":", label=f"Customers that account for 80% of the total spent")

# show label also on tick i+1
# plt.xticks(list(plt.xticks()[0]) + [i+1])

plt.title("Pareto Analysis: Cumulative Spend")
plt.xlabel("Customers (sorted by spending)")
plt.ylabel("Cumulative Spend Ratio")
plt.legend()
plt.show()

#### Explore the dataset with mathematical and visualization techniques. What do you find?

Checklist:

* What does each column mean?
    * > 1)	FRESH: annual spending (m.u.) on fresh products (Continuous);
      > 2)	MILK: annual spending (m.u.) on milk products (Continuous);
      > 3)	GROCERY: annual spending (m.u.)on grocery products (Continuous);
      > 4)	FROZEN: annual spending (m.u.)on frozen products (Continuous)
      > 5)	DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous) 
      > 6)	DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous); 
      > 7)	CHANNEL: customer Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
      > 8)	REGION: customer Region - Lisnon, Oporto or Other (Nominal)

* Any categorical data to convert?
    * > CHANNEL and REGION

* Any missing data to remove?
    * > No missing data

* Column collinearity - any high correlations?
    * > Yes, DETERGENTS_PAPER and GROCERY

* Descriptive statistics - any outliers to remove?
    * > Yes, there are outliers in all columns
* Column-wise data distribution - is the distribution skewed?
    * > Yes, most columns are skewed to the right, and some columns have a lot of outliers. for kurtosis, most columns have a positive value. 
* Etc.

Additional info: Over a century ago, an Italian economist named Vilfredo Pareto discovered that roughly 20% of the customers account for 80% of the typical retail sales. This is called the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle). Check if this dataset displays this characteristic.

**Your observations here**

> - GROCERY and DETERGENTS_PAPER are highly correlated
> - Channel and region are categorical data
> - There is no missing data in the dataset
> - The columns are skewed to the right
> - There are outliers in all columns

# Challenge 2 - Data Cleaning and Transformation

If your conclusion from the previous challenge is the data need cleaning/transformation, do it in the cells below. However, if your conclusion is the data need not be cleaned or transformed, feel free to skip this challenge. But if you do choose the latter, please provide rationale.

In [None]:
# fix skewness
from scipy.stats import boxcox

data['Fresh'] = boxcox(data['Fresh'])[0]
data['Milk'] = boxcox(data['Milk'])[0]
data['Grocery'] = boxcox(data['Grocery'])[0]
data['Frozen'] = boxcox(data['Frozen'])[0]
data['Detergents_Paper'] = boxcox(data['Detergents_Paper'])[0]
data['Delicassen'] = boxcox(data['Delicassen'])[0]


In [None]:
print('skewness after fixing:')
data.skew()

print('kurtosis after fixing:')
data.kurt()

In [None]:
# visualize dataset columns histograms
plt.figure(figsize=(12, 8))
data.hist()
plt.show()

In [None]:
data.kurtosis()

In [None]:
# correlation matrix
corr = data.corr()
plt.figure(figsize=(6, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

In [None]:
# boxplot
plt.figure(figsize=(12, 8))
data.boxplot()
plt.show()

> After fixing the skewness, the correlation between GROCERY and DETERGENTS_PAPER diminished. Therefore, there is no need to delete one of the columns.
> from the boxplot we can see that there are outliers in all columns, but we will not remove them because they are not errors, they are just extreme values. There are not much of them, so we will keep them.
>

In [None]:
# # Remove one of the highly correlated columns
# data.drop('Grocery', axis=1, inplace=True)

# # Channel and Region are categorical variables, so we will convert them to dummy variables 
# data = pd.get_dummies(data, columns=['Channel', 'Region'], drop_first=True)
# data.info()

**Your comment here**

-  ...
-  ...

# Challenge 3 - Data Preprocessing

One problem with the dataset is the value ranges are remarkably different across various categories (e.g. `Fresh` and `Grocery` compared to `Detergents_Paper` and `Delicassen`). If you made this observation in the first challenge, you've done a great job! This means you not only completed the bonus questions in the previous Supervised Learning lab but also researched deep into [*feature scaling*](https://en.wikipedia.org/wiki/Feature_scaling). Keep on the good work!

Diverse value ranges in different features could cause issues in our clustering. The way to reduce the problem is through feature scaling. We'll use this technique again with this dataset.

#### We will use the `StandardScaler` from `sklearn.preprocessing` and scale our data. Read more about `StandardScaler` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

*After scaling your data, assign the transformed data to a new variable `customers_scale`.*

In [None]:
# Your import here:
from sklearn.preprocessing import StandardScaler

# Your code here:
customers_scale = StandardScaler().fit_transform(data)
print(customers_scale.shape)
customers_scale.max(), customers_scale.min()


# Challenge 4 - Data Clustering with K-Means

Now let's cluster the data with K-Means first. Initiate the K-Means model, then fit your scaled data. In the data returned from the `.fit` method, there is an attribute called `labels_` which is the cluster number assigned to each data record. What you can do is to assign these labels back to `customers` in a new column called `customers['labels']`. Then you'll see the cluster results of the original data.

In [None]:
from sklearn.cluster import KMeans

km = KMeans(n_clusters=2).fit(customers_scale)
clusters = km.labels_.tolist()

data['Label'] = clusters 
data['Label'].value_counts()

predicted_labels = km.predict(customers_scale)

### Looking to the elbow we can choose 2 like the correct number of clusters

In [21]:
# This was already written here...

# kmeans_2 = KMeans(n_clusters=2).fit(customers_scale)

# labels = kmeans_2.predict(customers_scale)

# clusters = kmeans_2.labels_.tolist()

In [None]:
# clean_customers['Label'] = clusters

Count the values in `labels`.

In [None]:
# count values in predicted_labels
print(np.bincount(predicted_labels))

data['Label'].value_counts()

# Challenge 5 - Data Clustering with DBSCAN

Now let's cluster the data using DBSCAN. Use `DBSCAN(eps=0.5)` to initiate the model, then fit your scaled data. In the data returned from the `.fit` method, assign the `labels_` back to `customers['labels_DBSCAN']`. Now your original data have two labels, one from K-Means and the other from DBSCAN.

In [177]:
from sklearn.cluster import DBSCAN 

# Your code here
dbscan = DBSCAN(eps=0.5).fit(customers_scale)
data['labels_DBSCAN'] = dbscan.labels_


Count the values in `labels_DBSCAN`.

In [None]:
# Your code here
data['labels_DBSCAN'].value_counts()

# Challenge 6 - Compare K-Means with DBSCAN

Now we want to visually compare how K-Means and DBSCAN have clustered our data. We will create scatter plots for several columns. For each of the following column pairs, plot a scatter plot using `labels` and another using `labels_DBSCAN`. Put them side by side to compare. Which clustering algorithm makes better sense?

Columns to visualize:

* `Detergents_Paper` as X and `Milk` as y
* `Grocery` as X and `Fresh` as y
* `Frozen` as X and `Delicassen` as y

Visualize `Detergents_Paper` as X and `Milk` as y by `labels` and `labels_DBSCAN` respectively

In [179]:
def plot(x,y,hue1, hue2):
    plt.figure(figsize=(15,6))

    plt.subplot(1,2,1) 
    sns.scatterplot(x=x, 
                    y=y,
                    hue=hue1)
    plt.title(x.name + ' vs ' + y.name + '(' + hue1.name + ')')


    plt.subplot(1,2,2)
    sns.scatterplot(x=x, 
                    y=y,
                    hue=hue2)
    plt.title(x.name + ' vs ' + y.name + ' (' + hue2.name + ')')

    plt.tight_layout()

    return plt.show();

In [None]:
# Your code here:
# plot(data['Detergents_Paper'], data['Milk'], data['Label'])
plot(data['Detergents_Paper'], data['Milk'], data['Label'],data['labels_DBSCAN'])


Visualize `Grocery` as X and `Fresh` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Your code here:
plot(data['Grocery'], data['Fresh'], data['Label'], data['labels_DBSCAN'])

Visualize `Frozen` as X and `Delicassen` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Your code here:
plot(data['Frozen'], data['Delicassen'], data['Label'], data['labels_DBSCAN'])

Let's use a groupby to see how the mean differs between the groups. Group `customers` by `labels` and `labels_DBSCAN` respectively and compute the means for all columns.

In [None]:
# Your code here:
display(data.groupby(['Label']).mean())

data.groupby(['labels_DBSCAN']).mean()

Which algorithm appears to perform better?

**Your observations here**

> - K-means performed better than DBSCAN. DBSCAN set all the data as outliers, and K-means was able to cluster the data.

# Bonus Challenge 2 - Changing K-Means Number of Clusters

As we mentioned earlier, we don't need to worry about the number of clusters with DBSCAN because it automatically decides that based on the parameters we send to it. But with K-Means, we have to supply the `n_clusters` param (if you don't supply `n_clusters`, the algorithm will use `8` by default). You need to know that the optimal number of clusters differs case by case based on the dataset. K-Means can perform badly if the wrong number of clusters is used.

In advanced machine learning, data scientists try different numbers of clusters and evaluate the results with statistical measures (read [here](https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation)). We are not using statistical measures today but we'll use our eyes instead. In the cells below, experiment with different number of clusters and visualize with scatter plots. What number of clusters seems to work best for K-Means?

In [None]:
# grid search kmeans clusters
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import silhouette_score
import os
os.environ["OMP_NUM_THREADS"] = "2"

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10]

plt.figure(figsize=(12, 8))

for i in n_clusters:
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(customers_scale)
    # count values in np array
    print('\n',np.bincount(kmeans.labels_))    

    print(f'Silhouette score for {i} clusters: {silhouette_score(customers_scale, kmeans.labels_)}')
# param_grid = {'n_clusters': [2, 3, 4, 5, 6, 7, 8, 9, 10]}
# kmeans = KMeans()
# kmeans_cv = GridSearchCV(kmeans, param_grid, cv=3)
# kmeans_cv.fit(customers_scale)
# print(kmeans_cv.best_params_)
# print(kmeans_cv.best_score_)

print("The silhouette score represents how similar an object is to its own cluster (cohesion) compared to other clusters (separation). \nIdeally, we want the silhouette score to be close to 1. The silhouette score is the highest when the number of \nclusters is 2. This is because the data is not well separated into clusters.")


**Your comment here**

- 

# Bonus Challenge 3 - Changing DBSCAN `eps` and `min_samples`

Experiment changing the `eps` and `min_samples` params for DBSCAN. See how the results differ with scatter plot visualization.

In [None]:
# GRID SEARCH for DBSCAN, changing eps and min_samples
eps= [0.1, 0.5, 1, 1.5, 2] 
min_samples = [5, 10, 15, 20, 25]

for i in eps:
    for j in min_samples:
        dbscan = DBSCAN(eps=i, min_samples=j)
        dbscan.fit(customers_scale)
        print(f'eps: {i}, min_samples: {j}')
        print(pd.Series(dbscan.labels_).value_counts())
        print()


# dbscan = DBSCAN()
# dbscan_cv = GridSearchCV(dbscan, param_grid, cv=3,scoring='adjusted_rand_score')
# dbscan_cv.fit(customers_scale)
# print(dbscan_cv.best_params_)
# print(dbscan_cv.best_score_)


**Your comment here**

- 