# Clustering: Practical Exercise

In [None]:
# Uncomment to upgrade packages
# !pip install pandas --upgrade --user --quiet
# !pip install numpy --upgrade --user --quiet
# !pip install scipy --upgrade --user --quiet
# !pip install statsmodels --upgrade --user --quiet
# !pip install scikit-learn --upgrade --user --quiet
# !pip install matplotlib --upgrade --user --quiet

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Now we are going to explore a dataset of customers from a Wholesale distributor taken from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Wholesale+customers) using the clustering algorithms that we have seen in the lab session.

We have the data as a csv file among the files of this session, we can load it this way:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


In [None]:
wholesale =  pd.read_csv("Wholesale customers data.csv", sep=',')

These are the dimensions of the dataset and the names of the variables

In [None]:
wholesale.shape
wholesale.columns

According to the documentation the variables **channel** and **region** are categorical, so we will convert them to the categorical data type and we will use the labels that appear on the documentation

In [None]:
wholesale.Channel = wholesale.Channel.replace([1,2],['horeca', 'retail']).astype('category')
wholesale.Region = wholesale.Region.replace([1,2,3],['Lisbon', 'Oporto', 'Other']).astype('category')

wholesale.dtypes

As usual we will start with the descriptive statistics of the data

In [None]:
wholesale.describe(include='all')

And also some plots of the variables to chech their distribution and to spot outliers or skewed distributions.

In [None]:
wholesale.hist(bins=20,figsize=(10,10));

In [None]:
fig = plt.figure(figsize=(8,8))
wholesale.boxplot(figsize=(10,10));

You will see that there are some extreme values that skew the distributions (basically outliers).

Are these values from the same example? Check it doing queries to the dataframe like this one

In [None]:
wholesale[wholesale.Fresh>80000]

You can consider removing some of these examples with extreme values

# Visualizing patterns

We can not visualize all the variables at the same time, but we can visualize pairs of variables to see if there are clusters of examples.

This is no guarantee that there are clusters on more dimensions, but until we learn how to transform multidimensional data to two or three dimensions it will do

Visualize different groups of variables and check if outliers are still a problem

In [None]:
from pandas.plotting import scatter_matrix

scatter_matrix(wholesale.loc[:,['Fresh','Frozen','Detergents_Paper']],
alpha=0.2, figsize=(12, 12),
diagonal='kde', marker='o');

We can also make some 3D visualizations.

Check also for remaining outliers.

In [None]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
plt.scatter(wholesale['Fresh'], wholesale['Frozen'], zs=wholesale['Detergents_Paper'], depthshade=False, s=100);


# Clustering

In [None]:
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

Now we can try to partition the data using the two clustering algorithms that we have seen in this session, but first we will generate a dataset without the categorical variables because we can not use it with these algorithms and also maybe these variables are related with the clusters in the data, so we do not want to make it easy for the algorithms :-)

In [None]:
newwholesale = wholesale.drop(columns=['Channel', 'Region'])

In [None]:
newwholesale.describe()

Now is your time to find clusters in the data

- Review the notebook of this session and replicate the code to assess how many clusters are adequate for this data using K-means
    - Explore a range of number of clusters  and save the Kalinski-Harabasz index for each number of clusters
    - Plot the number of clusters vs the index and see if there is a number of clusters that seems more adequate
- Use the EM algortihm with the best number of clusters and find a clustering for the data using the "diag" and "full" covariance
- Extract the assigments of the best solutions for k-means and EM, create a dataframe with them and use the crosstab function to compute a contingecy table of the assignments. Do the partitions from both methods agree?

