# Unsupervised Project

In this project, you will work with a dataset specifying the sale quantities for different products. The rows indicate different products, and the columns will indicate the amount of sales that took place in a given week. You will put to use what you have learned from preprocessing, principal component analysis, kmeans unsupervised clustering, pipelines, and model persistence. 

## Part 0 - Importing the Dataset

The cell below imports the relevant libraries you need and imports the dataset. Run the cell below without modifying it, and then you can proceed.

In [85]:
# Imports needed in this exercise set
import numpy as np
import pandas as pd
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline

# Save the dataset in the same folder as this notebook
sales = pd.read_csv("Sales_Transactions_Dataset_Weekly.csv")

## Part 1 - Exploring the Dataset

Let's start as usual with exploring the dataset.

In [86]:
# Check out the first 5 rows of the dataset
sales.head(5)

Unnamed: 0,Product_Code,W0,W1,W2,W3,W4,W5,W6,W7,W8,...,Normalized 42,Normalized 43,Normalized 44,Normalized 45,Normalized 46,Normalized 47,Normalized 48,Normalized 49,Normalized 50,Normalized 51
0,P1,11,12,10,8,13,12,14,21,6,...,0.06,0.22,0.28,0.39,0.5,0.0,0.22,0.17,0.11,0.39
1,P2,7,6,3,2,7,1,6,3,3,...,0.2,0.4,0.5,0.1,0.1,0.4,0.5,0.1,0.6,0.0
2,P3,7,11,8,9,10,8,7,13,12,...,0.27,1.0,0.18,0.18,0.36,0.45,1.0,0.45,0.45,0.36
3,P4,12,8,13,5,9,6,9,13,13,...,0.41,0.47,0.06,0.12,0.24,0.35,0.71,0.35,0.29,0.35
4,P5,8,5,13,11,6,7,9,14,9,...,0.27,0.53,0.27,0.6,0.2,0.2,0.13,0.53,0.33,0.4


Not surprisingly, the naming `W0`, `W1`, ..., `W51` represents the 52 weeks of the year, while `P1`, `P2`, ... represents the different products.

Here we can see that the column `Product_Code` should really be the index, as this column represents the observations uniquely, and does not give more information about them.

In [87]:
# Set the index to be the Product_Code column and remove the Product_Code column afterward
sales.set_index("Product_Code", inplace=True)

There are many columns (107). The last part of the columns in the dataset gives the same columns as previously, only normalized.

In [88]:
# Check out the columns with sales.columns
sales.columns

Index(['W0', 'W1', 'W2', 'W3', 'W4', 'W5', 'W6', 'W7', 'W8', 'W9',
       ...
       'Normalized 42', 'Normalized 43', 'Normalized 44', 'Normalized 45',
       'Normalized 46', 'Normalized 47', 'Normalized 48', 'Normalized 49',
       'Normalized 50', 'Normalized 51'],
      dtype='object', length=106)

As you can see, there is an MIN column, a MAX column, and 52 columns giving normalized information. Remove all of these as we only need the 52 columns representing weekly sales.

In [89]:
# Keep only the unnormalized columns
columns_keep = [c for c in sales.columns if c.startswith("W")]
sales = sales[columns_keep]

In [90]:
# Show the first 5 rows again to make sure that everything is as you suspect
sales.head(5)

Unnamed: 0_level_0,W0,W1,W2,W3,W4,W5,W6,W7,W8,W9,...,W42,W43,W44,W45,W46,W47,W48,W49,W50,W51
Product_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
P1,11,12,10,8,13,12,14,21,6,14,...,4,7,8,10,12,3,7,6,5,10
P2,7,6,3,2,7,1,6,3,3,3,...,2,4,5,1,1,4,5,1,6,0
P3,7,11,8,9,10,8,7,13,12,6,...,6,14,5,5,7,8,14,8,8,7
P4,12,8,13,5,9,6,9,13,13,11,...,9,10,3,4,6,8,14,8,7,8
P5,8,5,13,11,6,7,9,14,9,9,...,7,11,7,12,6,6,5,11,8,9


In [91]:
# Make sure that none of the columns have missing values
sales.info()

<class 'pandas.core.frame.DataFrame'>
Index: 811 entries, P1 to P819
Data columns (total 52 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   W0      811 non-null    int64
 1   W1      811 non-null    int64
 2   W2      811 non-null    int64
 3   W3      811 non-null    int64
 4   W4      811 non-null    int64
 5   W5      811 non-null    int64
 6   W6      811 non-null    int64
 7   W7      811 non-null    int64
 8   W8      811 non-null    int64
 9   W9      811 non-null    int64
 10  W10     811 non-null    int64
 11  W11     811 non-null    int64
 12  W12     811 non-null    int64
 13  W13     811 non-null    int64
 14  W14     811 non-null    int64
 15  W15     811 non-null    int64
 16  W16     811 non-null    int64
 17  W17     811 non-null    int64
 18  W18     811 non-null    int64
 19  W19     811 non-null    int64
 20  W20     811 non-null    int64
 21  W21     811 non-null    int64
 22  W22     811 non-null    int64
 23  W23     811 non-nu

In [92]:
# Finally, use .describe() to look at some statistical summaries of the data
sales.describe()

Unnamed: 0,W0,W1,W2,W3,W4,W5,W6,W7,W8,W9,...,W42,W43,W44,W45,W46,W47,W48,W49,W50,W51
count,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,...,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0
mean,8.902589,9.12947,9.389642,9.717633,9.574599,9.466091,9.720099,9.585697,9.784217,9.681874,...,8.394575,8.318126,8.434032,8.556104,8.720099,8.670777,8.674476,8.895191,8.861899,8.889026
std,12.067163,12.564766,13.045073,13.553294,13.095765,12.823195,13.347375,13.049138,13.550237,13.137916,...,11.348777,11.250455,11.223499,11.382041,11.621684,11.43587,11.222996,10.941375,10.49271,9.558011
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
50%,3.0,3.0,3.0,4.0,4.0,3.0,4.0,4.0,4.0,4.0,...,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,5.0
75%,12.0,12.0,12.0,13.0,13.0,12.5,13.0,12.5,13.0,13.0,...,10.0,11.0,11.0,11.0,11.0,12.0,12.0,12.0,13.0,14.0
max,54.0,53.0,56.0,59.0,61.0,52.0,56.0,62.0,63.0,52.0,...,52.0,50.0,46.0,46.0,55.0,49.0,50.0,52.0,57.0,73.0


You should feel free to explore the dataset further if you want to. There are many things you can do, like to create visualizations to understand the data better. Proceed when you feel like you understand the data well.

## Part 2 - PCA for Dimensionality Reduction

The dataset has now only 52 columns left, but this is still quite a lot. We want to reduce the amount of columns to avoid the curse of dimensionality. In this section, we will use the fan favorite algorithm PCA to reduce the number of dimensions from 52 to 5.

Before using a PCA algorithm, it is good practice to scale the data. By doing this, we avoid that the PCA gives higher priority to a column based on having a different scale than the others.

In [93]:
# Initiate a StandardScaler instance
scaler_sales = StandardScaler()

In [94]:
# Scale the sales data by using the .fit_transform method
scaler_sales.fit_transform(sales)

array([[ 0.17391867,  0.22859969,  0.04681724, ..., -0.26477272,
        -0.36828257,  0.11630659],
       [-0.15776396, -0.2492208 , -0.49011498, ..., -0.72203567,
        -0.27291949, -0.93058184],
       [-0.15776396,  0.14896294, -0.10659197, ..., -0.08186755,
        -0.08219333, -0.19775994],
       ...,
       [-0.65528791, -0.72704129, -0.72022878, ..., -0.81348826,
        -0.46364565, -0.61651531],
       [-0.73820856, -0.72704129, -0.72022878, ..., -0.81348826,
        -0.65437181, -0.93058184],
       [-0.73820856, -0.64740454, -0.72022878, ..., -0.81348826,
        -0.84509797, -0.82589299]])

Now our data is scaled, and we can proceed to using PCA to reduce the number of dimensions.

In [95]:
# Initiate a PCA instance with 5 as the value for n_components
pca_sales = PCA(n_components=5)

In [96]:
# Use .fit_transform to reduce the number of dimensions
sales_dim_reduced = pd.DataFrame(pca.fit_transform(sales))

## Part 3 - Clustering with KMeans

We now have the most important parts of our data represented with only 5 columns. It's time to use a clustering algorithm to cluster the data into cluster groups. We will choose to cluster the data into 3 cluster groups.

In [105]:
# Initiate a KMeans with 3 clusters
k_model = KMeans(n_clusters=3)

In [106]:
# Fit the KMeans model to the processed data
k_model.fit(sales_dim_reduced)

In [109]:
# Get the cluster centers
sales_centers = k_model.cluster_centers_
sales_centers

array([[-5.25138861e+01, -1.38221540e+00,  1.58404158e-01,
         1.27853920e-01,  8.38471413e-02],
       [ 1.77786679e+02, -4.26962301e+00,  3.74300971e-01,
         3.41064598e-01,  1.99247607e-02],
       [ 1.87119592e+01,  6.12547613e+00, -6.29600802e-01,
        -5.32692541e-01, -2.21095277e-01]])

In [111]:
# Get the labels of the observations
sales_labels = k_model.labels_
sales_labels

array([2, 0, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2,
       0, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 2, 1, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1,
       1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 0, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 1, 2, 2, 2, 1, 1, 0, 2, 2, 1, 1, 2, 0, 0, 2, 2, 0, 2, 2,
       0, 1, 1, 2, 2, 2, 0, 2, 1, 1, 2, 2, 0, 0, 2, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 2,
       0, 0, 2, 0, 0, 2, 0, 2, 0, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 2, 2,
       0, 2, 0, 2, 0, 0, 2, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2,
       0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0,

As you can see, each observation (e.g., product) is grouped into one of three classes with the labels `0`, `1`, or `2`. 

## Part 4 - Create a Pipeline and Persist the Pipeline

We've done several steps to get the clustering of our data. It is now time to put this into a pipeline for simplicity! 

In [120]:
# Create a pipeline for the three steps (Standard Scaler, PCA, and KMeans)
pipeline = Pipeline(steps=[('sales_scaler', StandardScaler()),
                    ('sales_pca', PCA(n_components=5)),
                    ('sales_k_model', KMeans(n_clusters=3))])

We can now fit our pipeline to the data. Remember to fit the pipeline to the data after we have removed the extra normalized columns, but before scaling anything ourselves. The data used here should have 52 columns.

In [121]:
# Fit the pipeline
pipeline.fit(sales)

We can again get for example the cluster labels of the observations. Just remember to access the KMeans algorithm from the pipeline first.

In [124]:
# Get the labels
pipeline["sales_k_model"].labels_

array([2, 0, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2,
       0, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 2, 1, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1,
       1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 0, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 1, 2, 2, 2, 1, 1, 0, 2, 2, 1, 1, 2, 0, 0, 2, 2, 0, 2, 2,
       0, 1, 1, 2, 2, 2, 0, 2, 1, 1, 2, 2, 0, 0, 2, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 2,
       0, 0, 2, 0, 0, 2, 0, 2, 0, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 2, 2,
       0, 2, 0, 2, 0, 0, 2, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2,
       0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0,

Finally, we can persist the whole pipeline to a file by using joblib. This is super convenient as the pipeline bundles up most of what is being done to the data.

In [125]:
# Persist the pipeline as a joblib model
joblib.dump(pipeline, "sales_pipeline.joblib")

['sales_pipeline.joblib']

When a new observation arrive (e.g., a new product that has been on the market for 52 weeks), then you can load the pipeline and use the `.predict()` method to predict the label of the new observation. Which cluster this belongs to can help you to understand how to market the product, or which products should be given as <i>related products</i> on a website.