<a href="https://colab.research.google.com/github/kalc1/CIT-99-Machine-Learning/blob/main/Unsupervised_Learning_KevinAlcocer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Project

In this project, you will work with a dataset specifying the sale quantities for different products. The rows indicate different products, and the columns will indicate the amount of sales that took place in a given week. You will put to use what you have learned from preprocessing, principal component analysis, kmeans unsupervised clustering, pipelines, and model persistence.

## Part 0 - Importing the Dataset

The cell below imports the relevant libraries you need and imports the dataset. Run the cell below without modifying it, and then you can proceed.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Imports needed in this exercise set
import numpy as np
import pandas as pd
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline

# Save the dataset in the same folder as this notebook
sales = pd.read_csv("/content/drive/MyDrive/FCC/Machine Learning/Week 10 CIT 99/Sales_Transactions_Dataset_Weekly.csv")

## Part 1 - Exploring the Dataset

Let's start as usual with exploring the dataset.

In [None]:
# Check out the first 5 rows of the dataset
sales.head()

Unnamed: 0,Product_Code,W0,W1,W2,W3,W4,W5,W6,W7,W8,...,Normalized 42,Normalized 43,Normalized 44,Normalized 45,Normalized 46,Normalized 47,Normalized 48,Normalized 49,Normalized 50,Normalized 51
0,P1,11,12,10,8,13,12,14,21,6,...,0.06,0.22,0.28,0.39,0.5,0.0,0.22,0.17,0.11,0.39
1,P2,7,6,3,2,7,1,6,3,3,...,0.2,0.4,0.5,0.1,0.1,0.4,0.5,0.1,0.6,0.0
2,P3,7,11,8,9,10,8,7,13,12,...,0.27,1.0,0.18,0.18,0.36,0.45,1.0,0.45,0.45,0.36
3,P4,12,8,13,5,9,6,9,13,13,...,0.41,0.47,0.06,0.12,0.24,0.35,0.71,0.35,0.29,0.35
4,P5,8,5,13,11,6,7,9,14,9,...,0.27,0.53,0.27,0.6,0.2,0.2,0.13,0.53,0.33,0.4


Not surprisingly, the naming `W0`, `W1`, ..., `W51` represents the 52 weeks of the year, while `P1`, `P2`, ... represents the different products.

Here we can see that the column `Product_Code` should really be the index, as this column represents the observations uniquely, and does not give more information about them.

In [None]:
# Set the index to be the Product_Code column and remove the Product_Code column afterward
sales.set_index("Product_Code", inplace=True)
sales.head()

Unnamed: 0_level_0,W0,W1,W2,W3,W4,W5,W6,W7,W8,W9,...,Normalized 42,Normalized 43,Normalized 44,Normalized 45,Normalized 46,Normalized 47,Normalized 48,Normalized 49,Normalized 50,Normalized 51
Product_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
P1,11,12,10,8,13,12,14,21,6,14,...,0.06,0.22,0.28,0.39,0.5,0.0,0.22,0.17,0.11,0.39
P2,7,6,3,2,7,1,6,3,3,3,...,0.2,0.4,0.5,0.1,0.1,0.4,0.5,0.1,0.6,0.0
P3,7,11,8,9,10,8,7,13,12,6,...,0.27,1.0,0.18,0.18,0.36,0.45,1.0,0.45,0.45,0.36
P4,12,8,13,5,9,6,9,13,13,11,...,0.41,0.47,0.06,0.12,0.24,0.35,0.71,0.35,0.29,0.35
P5,8,5,13,11,6,7,9,14,9,9,...,0.27,0.53,0.27,0.6,0.2,0.2,0.13,0.53,0.33,0.4


There are many columns (107). The last part of the columns in the dataset gives the same columns as previously, only normalized.

In [None]:
# Check out the columns with sales.columns
print(list(sales.columns))

['W0', 'W1', 'W2', 'W3', 'W4', 'W5', 'W6', 'W7', 'W8', 'W9', 'W10', 'W11', 'W12', 'W13', 'W14', 'W15', 'W16', 'W17', 'W18', 'W19', 'W20', 'W21', 'W22', 'W23', 'W24', 'W25', 'W26', 'W27', 'W28', 'W29', 'W30', 'W31', 'W32', 'W33', 'W34', 'W35', 'W36', 'W37', 'W38', 'W39', 'W40', 'W41', 'W42', 'W43', 'W44', 'W45', 'W46', 'W47', 'W48', 'W49', 'W50', 'W51', 'MIN', 'MAX', 'Normalized 0', 'Normalized 1', 'Normalized 2', 'Normalized 3', 'Normalized 4', 'Normalized 5', 'Normalized 6', 'Normalized 7', 'Normalized 8', 'Normalized 9', 'Normalized 10', 'Normalized 11', 'Normalized 12', 'Normalized 13', 'Normalized 14', 'Normalized 15', 'Normalized 16', 'Normalized 17', 'Normalized 18', 'Normalized 19', 'Normalized 20', 'Normalized 21', 'Normalized 22', 'Normalized 23', 'Normalized 24', 'Normalized 25', 'Normalized 26', 'Normalized 27', 'Normalized 28', 'Normalized 29', 'Normalized 30', 'Normalized 31', 'Normalized 32', 'Normalized 33', 'Normalized 34', 'Normalized 35', 'Normalized 36', 'Normalized 

As you can see, there is an MIN column, a MAX column, and 52 columns giving normalized information. Remove all of these as we only need the 52 columns representing weekly sales.

In [None]:
# Keep only the unnormalized columns
sales = sales.iloc[:, 0:52]

In [None]:
# Show the first 5 rows again to make sure that everything is as you suspect
sales.head()

Unnamed: 0_level_0,W0,W1,W2,W3,W4,W5,W6,W7,W8,W9,...,W42,W43,W44,W45,W46,W47,W48,W49,W50,W51
Product_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
P1,11,12,10,8,13,12,14,21,6,14,...,4,7,8,10,12,3,7,6,5,10
P2,7,6,3,2,7,1,6,3,3,3,...,2,4,5,1,1,4,5,1,6,0
P3,7,11,8,9,10,8,7,13,12,6,...,6,14,5,5,7,8,14,8,8,7
P4,12,8,13,5,9,6,9,13,13,11,...,9,10,3,4,6,8,14,8,7,8
P5,8,5,13,11,6,7,9,14,9,9,...,7,11,7,12,6,6,5,11,8,9


In [None]:
# Make sure that none of the columns have missing values
sales.info()

<class 'pandas.core.frame.DataFrame'>
Index: 811 entries, P1 to P819
Data columns (total 52 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   W0      811 non-null    int64
 1   W1      811 non-null    int64
 2   W2      811 non-null    int64
 3   W3      811 non-null    int64
 4   W4      811 non-null    int64
 5   W5      811 non-null    int64
 6   W6      811 non-null    int64
 7   W7      811 non-null    int64
 8   W8      811 non-null    int64
 9   W9      811 non-null    int64
 10  W10     811 non-null    int64
 11  W11     811 non-null    int64
 12  W12     811 non-null    int64
 13  W13     811 non-null    int64
 14  W14     811 non-null    int64
 15  W15     811 non-null    int64
 16  W16     811 non-null    int64
 17  W17     811 non-null    int64
 18  W18     811 non-null    int64
 19  W19     811 non-null    int64
 20  W20     811 non-null    int64
 21  W21     811 non-null    int64
 22  W22     811 non-null    int64
 23  W23     811 non-nu

In [None]:
# Finally, use .describe() to look at some statistical summaries of the data
sales.describe()

Unnamed: 0,W0,W1,W2,W3,W4,W5,W6,W7,W8,W9,...,W42,W43,W44,W45,W46,W47,W48,W49,W50,W51
count,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,...,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0,811.0
mean,8.902589,9.12947,9.389642,9.717633,9.574599,9.466091,9.720099,9.585697,9.784217,9.681874,...,8.394575,8.318126,8.434032,8.556104,8.720099,8.670777,8.674476,8.895191,8.861899,8.889026
std,12.067163,12.564766,13.045073,13.553294,13.095765,12.823195,13.347375,13.049138,13.550237,13.137916,...,11.348777,11.250455,11.223499,11.382041,11.621684,11.43587,11.222996,10.941375,10.49271,9.558011
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
50%,3.0,3.0,3.0,4.0,4.0,3.0,4.0,4.0,4.0,4.0,...,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,5.0
75%,12.0,12.0,12.0,13.0,13.0,12.5,13.0,12.5,13.0,13.0,...,10.0,11.0,11.0,11.0,11.0,12.0,12.0,12.0,13.0,14.0
max,54.0,53.0,56.0,59.0,61.0,52.0,56.0,62.0,63.0,52.0,...,52.0,50.0,46.0,46.0,55.0,49.0,50.0,52.0,57.0,73.0


You should feel free to explore the dataset further if you want to. There are many things you can do, like to create visualizations to understand the data better. Proceed when you feel like you understand the data well.

## Part 2 - PCA for Dimensionality Reduction

The dataset has now only 52 columns left, but this is still quite a lot. We want to reduce the amount of columns to avoid the curse of dimensionality. In this section, we will use the fan favorite algorithm PCA to reduce the number of dimensions from 52 to 5.

Before using a PCA algorithm, it is good practice to scale the data. By doing this, we avoid that the PCA gives higher priority to a column based on having a different scale than the others.

In [None]:
# Initiate a StandardScaler instance
scaler = StandardScaler()

In [None]:
# Scale the sales data by using the .fit_transform method
sales_scaled = scaler.fit_transform(sales)

Now our data is scaled, and we can proceed to using PCA to reduce the number of dimensions.

In [None]:
# Initiate a PCA instance with 5 as the value for n_components
pca = PCA(n_components=5)

In [None]:
# Use .fit_transform to reduce the number of dimensions
sales_pca = pca.fit_transform(sales_scaled)

## Part 3 - Clustering with KMeans

We now have the most important parts of our data represented with only 5 columns. It's time to use a clustering algorithm to cluster the data into cluster groups. We will choose to cluster the data into 3 cluster groups.

In [None]:
# Initiate a KMeans with 3 clusters
kmeans = KMeans(n_clusters=3)

In [None]:
# Fit the KMeans model to the processed data
sales_kmeans = kmeans.fit(sales_pca)



In [None]:
# Get the cluster centers
cluster_centers = sales_kmeans.cluster_centers_
cluster_centers

array([[-4.38393244e+00, -1.35227937e-01, -1.18555140e-02,
        -3.74544257e-03,  1.71787735e-03],
       [ 1.48109967e+01, -4.17365071e-01, -3.33757388e-02,
        -1.77454704e-02, -1.43982111e-02],
       [ 1.58153963e+00,  5.99060701e-01,  5.04964136e-02,
         2.04858132e-02,  4.78994048e-03]])

In [None]:
# Get the labels of the observations
labels = kmeans.labels_
labels

array([2, 0, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2,
       0, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 2, 1, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1,
       1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 0, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 1, 2, 2, 2, 1, 1, 0, 2, 2, 1, 1, 2, 0, 0, 2, 2, 0, 2, 2,
       0, 1, 1, 2, 2, 2, 0, 2, 1, 1, 2, 2, 0, 0, 2, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 2,
       0, 0, 2, 0, 0, 2, 0, 2, 0, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 2, 2,
       0, 2, 0, 2, 0, 0, 2, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2,
       0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0,

As you can see, each observation (e.g., product) is grouped into one of three classes with the labels `0`, `1`, or `2`.

## Part 4 - Create a Pipeline and Persist the Pipeline

We've done several steps to get the clustering of our data. It is now time to put this into a pipeline for simplicity!

In [None]:
# Create a pipeline for the three steps (Standard Scaler, PCA, and KMeans)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=5)),
    ('kmeans', KMeans(n_clusters=3))
])

We can now fit our pipeline to the data. Remember to fit the pipeline to the data after we have removed the extra normalized columns, but before scaling anything ourselves. The data used here should have 52 columns.

In [None]:
# Fit the pipeline
pipeline.fit(sales)



We can again get for example the cluster labels of the observations. Just remember to access the KMeans algorithm from the pipeline first.

In [None]:
# Get the labels
pipeline['kmeans'].labels_

array([0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       2, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 2, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 1, 1, 2, 0, 0, 1, 1, 0, 2, 2, 0, 0, 2, 0, 0,
       2, 1, 1, 0, 0, 0, 2, 0, 1, 1, 0, 0, 2, 2, 0, 2, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 0, 2, 2, 0, 0, 0,
       2, 2, 0, 2, 2, 0, 2, 0, 2, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 0,
       2, 0, 2, 0, 2, 2, 0, 2, 2, 1, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 0, 0,
       2, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2,

Finally, we can persist the whole pipeline to a file by using joblib. This is super convenient as the pipeline bundles up most of what is being done to the data.

In [None]:
# Persist the pipeline as a joblib model
joblib.dump(pipeline, 'sales_pipeline.joblib')

['sales_pipeline.joblib']

When a new observation arrive (e.g., a new product that has been on the market for 52 weeks), then you can load the pipeline and use the `.predict()` method to predict the label of the new observation. Which cluster this belongs to can help you to understand how to market the product, or which products should be given as <i>related products</i> on a website.