# Unsupervised Project

In this project, you will work with a dataset specifying the sale quantities for different products. The rows indicate different products, and the columns will indicate the amount of sales that took place in a given week. You will put to use what you have learned from preprocessing, principal component analysis, kmeans unsupervised clustering, pipelines, and model persistence. 

## Part 0 - Importing the Dataset

The cell below imports the relevant libraries you need and imports the dataset. Run the cell below without modifying it, and then you can proceed.

In [None]:
# Imports needed in this exercise set
import numpy as np
import pandas as pd
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline

# Save the dataset in the same folder as this notebook
sales = pd.read_csv("Sales_Transactions_Dataset_Weekly.csv")

## Part 1 - Exploring the Dataset

Let's start as usual with exploring the dataset.

In [None]:
# Check out the first 5 rows of the dataset


Not surprisingly, the naming `W0`, `W1`, ..., `W51` represents the 52 weeks of the year, while `P1`, `P2`, ... represents the different products.

Here we can see that the column `Product_Code` should really be the index, as this column represents the observations uniquely, and does not give more information about them.

In [None]:
# Set the index to be the Product_Code column and remove the Product_Code column afterward


There are many columns (107). The last part of the columns in the dataset gives the same columns as previously, only normalized.

In [None]:
# Check out the columns with sales.columns


As you can see, there is an MIN column, a MAX column, and 52 columns giving normalized information. Remove all of these as we only need the 52 columns representing weekly sales.

In [None]:
# Keep only the unnormalized columns


In [None]:
# Show the first 5 rows again to make sure that everything is as you suspect


In [None]:
# Make sure that none of the columns have missing values


In [None]:
# Finally, use .describe() to look at some statistical summaries of the data


You should feel free to explore the dataset further if you want to. There are many things you can do, like to create visualizations to understand the data better. Proceed when you feel like you understand the data well.

## Part 2 - PCA for Dimensionality Reduction

The dataset has now only 52 columns left, but this is still quite a lot. We want to reduce the amount of columns to avoid the curse of dimensionality. In this section, we will use the fan favorite algorithm PCA to reduce the number of dimensions from 52 to 5.

Before using a PCA algorithm, it is good practice to scale the data. By doing this, we avoid that the PCA gives higher priority to a column based on having a different scale than the others.

In [None]:
# Initiate a StandardScaler instance


In [None]:
# Scale the sales data by using the .fit_transform method


Now our data is scaled, and we can proceed to using PCA to reduce the number of dimensions.

In [None]:
# Initiate a PCA instance with 5 as the value for n_components


In [None]:
# Use .fit_transform to reduce the number of dimensions


## Part 3 - Clustering with KMeans

We now have the most important parts of our data represented with only 5 columns. It's time to use a clustering algorithm to cluster the data into cluster groups. We will choose to cluster the data into 3 cluster groups.

In [None]:
# Initiate a KMeans with 3 clusters


In [None]:
# Fit the KMeans model to the processed data


In [None]:
# Get the cluster centers


In [None]:
# Get the labels of the observations


As you can see, each observation (e.g., product) is grouped into one of three classes with the labels `0`, `1`, or `2`. 

## Part 4 - Create a Pipeline and Persist the Pipeline

We've done several steps to get the clustering of our data. It is now time to put this into a pipeline for simplicity! 

In [None]:
# Create a pipeline for the three steps (Standard Scaler, PCA, and KMeans)


We can now fit our pipeline to the data. Remember to fit the pipeline to the data after we have removed the extra normalized columns, but before scaling anything ourselves. The data used here should have 52 columns.

In [None]:
# Fit the pipeline


We can again get for example the cluster labels of the observations. Just remember to access the KMeans algorithm from the pipeline first.

In [None]:
# Get the labels


Finally, we can persist the whole pipeline to a file by using joblib. This is super convenient as the pipeline bundles up most of what is being done to the data.

In [None]:
# Persist the pipeline as a joblib model


When a new observation arrive (e.g., a new product that has been on the market for 52 weeks), then you can load the pipeline and use the `.predict()` method to predict the label of the new observation. Which cluster this belongs to can help you to understand how to market the product, or which products should be given as <i>related products</i> on a website.