# Clustering of SKUs

One of your warehouses currently stores over 1,200 SKUs for your company. As the newly hired manager of the warehouse, want the operation of picking items for delivery to be improved. You have decided that you want to form groups of similiar SKUs so they can be grouped together in the warehouse. 

One your data analysts has pulled together a dataset containing SKU data at the pallet level. There are four features in total: 

1) unit price
2) number of units per pallet
3) pallet gross weight
4) pallet height

Your goal is to use these features to determine which SKUs should be grouped together.

In [None]:
# import packages and modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Going to use KMeans to cluster
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_score

In [None]:
# Read the data into a DataFrame named df
df = pd.read_csv('./data/sku_data.csv')

In [None]:
# take a look at it
df

In [None]:
# Look at summary statistics


In [None]:
# Look at .info()


In [None]:
# Create a pairplot with seaborn
sns.pairplot(df)

In [None]:
# Look at correlations among features
corr = df.corr()
corr

In [None]:
# Create a heatmap of the correlation matrix
sns.heatmap(corr, vmin=-1, vmax=1, annot=True, cmap='Blues')

In [None]:
# Scale the data
scaler = MinMaxScaler()
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)

In [None]:
# See what scaled data looks like


In [None]:
# End Goal: Create box plots of the scaled data

# Create empty list to hold the DataFrames
dfs = []

# Loop over the columns
for i in scaled_df.columns:
    # Create a temporary DataFrame
    tmp = pd.DataFrame(scaled_df[i])
    # name the column 'values'
    tmp.rename(columns={i: 'values'}, inplace=True)
    # add a column named 'Feature' that contains the feature name
    tmp['Feature'] = i
    # append tmp DataFrame to list dfs
    dfs.append(tmp)

# Combine all DataFrames found in the list dfs
data = pd.concat(dfs)

# look at the DataFrame data
data

In [None]:
# Look at its info()


In [None]:
# Take a look at the last index


In [None]:
# Create boxplot using our scaled data
sns.boxplot(x="Feature", y="values", data=data)

In [None]:
# Reset the index
data.reset_index(inplace=True)

In [None]:
# See .info() from resetting index
# What do you notice?
data.info()

In [None]:
# Create KDE plots of the scaled data to help with distributional analysis
sns.displot(data, x="values", hue="Feature", kind="kde")

## Time to Create Clusters

We'll try 2 through 9 clusters, capturing both the inertia and also the silhouette score.

In [None]:
# Create 2 empty dictionaries to capture metrics
inertia = {}
ss = {}

# Try 2 thru 9 clusters
for k in range(2,10):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_df)
    # capture the inertia
    inertia[k] = kmeans.inertia_
    
    # capture the silhouette score
    ss[k] = silhouette_score(scaled_df, kmeans.labels_)

In [None]:
# Look at inertia
inertia

In [None]:
# Plot the results
# Create DataFrame for easier plotting
df_inertia = pd.DataFrame.from_dict(inertia, orient='index', columns=['inertia'])
df_inertia

In [None]:
# Plot inertia as line plot
df_inertia.plot(marker='o')

In [None]:
# Look at silhouette scores
ss

In [None]:
# Create a DataFrame to make plotting easier
df_ss = pd.DataFrame.from_dict(ss, orient='index', columns=['silhouette_score'])
df_ss

In [None]:
# Plot silhouette score as line plot
df_ss.plot(marker='o')

In [None]:
# Make 5 clusters
five_clusters = KMeans(n_clusters=5)
five_clusters.fit_predict(scaled_df)

In [None]:
# Add the column Clusters which is the labels
df['Clusters'] = five_clusters.labels_ 

# Create a pairplot
sns.pairplot(df, hue='Clusters',
             palette=sns.color_palette(n_colors=5),
             markers=["o", "s", "D", "v", "^"])

**&copy; 2023 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**