# eCommerce Item Clustering with Python

In this notebook I will attempt to cluster eCommerce item data by their names. The data is from an outdoor apparel brand's catalog. I want to use the item names to find similar items and group them together. For example, if it's a t-shirt it should belong in the t-shirt group.

The steps to accomplish this goal will be:
1. Cleaning the data to just include the name (pandas)
2. Transform the corpus into vector space using tf-idf (Sci Kit)
3. Calculating cosine distance between each document as a measure of similarity (Sci Kit)
4. Hierarchical Clustering and Dendrogram (Scipy)
5. Cluster the documents with k-means (Sci Kit)
6. Use MDS to reduce the dimension
7. Plot the clusters (matplotlib)


The dataset consists of 500 actual SKUs from an outdoor apparel brand's product catalog downloaded from Kaggle (https://www.kaggle.com/cclark/product-item-data). 


I used http://brandonrose.org/clustering as a reference for this project. He has a lot of interesting projects with great explanations in his blog.

## Cleaning Data
Import the packages needed

In [4]:
import os
import pandas as pd
import re
import numpy as np

Read the data.

In [5]:
df = pd.read_csv('sample-data.csv')

IOError: File sample-data.csv does not exist

A quick look at the data. There are 2 columns, id and description.

In [3]:
df.head()

Let's take a closer look at the description and what it has. It starts off with the name then a long description then ending with material detail. I am only interested in the name for this project so I will separate it out.

In [4]:
print df['description'][5]

This function splits the description returning only the name.

In [5]:
def split_description(string):
    # name
    string_split = string.split(' - ',1)
    name = string_split[0]
    
    return name

Let's put the clean data into a new data frame.

In [6]:
df_new = pd.DataFrame()
df_new['name'] = df.loc[:,'description'].apply(lambda x: split_description(x))
df_new['id'] = df['id']

This function removes numbers and extra spaces from the name.

In [7]:
def remove(name):
    new_name = re.sub("[0-9]", '', name)
    new_name = ' '.join(new_name.split())
    return new_name

Let's apply the function above.

In [8]:
df_new['name'] = df_new.loc[:,'name'].apply(lambda x: remove(x))

Now the data is all nice and clean.

In [9]:
df_new.head()

## TF-IDF
Import TF-IDF vectorizer from sklearn.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Let's set up the parameters for our TF-IDF vectorizer.

I want to use the inverse document frequency so I set it as True.

By setting stop words as english it will remove irrevelant words such as to, and, etc.

The ngram range splits our documents in 1 term, 2 terms, ... 4 terms. 

Min df is used for removing terms that appear too infrequently, 0.05 means ignore terms that appear less than 1% of the documents. Max df is vice versa, ignore terms that appear more than 90% of the documents.

In [11]:
tfidf_vectorizer = TfidfVectorizer(
                                   use_idf=True,
                                   stop_words = 'english',
                                   ngram_range=(1,4), min_df = 0.01, max_df = 0.8)

Now that the vectorizer is set I will fit and transform the data. 

In [12]:
%time tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])

The parameters have narrowed down to 85 important terms in the matrix.

In [13]:
print(tfidf_matrix.shape)
print tfidf_vectorizer.get_feature_names()

I calculate the cosine similarity between each document. By subtracting 1 will provide the cosine distance for plotting on a 2 dimensional plane.

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1.0 - cosine_similarity(tfidf_matrix)
print dist

Before I begin the kmeans clustering I want to use a hierarchial clustering to figure how many clusters I should have. I truncated the dendrogram because if I didn't the dendrogram will be hard to read. I cut at 20 because it has the second biggest distance jump (the first big jump is at 60). After the cut there are 7 clusters.

In [41]:
from scipy.cluster.hierarchy import ward, dendrogram
%matplotlib inline
linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances

fig, ax = plt.subplots(figsize=(15, 20)) # set size
ax = dendrogram(linkage_matrix,
                truncate_mode='lastp', # show only the last p merged clusters
                p=20, # show only the last p merged clusters
                leaf_rotation=90.,
                leaf_font_size=12.,
                labels=list(df_plot['name']))

plt.axhline(y=20, linewidth = 2, color = 'black')

fig.suptitle("Hierarchial Clustering Dendrogram Truncated", fontsize = 35, fontweight = 'bold')

fig.show()

## K-Means Clustering
Let's fit k-means on the matrix with a range of clusters 1 - 19.

In [3]:
from sklearn.cluster import KMeans
num_clusters = range(1,20)

%time KM = [KMeans(n_clusters=k, random_state = 1).fit(tfidf_matrix) for k in num_clusters]

NameError: name 'tfidf_matrix' is not defined

Let's plot the within cluster sum of squares for each k to see which k I should choose.

The plot shows a steady decline from from 0 to 19. Since the elbow rule does not apply for this I will choose k = 7 because of the previous dendrogram.

In [15]:
import matplotlib.pyplot as plt
%matplotlib inline
with_in_cluster = [KM[k].inertia_ for k in range(0,len(num_clusters))]
plt.plot(num_clusters, with_in_cluster)
plt.ylim(min(with_in_cluster)-1000, max(with_in_cluster)+1000)
plt.ylabel('with-in cluster sum of squares')
plt.xlabel('# of clusters')
plt.title('kmeans within ss for k value')
plt.show()

I add the cluster label to each record in df_new

In [43]:
model = KM[6]
clusters = model.labels_.tolist()
df_new['cluster'] = clusters

Here is the distribution of clusters. Cluster 0 has a records, then cluster 1. Cluster 2 - 4 seem pretty even.

In [44]:
df_new['cluster'].value_counts()

I print the top terms per cluster and the names in the respective cluster.

In [46]:
print("Top terms per cluster:")
print
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names()
for i in range(model.n_clusters):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s' % terms[ind],
    print
    print "Cluster %d names:" %i,
    for idx in df_new[df_new['cluster'] == i]['name'].sample(n = 10):
        print ' %s' %idx,
    print
    print

I reduce the dist to 2 dimensions with MDS. The dissimilarity is precomputed because we provide 1 - cosine similarity. Then I assign the x and y variables.

In [47]:
import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn.manifold import MDS

mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

pos = mds.fit_transform(dist)

xs, ys = pos[:, 0], pos[:, 1]

Let's plot the clusters with colors and name each cluster as the top term so it is easier to view. The clusters look good except maybe cluster 0 "vest dress shirt". There seems to be some uncertainty. I am not sure what is causing this issue, but I was able to find out that TF-IDF works better on longer text. 

In [53]:
cluster_colors = {0: '#85C1E9', 1: '#FF0000', 2: '#800000', 3: '#04B320', 
                  4: '#6033FF', 5: '#33FF49', 6: '#F9E79F', 7: '#935116',
                  8: '#9B59B6', 9: '#95A5A6'}
cluster_labels = {0: 'vest  dress  print', 1: 'shirt  merino  island',
                  2: 'pants  guide pants  guide', 3: 'shorts  board  board shorts',
                  4: 'simply  live  live simply', 5: 'cap  cap bottoms  bottoms',
                  6: 'jkt  zip jkt  guide'}

#some ipython magic to show the matplotlib plots inline
%matplotlib inline 

#create data frame that has the result of the MDS plus the cluster numbers and titles
df_plot = pd.DataFrame(dict(x=xs, y=ys, label=clusters, name=df_new['name'])) 

#group by cluster
groups = df_plot.groupby('label')

# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size

for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12,
            label = cluster_labels[name], 
            color = cluster_colors[name])
    ax.set_aspect('auto')
    
ax.legend(numpoints = 1)  

fig.suptitle("SKU Clustering", fontsize = 35, fontweight = 'bold')

plt.show()