# Approach

* Broad groupings
    - L2 categories


* Similarity measure
    - TF-IDF on title/description, weighted by query frequency


* Clustering algorithm: AgglomerativeClustering


* Evaluation:
    - Exploratory analysis of random samples of items within a random selection of clusters

# Load dataset

In [38]:
import pandas as pd
%matplotlib inline

In [39]:
dfl =  pd.read_csv('../za_sample_listings_incl_cat_clean.csv')

# Split dataset by L2 categories

In [40]:
l2_dfs = {l2cat: rows for l2cat, rows in dfl.groupby('category_l2_name_en')}

I'll pick a small category for quick experimentation

In [41]:
catdf = l2_dfs['Musical Instruments']

# Computing TFIDF features

We include repeated copies of the title to give it more importance than the description

In [42]:
def get_cat_combined_docs(catdf, title_boost=5):
    return ['\n'.join([t] * title_boost + [d]) for (t,d) in zip(catdf.listing_title.values, catdf.listing_description.values)]

In [43]:
cat_docs = get_cat_combined_docs(catdf)

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

For better matching with search queries we don't use lemmatization this time, and extend the vocabulary with bigrams.

In [45]:
vec = TfidfVectorizer(lowercase=True, ngram_range=(1,2))

In [46]:
%time tfidf = vec.fit_transform(cat_docs)

CPU times: user 244 ms, sys: 4 ms, total: 248 ms
Wall time: 250 ms


In [47]:
tfidf

<2488x38349 sparse matrix of type '<class 'numpy.float64'>'
	with 103361 stored elements in Compressed Sparse Row format>

### Using user queries to improve TF IDF scores

In [48]:
kws = pd.read_csv('../za_queries_sample.csv')
searchkw_freqs = dict([(r['search_term'], r['cnt']) for _, r in kws[['search_term','cnt']].iterrows()])

We give a boost between 1 and 5 to words in the search queries, using the max observed search query frequency to adjust the scale

In [49]:
search_vocab = set(vec.vocabulary_.keys()).intersection(searchkw_freqs.keys())
max_freq = max(searchkw_freqs[w] for w in search_vocab)

sorted_kws = sorted(search_vocab, key=lambda w: searchkw_freqs[w])
group_size = int(len(sorted_kws)/5)
boost_factors = {}
for i in range(5):
    for w in sorted_kws[i * group_size: (i + 1) * group_size]:
        boost_factors[w] = (i + 1)

import numpy as np
boost_vector = np.ones(len(vec.vocabulary_))
for w, ind in vec.vocabulary_.items():
    if w in boost_factors:
        boost_vector[ind] += boost_factors[w]

In [50]:
_tfidf = tfidf
tfidf = _tfidf.multiply(boost_vector)

### Normalization

We now normalize the TFIDF vectors to compensate for different doc lengths. This also makes it possible to use Euclidean distance to compare them.

In [51]:
from sklearn.preprocessing import normalize

In [52]:
tfidf = normalize(tfidf)

Let's check if the normalization worked correctly

In [53]:
from scipy.sparse.linalg import norm
norms = norm(tfidf, axis=1)
len(norms), norms

(2488, array([ 1.,  1.,  1., ...,  1.,  1.,  1.]))

## TFIDF distances

In [54]:
from sklearn.metrics.pairwise import pairwise_distances
text_dists = pairwise_distances(tfidf)

In [55]:
text_dists

array([[ 0.        ,  1.41421356,  1.41421356, ...,  1.41128556,
         1.41421356,  1.41421356],
       [ 1.41421356,  0.        ,  1.40882666, ...,  1.41086926,
         1.41421356,  1.41392296],
       [ 1.41421356,  1.40882666,  0.        , ...,  0.97300259,
         1.41421356,  1.41402328],
       ..., 
       [ 1.41128556,  1.41086926,  0.97300259, ...,  0.        ,
         1.41421356,  1.41386514],
       [ 1.41421356,  1.41421356,  1.41421356, ...,  1.41421356,
         0.        ,  1.41421356],
       [ 1.41421356,  1.41392296,  1.41402328, ...,  1.41386514,
         1.41421356,  0.        ]])

In [56]:
text_dists.max()

1.4142135623730963

In [57]:
text_dists /= text_dists.max()

In [58]:
dists = text_dists

# Clustering

We want to adjust the number of clusters to our desired average cluster size.

In [59]:
cluster_size = 25
n_clusters = int(len(catdf)/cluster_size)

In [60]:
n_clusters

99

In [61]:
from sklearn.cluster import AgglomerativeClustering
cl = AgglomerativeClustering(
        n_clusters=n_clusters,
        affinity='precomputed',
        linkage='complete',
)
cl.fit(dists)

AgglomerativeClustering(affinity='precomputed', compute_full_tree='auto',
            connectivity=None, linkage='complete', memory=None,
            n_clusters=99, pooling_func=<function mean at 0x7fe640478598>)

We chose _complete_ linkage method because it produces a less concentrated distribution of cluster sizes.

In [62]:
catdf['cluster_labels'] = cl.labels_

Let's sort the clusters by size

In [63]:
from collections import defaultdict
cluster_sizes = defaultdict(int)
for l in cl.labels_:
    cluster_sizes[l] += 1
cluster_sizes = sorted(cluster_sizes.items(), key=lambda x: -x[1])

Let us explore a random sample of 20 clusters and display a random selection of up to 8 items for each of them:

In [64]:
from tabulate import tabulate
from random import sample
for ind, size in sample(cluster_sizes, 20):
    print("\n\nCluster {}, size {}".format(ind, size))
    items = catdf[catdf.cluster_labels == ind][['listing_title', 'listing_price']]
    if len(items) > 8:
        items = items.sample(8)
    print(tabulate(items, headers='keys', showindex=False, tablefmt='rst'))



Cluster 91, size 14
listing_title                                                listing_price
HYBRID mixer 24 channel                                               6500
hybrid                                                                 250
Brand new : HYBRID B-1600 and HYBRID B-2200 amps for sale            10000
Hybrid 4 channel mixer                                                2500
Hybrid C3000 Amp                                                      4000
Hybrid mixer mc12usb New                                              3000
hybrid 8chunel powered table mixer                                    1200
Hybrid mixer with usb input                                           5500


Cluster 90, size 49
listing_title                              listing_price
Ibanez S540 Electric Inc OHC                        6500
ibanez gio soundgear 5 string bass                  2499
IBANEZ AEW21/22/23 semi acoustic guitar             4999
Ibanez guitar amplifier.                            1