Task details on challenge blog post.
python3
is required due to use of built-in string translation and statistics methods.
pip3 install requirements.txt
python3 run.py --strict
results are in output/results.txt
passing --strict
increases match precision:
- disallows constructing valid
product
model
names from theproduct
family
name when themodel
name is unacceptable for hashing. - only matches a
listing
if theproduct
manufacturer
matches thelisting
manufacturer
I tried a couple approaches, first a KMeans
clustering algorithm, and then a simpler (but more robust) hashing
method.
The hashing
approach is what has been retained for use with run.py
, but I left the the generally useful KMeans text clustering class I wrote in source/cluster.py
(using this this requires you pip3 install -r source/kmeans_requirements.txt
).
- parse
product.txt
andlistings.txt
, converting and sanitizing any provided JSON keys - hash the
products
into a dictionary keyed bymanufacturer
:model
, anyproducts
with un-hashable model names are saved tounhashed_products.json
- match
listings
intoproducts
, store any unknown matches by keyunknown
, later dump them tounmatched_listings.json
- filter out matched, hashed
listings
by-s
standard deviations from the median, converting costs to a fixed currency at current date using theCurrencyConverter
library, and dump these outliers tooutlier_listings.json
sci-kit
learn is packed with tools for analyzing data, so I started looking at what tools were available to solve this kind of 'matching' problem.
-
According to the sklearn algorithm cheatsheet, this problem fits into the path of:
categorization
-->un-labelled data
-->known # categories
(products) -->>10k samples
- which means thatKMeans
clustering is recommended.The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares.
-
Existing text analysis and comparison work pointed me towards clustering
Listings
by text contents using a TF-IDF feature of thetitle
field
- can reduce the search-size of an even greater number of more varied listings (i.e not just imaging products as provided)
- model converges well on listing
title
data - can estimate similarity between listings even if the similarity would be complex to express via direct string operations
~80%
of the top15
words in each cluster centroid do not contain desired manufacturer or model keywords, making it challenging to associate products to clusters- the model is sensitive to the language of the text, and with data that is mostly english, it is more challenging to produce accurate predictions for all listings.
- it is non-trivial to cluster by text data alongside cost, date, etc.
- Products don't have enough text data to make an accurate cluster prediction to match them
- when clustering by the number of products, and using listing
titles
as text features,Kmeans
is able to match9%
ofproducts
to%20
oflistings
. - model fitting takes approximately 10 seconds on an 8-core machine
- there are many
listings
that do not contain themodel
ormanufacturer
strings in the matchedproduct
.
Because the product
data contains strings that should be explicitly present in associated Listings
, I decided to try a hashing approach where I build a dict()
keyed by manufacturer
and model
.
- much faster than KMeans
- is not very language specific, since the model name and manufacturer do not change significantly across languages.
- performing operations on a dict hashed by manufacturer and model minimizes the number of comparison operations required
- hashing short
model
ormanufacturer
strings (ex:hp
) leads to an increase in false-positivelisting
-product
matches - requires programmer time to develop the string comparison and sanitization
- has many edge-cases in the formation of the product hash
- requires more assertions to function
- the hashing approach matches
98%
of products to82%
listings with-s 2
standard deviations for cost outliers and with--strict
rules. - runs in well-under 5 seconds on an 8-core machine
Below sections detail the task, objects and caveats of analyzing the data provided.
Generate a file full of Result
objects which associate each known Product
to its associated Listing(s)
. Precision is valued over Recall, but both should be maximized.
The Result
should match product listings from a 3rd party retailers, e.g. “Nikon D90 12.3MP Digital SLR Camera (Body Only)”
against a set of known products, e.g. “Nikon D90”
.
There are 20196
listings and 743
known products in the provided dataset.
Listing
:
{'title': 'LED Flash Macro Ring Light (48 X LED) with 6 Adapter Rings for For Canon/Sony/Nikon/Sigma Lenses',
'manufacturer': 'Neewer Electronics Accessories',
'currency': 'CAD', 'price': '35.99'}
Product
:
{'announced-date': '2010-01-06T19:00:00.000-05:00',
'family': 'Cyber-shot',
'manufacturer': 'Sony',
'model': 'DSC-W310',
'product_name': 'Sony_Cyber-shot_DSC-W310'}
Result
:
{"product_name": String,
"listings": list(Listings) }
-
some listings will contain the titles of other listings ex: tripods/cases/batteries, which also includes a list of compatible cameras
-
prices are in multiple currencies, across multiple dates
-
family
keyword is not always inProduct
-
similar products may have numerical ratings that are slightly different ex:
3.2mp
vs3mp
-
listings are in different languages
- 67% of
Listings
(13454
) appear to be in english language - 33% of
Listings
(6742
) appear to be in foreign languages
- 67% of
-
many product names are very similar:
'Fujifilm_FinePix_S200EXR', 'Fujifilm_FinePix_S2500HD', 'Fujifilm_FinePix_S2600HD', 'Fujifilm_FinePix_S2800HD', 'Fujifilm_FinePix_S2900HD',