# Get Data

For the purposes of this introduction, we'll just use product data from Bed, Bath & Beyond and furthermore restrict it to a single category, kitchen electrics.  I chose KE because it's a decent size (3882 listed products) and has interesting name properties.  Since this is live code, we can obviously branch out and check other categories as the spirit moves us.

In [15]:
import unicodecsv
kitchenElectrics = []

with open('bedbathbeyond_products.csv', 'rU') as bbb:
    reader = unicodecsv.DictReader(bbb)
    for r in reader:
        try:
            if r['CategoryName'] == u'KITCHEN ELECTRICS':
                kitchenElectrics.append(r['Name'])
        except TypeError:
            print r
            raise

print len(kitchenElectrics)

3882


# Create TFIDF model and fit a KMeans Clusterer

Follow the links for explanations of [TFIDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [KMeans](http://en.wikipedia.org/wiki/K-means_clustering), or just ask!  There's a lot of complexity packed into a very few lines of code here, so we should spend some time going over it.

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

bigram_tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), token_pattern=r'\b\w+\b')
tfidf = bigram_tfidf_vectorizer.fit_transform(kitchenElectrics)

clusterer = KMeans(n_clusters = 700, n_jobs = 4)
predictions = clusterer.fit_predict(tfidf)

# Inspect the Clusters

Let's look over a few of the clusters and see if they make any sense.

In [42]:
def inspectClusters(predictions, products, numClusters = 10):
    for j in range(numClusters):
        for i in range(len(predictions)):
            if predictions[i] ==j:
                print '%d : %s' % (j, products[i])

inspectClusters(predictions, kitchenElectrics)

0 : Hario V60 Coffee Drip Bouno Kettle
0 : Hario V60 Pour-Over Kit
0 : Hario Acrylic Stand with Drip Tray for V60 Coffee Dripper
0 : Hario V60 Drip Decanter
0 : Hario V60 Ceramic Coffee Dripper in White
0 : Hario Filter Paper for 02 V60 Dripper
0 : Hario Ceramic Coffee Mini Mill Grinder
0 : Hario V60 Coffee Drip Scale/Timer
1 : Oxo Good Grips 4-Cup French Press Coffee Maker in Stainless Steel
1 : OXO Good Grips 8-Cup French Press Coffee Maker
1 : Oxo Good Grips Replacement 4-Cup French Press Carafe
1 : Oxo Good Grips Replacement 8-Cup French Press Carafe
1 : OXO Good Grips 4-Cup French Press Coffee Maker
1 : OXO Good Grips 8-Cup French Press Coffee Maker
1 : OXO Good Grips Cold Brew Coffee Maker
1 : OXO Good Grips Cold Brew Coffee Maker Paper Filters
2 : Dualit 4-Slice NewGen Classic Toaster in Pink
2 : Dualit 4-Slice NewGen Classic Toaster in Charcoal
2 : Dualit 4-Slice NewGen Classic Toaster in Grey
2 : Dualit 4-Slice Chrome Toaster
2 : Dualit 4-Slice NewGen Classic Toaster in Light 

These aren't bad, but most of these clusters seem to fall along brand lines.

## Remember, the TFIDF model is _very_ highly dimensional

In [29]:
tfidf.shape

(3882, 11544)

It's operating on both single works and bigrams right now.  One thing we need to experiment with and test is the effect of K (and implicitly the size of the clusters) on downstream outcomes.  It's not inherently important that these clusters map to human intuition, but it would certainly be nice if that happened.

## Experiments with Stopwords and Brands

In [39]:
from nltk.corpus import stopwords
import string

stopset = set(stopwords.words('english'))
stopset.update(string.punctuation)
stopset.update([';', 'reg', '&', u';', u'&']) #for unicode

This is a good default set of stopwords.  TFIDF controls for a lot of the effect of normal stopwords, but completely excluding them will make n-grams more useful.  Let's see what changes when we add stopwords to the vectorizer.

In [41]:
bigram_tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), token_pattern=r'\b\w+\b', stop_words=stopset)

tfidf = bigram_tfidf_vectorizer.fit_transform(kitchenElectrics)

clusterer = KMeans(n_clusters = 700, n_jobs = 4)
predictions = clusterer.fit_predict(tfidf)

inspectClusters(predictions, kitchenElectrics)

0 : Hario V60 Coffee Drip Bouno Kettle
0 : Hario V60 Pour-Over Kit
0 : Hario Acrylic Stand with Drip Tray for V60 Coffee Dripper
0 : Hario V60 Drip Decanter
0 : Hario V60 Ceramic Coffee Dripper in White
0 : Hario Filter Paper for 02 V60 Dripper
0 : Hario Ceramic Coffee Mini Mill Grinder
0 : Hario V60 Coffee Drip Scale/Timer
1 : Oxo Good Grips 4-Cup French Press Coffee Maker in Stainless Steel
1 : OXO Good Grips 8-Cup French Press Coffee Maker
1 : Oxo Good Grips Replacement 4-Cup French Press Carafe
1 : Oxo Good Grips Replacement 8-Cup French Press Carafe
1 : OXO Good Grips 4-Cup French Press Coffee Maker
1 : OXO Good Grips 8-Cup French Press Coffee Maker
1 : OXO Good Grips Cold Brew Coffee Maker
1 : OXO Good Grips Cold Brew Coffee Maker Paper Filters
2 : Dualit 4-Slice NewGen Classic Toaster in Pink
2 : Dualit 4-Slice NewGen Classic Toaster in Charcoal
2 : Dualit 4-Slice NewGen Classic Toaster in Grey
2 : Dualit 4-Slice Chrome Toaster
2 : Dualit 4-Slice NewGen Classic Toaster in Light 

The clusters have changed (though they'll naturally change every time we run it), but are they much better?  Though this is too small a sample to make any sweeping proclamations, it seems possible that the effect of brand names has become stronger.

Should we remove brand names?  If so, how should we go about it?