# Problem Statement

for each item belonging to one of the four categories (cases, laptop, mp3, and cellphones), do the following: (1) find its nearest neighbors; (2) assign it to the relevant cluster.

the following notebook shows a multi-step pipeline that (1) preprocesses (eg extracts named entity); (2) generates term document matrices; (3) organizes feature vectors via approximate nearest neighbor (ANN); and (4) clusters via k-means.

# step 0: dataset

In [147]:
from sklearn.externals import joblib
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.decomposition import PCA
from spacy.en import English
from sklearn.feature_extraction.text import CountVectorizer
from annoy import AnnoyIndex
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

In [148]:
data = pd.read_table('data/mp3 players 73839.txt', header = None)
data.columns = ['id', 'site', 'category', 'title']
data.head()

Unnamed: 0,id,site,category,title
0,301074677039,0,73839,Apple iPod nano 7th Generation Purple (16 GB) ...
1,191049351783,0,73839,APPLE IPOD TOUCH 16GB 4TH GEN WHITE MP3 PLAYER
2,251402664662,0,73839,Sport Sunglasses Headset Sun Glasses FOR IPHO...
3,370975588476,0,73839,DISNEY PARKS WHERE DREAMS COME TRUE MP3 PLAYER...
4,251398873497,0,73839,Mp3 Player Sunglasses 8gb Black w/ Bluetooth b...


In [149]:
data.shape

(24691, 4)

In [150]:
# subset records during exploration if necessary!

data_2 = data.ix[:10000]
data_2.shape

(10001, 4)

# step 1: text preprocessing

lets focus on nouns, since we're generating a term document matrix from product-centric titles.

In [151]:
def extract_nouns_from_title(title):
    '''
    function:
    ---------
    extract nouns from a product title. by doing so, we effectively fit tdm to the most
    relevant aspects of the title string.
    
    parameter:
    ----------
    @title: str, refering to product title.
    
    returns:
    --------
    @title_nouns_only: str, referring to product title with nouns only. this output will be 
    used to fit a term document matrix.
    '''
    
    try:
        title = unicode(title)
        parsedData = parser(title)
        
        # split product title into multiple sentences if necessary
        for span in parsedData.sents:
            sent = [parsedData[i] for i in range(span.start, span.end)]
            break
        
        # collect tokens only if they are nouns
        title_nouns_only = [token.orth_ for token in sent if token.pos_ == 'NOUN']
        return " ".join(title_nouns_only)
    
    except:
        return title # certain char not recognized and cannot be recasted to unicode!

In [152]:
# create new column by mapping extract_nouns_from_title() to titles

data_2['title_clean'] = map(lambda x: extract_nouns_from_title(x), data_2['title'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


# step 2: n-grams & tdm

generate term document matrix of ngrams (unigrams, bigrams and trigrams). frequently occuring ngrams will be weighted less, whereas rare words will be weighted more.

In [153]:
# PRE-FITTED VECTORIZER IF IT EXISTS

# vec = joblib.load('models/mp3/mp3_vec_v1.pkl')
# tdm_sparse = vec.transform(data_2['title_clean'])

In [154]:
# FIT NEW VECTORIZER

vec = TfidfVectorizer(stop_words= 'english',ngram_range=(1,3),lowercase=True, max_features=1000)
tdm_sparse = vec.fit_transform(data_2['title_clean'])
joblib.dump(vec, 'models/mp3/mp3_vec_v1.pkl')

['models/mp3/mp3_vec_v1.pkl',
 'models/mp3/mp3_vec_v1.pkl_01.npy',
 'models/mp3/mp3_vec_v1.pkl_02.npy']

In [155]:
# spot check tdm features

vec.get_feature_names()[100:105]

[u'band', u'battery', u'bk', u'black', u'black 32gb']

In [156]:
# convert to dense array for dimensionality reduction - might not be necessary

tdm_dense = tdm_sparse.toarray()
tdm_dense_df = pd.DataFrame(tdm_dense)
tdm_dense_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,0,0,0,0,0.0,0.0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0.229953,0.302025,0,0.325897,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0.0,0.0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0.0,0.0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0.0,0.0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0


# step 3: reduce dimensionality

dimensionality reduction may be optional since ANN handles high dimensional feature vectors very well!

In [157]:
# FIT PCA - should skip this step since ANN can search high dimensional vectors

# reducer = PCA(n_components = 500)
# tdm_reduced = reducer.fit_transform(tdm_dense)
# joblib.dump(reducer, 'models/mp3/mp3_reducer_v1.pkl')

In [158]:
tdm_reduced_df = pd.DataFrame(tdm_dense_df)
tdm_reduced_df.shape

(10001, 1000)

# step 4: extract named entities

extracting named entities will resolve limitations of inverse weighting. named entities will then be captured in an unweighted tdm.

In [159]:
# instantiate parser from spacy.io (nltk didnt work well)

parser = English()

In [160]:
def extract_named_entities(sentence):
    '''
    function:
    ---------
    given a product title string, identify the named entity. a separate tdm will be fitted with
    named entities.
    
    parameter:
    ----------
    @sentence: str, referring to product title string.
    
    returns:
    --------
    @named entity: str, referring to named entity in product title.
    '''
    
    try:
        sentence = unicode(sentence)
        doc = parser(sentence)
        return [chunk.orth_ for chunk in doc.noun_chunks][0]
    
    # catch instances where title string cannot be converted into unicode!
    except:
        return 'NA'

In [161]:
# create new column by mapping extract_named_entities() to titles

data_2['nouns'] = map(lambda x: extract_named_entities(x), data_2['title'])
data_2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,id,site,category,title,title_clean,nouns
0,301074677039,0,73839,Apple iPod nano 7th Generation Purple (16 GB) ...,Apple iPod nano Generation Purple GB Model,
1,191049351783,0,73839,APPLE IPOD TOUCH 16GB 4TH GEN WHITE MP3 PLAYER,IPOD TOUCH 16GB GEN WHITE MP3 PLAYER,IPOD TOUCH 16GB 4TH GEN WHITE MP3 PLAYER
2,251402664662,0,73839,Sport Sunglasses Headset Sun Glasses FOR IPHO...,Sport Sunglasses Headset Sun Glasses IPHONE SA...,IPHONE SAMSUNG HTC
3,370975588476,0,73839,DISNEY PARKS WHERE DREAMS COME TRUE MP3 PLAYER...,DISNEY PARKS DREAMS COME TRUE MP3 PLAYER USB F...,
4,251398873497,0,73839,Mp3 Player Sunglasses 8gb Black w/ Bluetooth b...,Mp3 Player Sunglasses Black w/ Bluetooth + Ext...,Black w/ Bluetooth


In [162]:
# FIT NAMED ENTITY VECTORIZER

ne_vec = CountVectorizer(stop_words= None,ngram_range=(2,3),lowercase=True, max_features=200)
ne_tdm_sparse = ne_vec.fit_transform(data_2['nouns'])
ne_tdm_dense = ne_tdm_sparse.toarray()
ne_tdm_dense_df = pd.DataFrame(ne_tdm_dense)
ne_tdm_reduced_df = pd.DataFrame(ne_tdm_dense_df)
joblib.dump(ne_vec, 'models/mp3/mp3_ne_vec_v1.pkl')
ne_tdm_reduced_df.shape

(10001, 200)

# step 5: merge dataframes

in this section, i merge product tdm (with inverse weighting) with named entity tdm. as result, each item is represented as a 1,200 dimensional vector.

In [163]:
features = pd.concat([tdm_reduced_df, ne_tdm_reduced_df], axis = 1)
features_df = pd.DataFrame(features)
features_df.shape

(10001, 1200)

# step 6: ANN

ANN is used to search high dimensional feature vectors.

### 6a: populate ANN

In [164]:
ann = AnnoyIndex(features_df.shape[1])

# populate ANN with each item 
for i in range(len(features_df)):
    
    vector = features_df.ix[i].tolist()
    ann.add_item(i, vector) 
    
ann.build(10) # 10 bins = 10x speedup!

True

### 6b: inspect top 10 nearest neighbors

In [165]:
def retrieve_nearest_neighbors(query_idx):
    '''
    function:
    ---------
    given an item ID, print its 10 closest neighbors according to the pre-defined ANN.
    
    parameters:
    -----------
    @query_idx: int, representing the 12 digit item ID
    
    '''
    
    row_idx = np.where(data['id'] == query_idx)[0][0] #  look up corresponding item ID
    
    neighbors = ann.get_nns_by_item(row_idx,11) # return 11 neighbors as the first one is a duplicate
    
    for n in neighbors:
        print data_2['title'].ix[n], cosine_similarity(features_df.ix[row_idx], features_df.ix[n])[0][0]
        print '-' * 50

In [166]:
# test case

retrieve_nearest_neighbors(301074677039)

Apple iPod nano 7th Generation Purple (16 GB) (Latest Model) 1.0
--------------------------------------------------
Apple iPod nano 7th Generation Purple (16 GB) (Latest Model) 1.0
--------------------------------------------------
Very NICE!!  Apple iPod nano 7th Generation Purple (16 GB) (Latest Model)  1.0
--------------------------------------------------
Apple iPod nano 7th Generation Purple (16 GB) (Latest Model) 1.0
--------------------------------------------------
Apple iPod nano 7th Generation Purple (16 GB) (Latest Model) 1.0
--------------------------------------------------
Apple iPod nano 7th Generation Purple (16 GB) (Latest Model) 1.0
--------------------------------------------------
Apple iPod nano 7th Generation Purple (16 GB) (Latest Model) 1.0
--------------------------------------------------
Apple iPod nano 7th Generation Purple (16 GB) (Latest Model) 1.0
--------------------------------------------------
Apple iPod nano 7th Generation Purple (16 GB) (Latest Mode

# step 7: store nearest neighbors to csv

for each record, find and store its nearest neighbors.

In [167]:
def find_nearest_neighbors(row_id, num_neighbors):
    
    '''
    function:
    ---------
    given an index id, return its 10 nearest neighbors. this version uses index id but
    should be converted to accept item id. next time.
    
    returns a list of strings of index IDs. itll make it easier to construct a tdm of neighbors
    for clustering purposes.
    
    parameters:
    -----------
    @row_id: int, referring to item's index id in the dataframe.
    
    returns:
    --------
    @neighbors: str list containing index IDs for nearest neighbors.
    '''
    neighbors_by_itemID = map(lambda x: rowID_to_itemID(x), ann.get_nns_by_item(row_id,num_neighbors))
    
    # convert to str and remove first element, which is the query item
    neighbors = map(lambda x: str(x), neighbors_by_itemID) 
    
    neighbors.pop(0) # remove the first element, which is the query item (duplicate)
    
    neighbors = ' '.join(neighbors)
    
    return neighbors


def rowID_to_itemID(row_id):
    '''
    helper function for converting dataframe index to item ID
    '''
    return data_2['id'][row_id]

In [168]:
# create new column called "nearest neighbors" by mapping find_nearest_neighbors() to index ID

data_2['nearest_neighbors'] = map(lambda x: find_nearest_neighbors(x, 11), data_2.index)
data_2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,id,site,category,title,title_clean,nouns,nearest_neighbors
0,301074677039,0,73839,Apple iPod nano 7th Generation Purple (16 GB) ...,Apple iPod nano Generation Purple GB Model,,261381459395 321303260713 281252260666 2513495...
1,191049351783,0,73839,APPLE IPOD TOUCH 16GB 4TH GEN WHITE MP3 PLAYER,IPOD TOUCH 16GB GEN WHITE MP3 PLAYER,IPOD TOUCH 16GB 4TH GEN WHITE MP3 PLAYER,191049423545 201025866125 380825954957 3108546...
2,251402664662,0,73839,Sport Sunglasses Headset Sun Glasses FOR IPHO...,Sport Sunglasses Headset Sun Glasses IPHONE SA...,IPHONE SAMSUNG HTC,181231327891 331097733903 321184323358 4006496...
3,370975588476,0,73839,DISNEY PARKS WHERE DREAMS COME TRUE MP3 PLAYER...,DISNEY PARKS DREAMS COME TRUE MP3 PLAYER USB F...,,251340354504 281228373930 151216417741 3709557...
4,251398873497,0,73839,Mp3 Player Sunglasses 8gb Black w/ Bluetooth b...,Mp3 Player Sunglasses Black w/ Bluetooth + Ext...,Black w/ Bluetooth,291064340570 251430628863 251430613032 2910612...


In [169]:
# store data in csv format

data_2['nearest_neighbors'].to_csv('output/mp3_neighbors.csv')

# step 8: cluster

clustering was accomplished via kmeans. 

a tdm representing item's 10 nearest neighbors was generated. therefore, each item was represented by a 1000-dimensional vector, where each feature corresponded to an item.

In [170]:
neighbors_vec = CountVectorizer(max_features=1000)

In [171]:
# fit tdm to 10 nearest neighbors

neighbors_tdm_sparse = neighbors_vec.fit_transform(data_2['nearest_neighbors'])
neighbors_tdm_dense = neighbors_tdm_sparse.toarray()
neighbors_tdm_dense_df = pd.DataFrame(neighbors_tdm_dense)
neighbors_tdm_dense_df.shape

(10001, 1000)

In [172]:
model = KMeans(n_clusters=50, init='k-means++', max_iter=100, n_init=1)
model.fit(neighbors_tdm_dense_df)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=50, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [173]:
categories = model.predict(neighbors_tdm_dense_df)

In [176]:
# inspect clusters

cluster_idx = 19

selectors = np.where(categories == cluster_idx)

for idx in selectors:
    print data_2['title'][idx]

48      Microsoft Zune Black (120 GB) Digital Media Pl...
750     Microsoft Zune 120 Black (120 GB) Digital Medi...
962     Microsoft Zune 30 Black (30 GB) Digital Media ...
1299    Microsoft Zune 30 Black (30 GB) Digital Media ...
1617    Microsoft Zune Black (120 GB) Digital Media Pl...
1744    ****FACTORY SEALED**** Microsoft Zune Black (8...
2739                 Microsoft Zune 16GB Black USA Seller
2973    Microsoft Zune 30 Black (30 GB) Digital Media ...
2998    Microsoft Zune 80 Black (80 GB) Media Player &...
3264    Microsoft Zune Black (120 GB) Digital Media Pl...
3533    Microsoft Zune 80 Black (80 GB) Digital Media ...
3987    Microsoft Zune 30 Black (30 GB) Digital Media ...
4498    Bundled Orig Microsoft Zune 30 Black (30 GB) D...
4802    Microsoft Zune Black (120 GB) Digital Media Pl...
5335    Microsoft Zune 30 Black (30 GB) Digital Media ...
6752    Microsoft Zune 8 Black 8 GB 8GB FM Wi-Fi Digit...
7166    Microsoft Zune 80 Black (80 GB) Digital Media ...
7662    Micros