# Overview

problem: (1) for each item belonging to one of the four categories (cases, laptop, mp3, and cellphones), find its nearest neighbors; (2) cluster items.

the following notebook shows a multi-step pipeline that (1) preprocesses (eg extracts named entity); (2) generates term document matrices; (3) organizes feature vectors via approximate nearest neighbor (ANN); and (4) clusters via k-means.

# step 0: dataset

In [7]:
from sklearn.externals import joblib
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.decomposition import PCA
from spacy.en import English
from sklearn.feature_extraction.text import CountVectorizer
from annoy import AnnoyIndex
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

In [8]:
data = pd.read_table('data/laptops 177.txt', header = None)
data.columns = ['id', 'site', 'category', 'title']
data.head()

Unnamed: 0,id,site,category,title
0,251432713573,0,177,LENOVO THINKPAD T500/CORE 2 DUO T9400/2.53GHZ/...
1,360826556319,0,177,"HP Compaq 6715b Turion 64x2 1.8Ghz 512MB 15"" W..."
2,181185253117,0,177,Lenovo ThinkPad Edge E430 3254-ACU Notebook PC...
3,271331653100,0,177,"HP F0Q65UA 15.6"" LED Notebook,AMD A4-5000 1.5G..."
4,111221678527,0,177,CANDY PINK Dell Latitude D630 C2D 2.4GHz 4GB R...


In [9]:
data.shape

(56638, 4)

In [10]:
data_2 = data.ix[:1000]
data_2.shape

(1001, 4)

# step 1: text preprocessing

since we're generating a term document matrix from products-centric titles, lets focus on nouns.

In [11]:
def extract_nouns_from_title(title):
    '''
    function:
    ---------
    extract nouns from a product title. by doing so, we effectively fit tdm to the most
    relevant aspects of the title string.
    
    parameter:
    ----------
    @title: str, refering to product title.
    
    returns:
    --------
    @title_nouns_only: str, referring to product title with nouns only. this output will be 
    used to fit a term document matrix.
    '''
    
    try:
        title = unicode(title)
        parsedData = parser(title)
        
        # split product title into multiple sentences if necessary
        for span in parsedData.sents:
            sent = [parsedData[i] for i in range(span.start, span.end)]
            break
        
        # collect tokens only if they are nouns
        title_nouns_only = [token.orth_ for token in sent if token.pos_ == 'NOUN']
        return " ".join(title_nouns_only)
    
    except:
        return title # certain char not recognized and cannot be recasted to unicode!

In [12]:
# create new column by mapping extract_nouns_from_title() to titles

data_2['title_clean'] = map(lambda x: extract_nouns_from_title(x), data_2['title'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


# step 2: n-grams & tdm

generate term document matrix of ngrams using "clean" titles. in this specific example, i'll extract unigrams, bigrams and trigrams.

In [13]:
# PRE-FITTED VECTORIZER IF IT EXISTS

# vec = joblib.load('models/mp3/mp3_vec_v1.pkl')
# tdm_sparse = vec.transform(data_2['title_clean'])

In [14]:
# FIT NEW VECTORIZER

vec = TfidfVectorizer(stop_words= 'english',ngram_range=(1,3),lowercase=True, max_features=1000)
tdm_sparse = vec.fit_transform(data_2['title_clean'])
joblib.dump(vec, 'models/laptops/laptops_vec_v1.pkl')

['models/laptops/laptops_vec_v1.pkl',
 'models/laptops/laptops_vec_v1.pkl_01.npy',
 'models/laptops/laptops_vec_v1.pkl_02.npy']

In [15]:
# spot check tdm features

vec.get_feature_names()[1000:1005]

[]

In [16]:
# convert to dense array for dimensionality reduction

tdm_dense = tdm_sparse.toarray()
tdm_dense_df = pd.DataFrame(tdm_dense)
tdm_dense_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# step 3: reduce dimensionality

In [17]:
# FIT PCA - should skip this step since ANN can search high dimensional vectors

# reducer = PCA(n_components = 500)
# tdm_reduced = reducer.fit_transform(tdm_dense)
# joblib.dump(reducer, 'models/mp3/mp3_reducer_v1.pkl')

In [18]:
# LOAD PRE-FITTED REDUCER

# reducer = joblib.load('models/mp3/mp3_reducer_v1.pkl')
# tdm_reduced = reducer.transform(tdm_dense)

In [19]:
tdm_reduced_df = pd.DataFrame(tdm_dense_df)
tdm_reduced_df.shape

(1001, 1000)

# step 4: extract named entities

extracting named entities will resolve limitations of inverse weighting schemes.

In [20]:
# instantiate parser from spacy.io (nltk didnt work well)

parser = English()

In [21]:
def extract_named_entities(sentence):
    '''
    function:
    ---------
    given a product title string, identify the named entity. a separate tdm will be fitted with
    named entities.
    
    parameter:
    ----------
    @sentence: str, referring to product title string.
    
    returns:
    --------
    @named entity: str, referring to named entity in product title.
    '''
    
    try:
        sentence = unicode(sentence)
        doc = parser(sentence)
        return [chunk.orth_ for chunk in doc.noun_chunks][0]
    
    # catch instances where title string cannot be converted into unicode!
    except:
        return 'NA'

In [22]:
# create new column by mapping extract_named_entities() to titles

data_2['nouns'] = map(lambda x: extract_named_entities(x), data_2['title'])
data_2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,id,site,category,title,title_clean,nouns
0,251432713573,0,177,LENOVO THINKPAD T500/CORE 2 DUO T9400/2.53GHZ/...,LENOVO THINKPAD T500/CORE 2 DUO T9400/2.53GHZ/...,
1,360826556319,0,177,"HP Compaq 6715b Turion 64x2 1.8Ghz 512MB 15"" W...","HP Compaq 6715b Turion 64x2 1.8Ghz 512MB 15"" W...",
2,181185253117,0,177,Lenovo ThinkPad Edge E430 3254-ACU Notebook PC...,Lenovo ThinkPad Edge E430 3254-ACU Notebook PC...,
3,271331653100,0,177,"HP F0Q65UA 15.6"" LED Notebook,AMD A4-5000 1.5G...","HP F0Q65UA 15.6"" LED Notebook,AMD A4-5000 1.5G...","HDD,DVD-W,Win8"
4,111221678527,0,177,CANDY PINK Dell Latitude D630 C2D 2.4GHz 4GB R...,CANDY PINK Dell Latitude D630 C2D 2.4GHz 4GB R...,


In [23]:
# FIT NAMED ENTITY VECTORIZER

ne_vec = CountVectorizer(stop_words= None,ngram_range=(2,3),lowercase=True, max_features=200)
ne_tdm_sparse = ne_vec.fit_transform(data_2['nouns'])
joblib.dump(ne_vec, 'models/mp3/mp3_ne_vec_v1.pkl')

['models/mp3/mp3_ne_vec_v1.pkl']

In [24]:
# USE PRE-FITTED VECTORIZER

# ne_vec = joblib.load('models/mp3/mp3_ne_vec_v1.pkl')
# ne_tdm_sparse = ne_vec.transform(data_2['nouns'])

In [25]:
ne_tdm_dense = ne_tdm_sparse.toarray()
ne_tdm_dense_df = pd.DataFrame(ne_tdm_dense)

ne_tdm_dense_df.shape

(1001, 200)

In [26]:
# FIT NEW PCA

# ne_reducer = PCA(n_components = 500)
#ne_tdm_reduced = ne_reducer.fit_transform(ne_tdm_dense)
#joblib.dump(ne_reducer, 'models/laptops/laptops_ne_reducer_v1.pkl')

In [27]:
# LOAD FITTED PCA
# ne_reducer = joblib.load('models/mp3/mp3_reducer_v1.pkl')
# ne_tdm_reduced = ne_reducer.transform(ne_tdm_dense)

In [28]:
ne_tdm_reduced_df = pd.DataFrame(ne_tdm_dense_df)
ne_tdm_reduced_df.shape

(1001, 200)

# step 5: merge dataframes

in this section, i merge product tdm (with inverse weighting) with named entity tdm.

In [29]:
features = pd.concat([tdm_reduced_df, ne_tdm_reduced_df], axis = 1)
features_df = pd.DataFrame(features)
features_df.shape

(1001, 1200)

# step 6: instantiate ANN

i use ANN to enforce structure on items. ANNs will reduce the search space for downstream nearest neighbor tasks. 

### 6a: populate ANN

In [31]:
ann = AnnoyIndex(features_df.shape[1])

# populate ANN with each item 
for i in range(len(features_df)):
    
    vector = features_df.ix[i].tolist()
    ann.add_item(i, vector) 
    
ann.build(10) 

True

### 6b: inspect top 10 nearest neighbors

In [40]:
def retrieve_nearest_neighbors(query_idx):
    '''
    function:
    ---------
    given an item ID, print its 10 closest neighbors according to the pre-defined ANN.
    
    parameters:
    -----------
    @query_idx: int, representing the 12 digit item ID
    
    '''
    
    row_idx = np.where(data['id'] == query_idx)[0][0] #  look up corresponding item ID
    
    neighbors = ann.get_nns_by_item(row_idx,11) # return 11 neighbors as the first one is a duplicate
    
    for n in neighbors:
        print data_2['title'].ix[n], cosine_similarity(features_df.ix[row_idx], features_df.ix[n])[0][0]
        print '-' * 50

In [41]:
# test case

retrieve_nearest_neighbors(251432713573)

LENOVO THINKPAD T500/CORE 2 DUO T9400/2.53GHZ/4GB RAM/160GB HD item# SKU 2089881 1.0
--------------------------------------------------
LENOVO THINKPAD T400/ CORE 2 DUO 2.40GHZ/4GB RAM/160GB HD item# SKU 2103911 0.71370073873
--------------------------------------------------
Dell Latitude E6400 Laptop 2.53ghz Core 2 Duo 4GB Ram 160GB HD DVD-RW Win 7 PRO 0.569046782221
--------------------------------------------------
PANASONIC TOUGHBOOK CF30/ WIFI/TOUCHSCREEN  / 3GB RAM/ 160GB HD/ WIN 8/  0.413448865023
--------------------------------------------------
Dell Latitude E4300 13.3" Laptop Core 2 Duo P9600 2.53GHz 4GB 160GB *No Battery* 0.330585679128
--------------------------------------------------
IBM ThinkPad X60 12.1" Core Duo 1.6GHz 1GB RAM 160GB HDD Tablet WiFi 0.267544149274
--------------------------------------------------
Lenovo ThinkPad T400 Intel Core2Duo T9400 1GB 2.53GHz 0.267201647908
--------------------------------------------------
IBM THINKPAD LENOVO T400 Core 2 DUO 

# step 7: store nearest neighbors to csv

for each record, find and store its nearest neighbors.

In [45]:
def find_nearest_neighbors(row_id, num_neighbors):
    
    '''
    function:
    ---------
    given an index id, return its 10 nearest neighbors. this version uses index id but
    should be converted to accept item id. next time.
    
    returns a list of strings of index IDs. itll make it easier to construct a tdm of neighbors
    for clustering purposes.
    
    parameters:
    -----------
    @row_id: int, referring to item's index id in the dataframe.
    
    returns:
    --------
    @neighbors: str list containing index IDs for nearest neighbors.
    '''
    neighbors_by_itemID = map(lambda x: rowID_to_itemID(x), ann.get_nns_by_item(row_id,num_neighbors))
    
    # convert to str and remove first element, which is the query item
    neighbors = map(lambda x: str(x), neighbors_by_itemID) 
    
    neighbors.pop(0) # remove the first element, which is the query item (duplicate)
    
    neighbors = ' '.join(neighbors)
    
    return neighbors


def rowID_to_itemID(row_id):
    '''
    helper function for converting dataframe index to item ID
    '''
    return data_2['id'][row_id]

In [46]:
# create new column called "nearest neighbors" by mapping find_nearest_neighbors() to index ID

data_2['nearest_neighbors'] = map(lambda x: find_nearest_neighbors(x, 11), data_2.index)
data_2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,id,site,category,title,title_clean,nouns,nearest_neighbors
0,251432713573,0,177,LENOVO THINKPAD T500/CORE 2 DUO T9400/2.53GHZ/...,LENOVO THINKPAD T500/CORE 2 DUO T9400/2.53GHZ/...,,121261643802 121261069573 301001430347 1411734...
1,360826556319,0,177,"HP Compaq 6715b Turion 64x2 1.8Ghz 512MB 15"" W...","HP Compaq 6715b Turion 64x2 1.8Ghz 512MB 15"" W...",,151215670348 201024236281 111241032287 2513157...
2,181185253117,0,177,Lenovo ThinkPad Edge E430 3254-ACU Notebook PC...,Lenovo ThinkPad Edge E430 3254-ACU Notebook PC...,,301032497619 181144309508 191029299535 2713745...
3,271331653100,0,177,"HP F0Q65UA 15.6"" LED Notebook,AMD A4-5000 1.5G...","HP F0Q65UA 15.6"" LED Notebook,AMD A4-5000 1.5G...","HDD,DVD-W,Win8",390740843954 400620260842 370987892664 3807681...
4,111221678527,0,177,CANDY PINK Dell Latitude D630 C2D 2.4GHz 4GB R...,CANDY PINK Dell Latitude D630 C2D 2.4GHz 4GB R...,,121203685985 121261069573 301079281716 3006943...


In [48]:
# store data in csv format

data_2['nearest_neighbors'].to_csv('output/laptops/laptops_neighbors.csv')

# step 8: cluster

clustering was accomplished via kmeans. 

a tdm representing item's 10 nearest neighbors was generated. therefore, each item was represented by a 1000-dimensional vector, where each feature corresponded to an item.

In [54]:
neighbors_vec = CountVectorizer(max_features=1000)

In [55]:
# fit tdm to 10 nearest neighbors

neighbors_tdm_sparse = neighbors_vec.fit_transform(data_2['nearest_neighbors'])
neighbors_tdm_dense = neighbors_tdm_sparse.toarray()
neighbors_tdm_dense_df = pd.DataFrame(neighbors_tdm_dense)
neighbors_tdm_dense_df.shape

(1001, 977)

In [56]:
model = KMeans(n_clusters=50, init='k-means++', max_iter=100, n_init=1)
model.fit(neighbors_tdm_dense_df)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=50, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [57]:
categories = model.predict(neighbors_tdm_dense_df)

In [59]:
# inspect clusters

cluster_idx = 10

selectors = np.where(categories == cluster_idx)

for idx in selectors:
    print data_2['title'][idx]

29                                  Compaq Presario 1900
33     Compaq Presario F500 Laptop w/ Battery, AMD Se...
121                COMPAQ PRESARIO MODEL 1700T  WORKING 
206    Compaq Presario CQ60-224NR Dual Core 2.16GHz/3...
238    Compaq Presario V6120US 15.4" Notebook Windows...
420    Compaq Presario CQ56-219WM Windows 7 Home Prem...
544    Compaq Presario V2000 CELERON 1.40GHZ 512MB NO...
609    HP Compaq Presario CQ60-211 15.6" Notebook - C...
631                    Compaq Presario M2000 (For Parts)
697    Compaq Presario CQ57-339WM Intel Celeron Proce...
873    Compaq Presario X1000 Laptop *Boots to Bios* F...
930    HP Compaq Presario F750US 15.4" (160 GB, AMD A...
942                      Compaq 12" Presario 1245 Laptop
984    COMPAQ PRESARIO C700/ INTEL PENTIUM DUAL/ 1.40...
Name: title, dtype: object
