On Day 1 we were using a lot of core python functions to process and tokenize the text data.  This is overly complex and time consuming for real data.  

We would prefer to reuse other work in this area - NLTK is one library that is designed to simplify working with text.  NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

The Python 2 Guide is available [here](http://nltk.org/book_1ed):

Bird, Steven, Edward Loper and Ewan Klein (2009), _Natural Language Processing with Python._ O’Reilly Media Inc.

Before we can get going we need to download the nltk resources:

In [1]:
import nltk
#nltk.download()

Now we can load our bike items from day 1:

In [2]:
import csv
r = csv.reader(open('data/bike-items.txt'),delimiter=',', quotechar='"')
items = [ line[0] for i, line in enumerate(r) if i > 1]
print(items[0:5])

['Cycling Bicycle MTB Bike Fixie Gloss 3K Carbon Fiber Riser Bar Handlebar 31.8mm', 'BICYCLE RIMS 26"x 50MM RED 3 SPEED INTERNAL HUB WHEEL SET BEACH CRUISER BIKE', 'Mavic Crossride 26" Mountain bike wheels and WTB Weirwolf Tires', 'New KCNC ARROW 7050 Alloy Stem , 31.8x150mm , 149g , Black', 'ROTOR QXL Aero Oval Road Chainring BCD110x5 53t']


## Tokenizing

In [3]:
from nltk.tokenize import word_tokenize
items_t = [ word_tokenize(item) for item in items]
print(items_t[0:5])

[['Cycling', 'Bicycle', 'MTB', 'Bike', 'Fixie', 'Gloss', '3K', 'Carbon', 'Fiber', 'Riser', 'Bar', 'Handlebar', '31.8mm'], ['BICYCLE', 'RIMS', '26', "''", 'x', '50MM', 'RED', '3', 'SPEED', 'INTERNAL', 'HUB', 'WHEEL', 'SET', 'BEACH', 'CRUISER', 'BIKE'], ['Mavic', 'Crossride', '26', "''", 'Mountain', 'bike', 'wheels', 'and', 'WTB', 'Weirwolf', 'Tires'], ['New', 'KCNC', 'ARROW', '7050', 'Alloy', 'Stem', ',', '31.8x150mm', ',', '149g', ',', 'Black'], ['ROTOR', 'QXL', 'Aero', 'Oval', 'Road', 'Chainring', 'BCD110x5', '53t']]


## Stopword Removal

Some words just aren't important, word such as:

In [11]:
from nltk.corpus import stopwords
s = set(stopwords.words('english'))
s

{u'a',
 u'about',
 u'above',
 u'after',
 u'again',
 u'against',
 u'ain',
 u'all',
 u'am',
 u'an',
 u'and',
 u'any',
 u'are',
 u'aren',
 u'as',
 u'at',
 u'be',
 u'because',
 u'been',
 u'before',
 u'being',
 u'below',
 u'between',
 u'both',
 u'but',
 u'by',
 u'can',
 u'couldn',
 u'd',
 u'did',
 u'didn',
 u'do',
 u'does',
 u'doesn',
 u'doing',
 u'don',
 u'down',
 u'during',
 u'each',
 u'few',
 u'for',
 u'from',
 u'further',
 u'had',
 u'hadn',
 u'has',
 u'hasn',
 u'have',
 u'haven',
 u'having',
 u'he',
 u'her',
 u'here',
 u'hers',
 u'herself',
 u'him',
 u'himself',
 u'his',
 u'how',
 u'i',
 u'if',
 u'in',
 u'into',
 u'is',
 u'isn',
 u'it',
 u'its',
 u'itself',
 u'just',
 u'll',
 u'm',
 u'ma',
 u'me',
 u'mightn',
 u'more',
 u'most',
 u'mustn',
 u'my',
 u'myself',
 u'needn',
 u'no',
 u'nor',
 u'not',
 u'now',
 u'o',
 u'of',
 u'off',
 u'on',
 u'once',
 u'only',
 u'or',
 u'other',
 u'our',
 u'ours',
 u'ourselves',
 u'out',
 u'over',
 u'own',
 u're',
 u's',
 u'same',
 u'shan',
 u'she',
 u'shoul

In [9]:
t = 'Mavic Crossride 26" Mountain bike wheels and WTB Weirwolf Tires'
tt = word_tokenize(t)
print(tt)
ttt = set(tt)
print(ttt)
tttt = ttt.difference_update(s)
print(tttt)


#items_t2 = [ set(words).difference_update(stopwords) for words in items_t]
#items_t2[0:5]

['Mavic', 'Crossride', '26', "''", 'Mountain', 'bike', 'wheels', 'and', 'WTB', 'Weirwolf', 'Tires']
set(['and', 'Mountain', '26', 'Mavic', 'Crossride', "''", 'Tires', 'WTB', 'bike', 'Weirwolf', 'wheels'])
None


## Stemming

Can be useful but in the context of item titles is this appropriate?

## Synonyms 

Query expansion using synonyms especially when result set contains few or no items.

## Spelling

We could correct spelling mistakes on the item titles but also we should ensure that query has no spelling mistakes.