## 1.3 Introduction to Information Retrieval

Here we work with a data set scraped from eBay.  The data contains 9895 item titles and descriptions.

First we load the data - this is easiest with a `csv.reader`:

In [16]:
import csv
import re
import pandas as pd
from collections import Counter
from scipy.sparse import csr_matrix
from bokeh.plotting import figure, output_notebook, show
from bokeh.charts import Bar
from sys import getsizeof

ImportError: cannot import name 'XYBuilder'

In [4]:
with open("data/bike-items.txt") as f:
    r = csv.reader(f, delimiter=',', quotechar='"')
    rgx = re.compile(r'\b[a-zA-Z]+\b') 
    docs = [ (' '.join(re.findall(rgx, x[0])).lower(), ' '.join(re.findall(rgx, x[1])).lower())  for i,x in enumerate(r) if i > 1 ]

print('We have a list of (item title, description) tuples :\n + %s\n + %s' % (docs[0][0],docs[0][1]))

items_t = [ d[0] for d in docs ] # item titles
items_d = [ d[1] for d in docs ] # item descriptions
items_i = range(0, len(items_t)) # item id


We have a list of (item title, description) tuples :
 + cycling bicycle mtb bike fixie gloss carbon fiber riser bar handlebar
 + description feature easy to use made of high quality carbon fiber with the special design can save for a long time the carbon fiber handlebar is made of high quality carbon fiber so that you can use it relieved this quick disassembling carbon fiber handlebar is easy to use and one of the best gifts to your friends specification material carbon fiber color black handlebar clamp diameter mm length package included x cycling carbon fiber rise


Our raw data is in text form.  We need to convert it into a form more amenable to analysis.  In this notebok we look at ways of converting a collection of documents into a collection of vectors (a matrix).  To do this we need to *tokenize* the text - i.e. split it into words - and then create vectors of token frequency.  We will start by doing this the hard way and then look at how we can scale this up using scikit-learn.  Later on we will repeat this exercise using map-reduce.

We will proceed as follows:
1  Compute term frequency as a dictionary, a matrix and a sparse matrix
2  Implement a boolean search against the TF matrix
3  Introduce scikit-learn and the 'hashing-trick'
4  Compute TF.IDF for the set of documents

## Basic Term Frequency (TF) Matrix

Please note that this code is for understanding - it is not optimised or intended to scale!

Let's start with the first 10 item titles as out corpus:

In [5]:
corpus = items_t[0:5]
print(corpus)

['cycling bicycle mtb bike fixie gloss carbon fiber riser bar handlebar', 'bicycle rims x red speed internal hub wheel set beach cruiser bike', 'mavic crossride mountain bike wheels and wtb weirwolf tires', 'new kcnc arrow alloy stem black', 'rotor qxl aero oval road chainring']


### TF Dictionary 

Now we can compute the frequency of each term across the entire corpus:

In [6]:
tf = {}
for doc in corpus:
    for word in doc.split(' '):
        if word in tf:
            tf[word] += 1
        else:
            tf[word] = 1

print(tf)

{'qxl': 1, 'x': 1, 'internal': 1, 'handlebar': 1, 'cycling': 1, 'and': 1, 'carbon': 1, 'road': 1, 'weirwolf': 1, 'hub': 1, 'rims': 1, 'kcnc': 1, 'aero': 1, 'beach': 1, 'fiber': 1, 'red': 1, 'fixie': 1, 'arrow': 1, 'set': 1, 'alloy': 1, 'cruiser': 1, 'new': 1, 'riser': 1, 'speed': 1, 'rotor': 1, 'wheel': 1, 'black': 1, 'bike': 3, 'wtb': 1, 'stem': 1, 'mountain': 1, 'mtb': 1, 'tires': 1, 'bicycle': 2, 'wheels': 1, 'oval': 1, 'gloss': 1, 'crossride': 1, 'chainring': 1, 'mavic': 1, 'bar': 1}


We can simplify by using a Counter rather than a dictionary:

In [7]:
tf = Counter()
for doc in corpus:
    for word in doc.split(' '):
        tf[word] += 1
        
print(tf)

Counter({'bike': 3, 'bicycle': 2, 'qxl': 1, 'x': 1, 'internal': 1, 'handlebar': 1, 'cycling': 1, 'and': 1, 'carbon': 1, 'road': 1, 'weirwolf': 1, 'hub': 1, 'rims': 1, 'kcnc': 1, 'aero': 1, 'beach': 1, 'fiber': 1, 'red': 1, 'fixie': 1, 'arrow': 1, 'set': 1, 'alloy': 1, 'cruiser': 1, 'new': 1, 'riser': 1, 'speed': 1, 'rotor': 1, 'wheel': 1, 'black': 1, 'wtb': 1, 'stem': 1, 'mountain': 1, 'mtb': 1, 'tires': 1, 'wheels': 1, 'oval': 1, 'gloss': 1, 'crossride': 1, 'chainring': 1, 'mavic': 1, 'bar': 1})


No speed difference - but cleaner code:

In [8]:
def tf1(corpus):
    for doc in corpus:
        for word in doc.split(' '):
            if word in tf:
                tf[word] += 1
            else:
                tf[word] = 1        
    return tf

def tf2(corpus):
    tf = Counter()
    for doc in corpus:
        for word in doc.split(' '):
            tf[word] += 1
    return tf

%timeit tf1(corpus)
%timeit tf2(corpus)


The slowest run took 11.71 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 23.8 µs per loop
10000 loops, best of 3: 26 µs per loop


### TF Matrix

Whilst the TF dictionary is a compact way to store the term frequency it is not much use for analysis.  We need a TF matrix where each document vector is the same length.  Now we convert the dictionary to a matrix:

In [9]:
def get_lexicon(corpus):
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return lexicon

lexicon = get_lexicon(corpus)

tfm =[]
for doc in corpus:
    for term in doc.split():
        tfv = [doc.split().count(word) for word in lexicon]
    tfm.append(tfv)
        
print([ x for x in tfm])

[[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0], [0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1]]


As number of terms increases this method becomes inefficient.  Here is a faster implementation:

In [10]:
def get_lexicon(corpus):
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return list(lexicon)

lexicon = get_lexicon(corpus)

tfm =[]
for doc in corpus:
    tfv = [0]*len(lexicon)
    for term in doc.split():
        tfv[lexicon.index(term)] += 1
    tfm.append(tfv)
        
print(tfm)

[[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0], [0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1]]


We can compare time for each:

In [11]:
def tfm1(corpus):
    
    def get_lexicon(corpus):
        lexicon = set()
        for doc in corpus:
            lexicon.update([word for word in doc.split()])
        return lexicon
    
    lexicon = get_lexicon(corpus)

    tfm =[]
    for doc in corpus:
        for term in doc.split():
            tfv = [doc.split().count(word) for word in lexicon]
        tfm.append(tfv)
    
    return tfm

def tfm2(corpus):
    
    def get_lexicon(corpus):
        lexicon = set()
        for doc in corpus:
            lexicon.update([word for word in doc.split()])
        return list(lexicon)

    lexicon = get_lexicon(corpus)

    tfm =[]
    for doc in corpus:
        tfv = [0]*len(lexicon)
        for term in doc.split():
            tfv[lexicon.index(term)] += 1
        tfm.append(tfv)
    
    return tfm

%timeit tfm1(corpus)
%timeit tfm2(corpus)


The slowest run took 6.08 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.75 ms per loop
10000 loops, best of 3: 35 µs per loop


In [12]:
# as size of corpus increases so does the sparsity

n = []
s = []
for i in range(100,1000,100):
    corpus = items_t[0:i]
    tfm = tfm2(corpus)
    c =[ [x.count(0), x.count(1)] for x in tfm]
    n_zero = sum([ y[0] for y in c])
    n_one = sum([ y[1] for y in c])  
    s.append(float(n_one / (n_one + n_zero)))
    n.append(i)
    
output_notebook(hide_banner=True)
p = figure(x_axis_label='Documents', y_axis_label='Sparsity',
          plot_width=400, plot_height=400)
p.line(n, s, line_width=2)
p.circle(n, s, fill_color="white", size=8)
show(p)


We can take advantage of the sparsity and only store the non-zero elements of the TF matrix.

### Spare matrix storage

In [13]:
def tfm3(corpus):
    
    def get_lexicon(corpus):
        lexicon = set()
        for doc in corpus:
            lexicon.update([word for word in doc.split()])
        return list(lexicon)

    lexicon = get_lexicon(corpus)

    tfm =[]
    for doc_id, doc in enumerate(corpus):
        tfv = [0]*len(lexicon)
        for term in doc.split():
            tfv[lexicon.index(term)] += 1
        tfm.append([[(doc_id, t_id), t] for t_id, t in enumerate(tfv) if t > 0])
    
    return tfm

tfm = tfm3(corpus)
print(tfm[0])

[[(0, 303), 1], [(0, 341), 1], [(0, 424), 1], [(0, 446), 1], [(0, 522), 1], [(0, 636), 1], [(0, 1003), 1], [(0, 1009), 1], [(0, 1399), 1], [(0, 1500), 1], [(0, 2019), 1]]


We can also use compression to store this data even more efficiently - scikit-learn provides a compressed sparse matrix:

In [14]:
tfm=csr_matrix(tfm2(corpus))
print(tfm[0,:])

l = ['tf2','tfm2','csr']
s = [getsizeof(tf2(corpus)) , getsizeof(tfm2(corpus)), getsizeof(csr_matrix(tfm2(corpus)))]

df = pd.DataFrame({'Type':l, 'Size':s})

output_notebook(hide_banner=True)
p = Bar(df.sort_values('Size'), label='Type', values='Size')
show(p)

  (0, 303)	1
  (0, 341)	1
  (0, 424)	1
  (0, 446)	1
  (0, 522)	1
  (0, 636)	1
  (0, 1003)	1
  (0, 1009)	1
  (0, 1399)	1
  (0, 1500)	1
  (0, 2019)	1


NameError: name 'Bar' is not defined

## Boolean Search

Now we have a tf matrix we can start to use it to find documents that contain words included in a query.  We will start by simply returning the documents from the corpus that match terms in our query: