## 1.3 Introduction to Information Retrieval

Here we work with a data set scraped from eBay.  The data contains 9895 item titles and descriptions.

First we load the data - this is easiest with a `csv.reader`:

In [29]:
import csv
import re
import pandas as pd
import math
from collections import Counter
from scipy.sparse import csr_matrix

from bokeh.plotting import figure, output_notebook, show, vplot
from bokeh.charts import Bar, Scatter, BoxPlot
from bokeh.charts.attributes import CatAttr

from sys import getsizeof

In [30]:
with open("data/bike-items.txt") as f:
    r = csv.reader(f, delimiter=',', quotechar='"')
    rgx = re.compile(r'\b[a-zA-Z]+\b') 
    docs = [ (' '.join(re.findall(rgx, x[0])).lower(), ' '.join(re.findall(rgx, x[1])).lower())  for i,x in enumerate(r) if i > 1 ]

print('We have a list of (item title, description) tuples :\n + %s\n + %s' % (docs[0][0],docs[0][1]))

items_t = [ d[0] for d in docs ] # item titles
items_d = [ d[1] for d in docs ] # item descriptions
items_i = range(0, len(items_t)) # item id


We have a list of (item title, description) tuples :
 + cycling bicycle mtb bike fixie gloss carbon fiber riser bar handlebar
 + description feature easy to use made of high quality carbon fiber with the special design can save for a long time the carbon fiber handlebar is made of high quality carbon fiber so that you can use it relieved this quick disassembling carbon fiber handlebar is easy to use and one of the best gifts to your friends specification material carbon fiber color black handlebar clamp diameter mm length package included x cycling carbon fiber rise


Our raw data is in text form.  We need to convert it into a form more amenable to analysis.  In this notebok we look at ways of converting a collection of documents into a collection of vectors (a matrix).  To do this we need to *tokenize* the text - i.e. split it into words - and then create vectors of token frequency.  We will start by doing this the hard way and then look at how we can scale this up using scikit-learn.  Later on we will repeat this exercise using map-reduce.

We will proceed as follows:
1  Compute term frequency as a dictionary, a matrix and a sparse matrix
2  Implement a boolean search against the TF matrix
3  Introduce scikit-learn and the 'hashing-trick'
4  Compute TF.IDF for the set of documents

## Basic Term Frequency (TF) Matrix

Please note that this code is for understanding - it is not optimised or intended to scale!

Let's start with the first 10 item titles as out corpus:

In [31]:
corpus = items_t[0:5]
print(corpus)

['cycling bicycle mtb bike fixie gloss carbon fiber riser bar handlebar', 'bicycle rims x red speed internal hub wheel set beach cruiser bike', 'mavic crossride mountain bike wheels and wtb weirwolf tires', 'new kcnc arrow alloy stem black', 'rotor qxl aero oval road chainring']


### TF Dictionary 

Now we can compute the frequency of each term across the entire corpus:

In [32]:
tf = {}
for doc in corpus:
    for word in doc.split():
        if word in tf:
            tf[word] += 1
        else:
            tf[word] = 1

print(tf)

{'and': 1, 'set': 1, 'bicycle': 2, 'cruiser': 1, 'tires': 1, 'fixie': 1, 'oval': 1, 'speed': 1, 'internal': 1, 'mountain': 1, 'cycling': 1, 'handlebar': 1, 'gloss': 1, 'chainring': 1, 'bike': 3, 'black': 1, 'new': 1, 'beach': 1, 'red': 1, 'kcnc': 1, 'wheel': 1, 'rotor': 1, 'fiber': 1, 'hub': 1, 'rims': 1, 'mavic': 1, 'aero': 1, 'stem': 1, 'alloy': 1, 'wtb': 1, 'carbon': 1, 'riser': 1, 'bar': 1, 'qxl': 1, 'crossride': 1, 'arrow': 1, 'weirwolf': 1, 'mtb': 1, 'x': 1, 'wheels': 1, 'road': 1}


We can simplify by using a Counter rather than a dictionary:

In [33]:
tf = Counter()
for doc in corpus:
    for word in doc.split():
        tf[word] += 1
        
print(tf)

Counter({'bike': 3, 'bicycle': 2, 'and': 1, 'set': 1, 'cruiser': 1, 'tires': 1, 'fixie': 1, 'oval': 1, 'speed': 1, 'internal': 1, 'mountain': 1, 'cycling': 1, 'handlebar': 1, 'gloss': 1, 'chainring': 1, 'black': 1, 'new': 1, 'beach': 1, 'red': 1, 'kcnc': 1, 'wheel': 1, 'rotor': 1, 'fiber': 1, 'hub': 1, 'rims': 1, 'mavic': 1, 'aero': 1, 'stem': 1, 'alloy': 1, 'wtb': 1, 'carbon': 1, 'riser': 1, 'bar': 1, 'qxl': 1, 'crossride': 1, 'arrow': 1, 'weirwolf': 1, 'mtb': 1, 'x': 1, 'wheels': 1, 'road': 1})


No speed difference - but cleaner code:

In [34]:
def tf1(corpus):
    for doc in corpus:
        for word in doc.split(' '):
            if word in tf:
                tf[word] += 1
            else:
                tf[word] = 1        
    return tf

def tf2(corpus):
    tf = Counter()
    for doc in corpus:
        for word in doc.split(' '):
            tf[word] += 1
    return tf

%timeit tf1(corpus)
%timeit tf2(corpus)


The slowest run took 6.97 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 20.6 µs per loop
10000 loops, best of 3: 21.8 µs per loop


### TF Matrix

Whilst the TF dictionary is a compact way to store the term frequency it is not much use for analysis.  We need a TF matrix where each document vector is the same length.  Now we convert the dictionary to a matrix:

In [35]:
def get_lexicon(corpus):
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return lexicon

lexicon = get_lexicon(corpus)

tfm =[]
for doc in corpus:
    for term in doc.split():
        tfv = [doc.split().count(word) for word in lexicon]
    tfm.append(tfv)
        
print([ x for x in tfm])

[[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]]


As number of terms increases this method becomes inefficient.  Here is a faster implementation:

In [36]:
def get_lexicon(corpus):
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return list(lexicon)

lexicon = get_lexicon(corpus)

tfm =[]
for doc in corpus:
    tfv = [0]*len(lexicon)
    for term in doc.split():
        tfv[lexicon.index(term)] += 1
    tfm.append(tfv)
        
print(tfm)

[[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]]


We can compare time for each:

In [37]:
def tfm1(corpus):
    
    def get_lexicon(corpus):
        lexicon = set()
        for doc in corpus:
            lexicon.update([word for word in doc.split()])
        return lexicon
    
    lexicon = get_lexicon(corpus)

    tfm =[]
    for doc in corpus:
        for term in doc.split():
            tfv = [doc.split().count(word) for word in lexicon]
        tfm.append(tfv)
    
    return tfm, lexicon

def tfm2(corpus):
    
    def get_lexicon(corpus):
        lexicon = set()
        for doc in corpus:
            lexicon.update([word for word in doc.split()])
        return list(lexicon)

    lexicon = get_lexicon(corpus)

    tfm =[]
    for doc in corpus:
        tfv = [0]*len(lexicon)
        for term in doc.split():
            tfv[lexicon.index(term)] += 1
        tfm.append(tfv)
    
    return (tfm, lexicon)

%timeit tfm1(corpus)
%timeit tfm2(corpus)


The slowest run took 5.71 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 964 µs per loop
10000 loops, best of 3: 30.7 µs per loop


In [38]:
# as size of corpus increases so does the sparsity

n = []
s = []
for i in range(100,1000,100):
    corpus = items_t[0:i]
    tfm, lexicon = tfm2(corpus)
    c =[ [x.count(0), x.count(1)] for x in tfm]
    n_zero = sum([ y[0] for y in c])
    n_one = sum([ y[1] for y in c])  
    s.append(float(n_one / (n_one + n_zero)))
    n.append(i)
    
output_notebook(hide_banner=True)
p = figure(x_axis_label='Documents', y_axis_label='Sparsity',
          plot_width=400, plot_height=400)
p.line(n, s, line_width=2)
p.circle(n, s, fill_color="white", size=8)
show(p)


<bokeh.io._CommsHandle at 0x7f7356fb6110>

We can take advantage of the sparsity and only store the non-zero elements of the TF matrix.

### Spare matrix storage

In [39]:
def tfm3(corpus):
    
    def get_lexicon(corpus):
        lexicon = set()
        for doc in corpus:
            lexicon.update([word for word in doc.split()])
        return list(lexicon)

    lexicon = get_lexicon(corpus)

    tfm =[]
    for doc_id, doc in enumerate(corpus):
        tfv = [0]*len(lexicon)
        for term in doc.split():
            tfv[lexicon.index(term)] += 1
        tfm.append([[(doc_id, t_id), t] for t_id, t in enumerate(tfv) if t > 0])
    
    return (tfm, lexicon)

tfm, lexicon = tfm3(corpus)
print(tfm[0])

[[(0, 12), 1], [(0, 29), 1], [(0, 38), 1], [(0, 75), 1], [(0, 80), 1], [(0, 308), 1], [(0, 347), 1], [(0, 514), 1], [(0, 886), 1], [(0, 1244), 1], [(0, 1250), 1]]


We can also use compression to store this data even more efficiently - scikit-learn provides a compressed sparse matrix:

In [40]:
tfm=csr_matrix(tfm2(corpus)[0])
print(tfm[0,:])

l = ['tf2','tfm2','csr']
s = [getsizeof(tf2(corpus)[0]) , getsizeof(tfm2(corpus)[0]), getsizeof(csr_matrix(tfm2(corpus)[0]))]

df = pd.DataFrame({'Type':l, 'Size':s})

output_notebook(hide_banner=True)
p = Bar(df.sort_values(by='Size'), label='Type', values='Size',
        plot_width=400, plot_height=400)
show(p)

  (0, 12)	1
  (0, 29)	1
  (0, 38)	1
  (0, 75)	1
  (0, 80)	1
  (0, 308)	1
  (0, 347)	1
  (0, 514)	1
  (0, 886)	1
  (0, 1244)	1
  (0, 1250)	1


<bokeh.io._CommsHandle at 0x7f7356f01e50>

## Boolean Search

Now we have a tf matrix we can start to use it to find documents that contain words included in a query.  We will start by simply returning the documents from the corpus that match terms in our query and rank by raw term frequency:

In [41]:
def get_results1(qry, tfm, lexicon):
    qrv = [0]*len(lexicon)
    for term in qry.split():
        if term in lexicon:
            qrv[lexicon.index(term)] = 1

    results = []      
    for i, tfv in enumerate(tfm):
        score = sum([x[0] * x[1] for x in zip(tfv, qrv)])
        if score > 0:
               results.append([score, i])
    return results

def print_results(results,n, head=True):
    if head:    
        print('\nTop %d from recall set of %d items ordered by tf-idf:' % (n,len(results)))
        for r in sorted(results, key=lambda t: t[0] * -1 )[:n]:
            print('\t%0.2f - %s'%(r[0],items_t[r[1]]))
    else:
        print('\nBottom %d from recall set of %d items ordered by tf-idf:' % (n,len(results)))
        for r in sorted(results, key=lambda t: t[0] * 1 )[:n]:
            print('\t%0.2f - %s'%(r[0],items_t[r[1]]))
    
tfm, lexicon = tfm2(items_t)
results = get_results1('front rear back led bike light', tfm , lexicon)

print_results(results,10)



Top 10 from recall set of 5070 items ordered by tf-idf:
	8.00 - frog waterproof bike light set led white front light led red rear light
	7.00 - bicycle bike led front head torch light led back rear tail flashlight lamp
	6.00 - ultra bright waterproof silicon led bicycle light set led front rear light
	6.00 - planet bike spok micro led front and back bike light set
	6.00 - waterproof white red led front head lamp led rear bike light set
	5.00 - x led bicycle bike cycling silicone head front rear wheel safety light lamp tl
	5.00 - usb rechargeable bike bicycle light rear back safety tail light red new be
	5.00 - cycling bike bicycle led front light head light torch mount aaa
	5.00 - x led bicycle bike cycling silicone head front rear wheel safety light lamp
	5.00 - waterproof led lamp bike bicycle front head light rear safety flashlight


But this is an expensive operation.  Each query has to be compared to all documents in the corpus.  We can speed this up by creating an inverted index:

In [42]:
def create_inverted_index(corpus):
    idx={}
    for i, doc in enumerate(corpus):
        for word in doc.split():
            if word in idx:
                idx[word].append(i)
            else:
                idx[word] = [i]
    return idx
            
idx = create_inverted_index(items_t)

print(set(idx['front']).intersection(set(idx['rear'])))
print(items_t[7676])


set([512, 2049, 9733, 5131, 3597, 1039, 5648, 8212, 8729, 2075, 7708, 1825, 7198, 5893, 3618, 9753, 2597, 9768, 6697, 1582, 9263, 9264, 8242, 7859, 565, 2615, 6712, 8763, 4164, 581, 7437, 8269, 9295, 1107, 3156, 599, 1380, 7258, 5612, 8288, 609, 6244, 8807, 9660, 6247, 3181, 1135, 3697, 4722, 1651, 2676, 8824, 5140, 2682, 4219, 1660, 9341, 7806, 9344, 6274, 9859, 9860, 136, 5771, 7823, 9873, 1175, 3224, 9370, 9883, 7837, 8867, 9841, 3671, 2219, 2734, 9393, 9587, 3611, 7881, 7864, 6772, 699, 2749, 8896, 7457, 6344, 4809, 6859, 5330, 3795, 9429, 5241, 8408, 6361, 1254, 4839, 4331, 3310, 7805, 8437, 9257, 8954, 251, 7933, 2302, 9471, 3840, 5889, 4866, 9477, 6407, 1884, 9346, 1294, 4370, 6419, 7956, 1815, 6424, 8473, 794, 289, 9216, 5937, 1320, 9513, 4402, 7985, 6962, 8499, 4404, 5429, 3977, 6456, 1209, 4000, 7495, 9546, 6711, 4942, 2387, 2903, 9564, 1373, 9060, 1895, 8553, 9815, 5351, 2412, 3437, 9582, 8560, 6514, 4467, 2025, 7033, 8571, 1404, 9086, 1941, 8577, 9602, 4483, 7613, 905, 9879

Now we just have to query for each of the terms and produce a set of results:

In [43]:
def get_results2(qry, idx):

    score = Counter()
    terms = qry.split()
    for term in terms:
        for doc in idx[term]:
            score[doc] += 1
            
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    return results;


idx = create_inverted_index(items_t)
%timeit results = get_results2('front rear back led bike light', idx)

print_results(results,10)


100 loops, best of 3: 4.86 ms per loop

Top 10 from recall set of 5070 items ordered by tf-idf:
	8.00 - frog waterproof bike light set led white front light led red rear light
	7.00 - bicycle bike led front head torch light led back rear tail flashlight lamp
	6.00 - ultra bright waterproof silicon led bicycle light set led front rear light
	6.00 - planet bike spok micro led front and back bike light set
	6.00 - waterproof white red led front head lamp led rear bike light set
	5.00 - x led bicycle bike cycling silicone head front rear wheel safety light lamp tl
	5.00 - usb rechargeable bike bicycle light rear back safety tail light red new be
	5.00 - cycling bike bicycle led front light head light torch mount aaa
	5.00 - x led bicycle bike cycling silicone head front rear wheel safety light lamp
	5.00 - waterproof led lamp bike bicycle front head light rear safety flashlight


We get a lot of documents in the recall set since many match on one of the words - bike is present in almost every other document!

In [44]:
df = pd.DataFrame({'term':[x for x in idx.keys()],'freq':[len(x) for x in idx.values()]})

output_notebook(hide_banner=True)
p = Bar(df.sort_values('freq', ascending=False)[:30], label=CatAttr(columns=['term'], sort=False), values='freq',
        plot_width=800, plot_height=400)
show(p)


<bokeh.io._CommsHandle at 0x7f7354de8a10>

## Inverse Document Frequency (IDF)

It would seem sensible to down weight words that are very common in the corpus - the word 'bike' in a query is not as discriminating as the word 'front'. IDF is a way to quantify how common or rare a term is in the corpus.  It is computed by taking the log of the inverse fraction of the number of documents in which the term appears divided by the total number of documents.  To avoid division by zero it is common to add 1 to the number of documents in which the term appears.  

IDF is already partially computed when we constructed the inverted index - it is the number of documents the term apears in - in otherwords the length of the document list in the inverted index

In [45]:
def create_inverted_index(corpus):
    idx={}
    for i, doc in enumerate(corpus):
        for word in doc.split():
            if word in idx:
                if i in idx[word]:
                    # Update document's frequency
                    idx[word][i] += 1
                else:
                    # Add document
                    idx[word][i] = 1
            else:
                # Add term
                idx[word] = {i:1}
    return idx

def idf(term, idx, n):
    return math.log( float(n) / (1 + len(idx[term])))

idx = create_inverted_index(items_t)


df = pd.DataFrame({'term':[x for x in idx.keys()],'freq':[len(x) for x in idx.values()],
                  'idf':[idf(x, idx, len(items_t)) for x in idx.keys()]})

output_notebook(hide_banner=True)
p1 = Bar(df.sort_values('freq', ascending=False)[:30], label=CatAttr(columns=['term'], sort=False), values='freq',
        plot_width=800, plot_height=400)

p2 = Bar(df.sort_values('freq', ascending=False)[:30], label=CatAttr(columns=['term'], sort=False), values='idf',
        plot_width=800, plot_height=400)

p = vplot(p1, p2)

show(p)

<bokeh.io._CommsHandle at 0x7f734fff7b50>

## Ranking by TF-IDF

We can now combine term frequency and inverse document frequency when computing the score for each item in the recall set.  Until now we have just computed the score as the raw frequency of the query terms in each document.  Now we want to weight the raw frequency by the inverse documetn frequency.

In [46]:
def get_results3(qry, idx, n):
    score = Counter()
    for term in qry.split():
        i = idf(term, idx, n)
        for doc in idx[term]:
            score[doc] += idx[term][doc] * i
        
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    return results;

In [47]:
idx = create_inverted_index(items_t)
results = get_results3('front led bike light', idx, len(items_t))

print_results(results,10)


Top 10 from recall set of 4835 items ordered by tf-idf:
	17.43 - frog waterproof bike light set led white front light led red rear light
	13.93 - ultra bright waterproof silicon led bicycle light set led front rear light
	12.01 - bicycle bike led front head torch light led back rear tail flashlight lamp
	12.01 - waterproof white red led front head lamp led rear bike light set
	11.87 - cycling bike bicycle led front light head light torch mount aaa
	11.87 - lm cree led cycling front bike bicycle light headlight only light
	11.87 - lm cree led cycling front bike bicycle light headlight only light
	11.87 - usb cycling xml led front bike light bicycle light headlamp headlight
	11.87 - lm cree led cycling front bike bicycle light headlight only light
	11.87 - cree xm l led front bicycle light bike headlamp lamp light modes


## Problematic queries!

With this corpus we cannot search for mountain bikes without returning a heap of accesories:

In [48]:
idx = create_inverted_index(items_t)
results = get_results3('mountain bike', idx, len(items_t))

print_results(results,10)


Top 10 from recall set of 4593 items ordered by tf-idf:
	5.73 - oakley mens automatic mountain mtb factory lite mountain bmx bike gloves large
	5.73 - mavic crossride wheelset mountain bike xc all mountain qr flat speed
	4.06 - salsa front black bike wheel skewer road or mountain bike quick release qr
	4.06 - kmc xsp speed chain bike bicycle links mtb mountain bike new
	4.06 - fat bike mountain bike frame and fork plus all components no wheels or tires
	4.06 - new gloves mountain bike motocross bike bmx blue black size l large
	4.06 - oem jagwire brake shifter cable housing kit road bike mountain bike
	4.06 - kmc xxsp speed chain bike bicycle links mtb mountain bike new
	4.06 - oem jagwire brake shifter cable housing kit road bike mountain bike
	4.06 - mtb road bike mountain bicycle adjustable alloy bike kick stand side kickstand


We need to penalise items where there are many more terms in the query.  For example the terms "mountain" and "bike" only make up 2 / 12 terms in the "oakley mens automatic mountain mtb factory lite mountain bmx bike gloves large" yet it scores highly because there is no penalty for all the other terms in the item title.

In addition this scheme create discrete levels based on combination of word frequency:


In [49]:
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'title':[items_t[x[1]] for x in results]})

d = df.groupby('score').first().reset_index()

r1 = re.compile('(bike)')
r2 = re.compile('(mountain)')

for i, t in enumerate(d.title):
    n1 = r1.findall(t)
    n2 = r2.findall(t)
    print('%d x Bike, %d x Mountain, Score = %0.2f'%(len(n1),len(n2),d.score[i]))
    

1 x Bike, 0 x Mountain, Score = 0.80
2 x Bike, 0 x Mountain, Score = 1.59
0 x Bike, 1 x Mountain, Score = 2.47
1 x Bike, 1 x Mountain, Score = 3.26
2 x Bike, 1 x Mountain, Score = 4.06
1 x Bike, 2 x Mountain, Score = 5.73


In [53]:
# Plot score vs item length
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'length':[len(items_t[x[1]].split()) for x in results]})

output_notebook(hide_banner=True)
p = Scatter(df, x='score', y='length')
show(p)

<bokeh.io._CommsHandle at 0x7f7356ceced0>

Ideally we do not want scores to be the same for lots of documents.  We could try by boosting the score for documents that are shorter than average.

In [62]:
def get_results4(qry, corpus):
    idx = create_inverted_index(corpus)
    n = len(corpus)
    d = [len(x.split()) for x in corpus]
    d_avg = float(sum(d)) / len(d)
    score = Counter()
    for term in qry.split():
        i = idf(term, idx, n)
        for doc in idx[term]:
            f = float(idx[term][doc])
            score[doc] += i *  ( f / (float(d[doc]) / d_avg) )
        
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    return results;

In [63]:
results = get_results4('mountain bike', items_t)
print_results(results, 10)

# Plot score vs item length
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'length':[len(items_t[x[1]].split()) for x in results]})

output_notebook()
p = Scatter(df, x='score', y='length')
show(p)


Top 10 from recall set of 4593 items ordered by tf-idf:
	10.76 - trek mountain bike
	8.13 - rocky mountain etsx
	8.13 - rocky mountain slayer
	8.07 - shadow nine mountain bike
	8.07 - cannondale jekyll mountain bike
	6.46 - mens mongoose mountain bike new
	6.46 - magna glacier point mountain bike
	6.46 - truvativ xx mountain bike chainring
	6.46 - gt timberline mountain bike small
	6.46 - banshee scream downhill mountain bike


<bokeh.io._CommsHandle at 0x7f734f421090>

## Okapi BM25

In [64]:
def get_results5(qry, corpus, k1=1.5, b=0.75):
    idx = create_inverted_index(corpus)
    n = len(corpus)
    d = [len(x.split()) for x in corpus]
    d_avg = float(sum(d)) / len(d)                
    score = Counter()
    for term in qry.split():
        i = idf(term, idx, n)
        for doc in idx[term]:
            f = float(idx[term][doc])
            score[doc] += i * (( f * (k1 + 1) ) / (f + k1 * (1 - b + (b * (float(d[doc]) / d_avg)))))
        
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    return results;

In [65]:
results = get_results5('mountain bike', items_t)
print_results(results, 10)

# Plot score vs item length
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'length':[len(items_t[x[1]].split()) for x in results]})

output_notebook()
p = Scatter(df, x='score', y='length')
show(p)
    


Top 10 from recall set of 4593 items ordered by tf-idf:
	4.75 - trek mountain bike
	4.46 - shadow nine mountain bike
	4.46 - cannondale jekyll mountain bike
	4.20 - mens mongoose mountain bike new
	4.20 - magna glacier point mountain bike
	4.20 - truvativ xx mountain bike chainring
	4.20 - gt timberline mountain bike small
	4.20 - banshee scream downhill mountain bike
	4.20 - lake winter mountain bike shoe
	4.20 - mountain bike tubular tires new


<bokeh.io._CommsHandle at 0x7f734f312ad0>