During day 1 we saw how to build an inverted index to efficiently generate a recall set of documents. We then ranked documents using term frequency, inverse document frequency and BM25.0.  

Today we have looked at building machine learning algorithms for ranking.  This notebook shows how we use the machine learnt models in a search engine.  We "deploy" one of the models we created.

In [1]:
import os 
import pandas as pd

In [2]:
data = pd.read_csv("data/fullDataset.tsv", sep="\t",header=0)
data = data.dropna()

  interactivity=interactivity, compiler=compiler, result=result)


## Build inverted index

In [3]:
data.columns

Index([u'key', u'query', u'Title', u'LeafCats', u'ItemID', u'X_unit_id',
       u'SCORE', u'label_relevanceGrade', u'label_relevanceBinary',
       u'feature_1', u'feature_2', u'feature_3', u'feature_4', u'feature_5',
       u'feature_6', u'feature_7', u'feature_8', u'feature_9', u'feature_10'],
      dtype='object')

In [108]:
import re

rgx = re.compile(r'\b[a-zA-Z]+\b')
corpus = [ ' '.join(re.findall(rgx, str(x))).lower() for x in data.Title]
corpus[0:5]

['disney world all star music all inclusive package feb',
 'mgtc td tf mga morris minor top wing bolts set of',
 'new cross vice gel ink pen gift set document marker',
 'white paper towel roll holder cabinet wall mount sturdy',
 'engine cooling head traxxas jato trx nitro rustler']

In [109]:
def create_inverted_index(corpus):
    idx={}
    for i, doc in enumerate(corpus):
        # << POPULATE INVERTED INDEX >> CODE HERE
        ## HIDE
        for word in doc.split():
            if word in idx:
                idx[word].append(i)
            else:
                idx[word] = [i]
        ## HIDE
    return idx

idx = create_inverted_index(corpus)

We still have bike items in our corpus:

In [110]:
print(idx['bike'][0:10])
print(corpus[55])

[55, 421, 546, 559, 648, 691, 702, 983, 1234, 1262]
waterproof anti shock led bike taillight tail rear light caution aaa battery


Now we have the index we can generate a recall set, all that is left is to score the documents.

### Questions?
Here I normalise the features - but what would we do in a real MLR problem - as new data accumulates - do we use algorithms that do not require normalisation of the features?

Also here we are not computing the query features at run time - I am just taking the feature vector associated with the document - this is wrong.  How do we compute the query features?

In [111]:
from sklearn.externals import joblib
features = data.loc[:,'feature_1':'feature_10']
scaler = joblib.load('models/scaler.pkl')
features = scaler.transform(features)

For efficiency we are going to store the feature vector for each document in the index:

In [112]:
def create_inverted_index(corpus, features):
    idx={}
    for i, doc in enumerate(corpus):
        for word in doc.split():
            if word in idx:
                if i not in idx[word]:
                    # Add document
                    idx[word][i] = features[i]
            else:
                # Add term
                idx[word] = {i:features[i]}
    return idx


idx = create_inverted_index(corpus, features)
print(idx['tricycle'].keys())
print(idx['tricycle'].values())

idx['tricycle']

[52481, 14155, 40812, 50745, 63003, 41183]
[array([-0.17379223, -1.13411908, -0.39806753,  0.63575084,  0.52457065,
       -0.45178351, -0.22241248,  0.45609258, -0.82743001, -0.75991721]), array([-0.17379223, -1.13411908, -0.10008165, -2.27973959,  3.52820582,
        1.23859683, -0.22241248,  0.45609258,  0.93583338,  1.35527817]), array([-0.17379223, -1.13411908, -1.13742724, -0.1531646 , -0.17649033,
        1.05240075, -0.22241248,  0.45609258, -0.73081284,  1.30008465]), array([-0.17379223, -1.13411908, -0.36959671, -1.13632595,  2.83878868,
        1.23859683, -0.22241248,  0.45609258,  0.88752479,  1.35527817]), array([-0.17379223, -1.13411908,  0.07580077, -1.7126046 , -0.51903625,
       -1.33313304, -0.21538468, -1.43918018, -0.00618405, -0.75991721]), array([-0.65272925,  0.88174163, -0.5299772 , -0.34945512, -1.62504047,
       -0.76156505,  7.32817495,  0.65148004, -0.85158431, -0.75991721])]


{14155: array([-0.17379223, -1.13411908, -0.10008165, -2.27973959,  3.52820582,
         1.23859683, -0.22241248,  0.45609258,  0.93583338,  1.35527817]),
 40812: array([-0.17379223, -1.13411908, -1.13742724, -0.1531646 , -0.17649033,
         1.05240075, -0.22241248,  0.45609258, -0.73081284,  1.30008465]),
 41183: array([-0.65272925,  0.88174163, -0.5299772 , -0.34945512, -1.62504047,
        -0.76156505,  7.32817495,  0.65148004, -0.85158431, -0.75991721]),
 50745: array([-0.17379223, -1.13411908, -0.36959671, -1.13632595,  2.83878868,
         1.23859683, -0.22241248,  0.45609258,  0.88752479,  1.35527817]),
 52481: array([-0.17379223, -1.13411908, -0.39806753,  0.63575084,  0.52457065,
        -0.45178351, -0.22241248,  0.45609258, -0.82743001, -0.75991721]),
 63003: array([-0.17379223, -1.13411908,  0.07580077, -1.7126046 , -0.51903625,
        -1.33313304, -0.21538468, -1.43918018, -0.00618405, -0.75991721])}

We can also load the model - in this case it was boosted gradient tree:

In [113]:
clf = joblib.load('models/mlr.pkl')

In [116]:
from collections import Counter
import itertools

def get_results_mlr(qry, idx, model):
    score = Counter()
    for term in qry.split():
        for doc, features in idx[term].iteritems():
            # Compute item-query and query level features
            score[doc] = model.predict_proba(features.reshape(1,-1))[0][1]
            
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results;

def print_results(results,n, head=True):
    ''' Helper function to print results
    '''
    if head:    
        print('\nTop %d from recall set of %d items:' % (n,len(results)))
        for r in results[:n]:
            print('\t%0.10f - %s'%(r[0],corpus[r[1]]))
    else:
        print('\nBottom %d from recall set of %d items:' % (n,len(results)))
        for r in results[-n:]:
            print('\t%0.10f - %s'%(r[0],corpus[r[1]]))
            
            
            
results = get_results_mlr('nike air yeezy', idx, clf)
print_results(results,10)
print_results(results,10,head=False)



Top 10 from recall set of 2053 items:
	1.0000000000 - new nike drawstring backpack school book shoe tote gym bag swim golf sport pack
	0.9999999992 - timemist metered air freshener refills bayberry refills tms
	0.9999999644 - men s new authentic nike shox nz eu running shoes sizes
	0.9999999629 - crosman sheridan cowboy youth single shot lever action air rifle
	0.9999999564 - nike grillroom golf shoes men s anthracite white black
	0.9999999526 - chevy fuel gas tank chevrolet bel air new
	0.9999998845 - new frigidaire room air conditioner remote control transmitter
	0.9999998763 - nike fuelband se plus health fitness tracker bluetooth nike
	0.9999998511 - nike oregon ducks footall t tyner mens sz spring game worn pants
	0.9999998493 - men s new authentic nike total shox running shoes sizes

Bottom 10 from recall set of 2053 items:
	0.0000000837 - air conditioning vent clock time thermometer celsius digital blue led backlight
	0.0000000768 - new nike lunar control mens golf shoes pick s

In [117]:
results = get_results_mlr('white iphone', idx, clf)
print_results(results,10)
print_results(results,10,head=False)


Top 10 from recall set of 4347 items:
	1.0000000000 - bullets ammunition custom case for nexus lg snap on black white
	0.9999999998 - squatty potty ecco toilet stool white inch
	0.9999999984 - for apple iphone plus tpu rubber ultra thin bumper case frame cover
	0.9999999951 - hard matte clear back case with soft silicone tpu bumper cover for apple iphone
	0.9999999928 - aibocn dual usb power bank charger for mobile phone iphone tablets
	0.9999999588 - lg volt white virgin mobile smartphone
	0.9999999586 - apple iphone plus verizon unlocked gold silver gray
	0.9999999564 - nike grillroom golf shoes men s anthracite white black
	0.9999999481 - sangean ps pillow speaker with in line volume control and amplifier white
	0.9999999475 - apple iphone latest model space gray at t smartphone

Bottom 10 from recall set of 4347 items:
	0.0000000897 - real white pearl beads white gold plated pendant and necklace
	0.0000000627 - new nike park iv cushioned otc sock large soccer white blue
	0.0000000