## Machine Learning for Ranking - Deployment

During day 1 we saw how to build an inverted index to efficiently generate a recall set of documents. We then ranked documents using term frequency, inverse document frequency and BM25.0.  

Today we have looked at building machine learning algorithms for ranking.  This notebook shows how we use the machine learnt models in a search engine.  We "deploy" one of the models we created.

In [1]:
import os 
import pandas as pd

In [2]:
data = pd.read_csv("data/bike-queries-scraped.tsv", sep="\t",header=0)
data = data.dropna()

In [3]:
data.columns

Index([u'query', u'Score', u'ItemID', u'feature_2', u'feature_3', u'feature_4',
       u'feature_5', u'feature_6', u'feature_7', u'feature_8', u'feature_9',
       u'feature_1', u'LeafCats', u'Title', u'feature_10'],
      dtype='object')

In [4]:
data[['query','Title']].head(n=20)

Unnamed: 0,query,Title
0,bike,"BIKE 1"" Swimmer Jock Strap / Jogging Jockstrap..."
1,bike,Topeak Super Bicycle Chain Tool
2,bike,2008 FUJI FC-770 18 Speed Men Bicycle 56cm Car...
3,bike,New Black Roko Goggle Quick Straps Release Hel...
4,bike,Bike & Waterproof Bicycle Motorcycle Mount Hol...
5,bike,Carbon Fiber MTB Road Mountain Bike Bicycle fl...
6,bike,PRADA MEN'S SHOES NYLON TRAINERS SNEAKERS NEW ...
7,bike,Kids Adult Protect Helmets for Riding Bike Cyc...
8,bike,Digital Age Sport Cycling Jersey Bike Bicycle ...
9,bike,Carbon MTB Road Mountain Bike Bicycle rise Han...


We have the following unique queries:

In [5]:
data['query'].unique()

array(['bike', 'bicycle', 'mens red bicycle', 'led bike light',
       'trek 7.2 hybrid', 'bike seat', 'spoke light',
       'shimano rear derailleur', 'unicycle', 'tricycle'], dtype=object)

## Build inverted index

Before we can build an inverted index we need to tokenize the item Titles:

In [7]:
import re

rgx = re.compile(r'\b[a-zA-Z]+\b')
corpus = [ ' '.join(re.findall(rgx, str(x))).lower() for x in data.Title]
corpus[0:5]

['bike swimmer jock strap jogging jockstrap waistband uk freepost',
 'topeak super bicycle chain tool',
 'fuji fc speed men bicycle carbon aluminum no suspension',
 'new black roko goggle quick straps release helmet dirt bike atv mx smith scott',
 'bike waterproof bicycle motorcycle mount holder for nokia nok lumia phones new']

Our inverted index for offline evaluation of the Machine Learnt Ranker is going to be different to the inverted index we built in Day 1.  The original inverted index was built exclusively using the corpus of item titles.  The inverted index we require for offline testing is extended in the following ways:
     
*  For offline testing we need to create a document index for each query term in our query-item data set.  We can only test against this 'Golden Set' of queries.  For offline evaluation of the MLR model we are working with a dataset that we scraped from the eBay search engine.  The data includes the query term and the items returned against that query.  The dataset also includes the input features we previously used to train the MLR model. In an online search engine the feature vector would be computed in real time for any query. 
*  When we trained our MLR model we scaled the features to have 0 mean and unit standard deviations.  Before we can apply the MLR model test index we need to make sure the features are normalised using the same range.  That is why the scaling parameters were serialised previously.  In an online search engine this normalisation would be done in realtime. 

Now we can use the index to generate a recall set for a small set of queries - all that is left is to score the documents using the model. 

First we scale the features:

In [8]:
from sklearn.externals import joblib
data.columns
features = data[['feature_1','feature_2','feature_3','feature_4','feature_5',
                 'feature_6','feature_7','feature_8','feature_9','feature_10']]
scaler = joblib.load('models/scaler.pkl')
features = scaler.transform(features)
queries = data['query']

Now we create our index for each query.  In addition we store the feature vector for each document in the index:

In [11]:
def create_inverted_index_mlr(corpus, queries, features):
    idx={}
    for i, doc in enumerate(corpus):
        query = queries[i]
        feature = features[i]
        if query not in idx:
            idx[query] = {}
            
        for word in doc.split():
            if word in idx[query]:
                idx[query][word][i] = feature
            else:
                idx[query][word] = {i:feature}
                
    return idx

# queries = ['bike','bike','bike helmet','bike helmet']
# corpus = ['sunday bike','bike light led','motorcycle helmet','blue helmet']
# features = [(1,2,3), (4,5,6), (2,4,6), (3,6,9)]

idxq = create_inverted_index_mlr(corpus, queries, features)

We can also load the model - in this case it was boosted gradient tree:

In [12]:
clf = joblib.load('models/mlr.pkl')

In [31]:
from collections import Counter
import itertools

def get_results_mlr(qry, idxq, model):
    score = Counter()
    
    if (qry in idxq):
        idx = idxq[qry]
        for term in qry.split():
            for doc, features in idx[term].iteritems():
                # Compute item-query and query level features
                score[doc] = model.predict_proba(features.reshape(1,-1))[0][1]

    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results;

def print_results(corpus, results,n, head=True):
    ''' Helper function to print results
    '''
    if head:    
        print('\nTop %d from recall set of %d items:' % (n,len(results)))
        for r in results[:n]:
            print('\t%0.10f - %s'%(r[0],corpus[r[1]]))
    else:
        print('\nBottom %d from recall set of %d items:' % (n,len(results)))
        for r in results[-n:]:
            print('\t%0.10f - %s'%(r[0],corpus[r[1]]))
            
    
    
results = get_results_mlr('led bike light', idxq, clf)
print_results(corpus, results,10)
print_results(corpus, results,10,head=False)



Top 10 from recall set of 50 items:
	0.9999946067 - cycling bicycle led solar energy usb rechargeable bike headlight lamp
	0.9999302850 - led waterproof flexible strip car bike light lamp bulb blue red green warm white
	0.9998667451 - lumen cree xm usb rechargeable led bike bicycle headlight headlamp light
	0.9998652580 - new laser beam led bike bicycle cycling rear tail light lamp modes
	0.9993209796 - night ride blue lights bicycle bike cycling wheel tire spoke led light lamp mt
	0.9959341931 - red waterproof silicone bike bicycle cycle cycling rear led light
	0.9954208404 - rechargeable cree led front bicycle lamp bike light charger
	0.9930440450 - led lamp flash tyre wheel valve cap light for car bike bicycle motorbicycle
	0.9904830402 - gold a cree led front bicycle light bike headlamp headlight kit

Bottom 10 from recall set of 50 items:
	0.0017743213 - elastic rubber o rings for headlamp light mtb bike bicycle led light best
	0.0004777810 - motorcycle bike leds turn signal ligh

In [43]:
results = get_results_mlr('tricycle', idxq, clf)
print_results(corpus, results,10)
print_results(corpus, results,10,head=False)


Top 10 from recall set of 47 items:
	0.9989410966 - tricycle trike kids red radio flyer dual deck bike play outdoor ride adjustable
	0.9435015487 - trike red radio flyer stroller kids toddler tricycle bike ride infant toy
	0.8970217776 - tricycle pick your ride cycle dvd tricycle pick your ride cycle
	0.7006430107 - tricycle patent print green chalkboard
	0.5339988355 - real photo postcard children sister brother on tricycle los angeles ca
	0.3798914381 - antique vintage photo little girl on trike tricycle
	0.1077623580 - kick it old school funny retro tricycle graphic t shirt
	0.0762950473 - boy on tricycle vinyl sticker wall art
	0.0225111446 - advert gendron motor bike bicycle tricycle streamline race cycle motorcycle
	0.0136251418 - vintage s tricycle photos funny little blonde girl riding her trike

Bottom 10 from recall set of 47 items:
	0.0000012753 - rear wheel for trike tricycle black tire white mag vintage inch od mm id
	0.0000012731 - vintage gay nineties tricycle salt pepp

Let's also implement BM25 on this query corpus for comparison:

In [38]:
import math
import pdb

def idf(term, idx, n):
    return math.log( float(n) / (1 + len(idx[term])))    

def create_inverted_index_bm25(corpus):
    idx={}
    for i, doc in enumerate(corpus):
        for word in doc.split():
            if word in idx:
                if i in idx[word]:
                    # Update document's frequency
                    idx[word][i] += 1
                else:
                    # Add document
                    idx[word][i] = 1
            else:
                # Add term
                idx[word] = {i:1}
    return idx

def get_results_bm25(qry, corpus, k1=1.5, b=0.75):
    idx = create_inverted_index_bm25(corpus)
    n = len(corpus)
    d = [len(x.split()) for x in corpus]
    d_avg = float(sum(d)) / len(d)                
    score = Counter()
    for term in qry.split():
        if term in idx:
            i = idf(term, idx, n)
            for doc in idx[term]:
                f = float(idx[term][doc])
                s = i * (( f * (k1 + 1) ) / (f + k1 * (1 - b + (b * (float(d[doc]) / d_avg)))))
                score[doc] += s
                
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results

In [39]:
qry = 'led bike light'
corpus_q = [ corpus[i] for i, doc in enumerate(corpus) if (queries == qry)[i]]
results = get_results_bm25(qry, corpus_q)
print_results(corpus_q, results,10)
print_results(corpus_q, results,10,head=False)


Top 10 from recall set of 50 items:
	0.5190954638 - waterproof led bicycle bike cycling wheel light flash safe spoke light
	0.4869835585 - elastic rubber o rings for headlamp light mtb bike bicycle led light best
	0.4604563024 - rechargeable cree led front bicycle lamp bike light charger
	0.4411580500 - red waterproof silicone bike bicycle cycle cycling rear led light
	0.4234123521 - griplit led bike handlebar light disc o pack of md glt
	0.4234123521 - bike bicycle cycling silicone white wheel tyre waterproof led light white
	0.4234123521 - led bike bicycle safety tail flashlight waterproof lamp back rear light
	0.4234123521 - gold a cree led front bicycle light bike headlamp headlight kit
	0.4234123521 - lumen cree led flashlight torch cycling bicycle bike head light mount

Bottom 10 from recall set of 50 items:
	0.2201121937 - skull bicycle tire led flash light valve dust cap car motor tyre white qgs
	0.2054825474 - cycling bicycle led solar energy usb rechargeable bike headlight l

In [42]:
qry = 'tricycle'
corpus_q = [ corpus[i] for i, doc in enumerate(corpus) if (queries == qry)[i]]
results = get_results_bm25(qry, corpus_q)
print_results(corpus_q, results,10)
print_results(corpus_q, results,10,head=False)


Top 10 from recall set of 47 items:
	0.0594037897 - blue kent tricycle
	0.0594037897 - water tricycle photo
	0.0561880579 - tricycle pick your ride cycle dvd tricycle pick your ride cycle
	0.0524260540 - tricycle patent print green chalkboard
	0.0524260540 - vintage tricycle collar sidewalk bicycle
	0.0524260540 - angeles myrider space buggy tricycle
	0.0495178040 - wrought iron tricycle planter whatever lol
	0.0469152565 - boy on tricycle vinyl sticker wall art
	0.0469152565 - hungary stampday early postman motorized tricycle mnh
	0.0469152565 - e flite to size tricycle electric retracts

Bottom 10 from recall set of 47 items:
	0.0371520927 - dual deck tricycle girl s pink chrome bell classic streamers adjustable outdoor
	0.0371520927 - tricycle hollow hub bearing for shaft beach cruiser low rider bicycle trike
	0.0371520927 - novelty dress it up buttons red fire engine tricycle cement truck dinosaur
	0.0371520927 - railway tricycle patent print white in a honey red oak wood frame
	0