# Introduction to Learning to Rank with Pyterrier

This notebook covers the basic steps to setup a Learning to Rank (LTR) pipeline with Pyterrier. After the setup, we have a quick look at "pipes" to understand what kind of features can be used in general to train LTR methods.

Afterward, we have a look at Pyterrier's support of common Machine Learning (ML) software frameworks - `scikit-learn`, `xgboost`, and `lightgbm`.

Finally, we have a look at how you can define custom features. LTR is not only about the ML methods, but also covers the feature engineering. For this reason, we will include the "number of authors" from the previous notebook as an additional feature into our LTR pipeline. 

## Setup

Install Pyterrier.

In [None]:
!pip install python-terrier

Imports.

In [3]:
import os
import numpy as np
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init()

Download and index the dataset, get topics and qrels.

In [5]:
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './indices/cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
  indexer = pt.index.IterDictIndexer(pt_index_path, blocks=True)
  index_ref = indexer.index(dataset.get_corpus_iter(), 
                            fields=['title', 'doi', 'abstract'], 
                            meta=('docno',))
  
else:
  index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")
  
index = pt.IndexFactory.of(index_ref)
topics = dataset.get_topics('title')
qrels = dataset.get_qrels()

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [00:00] [18.7kB] [8.59MB/s]
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [00:00] [1.14MB] [3.65MB/s]


## Quick intro to pipes.

Later, we setup a two-stage ranking pipeline with a BM25-based first-stage ranker whose outputs will be reranked by a ML method given the predefined features. In the example below, we setup three different batch retrievers.

In [189]:
BM25 = pt.BatchRetrieve(index, controls = {"wmodel": "BM25"})
TF_IDF =  pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"})
PL2 =  pt.BatchRetrieve(index, controls = {"wmodel": "PL2"})

We make a `pipe` by transforming the BM25 outputs with the help of PL2 and TFIDF (the latter two batch retrievers are used to generate our features).

In [190]:
pipe = BM25 >> (TF_IDF ** PL2)

Let's use an example query and have a look at the outputs. The results are ranked by the BM25 score and the `feature` column contains a 1-D array with the features based on TFIDF and PL2. Later on, we use these BM25 candidates in combination with their feature representations.

In [191]:
pipe.transform("coronavirus immunity")


  topics = m.transform(topics)


Unnamed: 0,qid,docid,docno,rank,score,query,features
0,1,187945,sp212tai,0,10.118259,coronavirus immunity,"[6.262270748212185, 6.300463024778679]"
1,1,126990,e1mw9lx1,1,10.001470,coronavirus immunity,"[6.221824283576121, 5.9616508350637245]"
2,1,179948,ltmuw6f8,2,9.974369,coronavirus immunity,"[6.204301824595198, 5.9242165698170055]"
3,1,156456,1oruu33o,3,9.955978,coronavirus immunity,"[6.261722521435452, 6.080975784142439]"
4,1,94922,5jl6ltfj,4,9.734640,coronavirus immunity,"[6.0955155779479995, 5.642759107219646]"
...,...,...,...,...,...,...,...
1073,1,107309,t6gqa48n,995,7.153615,coronavirus immunity,"[4.490926556339275, 3.312256733881456]"
1074,1,118090,cij94qxl,996,7.153615,coronavirus immunity,"[4.490926556339275, 3.312256733881456]"
1075,1,142138,bxz9278z,997,7.152973,coronavirus immunity,"[4.377525236902022, 3.3080236155861185]"
1076,1,66073,x9piyivm,998,7.152920,coronavirus immunity,"[4.559628851404359, 3.5124279884117215]"


As an alternative to the previous code snippets, we can also create a `FeaturesBatchRetrieve` object.

In [193]:
fbr = pt.FeaturesBatchRetrieve(index, controls = {"wmodel": "BM25"}, features=["WMODEL:TF_IDF", "WMODEL:PL2"]) 
(fbr % 5).search("coronavirus immunity")

Unnamed: 0,qid,query,docid,rank,features,docno,score
0,1,coronavirus immunity,187945,0,"[4.540221677858543, 3.477440707307063]",sp212tai,10.118259
1,1,coronavirus immunity,126990,1,"[4.377525236902022, 3.3080236155861185]",e1mw9lx1,10.00147
2,1,coronavirus immunity,179948,2,"[4.559628851404359, 3.5124279884117215]",ltmuw6f8,9.974369
3,1,coronavirus immunity,156456,3,"[4.600248221319108, 3.649953983552013]",1oruu33o,9.955978
4,1,coronavirus immunity,94922,4,"[4.490926556339275, 3.312256733881456]",5jl6ltfj,9.73464


## Learning-to-rank

Finally, we can have a look at the actual ML methods and Pyterrier's support of well-known ML software frameworks.

As it is crucial to avoid test/training data leakage, we have to split our topics for the training, validation, and testing. In this introduction, we do a simple "static" split. However, for more reliable evaluations you should consider [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)), i.e., different test/training data splits.

In [34]:
train_topics, validation_topics, test_topics = np.split(topics, [int(.6*len(topics)), int(.8*len(topics))])

Likewise, we have to split qrels for some ML methods.

In [64]:
train_min = train_topics['qid'].astype(int).min()
train_max = train_topics['qid'].astype(int).max()
train_qrels = qrels[(qrels['qid'].astype(int) >= train_min) & (qrels['qid'].astype(int) <= train_max)]

val_min = validation_topics['qid'].astype(int).min()
val_max = validation_topics['qid'].astype(int).max()
validation_qrels = qrels[(qrels['qid'].astype(int) >= val_min) & (qrels['qid'].astype(int) <= val_max)]

test_min = test_topics['qid'].astype(int).min()
test_max = test_topics['qid'].astype(int).max()
test_qrels = qrels[(qrels['qid'].astype(int) >= test_min) & (qrels['qid'].astype(int) <= test_max)]

### `scikit-learn`

First of all, we have a look at how Pyterrier support [`scikit-learn`](https://scikit-learn.org/stable/). In the example below, we user Random Forest regression, Logistic regression, and Support Vector regression but the framework support more ML methods. Please have a look at the documentation.

In [198]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn import svm

# Create the regressor object.
rf = RandomForestRegressor(n_estimators=400)
# Pipe the outputs of the first stage ranking and the corresponding features into the regressor.
rf_pipe = fbr >> pt.ltr.apply_learned_model(rf)
# Fit the regressor with the given documents corresponding to the training topics.
rf_pipe.fit(train_topics, qrels)

# Logistic regression (default parametrization, have a look at the documentation to specify hyperparamters)
lr = LogisticRegression()
lr_pipe = fbr >> pt.ltr.apply_learned_model(lr)
lr_pipe.fit(train_topics, qrels)

# Support Vector regression (default parametrization, have a look at the documentation to specify hyperparamters)
svr = svm.SVR()
svr_pipe = fbr >> pt.ltr.apply_learned_model(svr)
svr_pipe.fit(train_topics, qrels)

# Determine the results with the help of the test topics
results = pt.Experiment([PL2, rf_pipe, lr_pipe, svr_pipe], test_topics, qrels, ["map"], names=["PL2 (Baseline)", "Random forest", "Logistic regression", "Support vector regression"])
results

Unnamed: 0,name,map
0,PL2 (Baseline),0.313517
1,Random forest,0.143222
2,Logistic regression,0.132631
3,Support vector regression,0.120909


### LambdaMART and Gradient Boosting with `xgboost` and `lightgbm`

[LambdaMART](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf) is a well-known and established LTR technique. It is implemented into both `xgboost` and `lightgbm` (both packages are pre-installed on Colab). In the example below, it uses the nDCG measure in its objective function. Besides qrels for the training, it also requires qrels for the validation to optimize the learned weights. Apart from that, the implementation is similar to the previous example. For more details please have a look at the linked paper.


In [65]:
import xgboost as xgb

lmart_x = xgb.sklearn.XGBRanker(objective='rank:ndcg',
      learning_rate=0.1,
      gamma=1.0,
      min_child_weight=0.1,
      max_depth=6,
      verbose=2,
      random_state=42)

lmart_x_pipe = fbr >> pt.ltr.apply_learned_model(lmart_x, form="ltr")
lmart_x_pipe.fit(train_topics, train_qrels, validation_topics, validation_qrels)

import lightgbm as lgb

lmart_l = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10)
lmart_l_pipe = fbr >> pt.ltr.apply_learned_model(lmart_l, form="ltr")
lmart_l_pipe.fit(train_topics, train_qrels, validation_topics, validation_qrels)

pt.Experiment(
    [PL2, lmart_x_pipe, lmart_l_pipe],
    test_topics,
    test_qrels,
    ["map"],
    names=["PL2 Baseline", "LambdaMART (xgBoost)", "LambdaMART (LightGBM)" ]
)



[1]	valid_0's ndcg@1: 0.2
[2]	valid_0's ndcg@1: 0
[3]	valid_0's ndcg@1: 0
[4]	valid_0's ndcg@1: 0
[5]	valid_0's ndcg@1: 0.0333333
[6]	valid_0's ndcg@1: 0.0333333
[7]	valid_0's ndcg@1: 0.0333333
[8]	valid_0's ndcg@1: 0.133333
[9]	valid_0's ndcg@1: 0.133333
[10]	valid_0's ndcg@1: 0.133333


Unnamed: 0,name,map
0,PL2 Baseline,0.313517
1,LambdaMART (xgBoost),0.165912
2,LambdaMART (LightGBM),0.172132


## Custom features a.k.a. "feature engineering"

In the earlier example, we have used the scores of lexical-based matching functions. However, in our case, we might want to include other bibliometric or network-based metadata as additional features. In the following example, we include the "number of authors" from the previous notebook as a LTR features.

First of all, we have to load the metadata file into a DataFrame.

In [195]:
# '/root/.ir_datasets/cord19/2020-07-16/metadata.csv' on Google Colab
metadata = pd.read_csv('/root/.ir_datasets/cord19/2020-07-16/metadata.csv', low_memory=False)

Afterward, we use the code form the previous notebook in a slightly modified way to the determine the raw author count that is added as an additional feature to the TFIDF and PL2 scores. If you do not want to use these features, simply use a `BatchRetriever` or modify the combination of features in `_features()`.

The output shows that there is an additional third feature that represents the number of authors.

In [196]:
# for faster access write the author information into a dictionary
author_dict = {}
for id, authors in zip(metadata['cord_uid'], metadata['authors']):
  author_dict[id] = authors
author_dict

def authors(docno):
  
    raw_authors = author_dict[docno]
    if isinstance(raw_authors, str):
      authors = raw_authors.split(';')
      num_authors = len(authors)
      return num_authors

    return 1

def _features(row):
    f1 = authors(row["docno"])
    features = np.append(row['features'], np.array([f1]))
    return features

fbr = pt.FeaturesBatchRetrieve(index, controls = {"wmodel": "BM25"}, features=["WMODEL:TF_IDF", "WMODEL:PL2"]) 
# br = pt.BatchRetrieve(index, wmodel="BM25") >> pt.apply.doc_features(_features)

p = fbr >> pt.apply.doc_features(_features)
# p = br >> pt.apply.doc_features(_features)

p.transform("coronavirus immunity")

  topics = m.transform(topics)


Unnamed: 0,qid,query,docid,rank,features,docno,score
0,1,coronavirus immunity,187945,0,"[4.540221677858543, 3.477440707307063, 6.0]",sp212tai,10.118259
1,1,coronavirus immunity,126990,1,"[4.377525236902022, 3.3080236155861185, 5.0]",e1mw9lx1,10.001470
2,1,coronavirus immunity,179948,2,"[4.559628851404359, 3.5124279884117215, 5.0]",ltmuw6f8,9.974369
3,1,coronavirus immunity,156456,3,"[4.600248221319108, 3.649953983552013, 3.0]",1oruu33o,9.955978
4,1,coronavirus immunity,94922,4,"[4.490926556339275, 3.312256733881456, 3.0]",5jl6ltfj,9.734640
...,...,...,...,...,...,...,...
995,1,coronavirus immunity,107309,995,"[5.246637878274745, 4.376124849935644, 1.0]",t6gqa48n,7.153615
996,1,coronavirus immunity,118090,996,"[5.61520763991354, 5.186165183189388, 16.0]",cij94qxl,7.153615
997,1,coronavirus immunity,142138,997,"[5.379220010061081, 4.61397822144015, 20.0]",bxz9278z,7.152973
998,1,coronavirus immunity,66073,998,"[6.0197830061095505, 5.485601944147103, 26.0]",x9piyivm,7.152920


Let's train and compare the same regressor with different feature sets.

In [197]:
rf = RandomForestRegressor(n_estimators=400)
rf_pipe = fbr >> pt.ltr.apply_learned_model(rf)
rf_pipe.fit(train_topics, qrels)

rfa = RandomForestRegressor(n_estimators=400)
rfa_pipe = fbr >> pt.apply.doc_features(_features) >> pt.ltr.apply_learned_model(rfa)
rfa_pipe.fit(train_topics, qrels)

results = pt.Experiment([PL2, rf_pipe, rfa_pipe], test_topics, qrels, ["map"], names=["PL2 (Baseline)", "Random forest (without authors)", "Random forest (with authors)"])
results

Unnamed: 0,name,map
0,PL2 (Baseline),0.313517
1,Random forest (without authors),0.142471
2,Random forest (with authors),0.136353


Unfortunately, the "number of authors" features did not improve the retrieval performance. Now it is up to you to find some reasonable and useful features! :)

### Additional resources 

Pyterrier
- [Pyterrier documentation: Learning to Rank](https://pyterrier.readthedocs.io/en/latest/ltr.html#introduction)
- [Pyterrier documentation: apply.doc_features](https://pyterrier.readthedocs.io/en/latest/apply.html#pyterrier.apply.doc_features)
- [Pyterrier Notebook: Learning to Rank](https://colab.research.google.com/github/terrier-org/pyterrier/blob/master/examples/notebooks/ltr.ipynb#scrollTo=YTI_ax4K19nl)

Literature
- [Paper: From RankNet to LambdaRank to LambdaMART: An Overview](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf)
- [Wikipedia: Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

Software
- [scikit-learn](https://scikit-learn.org/stable/)
- [lightgbm](https://lightgbm.readthedocs.io)
- [xgboost](https://xgboost.readthedocs.io/)
- [fastrank (another interesting LTR toolkit)](https://github.com/jjfiv/fastrank)