# Assignment 3 - Part 3 solution

This notebook focuses only on the learning-to-rank (LTR) part of the assignment. It takes pre-computed feature vectors for query-document pairs as input. Importantly, query-document pairs are sorted by the BM25 score (feature #0).

You should be able to run this notebook as is. If you upload the resulting `ltr2.txt` output file to Kaggle, it'll give you an NDCG@20 score of 0.10044.

In [1]:
import numpy as np
from sklearn.ensemble import RandomForestRegressor

## Utility functions

Loading queries

In [2]:
def load_queries(query_file):
    queries = {}
    with open(query_file, "r") as fin:
        for line in fin.readlines():
            qid, query = line.strip().split(" ", 1)
            queries[qid] = query
    return queries

**Loading features file**

  - `ignore` contains a list of feature IDs that are ignored when loading
  - `max_docs` is the maximum number of (unlabeled) docs loaded per query (mind that you should sort instances beforehand if you want to use this)

In [3]:
def load_features(features_file, ignore=[], max_docs=10000):
    X, y, qids, doc_ids = [], [], [], []
    with open(features_file, "r") as f:
        i, s_qid = 0, None
        n = 0  # total number of docs for the current query
        for line in f:
            items = line.strip().split()
            label = 0 if items[0] == "?" else int(items[0])
            qid = items[1]
            doc_id = items[2]
            tmp = np.array([float(i.split(":")[1]) for i in items[3:]])
            features = [tmp[j] for j in range(len(tmp)) if j not in ignore]
            
            if s_qid != qid:  # new query seen
                s_qid = qid
                n = 0
            n += 1
            
            if (n <= max_docs) or (items[0] != "?"):  # max number of documents 
                X.append(features)
                y.append(label)
                qids.append(qid)
                doc_ids.append(doc_id)

    return X, y, qids, doc_ids

Min-max feature normalization

In [4]:
def minmax_norm(features, qids):
    """Normalizes all features."""
    n = len(features[0]) # number of features
    for qid in set(qids):
        min_x = [10000] * n # sufficiently large number
        max_x = [-10000] * n # # sufficiently small number                

        # finding min and max values        
        for i in range(len(features)):
            if qids[i] == qid:
                for j in range(n):
                    x = features[i][j]
                    if x != -1:
                        if x < min_x[j]:
                            min_x[j] = x
                        if x > max_x[j]:
                            max_x[j] = x

        # applying normalization
        for i in range(len(features)):
            if qids[i] == qid:
                for j in range(n):
                    if max_x[j] == min_x[j]:  # no norm for that feature
                        continue
                    x = features[i][j]
                    if x != -1:
                        features[i][j] = (x - min_x[j]) / (max_x[j] - min_x[j])

## Pointwise LTR class

In [5]:
class PointWiseLTRModel(object):
    def __init__(self, regressor):
        """
        :param classifier: an instance of scikit-learn regressor
        """
        self.regressor = regressor

    def _train(self, X, y):
        """
        Trains and LTR model.
        :param X: features of training instances
        :param y: relevance assessments of training instances
        :return:
        """
        assert self.regressor is not None
        self.model = self.regressor.fit(X, y)

        
    def rank(self, ft, doc_ids):
        """
        Predicts relevance labels and rank documents for a given query
        :param ft: a list of features for query-doc pairs
        :param ft: a list of document ids
        :return:
        """
        assert self.model is not None
        rel_labels = self.model.predict(ft)
        results = []
        for i in range(len(doc_ids)):
            results.append((doc_ids[i], rel_labels[i]))
        
        return sorted(results, key=lambda x: x[1], reverse=True)

## Main

### Training the model

The `features.txt` file contains the top-200 documents retrieved using BM25 on the content field. The first number in this file is the relevance label according to the ground truth (`qrels.csv`) if the query-document pair is present there; otherwise, the value is "?". We will treat documents with missing relevance labels as non-relevant (0).

In [6]:
FEATURES_FILE = "data/features_sorted.txt"

In [7]:
FEATURES = ["BM25 content score", "BM25 title score", "BM25 anchors score", "LM content score", 
            "LM title score", "LM anchors score", "query length", "sum query term IDF in content", 
            "avg query term IDF in content", "sum query term IDF in title", 
            "avg query term IDF in title", "sum query term IDF in anchors", 
            "avg query term IDF in anchors", "PageRank", "content field length", 
            "title field length", "anchors field length", "content sum TFIDF", "title sum TFIDF", 
            "anchors sum TFIDF"]

In [8]:
train_X, train_y, train_qids, train_doc_ids = load_features(FEATURES_FILE)

In [9]:
minmax_norm(train_X, train_qids)

In [10]:
clf = RandomForestRegressor(n_estimators=100, max_depth=3, random_state=0)
ltr = PointWiseLTRModel(clf)
ltr._train(train_X, train_y)

Output feature importance according to the learned model

In [11]:
imp = zip(FEATURES, clf.feature_importances_)
imp_sorted = sorted(imp, key=lambda imps: imps[1], reverse=True)
print("\n".join(["{:30}: {}".format(k,v) for k, v in imp_sorted]))

BM25 title score              : 0.22798805739892966
BM25 content score            : 0.13885445479937622
content sum TFIDF             : 0.12117450414966097
LM title score                : 0.1135795023442336
sum query term IDF in title   : 0.09042303921443162
content field length          : 0.0687523840711894
BM25 anchors score            : 0.0533051958591014
sum query term IDF in anchors : 0.05208598514389976
LM content score              : 0.032096002884159223
avg query term IDF in anchors : 0.02443032849236756
avg query term IDF in content : 0.02331145185636168
anchors sum TFIDF             : 0.013677520504176012
sum query term IDF in content : 0.009995155621257669
title field length            : 0.008774318029963708
title sum TFIDF               : 0.0069109412495633234
LM anchors score              : 0.006778708992783746
anchors field length          : 0.003698505015740769
PageRank                      : 0.0036012033532626707
query length                  : 0.0005627410195409994
avg

**NOTE** One key thing to notice here is that the BM25 title score is considered the most important feature. Using a single-field BM25 baseline on the training query set, the NDCG@20 scores on the title and content fields are 0.1106 and 0.1290, respectively. On the test query set, however, these scores, according to Kaggle, are 0.09268 and 0.03336, respectively. Notice that the title field works much worse on the test query set. (That is why both the title and content baselines were made available on Kaggle, so that you can spot this...) Thus, the learned model that assigns very high importance to the title field is not expected to work very well on the test data. Therefore, we drop the "BM25 title" (#1) and "LM title" (#4) features from our feature set.

After removing these two title features, we find that "BM25 anchors" (#2) ends up as the top feature. We find -experimentally- that removing that one further helps a bit. This suggests that the training and test data have slightly different characteristics, and we'd otherwise be overfitting on the training data.

We load the features again (excepting features #1, #2, #4) and train a new model

In [12]:
features_drop = [1, 2, 4]

In [13]:
for f in sorted(features_drop, reverse=True):
    FEATURES.pop(f)

In [14]:
train_X, train_y, train_qids, train_doc_ids = load_features(FEATURES_FILE, features_drop)

In [15]:
minmax_norm(train_X, train_qids)

In [16]:
clf = RandomForestRegressor(n_estimators=100, max_depth=3, random_state=0)
ltr = PointWiseLTRModel(clf)
ltr._train(train_X, train_y)

Feature importances for our updated model

In [17]:
imp = zip(FEATURES, clf.feature_importances_)
imp_sorted = sorted(imp, key=lambda imps: imps[1], reverse=True)
print("\n".join(["{:30}: {}".format(k,v) for k, v in imp_sorted]))

BM25 content score            : 0.2660943723057572
LM anchors score              : 0.22969573699217333
sum query term IDF in anchors : 0.1453840894831454
content field length          : 0.10866177320580166
sum query term IDF in title   : 0.10798806527648358
content sum TFIDF             : 0.07006541937063403
LM content score              : 0.017759206525096313
avg query term IDF in content : 0.01378558972528977
sum query term IDF in content : 0.012020659268348596
PageRank                      : 0.010507015533364005
anchors field length          : 0.008601025562756796
anchors sum TFIDF             : 0.0045592198415815265
avg query term IDF in anchors : 0.00232070100566144
title sum TFIDF               : 0.0018263705085941981
avg query term IDF in title   : 0.0006495009408379531
query length                  : 8.12544544743667e-05
title field length            : 0.0


### Applying the model on unseen queries

In [18]:
QUERY2_FILE = "data/queries2.txt"
FEATURES2_FILE = "data/features2_sorted.txt"
OUTPUT_FILE = "data/ltr2.txt"
TOP_DOCS = 20  # this many top docs to write to output file

In [19]:
queries = load_queries(QUERY2_FILE)

**NOTE** Our final trick is here. Instead of reranking the top-200 documents for a new, unseen query, we only rerank the top-50. We trust the baseline BM25 ranker and play a "bit safer". (This makes a difference about 0.007 NDCG@20 on Kaggle -- just that extra final bit that you needed.)

In [20]:
X, _, qids, doc_ids = load_features(FEATURES2_FILE, features_drop, max_docs=50)

In [21]:
minmax_norm(X, qids)

Apply model and write results to output file

In [22]:
with open(OUTPUT_FILE, "w") as fout:
    
    fout.write("QueryId,DocumentId\n")
    
    for qid, query in sorted(queries.items()):
        print("Ranking query #{} '{}'".format(qid, query))

        fts = []
        ids = []
        
        for i in range(len(X)):
            if qids[i] == qid:
                fts.append(X[i])
                ids.append(doc_ids[i])
                
        # Get ranking
        r = ltr.rank(fts, ids)    
        # Write the results to file
        rank = 1
        for doc_id, score in r:
            if rank <= TOP_DOCS:
                fout.write(qid + "," + doc_id + "\n")                            
            rank += 1

Ranking query #251 'identifying spider bites'
Ranking query #252 'history of orcas island'
Ranking query #253 'tooth abscess'
Ranking query #254 'barrett's esophagus'
Ranking query #255 'teddy bears'
Ranking query #256 'patron saint of mental illness'
Ranking query #257 'holes by louis sachar'
Ranking query #258 'hip roof'
Ranking query #259 'carpenter bee'
Ranking query #260 'the american revolutionary'
Ranking query #261 'folk remedies sore throat'
Ranking query #262 'balding cure'
Ranking query #263 'evidence for evolution'
Ranking query #264 'tribe formerly living in alabama'
Ranking query #265 'F5 tornado'
Ranking query #266 'symptoms of heart attack'
Ranking query #267 'feliz navidad lyrics'
Ranking query #268 'benefits of running'
Ranking query #269 'marshall county schools'
Ranking query #270 'sun tzu'
Ranking query #271 'halloween activities for middle school'
Ranking query #272 'dreams interpretation'
Ranking query #273 'wilson's disease'
Ranking query #274 'golf instruction'

That's all folks.