## Applied LTR, Part II.

Building on Part I of this exercise, expand it as follows.

1. **Apply min-max normalization on the feature values** and see if it changes retrieval performance. Note that you need to use the same folds for cross-validation, in order to make it a fair comparison.
  - Min-max normalization: $\tilde{x}_i = \frac{x_i -\min(x)}{\max(x) - \min(x)}$
    - $x_1,\dots,x_n$ are the original values for a given feature
    - $\tilde{x}_i$ is the transformed feature value for the $i$th instance
2. **Add query and document features to your feature vector** and evaluate performance. 
  - Example query features
    - The length of the query, i.e., number of terms
    - Avg. IDF score of query terms
  - Example document features
    - The length of each field (title, content), i.e., number of terms

In [7]:
from elasticsearch import Elasticsearch

INDEX_NAME = "aquaint"
DOC_TYPE = "doc"

es = Elasticsearch()

In [8]:
QUERY_FILE = "data/queries.txt"  # make sure the query file exists on this location
QRELS_FILE = "data/qrels2.csv"  # file with the relevance judgments (ground truth)

In [9]:
FEATURES_FILE = "data/features.txt"  # output the features in this file
OUTPUT_FILE = "data/ltr.txt"  # output the ranking

## Utility functions

#### Load queries

In [10]:
def load_queries(query_file):
    queries = {}
    with open(query_file, "r") as fin:
        for line in fin.readlines():
            qid, query = line.strip().split(" ", 1)
            queries[qid] = query
    return queries

queries = load_queries(QUERY_FILE)

#### Load qrels

In [11]:
def load_qrels(qrels_file):
    gt = {}  # holds a list of relevant documents for each queryID
    with open(qrels_file, "r") as fin:
        header = fin.readline().strip()
        if header != "queryID,docIDs":
            raise Exception("Incorrect file format!")
        for line in fin.readlines():
            qid, docids = line.strip().split(",")
            gt[qid] = docids.split()
    return gt
            
qrels = load_qrels(QRELS_FILE)

## Step 1) Creating training data and writing it to a file

### Extracting features for query-document pairs

We have 6 features in total. Each feature here is a retrieval score, which we obtain using a different ES configuration.

In [12]:
ES_CONFIG = {
    1: {
        "field": "content",
        "similarity": {
            "default": {
                "type": "BM25", 
                "b": 0.75, 
                "k1": 1.2
            } 
        }
    },
    2: {
        "field": "title",
        "similarity": {
            "default": {
                "type": "BM25", 
                "b": 0.75, 
                "k1": 1.2
            } 
        }
    },    
    3: {
        "field": "content",
        "similarity": {
            "default": {
                "type": "LMDirichlet", 
                "mu": 2000  # larger for content
            } 
        }
    },
    4: {
        "field": "title",
        "similarity": {
            "default": {
                "type": "LMDirichlet", 
                "mu": 200  # small for title
            } 
        }
    },
    5: {
        "field": "content",
        "similarity": {
            "default": {
                "type": "LMJelinekMercer", 
                "lambda": 0.1  
            } 
        }
    },    
    6: {
        "field": "title",
        "similarity": {
            "default": {
                "type": "LMJelinekMercer", 
                "lambda": 0.1  
            } 
        }
    }

}

NUM_FEAT = len(ES_CONFIG)

Min-max feature normalization

In [22]:
def minmax_norm(features, fid):
    """Normalizes a given feature."""
    # this is to be done for each query separately
    for qid, fts in features.items():
        min_x = 10000 # sufficiently large number
        max_x = -10000 # # sufficiently small number
        for docid in fts.keys():
            x = features[qid][docid][fid]
            if x < min_x:
                min_x = x
            if x > max_x:
                max_x = x
        for docid in fts.keys():
            x = features[qid][docid][fid]
            features[qid][docid][fid] = (x - min_x) / (max_x - min_x) 

In [14]:
import time

### Feature computation

Collecting feature values in the `features` dict. It has the structure `features[qid][docid][fid] = value`, where fid is a feature ID (1..6).

In [15]:
features = {}

#### Query-document features

  * The first feature in our set will be the BM25 content retrieval score. This is special in that this is the **candidate document set** we will rerank. We only consider the top-100 documents here.
  * For other field/model combinations, we retrieve the top-1000 documents. However, we only keep those that are in the candidate document set.

In [16]:
for fid in range(1, len(ES_CONFIG) + 1):
    print("Computing values for feature #", fid)
    # Set ES similarity config
    es.indices.close(index=INDEX_NAME)
    es.indices.put_settings(index=INDEX_NAME, body={"similarity": ES_CONFIG[fid]["similarity"]})
    es.indices.open(index=INDEX_NAME)

    time.sleep(1)  # wait until it takes effect

    for qid, query in queries.items():
        if qid not in features:
            features[qid] = {}
        num_docs = 100 if fid == 1 else 1000
        res = es.search(index=INDEX_NAME, q=query, df=ES_CONFIG[fid]["field"], _source=False, size=num_docs).get('hits', {})
        for doc in res.get("hits", {}):
            docid = doc.get("_id")
            if fid == 1:  # for BM25 content, we keep all docs; this is our candidate document set
                if docid not in features[qid]:
                    features[qid][docid] = {}
            else:  # for other features, we ignore the doc if it's not in the candidate document set
                if docid not in features[qid]:
                    continue
            features[qid][docid][fid] = doc.get("_score")

Computing values for feature # 1
Computing values for feature # 2
Computing values for feature # 3
Computing values for feature # 4
Computing values for feature # 5
Computing values for feature # 6


#### Document features

  - Feature #7: length of content field
  - Feature #8: length of title field

In [19]:
for qid, query in queries.items():
    print("Computing document features for query #", qid)
    for doc_id in features[qid].keys():
        # get document term vector from elastic
        tv = es.termvectors(index=INDEX_NAME, doc_type=DOC_TYPE, id=doc_id, fields=["title", "content"],
                              term_statistics=True).get("term_vectors", {})

        # content field length
        len_content = sum([s["term_freq"] for t, s in tv.get("content", {}).get("terms", {}).items()])
        features[qid][doc_id][NUM_FEAT+1] = len_content
        
        # title field length
        len_title = sum([s["term_freq"] for t, s in tv.get("title", {}).get("terms", {}).items()])
        features[qid][doc_id][NUM_FEAT+2] = len_title
            
NUM_FEAT += 2        

Computing document features for query # 443
Computing document features for query # 404
Computing document features for query # 622
Computing document features for query # 436
Computing document features for query # 448
Computing document features for query # 658
Computing document features for query # 310
Computing document features for query # 367
Computing document features for query # 378
Computing document features for query # 389
Computing document features for query # 651
Computing document features for query # 408
Computing document features for query # 374
Computing document features for query # 303
Computing document features for query # 393
Computing document features for query # 354
Computing document features for query # 330
Computing document features for query # 325
Computing document features for query # 648
Computing document features for query # 322
Computing document features for query # 433
Computing document features for query # 416
Computing document features for 

#### Query features

  - Feature #9: query length

In [20]:
for qid, query in queries.items():
    print("Computing query features for query #", qid)
    len_query = len(query.split())
    for doc_id in features[qid].keys():
        features[qid][doc_id][NUM_FEAT+1] = len_query
            
NUM_FEAT += 1

Computing query features for query # 443
Computing query features for query # 404
Computing query features for query # 622
Computing query features for query # 436
Computing query features for query # 448
Computing query features for query # 658
Computing query features for query # 310
Computing query features for query # 367
Computing query features for query # 378
Computing query features for query # 389
Computing query features for query # 651
Computing query features for query # 408
Computing query features for query # 374
Computing query features for query # 303
Computing query features for query # 393
Computing query features for query # 354
Computing query features for query # 330
Computing query features for query # 325
Computing query features for query # 648
Computing query features for query # 322
Computing query features for query # 433
Computing query features for query # 416
Computing query features for query # 419
Computing query features for query # 426
Computing query 

### Generating training data and writing it to file

**NOTE** For IR tasks, there should be no "missing" features. When comparing a document field against a query, there is always a retrieval score (which may be 0).  To keep this exercise simple, we use this trick of getting top-1000 docs for each method/field combination from Elasticsearch and using those retrieval scores. If the doc was not returned in the top-1000, then we'll take the retrieval score for that field to be 0. 

In Assignment-3, you should score each of document fields by computing the retrieval scores based on termvector information.

In [23]:
with open(FEATURES_FILE, "w") as fout:
    for qid, query in queries.items():
        for docid, ft in features[qid].items():
            # Note that docid will not have a feature value for feature ID i
            # if it was not retrieved in the top-1000 positions for that feature
            # Then we use 0 as retrieval score
            for fid in range(1, len(ES_CONFIG) + 1):
                if fid not in ft:
                    ft[fid] = 0
            
            # min-max normalization: this is only to be applied to features that
            # are not compatible (i.e., comparable) across queries
            # document and query lengths are comparable => no normalization needed
            # document retrieval scores depend on the query length (i.e., not comparable) => normalization needed
            for fid in range(1, len(ES_CONFIG) + 1):  # normalize first 6 features
                minmax_norm(features, fid)
            
            # relevance label is determined based on the ground truth (qrels) file
            label = 1 if docid in qrels.get(qid, []) else 0
                        
            feat_str = ['{}:{}'.format(k,v) for k,v in ft.items()]
            fout.write(" ".join([str(label), qid, docid] + feat_str) + "\n")

## Step 2) Loading training data from file and performing retrieval

Learning-to-rank code copy-pasted from the example (Task 1).

In [42]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np

### A class for pointwise-based learning to rank model

In [25]:
class PointWiseLTRModel(object):
    def __init__(self, regressor):
        """
        :param classifier: an instance of scikit-learn regressor
        """
        self.regressor = regressor

    def _train(self, X, y):
        """
        Trains and LTR model.
        :param X: features of training instances
        :param y: relevance assessments of training instances
        :return:
        """
        assert self.regressor is not None
        self.model = self.regressor.fit(X, y)

    def rank(self, ft, doc_ids):
        """
        Predicts relevance labels and rank documents for a given query
        :param ft: a list of features for query-doc pairs
        :param ft: a list of document ids
        :return:
        """
        assert self.model is not None
        rel_labels = self.model.predict(ft)
        sort_indices = np.argsort(rel_labels)[::-1]

        results = []
        for i in sort_indices:
            results.append((doc_ids[i], rel_labels[i]))
        return results

### Read training data from file

In [26]:
def read_data_from_file(path):
    """
    :param path: path of file
    :return: X features of data, y labels of data, group a list of numbers indicate how many instances for each query
    """
    X, y, qids, doc_ids = [], [], [], []
    with open(path, "r") as f:
        i, s_qid = 0, None
        for line in f:
            items = line.strip().split()
            label = int(items[0])
            qid = items[1]
            doc_id = items[2]
            features = np.array([float(i.split(":")[1]) for i in items[3:]])
            # replace -1 values with np.nan
            for j in range(len(features)):
                if features[j] == -1:
                    features[j] = 0
            X.append(features)
            y.append(label)
            qids.append(qid)
            doc_ids.append(doc_id)

    return X, y, qids, doc_ids

Now, applying LTR for this data.

### Loading training data

In [27]:
X, y, qids, doc_ids = read_data_from_file(path=FEATURES_FILE)
qids_unique= list(set(qids))

print("#queries: ", len(qids_unique))
print("#query-doc pairs: ", len(y))

#queries:  50
#query-doc pairs:  5000


### Applying 5-fold cross-validation

In [44]:
FOLDS = 5

fout = open(OUTPUT_FILE, "w")
# write header
fout.write("QueryId,DocumentId\n")
    
for f in range(FOLDS):
    print("Fold #{}".format(f + 1))
    
    train_qids, test_qids = [], []  # holds the IDs of train and test queries
    train_ids, test_ids = [], []  # holds the instance IDs (indices in X )

    for i in range(len(qids_unique)):
        qid = qids_unique[i]
        if i % FOLDS == f:  # test query
            test_qids.append(qid)
        else:  # train query
            train_qids.append(qid)

    train_X, train_y = [], []  # training feature values and target labels
    test_X = []  # for testing we only have feature values

    for i in range(len(X)):
        if qids[i] in train_qids:
            train_X.append(X[i])
            train_y.append(y[i])
        else:
            test_X.append(X[i])

    # Create and train LTR model
    print("\tTraining model ...")
    clf = RandomForestRegressor(max_features=4, random_state=0) 
    ltr = PointWiseLTRModel(clf)
    ltr._train(train_X, train_y)
    
    # Apply LTR model on the remaining fold (test queries)
    print("\tApplying model ...")
    
    for qid in set(test_qids):
        print("\t\tRanking docs for queryID {}".format(qid))
        # Collect the features and docids for that (test) query `qid`
        test_ft, test_docids = [], []
        for i in range(len(X)):
            if qids[i] == qid:
                test_ft.append(X[i])
                test_docids.append(doc_ids[i])
        
        # Get ranking
        r = ltr.rank(test_ft, test_docids)    
        # Write the results to file
        for doc, score in r:
            fout.write(qid + "," + doc + "\n")
        
fout.close()

Fold #1
	Training model ...
	Applying model ...
		Ranking docs for queryID 443
		Ranking docs for queryID 648
		Ranking docs for queryID 639
		Ranking docs for queryID 651
		Ranking docs for queryID 336
		Ranking docs for queryID 419
		Ranking docs for queryID 344
		Ranking docs for queryID 433
		Ranking docs for queryID 409
		Ranking docs for queryID 399
Fold #2
	Training model ...
	Applying model ...
		Ranking docs for queryID 404
		Ranking docs for queryID 435
		Ranking docs for queryID 625
		Ranking docs for queryID 310
		Ranking docs for queryID 363
		Ranking docs for queryID 439
		Ranking docs for queryID 347
		Ranking docs for queryID 341
		Ranking docs for queryID 397
		Ranking docs for queryID 427
Fold #3
	Training model ...
	Applying model ...
		Ranking docs for queryID 622
		Ranking docs for queryID 367
		Ranking docs for queryID 314
		Ranking docs for queryID 345
		Ranking docs for queryID 408
		Ranking docs for queryID 353
		Ranking docs for queryID 638
		Ranking docs for 