## Applied LTR, Part I.

Based on the previous example, apply LTR using the queries and collection from Assignment 1. (The query and qrels files to use are [here](https://github.com/kbalog/uis-dat630-fall2017/tree/master/assignment-1/data).)

1. **Extract features** for all document-query pairs from the qrels (i.e., all documents with relevance assessments).
Use the following features (all are retrieval scores that you have computed for Assignment 1; we do not apply any normalization here):  
  - BM25 retrieval score of the query against each field (title, content)
  - LM retrieval score of the query against each field (title, content) using Jelinek-Mercer smoothing
  - LM retrieval score of the query against each field (title, content) using Dirichlet smoothing

2. **Train and evaluate the model using 5-fold cross-validation**

In [13]:
from elasticsearch import Elasticsearch

INDEX_NAME = "aquaint"
DOC_TYPE = "doc"

es = Elasticsearch()

In [8]:
QUERY_FILE = "data/queries.txt"  # make sure the query file exists on this location

In [9]:
OUTPUT_FILE = "data/features.txt"  # output the features in this file

### Load queries

In [11]:
def load_queries(query_file):
    queries = {}
    with open(query_file, "r") as fin:
        for line in fin.readlines():
            qid, query = line.strip().split(" ", 1)
            queries[qid] = query
    return queries

queries = load_queries(QUERY_FILE)

### Extract features for query-document pairs

In [26]:
# Each feature here is a Retrieval score obtained using a different ES configuration
ES_CONFIG = {
    1: {
        "field": "title",
        "similarity": {
            "default": {
                "type": "BM25", 
                "b": 0.75, 
                "k1": 1.2
            } 
        }
    },
    2: {
        "field": "content",
        "similarity": {
            "default": {
                "type": "BM25", 
                "b": 0.75, 
                "k1": 1.2
            } 
        }
    }
}

In [29]:
import time

In [38]:
features = {}  # features[qid][docid][fid] = value, where fid is a retrieval score

for fid in range(1, len(ES_CONFIG) + 1):
    print("Computing values for feature #", fid)
    # Set ES similarity config
    es.indices.close(index=INDEX_NAME)
    es.indices.put_settings(index=INDEX_NAME, body={"similarity": ES_CONFIG[fid]["similarity"]})
    es.indices.open(index=INDEX_NAME)

    time.sleep(1)  # wait until it takes effect

    for qid, query in queries.items():
        if qid not in features:
            features[qid] = {}
        #print("Ranking documents for [%s] '%s'" % (qid, query))
        res = es.search(index=INDEX_NAME, q=query, df=ES_CONFIG[fid]["field"], _source=False, size=100).get('hits', {})
        for doc in res.get("hits", {}):
            docid = doc.get("_id")
            if docid not in features[qid]:
                features[qid][docid] = {}
            features[qid][docid][fid] = doc.get("_score")

Computing values for feature # 1
Computing values for feature # 2


In [39]:
print(features["303"])

{'NYT19991115.0396': {1: 21.474487, 2: 22.602192}, 'APW19990310.0063': {1: 21.474487, 2: 24.121357}, 'APW19990310.0167': {1: 21.474487, 2: 24.121357}, 'APW19990311.0166': {1: 21.474487, 2: 24.121357}, 'APW19990224.0040': {1: 21.474487, 2: 24.678314}, 'APW19990224.0042': {1: 21.474487, 2: 24.678314}, 'NYT19990310.0333': {1: 19.051239, 2: 25.083767}, 'NYT19991215.0229': {1: 19.051239, 2: 25.051338}, 'NYT19990310.0503': {1: 16.04793, 2: 24.678314}, 'NYT19991222.0420': {1: 16.04793, 2: 23.97029}, 'NYT19991216.0309': {1: 16.04793}, 'NYT19981008.0331': {1: 11.925177, 2: 18.954027}, 'APW19990624.0050': {1: 11.481674, 2: 19.768824}, 'APW19990624.0293': {1: 11.481674}, 'APW19990723.0142': {1: 11.481674}, 'APW19990723.0145': {1: 11.481674}, 'APW19990709.0230': {1: 11.481674}, 'APW19990709.0233': {1: 11.481674}, 'APW19991006.0154': {1: 11.481674, 2: 21.764845}, 'APW19991006.0285': {1: 11.481674, 2: 23.50713}, 'NYT19991223.0360': {1: 10.940688, 2: 22.846245}, 'APW19990414.0120': {1: 10.940688}, 'A