# Assignment 2B: Feature computation

The purpose of this notebook is to perform the computation of features. 

Note that some features might be expensive, so you don't want to keep re-computing them. Instead, aim for writing a set of relatively simple feature extractors, each computing one or multiple features, and save their output to separate files. Then, load the pre-computed features from multiple files in the learning step (in the [ranking notebook](2_Ranking.ipynb)).

## Feature extractors

Example feature extractors.

In [5]:
def feature_qlen(query, doc):
    """Feature: query length (number of terms). 
    This is a query feature, so it'll have the same value for all documents."""
    return len(query.split())

In [11]:
def feature_bm25(query, doc, field):
    """Feature: BM25 retrieval score on a given field."""
    # TODO
    return 0

## Feature computation

Computes features for document-query pairs and saves them to a file.

Specifically, we will save features to a JSON file, using a nested map structure, with queries on the first level, documents on the second level, and individual features on the third level. 

```python
  features = {
      'query_i': {
          'doc_j': {
              'feature_1': 0,  # value of feature_1 for (query_i, doc_j) pair
              'feature_2': 0,  # value of feature_2 for (query_i, doc_j) pair
              ...
          }
          ...
      }
      ...
  }
```

**Note**: The set of documents for a query (for which you want to compute features) should be a combination of the documents for which you have relevance labels and the top-100 documents retrieved in first-pass retrieval.
You can then decide in the learning part if/how you want to deal with class imbalance.

In [19]:
import json

In [18]:
# TODO load actual queries from file
queries = ["q1", "q2", "q3"]

features_1 = {}
features_2 = {}

for q in queries:
    features_1[q] = {}
    features_2[q] = {}
    # TODO load actual candidate documents from file
    docs = ["d1", "d2", "d3"]
    for d in docs:
        # Here, two sets of features are computed in a single go to produce some toy data.
        # Normally, you would run these sequentially.
        features_1[q][d] = {
            'qlen': feature_qlen(q, d)
        }
        features_2[q][d] = {
            'bm25_content': feature_bm25(q, d, "content"),
            'bm25_title': feature_bm25(q, d, "title")
        }
        
# Write computed features to file
with open("data/sample_features_1.json", "w") as f:
    json.dump(features_1, f, indent=4, sort_keys=True)
    
with open("data/sample_features_2.json", "w") as f:
    json.dump(features_2, f, indent=4, sort_keys=True)    