# Search API usage example

This notebook shows how to use the Search API.

This API (see the [source code here](api.py)) acts as a broker between users and the Elasticsearch service, hosted on a cloud.  It basically recives requests, passes them onto Elastissearch (via the [Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html)), and returns the results as JSON.  It mirrors the parameterization of the respective Elasticsearch API methods.

Note that you don't need to run anything locally, the source code is merely provided for transparency.

![Search API](search_api.png)

You talk to the Search API using HTTP requests. 

In [1]:
import urllib
import requests
import json

In [2]:
API = "http://gustav1.ux.uis.no:5002"

MAIN_INDEX = "clueweb12b"
ANCHORS_INDEX = "clueweb12b_anchors"

## Search

Executing a search query using [es.search()](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search) and returns the search hits

Parameters:
  - `q` (mandatory): query
  - `df` (mandatory): field to search in
  - `size` (optional): number of hits to return (default: 10)

In [3]:
def search(indexname, query, field, size=10):
    url = "/".join([API, indexname, "_search"]) + "?" \
          + urllib.parse.urlencode({"q": query, "df": field, "size": size})
    response = requests.get(url).text
    return json.loads(response)

### Generating rankings for a set of queries

The code below is used for generating the first-pass (BM25) ranking (top-100 documents based on the content field).

(For your actual submissions, make sure you use the file that is provided in your private repository. You may use the code below to generate a temporary file that can be used during development.) 

In [4]:
QUERIES_FILE = "data/queries.txt"
OUTPUT_FILE = "data/ranking_bm25_temp.csv"

In [5]:
def load_queries(query_file):
    queries = {}
    with open(query_file, "r") as fin:
        for line in fin.readlines():
            qid, query = line.strip().split(" ", 1)
            queries[qid] = query
    return queries

In [6]:
queries = load_queries(QUERIES_FILE)

with open(OUTPUT_FILE, "w") as fout:
    fout.write("QueryId,DocumentId\n")  # header
    for qid, query in sorted(queries.items()):
        print("Ranking documents for [{}] '{}'".format(qid, query))
        res = search(MAIN_INDEX, query, "content", size=100)
        for doc in res.get('hits', {}).get("hits", {}):
            fout.write("{},{}\n".format(qid, doc.get("_id")))

Ranking documents for [201] 'raspberry pi'
Ranking documents for [202] 'uss carl vinson'
Ranking documents for [203] 'reviews of les miserables'
Ranking documents for [204] 'rules of golf'
Ranking documents for [205] 'average charitable donation'
Ranking documents for [206] 'wind power'
Ranking documents for [207] 'bph treatment'
Ranking documents for [208] 'doctor zhivago'
Ranking documents for [209] 'land surveyor'
Ranking documents for [210] 'golf gps'
Ranking documents for [211] 'what is madagascar known for'
Ranking documents for [212] 'home theater systems'
Ranking documents for [213] 'carpal tunnel syndrome'
Ranking documents for [214] 'capital gains tax rate'
Ranking documents for [215] 'maryland department of natural resources'
Ranking documents for [216] 'nicolas cage movies'
Ranking documents for [217] 'kids earth day activities'
Ranking documents for [218] 'solar water fountains'
Ranking documents for [219] 'what was the name of elvis presley's home'
Ranking documents for [

## Exists

Checking whether the given document ID exists in that index using [es.exists()](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.exists).

In [7]:
def exists(indexname, doc_id):
    url = "/".join([API, indexname, doc_id, "_exists"])
    response = requests.get(url).text
    return json.loads(response)['exists']
    
print(exists(MAIN_INDEX, "clueweb12-0713wb-35-00870"))
print(exists(MAIN_INDEX, "clueweb12-0906wb-09-33744"))

True
False


## Analyzer

Returns the analyzed version of the input text using [es.indices.analyze()](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.analyze).

Instead of just splitting on spaces, use this request for tokenizing the query text.

Parameters:
  - `text` (mandatory): text to be analyzed

In [8]:
def analyze_query(indexname, query):
    url = "/".join([API, indexname, "_analyze"]) + "?" \
          + urllib.parse.urlencode({"text": query})
    response = requests.get(url).text
    r = json.loads(response)
    return [t["token"] for t in r["tokens"]]
    
print("{} => {}".format(queries['219'], analyze_query(MAIN_INDEX, queries['219'])))

what was the name of elvis presley's home => ['what', 'name', 'elvi', 'preslei', 'home']


## Termvectors

Returns information and statistics on terms in the fields of a particular document using [es.termvectors()](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.termvectors).

Parameters:
  - `term_statistics` (optional): set true to return term statistics (default is false)

In [9]:
def term_vectors(indexname, doc_id, term_statistics=False):
    ret = {}    
    url = "/".join([API, indexname, doc_id, "_termvectors"]) + "?" \
          + urllib.parse.urlencode({"term_statistics": str(term_statistics).lower()})
    response = requests.get(url).text
    try:
        ret = json.loads(response)
    except:
        print("Failed to json-decode this response:\n{}".format(response))
    return ret

print(term_vectors(MAIN_INDEX, "clueweb12-0713wb-35-00870", term_statistics=True)['term_vectors']['title'])

{'field_statistics': {'doc_count': 5304355, 'sum_doc_freq': 31308821, 'sum_ttf': 34674338}, 'terms': {'bbc': {'doc_freq': 1826, 'term_freq': 1, 'ttf': 1935}, 'comput': {'doc_freq': 23036, 'term_freq': 1, 'ttf': 25856}, 'softwar': {'doc_freq': 28921, 'term_freq': 1, 'ttf': 32599}}}
