# Elasticsearch

In this exercise, you'll first build an Elasticsearch index of a toy document collection, then request various term statistics from that index.

Remember to make sure that the Elasticsearch service is running (i.e., has been started in a terminal window).

See [this document](Elasticsearch.md) for help on Elasticsearch usage.

In [None]:
from elasticsearch import Elasticsearch
from typing import Dict, List, Optional

import ipytest
import pytest

ipytest.autoconfig()

This is to check that the Elasticsearch service is running on your machine.

In [None]:
es = Elasticsearch()

## Indexing

We use a toy data collection with 5 documents, each with title and content fields.

In [None]:
DOCS = [
    {"doc_id": "D1",
     "title": "First document",
     "content": "House on the hill"
    },
    {"doc_id": "D2",
     "title": "Second title",
     "content": "Downtown Stavanger is beautiful"
    },
    {"doc_id": "D3",
     "title": "First, second, and third",
     "content": "Never step on snakes"
    },
    {"doc_id": "D4",
     "title": "Document number four",
     "content": "House, house. It's a beautiful house you have"
    },
    {"doc_id": "D5",
     "title": "This document is the last document",
     "content": "There can be only one matching result"
    }    
]

In [None]:
INDEX_SETTINGS = {  # single shard with a single replica
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        }
    }
}

In [None]:
INDEX_NAME = "test_e6-3"

In [None]:
if es.indices.exists(INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)
es.indices.create(index=INDEX_NAME, body=INDEX_SETTINGS)

Add documents in `DOC` to the index.

In [None]:
# TODO

## Term statistics

Complete the methods below for getting various term statistics from the index.

Consult [this notebook](2-Elasticsearch.ipynb) for the interpretation of term vector statistics.

In [None]:
def get_doc_term_freqs(index_name: str, doc_id: str, field: str) -> Dict[str, int]:
    """Returns the terms along with their frequencies contained in a given document.
    
    Args:
        index_name: Name of index.
        doc_id: Document ID.
        field: Field name.
    
    Returns:
        Dict with terms as keys and corresponding frequencies (i.e., 
        number of occurrences within the given document field) as values.
    """
    # TODO: Complete
    return None

In [None]:
def get_doc_field_len(index_name: str, doc_id: str, field: str) -> int:
    """Returns the length of a given document field.
    
    Length is defined as the total number of terms contained in that field.
    
    Args:
        index_name: Name of index.
        doc_id: Document ID.
        field: Field name.
    
    Returns:
        Field length.    
    """
    # TODO: Complete
    return None

In [None]:
def get_doc_containing_term(index_name: str, field: str, term: str) -> Optional[str]:
    """Returns any document ID that contains term in a given field or None.
    
    Args:
        index_name: Name of index.
        field: Field name.
        term: Term.

    Returns:
        ID of a document that contains `term` or None.
    """
    # TODO: Complete
    # Hint: Use a boolean query to find a document that contains the term.
    return None

In [None]:
def get_term_doc_count(index_name: str, field: str, term: str) -> int:
    """Returns the total number of documents that contain a given term within a specific field.
    
    Args:
        index_name: Name of index.
        field: Field name.
        term: Term.
        
    Returns:
        Number of documents that contain the given term within `field`.
    """
    # Find a document that contains the term.
    doc_id = get_doc_containing_term(index_name, field, term)
    if doc_id is None:
        return 0
    # Request term statistics for that document and extract the 
    # requested information from there.
    # TODO: Complete
    return None    

In [None]:
def get_term_coll_freq(index_name: str, field: str, term: str) -> int:
    """Returns the total collection term frequency of a term in a given field.
    
    Args:
        index_name: Name of index.
        field: Field name.
        term: Term.
        
    Returns:
        Total number of occurrences of `term` in all documents within `field`.
    """
    # Find a document that contains the term.
    doc_id = get_doc_containing_term(index_name, field, term)
    if doc_id is None:
        return 0
    # Request term statistics for that document and extract the 
    # requested information from there.
    # TODO: Complete
    return None

Tests.

In [None]:
%%run_pytest[clean]

def test_doc_term_freqs():
    assert get_doc_term_freqs(INDEX_NAME, "D2", "title") == {"second": 1, "title": 1}
    assert get_doc_term_freqs(INDEX_NAME, "D4", "content") == {"a": 1, "beautiful": 1, "have": 1,
                                                               "house": 3, "it's": 1, "you": 1}    
def test_doc_field_len():
    assert get_doc_field_len(INDEX_NAME, "D2", "title") == 2
    assert get_doc_field_len(INDEX_NAME, "D4", "content") == 8
    
def test_doc_containing_term():
    assert get_doc_containing_term(INDEX_NAME, "title", "document") in ["D1", "D4", "D5"]
    assert get_doc_containing_term(INDEX_NAME, "content", "house") in ["D1", "D4"]
    
def test_term_doc_count():
    assert get_term_doc_count(INDEX_NAME, "title", "document") == 3
    assert get_term_doc_count(INDEX_NAME, "content", "house") == 2    
    
def test_term_coll_freq():
    assert get_term_coll_freq(INDEX_NAME, "title", "this") == 1
    assert get_term_coll_freq(INDEX_NAME, "title", "document") == 4
    assert get_term_coll_freq(INDEX_NAME, "content", "house") == 4       