# Using Elasticsearch in PyTerrier experiments
Elasticsearch can store huge indices that could not easily be retrieved from with PyTerrier.
Using the Elasticsearch API via the [`elasticsearch`](https://pypi.org/project/elasticsearch/) Python package,
we can integrate large indices into PyTerrier experiments and take advantage of Elasticsearch's distribution capabilities.

## Configuration
To access Elasticsearch, we need to connect to a cluster by URL, username, and password. Refer to the [API documentation](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html) about other ways to connect to a cluster.

In [None]:
url: str = input("Elasticsearch URL: ")

In [None]:
username: str = input("Elasticsearch username: ")

In [None]:
password: str = input("Elasticsearch password: ")

In [None]:
index: str = input("Elasticsearch index: ")

## Setup

Install Python packages if run in Google Colab.

In [None]:
from sys import modules

if "google.colab" in modules:
    !pip install -q chatnoir-pyterrier python-terrier

Initialize PyTerrier.

In [None]:
from pyterrier import init, started

In [None]:
if not started():
    init()

Connect to Elasticsearch cluster.

In [None]:
from elasticsearch import Elasticsearch

client = Elasticsearch(
    hosts=url,
    basic_auth=(username, password)
)
client

## Retrieval pipeline
We can now create a retrieval pipeline which retrieves results from Elasticsearch.
Create a `ElasticsearchRetrieve` transformer by specifying the ChatNoir API key and (optionally) some index.
You can then use the pipeline in the same way as `BatchRetrieve`.

The `fields` parameter specifies on which fields of the Elasticsearch index the terms should match.
The `columns` parameter then specifies which Elasticsearch fields are mapped to which column in the result data frame.

(We [cache](https://pyterrier.readthedocs.io/en/latest/operators.html#caching) the transformer results with `~`.)

In [None]:
from pyterrier_elasticsearch import ElasticsearchRetrieve

es_text_title = ~ElasticsearchRetrieve(
    client=client,
    index=index,
    fields=["text", "title"],
    columns={
        # source field -> destination column
        "text": "text",
        "title": "title",
    },
    verbose=True,
)

### Search
For example, we can search the ClueWeb 12 for documents containing `python library`:

In [None]:
es_text_title.search("python library")

### Evaluation
We can also use the pipeline in a PyTerrier `Experiment` (and compare it to other retrieval pipelines).
First, we need to download the test topics, for example from the TREC Web Track 2014.
(Refer to the [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/datasets.html#examples) for more detailed guides.)

In [None]:
from pandas import DataFrame
from pyterrier.datasets import Dataset, get_dataset

dataset: Dataset = get_dataset("irds:clueweb12/trec-web-2014")
topics: DataFrame = dataset.get_topics(variant="query").iloc[:5]

Now we can, for example, retrieve documents for the TREC Web Track 2014 topics.

In [None]:
es_text_title.transform(topics)

Alternatively, we could compare the results with searching only the text field.

In [None]:
from pyterrier_elasticsearch import ElasticsearchRetrieve

es_text = ~ElasticsearchRetrieve(
    client=client,
    index=index,
    fields=["text"],
    columns={
        # source field -> destination column
        "text": "text",
        "title": "title",
    },
    verbose=True,
)

Then we runs an experiment like this

In [None]:
from ir_measures import nDCG, RR, MAP
from pyterrier.pipelines import Experiment

Experiment(
    [es_text_title, es_text],
    topics,
    dataset.get_qrels(),
    eval_metrics=[nDCG @ 5, MAP, RR],
    names=["ES (text+title)", "ES (text)"],
)