# Searching through the index
In this tutorial we demonstrate how to search through an index build with HuggingFace and Pyserini. Check out previous tutorials to learn how to build the index in the first place:
- https://github.com/huggingface/gaia/blob/main/notebooks/00-indexing.ipynb
- https://github.com/huggingface/gaia/blob/main/notebooks/01-tokenization.ipynb

We start by creating a `searcher` object which gives us access to underlying Lucene API.

In [19]:
from datasets import load_dataset
from pyserini.search.lucene import LuceneSearcher
from pyserini.analysis import get_lucene_analyzer

In [20]:
searcher = LuceneSearcher("../indexes/imdb")

We can search through the index with a smiple `search` function, providing the query and a maximum number of requested results.

In [21]:
%time
hits = searcher.search("Horrible movie", k=5)

CPU times: user 6 µs, sys: 6 µs, total: 12 µs
Wall time: 23.6 µs


In [22]:
for hit in hits:
    print(f"Document {hit.docid}, score: {hit.score}")
    print(f"{hit.contents[:500]}...\n\n")

Document 1765, score: 2.965100049972534
this movie was horrible. I could barely stay awake through it. I would never see this movie again if I were payed to. The so-called horror scenes in it were increadably predictable and over played. There was really nothing about this movie that would have made it original or worth the $7.50 I payed to see it. Don't go see it, don't rent it, don't read about it online because any of these things would be a complete waste of your time. Sarah Michelle Geller gave a lackluster performance and really...


Document 3685, score: 2.9449000358581543
A patient escapes from a mental hospital, killing one of his keepers and then a University professor after he makes his way to the local college. Next semester, the late prof's replacement and a new group of students have to deal with a new batch of killings. The dialogue is so clichéd it is hard to believe that I was able to predict lines in quotes. This is one of those cheap movies that was thrown together i

## Tying the search results back to the dataset
Because we used HF datasets as the source of indexed documents preserving their original document IDs, we can leverage it when performing the analysis of retrieved results. This may come in handy if the HF datasets contains useful metadata which we don't want to store inside the Pyserini index not to inflate its size.

In [26]:
dset = load_dataset("imdb", split="train")

Found cached dataset imdb (/home/piktus_huggingface_co/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


In [27]:
for hit in hits:
    print(f"Document {hit.docid}, score: {hit.score}")
    print(f"{hit.contents[:500]}...")
    print(dset[int(hit.docid)])
    print()

Document 1765, score: 2.965100049972534
this movie was horrible. I could barely stay awake through it. I would never see this movie again if I were payed to. The so-called horror scenes in it were increadably predictable and over played. There was really nothing about this movie that would have made it original or worth the $7.50 I payed to see it. Don't go see it, don't rent it, don't read about it online because any of these things would be a complete waste of your time. Sarah Michelle Geller gave a lackluster performance and really...
{'text': "this movie was horrible. I could barely stay awake through it. I would never see this movie again if I were payed to. The so-called horror scenes in it were increadably predictable and over played. There was really nothing about this movie that would have made it original or worth the $7.50 I payed to see it. Don't go see it, don't rent it, don't read about it online because any of these things would be a complete waste of your time. Sarah Mi

## Health check on Streamed Datasets
Let's make sure the index build in https://github.com/huggingface/gaia/blob/main/notebooks/00-indexing.ipynb using streaming is the same as the one build from offline data. 

In [28]:
searcher = LuceneSearcher("../indexes/imdb-streaming/")

In [29]:
%time
hits = searcher.search("Horrible movie", k=10)

CPU times: user 5 µs, sys: 5 µs, total: 10 µs
Wall time: 37.4 µs


In [30]:
for hit in hits:
    print(f"Document {hit.docid}, score {hit.score}")

Document 1765, score 2.965100049972534
Document 3685, score 2.9449000358581543
Document 11077, score 2.902400016784668
Document 9233, score 2.8859000205993652
Document 2873, score 2.8440001010894775
Document 1140, score 2.837399959564209
Document 3335, score 2.81850004196167
Document 5370, score 2.8125998973846436
Document 5692, score 2.812598943710327
Document 6744, score 2.7985999584198


## Search Results for the Index with Hugging Face tokenization
The index differes from the one using the default English Analyser, so search results are also different.

In [31]:
hf_tokenizer_searcher = LuceneSearcher("../indexes/bpe-imdb-25k/")
hf_analyzer = get_lucene_analyzer(
    language="hgf_tokenizer", huggingFaceTokenizer="spacerini/bpe-imdb-25k"
)
hf_tokenizer_searcher.set_analyzer(hf_analyzer)

In [32]:
%time
hits = hf_tokenizer_searcher.search("Horrible movie", k=10)

CPU times: user 8 µs, sys: 7 µs, total: 15 µs
Wall time: 30.3 µs


In [35]:
for hit in hits:
    print(f"Document {hit.docid}, score: {hit.score}")
    print(f"{hit.contents[:500]}...")
    print()

Document 2873, score: 5.376500129699707
Horrible, Horrible, Horrible do not waste your money to rent this movie. Its like a low budget made for TV Canadian movie. Absolutely the worst movie I have ever seen and there have been many others out there. This movie is not worth the time it takes to put it in the DVD player or VCR. :~( . Is it possible to write ten lines? The acting was horrific. It had absolutely no flow. I saw the made for TV movie on the BTK killer and it was much better(in comparison to this one). I am not sure what the...

Document 1140, score: 4.942500114440918
Horrible acting, horrible cast and cheap props. Would've been a lot better if was set as an action parody style movie. What a waste. Starting from the name of the movie.<br /><br />"The Enemy" Naming it "Action Movie" would've made it better. (contributing to the parody effect). The cop looking like a 60 Year old player, the blond girl just having the same blank boring look on her face at all times. Towards the 