# Semantic Search with Haystack and Elastic

This notebook will demonstrate how you can use Haystack to upload a set of documents to an Elastic index and build a query pipeline to retrieve documents from the document store.

In this notebook you will:
- Create an Elastic document store in Haystack
- Generate text embeddings for documents in your document store
- Build a search pipeline to retrieve documents
- How to incrementally add documents to the document store

In [1]:
from haystack.utils import launch_es

launch_es()

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
ERROR - root -  Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.


In [2]:
from haystack.document_stores import ElasticsearchDocumentStore

# Connect to the Elastic instance and create a document store.
document_store = ElasticsearchDocumentStore(
    host="localhost",
    username="",
    password="",
    index="document",
    create_index=True,
    similarity="dot_product"
)

In [3]:
# Download Game of Thrones data and preprocess the text.
from haystack.utils import clean_wiki_text, fetch_archive_from_http
from haystack.utils.preprocessing import convert_files_to_docs

# Read data from S3. Write text to the specified directory.
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Convert files to documents.
docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text)

INFO - haystack.utils.import_utils -  Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/145_Elio_M._García_Jr._and_Linda_Antonsson.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/368_Jaime_Lannister.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/133_Game_of_Thrones__Season_5__soundtrack_.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/515_The_Door__Game_of_Thrones_.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/119_Walk_of_Punishment.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/369_Samwell_Tarly.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/356_Tales_of_Dunk_and_Egg.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/195_World_of_A_Song_of_Ice_and_Fire.txt
INFO - haystack.utils

In [4]:
# View what a document looks like.
docs[0]

<Document: {'content': "Linda Antonsson and Elio García at Archipelacon on June 28, 2015.\n'''Elio Miguel García Jr.''' (born May 6, 1978) and '''Linda Maria Antonsson''' (born November 18, 1974) are authors known for their contributions and expertise in the ''A Song of Ice and Fire'' series by George R. R. Martin, co-writing in 2014 with Martin ''The World of Ice & Fire'', a companion book for the series. They are also the founders of the fansite Westeros.org, one of the earliest fan websites for ''A Song of Ice and Fire''.", 'content_type': 'text', 'score': None, 'meta': {'name': '145_Elio_M._García_Jr._and_Linda_Antonsson.txt'}, 'embedding': None, 'id': '41655cc804bb07b1569f3118ce70e05'}>

In [5]:
# Write documents to the datastore.
# All but 10 documents will be stored. We'll use the remaining 10 later.
document_store.write_documents(docs[:-10])

In [6]:
# Define the retriever.
from haystack.nodes import DensePassageRetriever

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base"
)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-question_encoder-single-nq-base locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded facebook/dpr-question_encoder-single-nq-base
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-ctx_encoder-single-nq-base locally.
INFO 

In [7]:
# Update the embeddings in the document store to use the retriever.
document_store.update_embeddings(retriever)

INFO - haystack.document_stores.elasticsearch -  Updating embeddings for all 2357 docs ...


Updating embeddings:   0%|          | 0/2357 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/2368 [00:00<?, ? Docs/s]

In [18]:
from haystack.pipelines import DocumentSearchPipeline
from haystack.utils import print_documents

# Build and execute the query pipeline.
pipeline = DocumentSearchPipeline(retriever)
query = "what is the Red Wedding?"
result = pipeline.run(query, params={"Retriever": {"top_k": 2}})

# View the query results.
# You should see a document for "The Rains of Castamere", which is the
# episode the Red Wedding occurred in, so a very relevant response.
print_documents(result, max_text_len=100, print_name=True, print_meta=True)


Query: what is the Red Wedding?

{   'content': '"\'\'\'The Rains of Castamere\'\'\'" is the ninth and '
               "penultimate episode of the third season of HBO's fantasy "
               "television series ''Game of Thrones'', and the 29th episode of "
               'the series. The episode was written by executive producers '
               'David Benioff and D. B. Weiss, and directed by David Nutter. '
               'It aired on .\n'
               'The episode is centered on the wedding of Edmure Tully and '
               'Roslin Frey, one of the most memorable events of the book '
               'series, commonly called "The Red Wedding", during which Robb '
               'Stark and his banner...',
    'meta': {'name': '78_The_Rains_of_Castamere.txt'},
    'name': '78_The_Rains_of_Castamere.txt'}

{   'content': '\n'
               '===Writing===\n'
               '"The Rains of Castamere" was written by executive producers '
               "David Benioff and D. B. Wei

### Add New Documents

What happens if you want to add new documents to the document store? To avoid having to rerun the embeddings on all documents you can use the `update_existing_embeddings` parameter when you update the embeddings, which will only update embeddings for documents missing an embedding.

In [None]:
# Add new documents to the document store.
document_store.write(docs[-10:])

# Update embeddings for documents without an embedding by setting
# `update_existing_embeddings=False`. This should run much faster
# since only 10 documents need to be updated.
document_store.update_embeddings(
    retriever,
    update_existing_embeddings=False
)