<a href="https://colab.research.google.com/github/martindevoto/machine-learning-notebooks-personal/blob/main/Intro_Haystack_pt_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Build Your First QA System


In [None]:
#  A very common one: Using it to navigate through
#  complex knowledge bases or long documents ("search setting").

In [None]:
# Make sure you have a GPU running
!nvidia-smi

Sun Feb  6 14:51:57 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P8    34W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install the latest release of Haystack in your environment
#!pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Collecting pip
  Downloading pip-22.0.3-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.1 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.0.3
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-6v_p0upn/farm-haystack_ee213f65283442349723c929573a214d
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-6v_p0upn/farm-haystack_ee213f65283442349723c929573a214d
  Resolved https://github.com/deepset-ai/haystack.git to commit a095aea21ea9f9a6dff155d571ec7be3f92fcbfa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pydantic
  Downloading pydantic-1.9.0

In [None]:
from haystack.utils import (clean_wiki_text, convert_files_to_dicts,
                            fetch_archive_from_http, print_answers)
from haystack.nodes import FARMReader, TransformersReader

## Document Store
Haystack finds answers to queries within the documents stored in a DocumentStore. The current implementations of DocumentStore include ElasticsearchDocumentStore, FAISSDocumentStore, SQLDocumentStore, and InMemoryDocumentStore.

Here: We recommended Elasticsearch as it comes preloaded with features like full-text queries, BM25 retrieval, and vector storage for text embeddings.

Alternatives: If you are unable to setup an Elasticsearch instance, then follow the Tutorial 3 for using SQL/InMemory document stores.

Hint: This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores.

## Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
# Recommended: start elasticseach using docker via the haystack utility function
from haystack.utils import launch_es

launch_es()



In [None]:
# In Colab / No Docker environments: Start Elasticseach from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)) # as daemon

# wait until ES has started
! sleep 30   

In [None]:
# Connect to Elasticsearch

from haystack.document_stores import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host='localhost', username='',
                                             password='', index='document')

## Preprocessing of documents
Haystack provides a customizable pipeline for:

* converting files into texts
* cleaning texts
* splitting texts
* writing them to a Document Store

In this tutorial, we download Wikipedia articles about Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch.

In [None]:
# Let's first fetch some documents that we want to query
# Here: 517 Wikipedia articles for Game of Thrones
doc_dir = 'data/article_txt_got'
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)



INFO - haystack.utils.import_utils -  Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip to `data/article_txt_got`


True

In [None]:
# Convert files to dicts
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It muts take a str as input, and return a str.
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/77_Game_of_Thrones_Ascent.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/450_Baelor.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/131_Mhysa.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/504_List_of_A_Song_of_Ice_and_Fire_video_games.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/273_High_Sparrow.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/395_Game_of_Thrones__season_5_.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/208_Robb_Stark.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/485_Oathkeeper.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/362_Winter_Is_Coming.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/201_A_Game_of_Thrones__card_game_.txt
INFO - haystack.utils.p

In [None]:
# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a differente source (e.g. a DB), you can of course
# skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is :
# {
    # 'text': '<DOCUMENT_TEXT_HERE>',
    # 'meta': {'name': '<DOCUMENT_NAME_HERE>', ...}
# }
# (Optionally: you can also add more key-value-pairs here, that will be indexed
# as fields in Elasticsearch and can be accesed later for filtering or shown
# in the responses of the Pipeline)

# Let's have a look at the first 3 entries:
print(dicts[:3])

[{'content': '\'\'\'\'\'Game of Thrones Ascent\'\'\'\'\' was a strategy video game developed by Disruptor Beam for iOS, Facebook, Kongregate, and Android. The game was a 2013 Facebook Game of the Year in the Staff Picks category and a winner of a 2013 Friendie Award. The game is an adaptation of the novel series \'\'A Song of Ice and Fire\'\' by George R. R. Martin and the HBO TV series \'\'Game of Thrones\'\', and is the first such social network game. According to Martin, the game features "alliance building, treachery, marriages, murders, and most of all the constant struggle to be the greatest house in Westeros." The game includes the ability to engage in the dynamic political and social intrigue featured in the books and television show. The game has over 9 million registered players though daily activity suggests 3 thousand active players.\nDisruptor Beam released the first expansion for the game, titled "The Long Night", in October 2014. The expansion allows players to travel be

## Initalize Retriever, Reader, & Pipeline
### Retriever
Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. They use some simple but fast algorithm.

Here: We use Elasticsearch's default BM25 algorithm

Alternatives:

* Customize the ElasticsearchRetrieverwith custom queries (e.g. boosting) and filters
* Use TfidfRetriever in combination with a SQL or InMemory Document store for simple prototyping and debugging
* Use EmbeddingRetriever to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
* Use DensePassageRetriever to use different embedding models for passage and query (see Tutorial 6)

In [None]:
# Now, let's write the dicts containing the documents to our DB.
document_store.write_documents(dicts)

In [None]:
from haystack.nodes import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [None]:
# Alternative: An in-memory TfidfRetriever based on a Pandas dataframes for 
# building quick-prototypes with SQLite document store

# from haystack.nodes import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)

## Reader
A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers. With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

Here: a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

Alternatives (Reader): TransformersReader (leveraging the pipeline of the Transformers package)

Alternatives (Models): e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

Hint: You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

### FARMReader

In [None]:
# Load a local model or any of the QA models on 
# HuggingFace's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path='deepset/roberta-base-squad2', use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


In [None]:
# TransformerReader
# Alternative:
# reader = TransformersReader(model_name_or_path='distilert-base-uncase-distilled-squad',
#                             tokenizer='distilbert-base-uncased', use_gpu=-1)

## Pipeline
With a Haystack Pipeline you can stick together your building blocks to a search pipeline. Under the hood, Pipelines are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases. To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the ExtractiveQAPipeline that combines a retriever and a reader to answer our questions. You can learn more about Pipelines in the docs.

In [None]:
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

In [None]:
# Voilà, ask a question!

# You can configure how many canidadtes the reader and retreiver shall return
# The higher top_k_retriever, the better (but also slower) your answers
prediction = pipe.run(
    query='Who is the father of Arya Stark?', params={'Retriever': {'top_k': 10},
                                                      'Reader': {'top_k': 5}}
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.12 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.14 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.74 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.92 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.64 Batches/s]


In [None]:
# prediction = pipe.run(query="Who created the Dothraki vocabulary?", params={"Reader": {"top_k": 5}})
# prediction = pipe.run(query="Who is the sister of Sansa?", params={"Reader": {"top_k": 5}})

In [None]:
# Now you can either print the object directly...
from pprint import pprint
pprint(prediction)
# Sample output:    
# {
#     'answers': [ <Answer: answer='Eddard', type='extractive', score=0.9919578731060028, offsets_in_document=[{'start': 608, 'end': 615}], offsets_in_context=[{'start': 72, 'end': 79}], document_id='cc75f739897ecbf8c14657b13dda890e', meta={'name': '454_Music_of_Game_of_Thrones.txt'}}, context='...' >,
#                  <Answer: answer='Ned', type='extractive', score=0.9767240881919861, offsets_in_document=[{'start': 3687, 'end': 3801}], offsets_in_context=[{'start': 18, 'end': 132}], document_id='9acf17ec9083c4022f69eb4a37187080', meta={'name': '454_Music_of_Game_of_Thrones.txt'}}, context='...' >,
#                  ...
#                ]
#     'documents': [ <Document: content_type='text', score=0.8034909798951382, meta={'name': '332_Sansa_Stark.txt'}, embedding=None, id=d1f36ec7170e4c46cde65787fe125dfe', content='\n===\'\'A Game of Thrones\'\'===\nSansa Stark begins the novel by being betrothed to Crown ...'>,
#                    <Document: content_type='text', score=0.8002150354529785, meta={'name': '191_Gendry.txt'}, embedding=None, id='dd4e070a22896afa81748d6510006d2', 'content='\n===Season 2===\nGendry travels North with Yoren and other Night's Watch recruits, including Arya ...'>,
#                    ...
#                  ],
#     'no_ans_gap':  11.688868522644043,
#     'node_id': 'Reader',
#     'params': {'Reader': {'top_k': 5}, 'Retriever': {'top_k': 5}},
#     'query': 'Who is the father of Arya Stark?',
#     'root_node': 'Query'
# }

{'answers': [<Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9919579923152924, 'context': "s Nymeria after a legendary warrior queen. She travels with her father, Eddard, to King's Landing when he is made Hand of the King. Before she leaves,", 'offsets_in_document': [{'start': 147, 'end': 153}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f', 'meta': {'name': '43_Arya_Stark.txt'}}>,
             <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.9767242670059204, 'context': "\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure, Arya's half-brother Jon Snow gifts A", 'offsets_in_document': [{'start': 46, 'end': 49}], 'offsets_in_context': [{'start': 46, 'end': 49}], 'document_id': '180c2a6b36369712b361a80842e79356', 'meta': {'name': '43_Arya_Stark.txt'}}>,
             <Answer {'answer': 'Lord Eddard Stark', 'type': 'extractive', 'score': 0.89303985238

In [None]:
# ...or use a util to simplify the output
# change 'minimum' to 'medium' or 'all' to raise the level of detal
print_answers(prediction, details='minimum')


Query: Who is the father of Arya Stark?
Answers:
[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's "
                   'half-brother Jon Snow gifts A'},
    {   'answer': 'Lord Eddard Stark',
        'context': 'ark daughters.\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Joffrey',
        'context': 'laying with one of his wooden toys.\n'
                   "After Eddard discovers the 