<a href="https://colab.research.google.com/github/martindevoto/machine-learning-notebooks-personal/blob/main/Intro_Haystack_pt_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a QA System Without Elasticsearch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb)

Haystack provides alternatives to Elasticsearch for developing quick prototypes.

You can use an `InMemoryDocumentStore` or a `SQLDocumentStore`(with SQLite) as the document store.

If you are interested in more feature-rich Elasticsearch, then please refer to the Tutorial 1. 

In [None]:
# Make sure you have a GPU running
!nvidia-smi

/bin/bash: nvidia-smi: command not found


In [None]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

[0mCollecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-tll5jkpd/farm-haystack_00e8919311684c908025942373e6a635
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-tll5jkpd/farm-haystack_00e8919311684c908025942373e6a635
  Resolved https://github.com/deepset-ai/haystack.git to commit a095aea21ea9f9a6dff155d571ec7be3f92fcbfa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[0m

In [None]:
from haystack.utils import (clean_wiki_text, convert_files_to_dicts,
                            fetch_archive_from_http, print_answers)
from haystack.nodes import FARMReader, TransformersReader

## Document Store

In [None]:
# In-Memory Document Store
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


In [None]:
# SQLite Document Store
# from haystack.document_stores import SQLDocumentStore
# document_store = SQLDocumentStore(url="sqlit:///qa.db")

## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store

In this tutorial, we download Wikipedia articles on Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch.

In [None]:
# Let's first get some documents that we want to query
# Here: 517 Wikipedia articles for Game of Thrones
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# convert files to dicts containing documents that can be indexed to our datastore
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is: {"name": "<some-document-name>, "text": "<the-actual-text>"}

# Let's have a look at the first 3 entries:
print(dicts[:3])
# Now, let's write the docs to our DB.
document_store.write_documents(dicts)

INFO - haystack.utils.import_utils -  Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/7_The_Spoils_of_War__Game_of_Thrones_.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/485_Oathkeeper.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/460_Battle_of_the_Bastards.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/52_Catch_the_Throne.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/98_Black_Friday__South_Park_.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/121_The_Bear_and_the_Maiden_Fair.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/330_Oberyn_Martell.txt
INFO - haystack.utils.preprocessing -  Converting data/article_txt_got/148_Game_of_Thrones__Winter_Is_Coming.txt
INFO - haystack.utils.preprocessing -  Con

[{'content': '"\'\'\'The Spoils of War\'\'\'" is the fourth episode of the seventh season of HBO\'s fantasy television series \'\'Game of Thrones\'\', and the 64th overall. It was written by series co-creators David Benioff and D. B. Weiss, and directed by Matt Shakman.\nAt Dragonstone, Daenerys Targaryen and Jon Snow observe cave drawings left by the Children of the Forest, indicating that the First Men and the Children fought together against the White Walkers. In King\'s Landing, Cersei Lannister seeks further investment from the Iron Bank, after reassuring them that their debt will soon be paid. In the North, Arya Stark returns to Winterfell, reunites with her siblings, Sansa and Bran Stark, and spars with Brienne of Tarth. On the road to King\'s Landing, Jaime Lannister, Bronn, and the Lannister and Tarly armies are caught in an attack led by Daenerys, her dragon Drogon, and the Dothraki army.\nThe title of the episode refers to the Tyrell gold and other resources in possession of

INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '9e9a3181b6bc168b4a25429b641e8c86' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '5d79fbf801011475553a09ab068f02e2' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '7ed12f389f7f085bb30c7d00abd26f81' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '9e9a3181b6bc168b4a25429b641e8c86' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'bbcb394a991cab6a7f8c18e5a294452f' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'c8b51f62e0fccac8361c4464cc2c8f70' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '9e9a3181b6bc168b4a25429b641e8c86'

## Initalize Retriever, Reader & Pipeline

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. 

With InMemoryDocumentStore or SQLDocumentStore, you can use the TfidfRetriever. For more retrievers, please refer to the tutorial-1.

In [None]:
# An in-memory TfidfRetriever based on Pandas dataframes

from haystack.nodes import TfidfRetriever

retriever = TfidfRetriever(document_store=document_store)

INFO - haystack.nodes.retriever.sparse -  Found 2357 candidate paragraphs from 2357 docs in DB


### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

#### FARMReader

In [None]:
# Load a local model or any of the QA models on 
# HuggingFace's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
                    use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2
INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


In [None]:
# TransformersReader
# Alternative:
# reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad",
# tokenizer="distilbert-base-uncased", use_gpu=-1)

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

In [None]:
## Voilà! Ask a question!

# You can configure how many candidates the reader and retriever shall return
# The higher top_k for retriever, the better (but also the slower) your answers.
prediction = pipe.run(
    query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.85s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:04<00:00,  4.34s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.75s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.57s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.77s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.74s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.77s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.58s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.02 Batches/s]


In [None]:
# Now you can either print the object directly...
from pprint import pprint

pprint(prediction)

# Sample output:
# {
#     'answers': [ <Answer: answer='Eddard', type='extractive', score=0.9919578731060028, offsets_in_document=[{'start': 608, 'end': 615}], offsets_in_context=[{'start': 72, 'end': 79}], document_id='cc75f739897ecbf8c14657b13dda890e', meta={'name': '454_Music_of_Game_of_Thrones.txt'}}, context='...' >,
#                  <Answer: answer='Ned', type='extractive', score=0.9767240881919861, offsets_in_document=[{'start': 3687, 'end': 3801}], offsets_in_context=[{'start': 18, 'end': 132}], document_id='9acf17ec9083c4022f69eb4a37187080', meta={'name': '454_Music_of_Game_of_Thrones.txt'}}, context='...' >,
#                  ...
#                ]
#     'documents': [ <Document: content_type='text', score=0.8034909798951382, meta={'name': '332_Sansa_Stark.txt'}, embedding=None, id=d1f36ec7170e4c46cde65787fe125dfe', content='\n===\'\'A Game of Thrones\'\'===\nSansa Stark begins the novel by being betrothed to Crown ...'>,
#                    <Document: content_type='text', score=0.8002150354529785, meta={'name': '191_Gendry.txt'}, embedding=None, id='dd4e070a22896afa81748d6510006d2', 'content='\n===Season 2===\nGendry travels North with Yoren and other Night's Watch recruits, including Arya ...'>,
#                    ...
#                  ],
#     'no_ans_gap':  11.688868522644043,
#     'node_id': 'Reader',
#     'params': {'Reader': {'top_k': 5}, 'Retriever': {'top_k': 5}},
#     'query': 'Who is the father of Arya Stark?',
#     'root_node': 'Query'
# }

{'answers': [<Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9919578731060028, 'context': "s Nymeria after a legendary warrior queen. She travels with her father, Eddard, to King's Landing when he is made Hand of the King. Before she leaves,", 'offsets_in_document': [{'start': 147, 'end': 153}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f', 'meta': {'name': '43_Arya_Stark.txt'}}>,
             <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.9767240881919861, 'context': "\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure, Arya's half-brother Jon Snow gifts A", 'offsets_in_document': [{'start': 46, 'end': 49}], 'offsets_in_context': [{'start': 46, 'end': 49}], 'document_id': '180c2a6b36369712b361a80842e79356', 'meta': {'name': '43_Arya_Stark.txt'}}>,
             <Answer {'answer': 'Robert Baratheon', 'type': 'extractive', 'score': 0.940885215997

In [None]:
# ...or use a util to simplify the output
# Change `minimum` to `medium` or `all` to raise the level of detail
print_answers(prediction, details="minimum")


Query: Who is the father of Arya Stark?
Answers:
[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's "
                   'half-brother Jon Snow gifts A'},
    {   'answer': 'Robert Baratheon',
        'context': 'hen Gendry gives it to Arya, he tells her he is the '
                   'bastard son of Robert Baratheon. Aware of their chances of '
                   'dying in the upcoming battle and Arya w'},
    {   'answer': 'Eddard and Catelyn Stark',
        'context': 'tark ===\n'
                   'Arya Stark is the third child and younger daughter of '
     