# Reader

We've already explored the *reader* model in the first few sections of this chapter, where we loaded a pretrained model using the `transformers` library.

Now, we'll do the same thing but via Haystack - which uses the same `transformers` library in the background (so we can use the same pretrained model names).

To initialize our reader via Haystack all we need is:

In [1]:
from haystack.nodes import FARMReader

In [2]:
reader = FARMReader(model_name_or_path='deepset/bert-base-cased-squad2', use_gpu=True)

Now we have our reader model initialized, let's load up our FAISS index and DPR just like before.

In [3]:
# from haystack.document_store.faiss import FAISSDocumentStore
# from haystack.retriever.dense import DensePassageRetriever
from haystack.document_stores.faiss import FAISSDocumentStore
from haystack.nodes.retriever.dense import DensePassageRetriever

path = '../../models/faiss'

# load FAISS from file (the squad validation set index)
# document_store = FAISSDocumentStore.load(f'{path}/squad_dev.faiss', f'sqlite:///{path}/squad_dev.db')
# document_store = FAISSDocumentStore(index_path=f'{path}/squad_dev.faiss', f'sqlite:///{path}/squad_dev.db')
document_store = FAISSDocumentStore.load(index_path=f'{path}/squad_dev.faiss', config_path=f'{path}/squad_dev.json')
# initialize DPR model
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    use_gpu=True,
    embed_title=True
)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


Haystack comes with a very convenient `ExtractiveQAPipeline` class which allows us to pass our `reader` and `retriever` to build an easy-to-use *extractive* Q&A pipeline:

In [4]:
# from haystack.pipeline import ExtractiveQAPipeline
from haystack.pipelines import ExtractiveQAPipeline

pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

We can ask questions using the `run` method, alongside the `query` parameter:

In [16]:
pipeline.run(query='What is the best statistical modeling approach?', params={"Retriever": {"top_k": 3}})['answers']

Inferencing Samples: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.19 Batches/s]


[<Answer {'answer': 'complexity-theoretic theorems regularly assume some concrete choice of input encoding, one tries to keep the discussion abstract enough to be independent of the choice of encoding', 'type': 'extractive', 'score': 0.005286967847496271, 'context': ' complexity-theoretic theorems regularly assume some concrete choice of input encoding, one tries to keep the discussion abstract enough to be independent of the choice of encoding', 'offsets_in_document': [{'start': 27, 'end': 206}], 'offsets_in_context': [{'start': 1, 'end': 180}], 'document_id': 'fdafa6182e072da1f6005c49f858c175', 'meta': {'vector_id': '1195'}}>,
 <Answer {'answer': 'integer factorization problem.', 'type': 'extractive', 'score': 0.004754381254315376, 'context': "f a decision problem, that is, it isn't just yes or no. Notable examples include the traveling salesman problem and the integer factorization problem.", 'offsets_in_document': [{'start': 281, 'end': 311}], 'offsets_in_context': [{'start': 120, 

In [15]:
pipeline.run(query='What is the best statistical modeling approach?', params={"Reader": {"top_k": 3}})['answers']

Inferencing Samples: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.88 Batches/s]


[<Answer {'answer': 'Fermat primality test', 'type': 'extractive', 'score': 0.11868560314178467, 'context': "A particularly simple example of a probabilistic test is the Fermat primality test, which relies on the fact (Fermat's little theorem) that np≡n (mod ", 'offsets_in_document': [{'start': 61, 'end': 82}], 'offsets_in_context': [{'start': 61, 'end': 82}], 'document_id': '70355cf08ce24ae883dce1eceb422e54', 'meta': {'vector_id': '493'}}>,
 <Answer {'answer': 'probabilistic', 'type': 'extractive', 'score': 0.027851182967424393, 'context': 'ty tests for general numbers n can be divided into two main classes, probabilistic (or "Monte Carlo") and deterministic algorithms. Deterministic algo', 'offsets_in_document': [{'start': 83, 'end': 96}], 'offsets_in_context': [{'start': 69, 'end': 82}], 'document_id': '4b80ccd6b10877e13ce75ebcf6945a6c', 'meta': {'vector_id': '298'}}>,
 <Answer {'answer': 'The best case occurs when each pivoting divides the list in half, also needing O(n log n) time

Here we're returning a lot of different answers, which is not really necessary. We can limit the number of answers we return using two parameters, `top_k_retriever` and `top_k_reader`. Both of these limit the number of items being returned from the `retriever` and `reader` models respectively.

In [14]:
# Broken in latest version of haystack, couldn't find fix
# pipeline.run(query='What does theoretical computer science cover?',
#              top_k_retriever=3)

pipeline.run(query='What does theoretical computer science cover?', params={"Retriever": {"top_k": 3}})['answers']

Inferencing Samples: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.66 Batches/s]


[<Answer {'answer': 'analysis of algorithms and computability theory', 'type': 'extractive', 'score': 0.8070324659347534, 'context': ' related fields in theoretical computer science are analysis of algorithms and computability theory. A key distinction between analysis of algorithms ', 'offsets_in_document': [{'start': 59, 'end': 106}], 'offsets_in_context': [{'start': 52, 'end': 99}], 'document_id': '42b3e25793765b085da634955c175f98', 'meta': {'vector_id': '254'}}>,
 <Answer {'answer': 'Science of the Physical Universe', 'type': 'extractive', 'score': 0.3831784725189209, 'context': 'l Reasoning, Ethical Reasoning, Science of Living Systems, Science of the Physical Universe, Societies of the World, and United States in the World. H', 'offsets_in_document': [{'start': 556, 'end': 588}], 'offsets_in_context': [{'start': 59, 'end': 91}], 'document_id': 'fbccb09bae381f66ed2ec74583b5dc09', 'meta': {'vector_id': '1182'}}>,
 <Answer {'answer': 'classifying computational problems according to 

And the `reader`:

In [13]:
# Broken in latest version of haystack, couldn't find fix
# pipeline.run(query='What does theoretical computer science cover?',
#              top_k_reader=3)
pipeline.run(query='What does theoretical computer science cover?', params={"Reader": {"top_k": 3}})['answers']

Inferencing Samples: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.02 Batches/s]


[<Answer {'answer': 'analysis of algorithms and computability theory', 'type': 'extractive', 'score': 0.8070326447486877, 'context': ' related fields in theoretical computer science are analysis of algorithms and computability theory. A key distinction between analysis of algorithms ', 'offsets_in_document': [{'start': 59, 'end': 106}], 'offsets_in_context': [{'start': 52, 'end': 99}], 'document_id': '42b3e25793765b085da634955c175f98', 'meta': {'vector_id': '254'}}>,
 <Answer {'answer': 'algorithmic problems', 'type': 'extractive', 'score': 0.42935144901275635, 'context': 'fore the actual research explicitly devoted to the complexity of algorithmic problems started off, numerous foundations were laid out by various resea', 'offsets_in_document': [{'start': 67, 'end': 87}], 'offsets_in_context': [{'start': 65, 'end': 85}], 'document_id': '27a839b0a3dba79e809de956f6832be6', 'meta': {'vector_id': '114'}}>,
 <Answer {'answer': 'Science of the Physical Universe', 'type': 'extractive', 'scor

In [19]:
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

def reading(file_name = 'credentials.txt'):
    s = open(file_name, 'r').read()
    dict = eval(s)
    return(dict)

credential_dict = reading()

document_store = ElasticsearchDocumentStore(
    host='localhost', 
    scheme='https', 
    username=credential_dict['username'], 
    password=credential_dict['pwd'], 
    ca_certs=credential_dict['ca_certs'], index='squad_docs')

In [20]:
# initialize DPR model
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    use_gpu=True,
    embed_title=True
)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


In [21]:
document_store.update_embeddings(retriever=retriever)

Updating embeddings:   0%|          | 0/1204 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/1216 [00:00<?, ? Docs/s]

In [22]:
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

In [23]:
pipeline.run(query='Will there ever be born a boy who can swim as fast as a shark?')['answers']

Inferencing Samples: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.60 Batches/s]


[<Answer {'answer': '1653 the poet Zygmunt Laukowski', 'type': 'extractive', 'score': 0.03928980231285095, 'context': ' use of a crude form of a sea monster with a female upper body and holding a sword in its claws. In 1653 the poet Zygmunt Laukowski asks the question:', 'offsets_in_document': [{'start': 521, 'end': 552}], 'offsets_in_context': [{'start': 100, 'end': 131}], 'document_id': 'bf117db136b152b4b30ad6fb942b6aae', 'meta': {}}>,
 <Answer {'answer': 'juveniles are capable of reproduction before reaching the adult size and shape.', 'type': 'extractive', 'score': 0.0005937825189903378, 'context': 'ult form. In at least some species, juveniles are capable of reproduction before reaching the adult size and shape. The combination of hermaphroditism', 'offsets_in_document': [{'start': 912, 'end': 991}], 'offsets_in_context': [{'start': 36, 'end': 115}], 'document_id': 'e88147b5aea54c44d788c726b5555167', 'meta': {}}>,
 <Answer {'answer': '2009', 'type': 'extractive', 'score': 0.000492