## Installing Haystack

To start, let's install the latest release of Haystack with `pip`:

In [1]:
!pip install --upgrade --quiet pip
!pip install --quiet farm-haystack[colab,faiss]==1.17.2
print('pip install haystack complete.')

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
chex 0.1.82 requires numpy>=1.25.0, but you have numpy 1.23.5 which is incompatible.
google-cloud-aiplatform 0.6.0a1 requires google-api-core[grpc]<2.0.0dev,>=1.22.2, but you have google-api-core 2.11.1 which is incompatible.
google-cloud-automl 1.0.1 requires google-api-core[grpc]<2.0.0dev,>=1.14.0, but you have google-api-core 2.11.1 which is incompatible.
ipython-sql 0.5.0 requires sqlalchemy>=2.0, but you have sqlalchemy 1.4.52 which is incompatible.
kfp 2.0.1 requires google-cloud-storage<3,>=2.2.1, but you have google-cloud-storage 1.44.0 which is incompatible.
pymc3 3.11.5 requires numpy<1.22.2,>=1.15.0, but you have numpy 1.23.5 which is incompatible.
pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but 

In [2]:
# import logging

# logging.basicConfig(format='%(levelname)s - %(name)s -  %(message)s', level=logging.WARNING)
# logging.getLogger('haystack').setLevel(logging.INFO)

## Initializing the DocumentStore

We'll start creating our question answering system by initializing a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, we're using the `FAISSDocumentStore`.

Let's initialize our DocumentStore.

In [3]:
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(embedding_dim=128, faiss_index_factory_str='Flat')
print('created our document store')

created our document store


> To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

In [4]:
!apt-get install --quiet poppler-utils -y
print('pip install poppler utils complete.')
!sudo apt-get update
!sudo apt-get install --quiet poppler-utils -y
print('pip install poppler complete for real and for true')

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libpoppler97 poppler-data
Suggested packages:
  ghostscript fonts-japanese-mincho | fonts-ipafont-mincho
  fonts-japanese-gothic | fonts-ipafont-gothic fonts-arphic-ukai
  fonts-arphic-uming fonts-nanum
The following NEW packages will be installed:
  libpoppler97 poppler-data poppler-utils
0 upgraded, 3 newly installed, 0 to remove and 40 not upgraded.
Need to get 2562 kB of archives.
After this operation, 16.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 poppler-data all 0.4.9-2 [1475 kB]
Ign:2 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libpoppler97 amd64 0.86.1-0ubuntu1.3
Ign:3 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 poppler-utils amd64 0.86.1-0ubuntu1.3
Err:2 http://security.ubuntu.com/ubuntu focal-updates/main amd64 libpoppler97 amd64 0.86.1-0ubu

In [5]:
# Initialize a PDFToTextConverter to convert the PDF to text
from haystack.utils import convert_files_to_docs, fetch_archive_from_http, clean_wiki_text
from haystack.utils import fetch_archive_from_http, convert_files_to_docs
from haystack.nodes import PDFToTextConverter, PreProcessor, DensePassageRetriever
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor
print('many imports complete.')

many imports complete.


In [6]:
#replace this paths with your input Data path
pdf_paths = ['/kaggle/input/book-for-qa/book.pdf']

# Convert the PDF documents to Haystack-compatible format
pdf_converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=['en'])
pdf_docs = []

for pdf_path in pdf_paths:
    doc = pdf_converter.convert(file_path=pdf_path, meta=None)[0]
    pdf_docs.append(doc)
print('converted {} documents'.format(len(pdf_docs)))

converted 1 documents


pdftotext version 0.86.1
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC


In [7]:
# Initializing the PreProcessor to clean and split the text
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by='word',
    split_length=100,
    split_respect_sentence_boundary=True,
)
processed_docs = preprocessor.process(pdf_docs)
print('document processing done.')

Preprocessing:   0%|          | 0/1 [00:00<?, ?docs/s]

document processing done.


## Initializing the Retriever

We use a `DensePassageRetriever` and we invoke `update_embeddings` to index the embeddings of documents in the `FAISSDocumentStore`.

In [8]:
# Initialize a DensePassageRetriever
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model='vblagoje/dpr-question_encoder-single-lfqa-wiki',
    passage_embedding_model='vblagoje/dpr-ctx_encoder-single-lfqa-wiki',
)
print('built retriever.')

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/495 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/494 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

built retriever.


In [9]:
document_store.write_documents(processed_docs)
document_store.update_embeddings(retriever)
print('updated embeddings.')

Writing Documents:   0%|          | 0/11 [00:00<?, ?it/s]

Updating Embedding:   0%|          | 0/11 [00:00<?, ? docs/s]

Create embeddings:   0%|          | 0/16 [00:00<?, ? Docs/s]

updated embeddings.


## Initializing the Generator

we now initalize our Generator.

Here we use a `Seq2SeqGenerator` with the [*vblagoje/bart_lfqa*](https://huggingface.co/vblagoje/bart_lfqa) model.

In [10]:
from haystack.nodes import Seq2SeqGenerator

generator = Seq2SeqGenerator(model_name_or_path='vblagoje/bart_lfqa')
print('built generator.')

Downloading tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

built generator.


## Initializing the Reader
we now initalize our Reader.

Here we use a FARMReader

In [11]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path='deepset/deberta-v3-large-squad2')
print('built the reader.')

Downloading config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

built the reader.


## Initializing the Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `GenerativeQAPipeline` that combines a Retriever and a Generator to answer our questions.
You can learn more about `Pipelines` in the [docs](https://docs.haystack.deepset.ai/docs/pipelines).

In [12]:
from haystack.pipelines import GenerativeQAPipeline
from haystack.pipelines import ExtractiveQAPipeline

generative_pipeline = GenerativeQAPipeline(generator, retriever)
extractive_pipeline = ExtractiveQAPipeline(reader, retriever)

## Ask Questions

In [13]:
query = 'Where did the bulls live?'
result = extractive_pipeline.run(query=query, params={'Retriever': {'top_k': 5}, 'Reader': {'top_k': 1}})
print(result)

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

{'query': 'Where did the bulls live?', 'no_ans_gap': -7.116244792938232, 'answers': [<Answer {'answer': 'Where did the bulls live?\n___________________________________________________________________________\n____________________________________________________________________\nb. How many bulls were united', 'type': 'extractive', 'score': 0.023241646587848663, 'context': 'Where did the bulls live?\n___________________________________________________________________________\n____________________________________________________________________\nb. How many bulls were united', 'offsets_in_document': [{'start': 175, 'end': 375}], 'offsets_in_context': [{'start': 0, 'end': 200}], 'document_ids': ['296fde43f0dea180cde7938f42c9181d'], 'meta': {'_split_id': 2, 'vector_id': '0'}}>], 'documents': [<Document: {'content': 'If he fought with one bull, the other three would also join in. They were\nvery united. Then lion made a plan. He went to each one separately and told them that the other\nthre

In [14]:
print(result['answers'][0].answer)

Where did the bulls live?
___________________________________________________________________________
____________________________________________________________________
b. How many bulls were united


In [15]:
piperun=generative_pipeline.run(
    query='Where did the bulls live?', params={'Retriever': {'top_k': 5}} #output varies with different top_k values
)
for answer in piperun['answers']:
    print(answer.answer)

I'm not sure if this is what you're looking for, but I'll give it a shot. The bulls were domesticated by the Romans. They were bred for their meat, not for their ability to fight.


In [16]:
piperun=generative_pipeline.run(
    query='What are Pointers in C++?', params={'Retriever': {'top_k': 5}}
)
piperun['answers'][0].answer

'Pointers in C++ are a special type of variable called a pointer. A pointer is a variable that can be used to describe a function. For example, if you have a function that returns a value, you can use a pointer to tell the function to return a value that is equal to the value of the pointer.'