# Modelling

## References

- [Extractive Question Answering](https://haystack.deepset.ai/tutorials/01_basic_qa_pipeline)
- [Generative Question Answering](https://haystack.deepset.ai/tutorials/07_rag_generator)
- [Open-Domain QA on Tables](https://haystack.deepset.ai/tutorials/15_tableqa)

## Imports

In [1]:
from pathlib import Path

In [2]:
import sys 
sys.path.append('..')

In [3]:
from utils.data import pdf_to_text_and_tables, preprocess_text_documents

  from .autonotebook import tqdm as notebook_tqdm


## Data Prep

## Extractive Question Answering

### Initializing the DocumentStore

In [4]:
from haystack.document_stores import InMemoryDocumentStore

In [5]:
# DocumentStore stores the Documents that the question answering system uses to find answers to your questions
document_store = InMemoryDocumentStore(use_bm25=True)

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0


### Preparing Documents

In [6]:
DATA_DIR = Path("../data/")

In [10]:
%%time
text_list, table_list = pdf_to_text_and_tables(DATA_DIR)

CPU times: total: 6.03 s
Wall time: 8min 22s


In [11]:
len(text_list), len(table_list)

(12, 854)

In [12]:
%%time
processed_text_list = preprocess_text_documents(text_list)

Preprocessing: 100%|█████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 12.13docs/s]

CPU times: total: 984 ms
Wall time: 995 ms





In [13]:
len(processed_text_list)

4124

In [14]:
# write documents into document store
document_store.write_documents(processed_text_list, duplicate_documents='skip')

INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'fa4c5afcc6aa500f113179bd3a30165e' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'fa4c5afcc6aa500f113179bd3a30165e' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '14eb724bc56eb875073aa2b77bc95d91' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '9bfe67cf1a03426007557bd2d75cf35e' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '86fbacce2f3ed0cd6f8ee1d68042f7ee' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'a809989dd800380367695af4dd38659b' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '342db4ad8a6c21f09739a25d973f08f5'

### Initializing the Retriever

In [15]:
from haystack.nodes import BM25Retriever
from haystack.nodes.retriever import EmbeddingRetriever

In [18]:
retriever = BM25Retriever(document_store=document_store)
# retriever = EmbeddingRetriever(document_store=document_store, embedding_model="deepset/all-mpnet-base-v2-table")

In [17]:
# only for embedding retriever
# Add table embeddings to the tables in DocumentStore
# document_store.update_embeddings(retriever=retriever)

INFO - haystack.document_stores.memory -  Updating embeddings for 0 docs ...
Updating Embedding:   0%|                                                                  | 0/3338 [00:00<?, ? docs/s]
Batches:   0%|                                                                                 | 0/105 [00:00<?, ?it/s][A
Batches:   1%|▋                                                                      | 1/105 [00:59<1:43:36, 59.78s/it][A
Batches:   2%|█▎                                                                     | 2/105 [01:59<1:42:59, 59.99s/it][A
Batches:   3%|██                                                                     | 3/105 [02:46<1:34:27, 55.57s/it][A
Updating Embedding:   0%|                                                                  | 0/3338 [02:47<?, ? docs/s]


KeyboardInterrupt: 

### Initializing the Reader

In [19]:
from haystack.nodes import FARMReader

In [20]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0
INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0
INFO - haystack.modeling.model.language_model -   * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
INFO - haystack.modeling.model.language_model -  Auto-detected model language: english
INFO - haystack.modeling.model.language_model -  Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0


### Creating the Retriever-Reader Pipeline

In [21]:
from haystack.pipelines import ExtractiveQAPipeline

In [22]:
pipe = ExtractiveQAPipeline(reader, retriever)

### Asking a Question

In [23]:
from haystack.utils import print_answers

In [32]:
prediction = pipe.run(
    query="What is the number of Class C stockholders as of 2022?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)

Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.70s/ Batches]


In [33]:
print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium`, and `all`
)

'Query: What is the number of Class C stockholders as of 2022?'
'Answers:'
[   {   'answer': '1,657',
        'context': 'rs of Record\n'
                   'As of December 31, 2022, there were approximately 6,670 '
                   'and 1,657 stockholders of record of our Class A stock\n'
                   'and Class C stock, respecti'},
    {   'answer': '315,639,479',
        'context': '44,576,938 shares of\n'
                   'the registrant’s Class B common stock outstanding, and '
                   '315,639,479 shares of the registrant’s Class C capital '
                   'stock outstanding.\n'},
    {   'answer': 'Total Number of',
        'context': ' December 31, 2022:\n'
                   'Period\n'
                   'Total Number of\n'
                   'Class A Shares\n'
                   'Purchased\n'
                   'Total Number of\n'
                   'Class C Shares\n'
                   'Purchased\n'
                   'Average Price\n'
              

## Open-Domain QA on Tables

### Imports

In [6]:
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import BM25Retriever
from haystack.nodes import TableReader
from haystack import Pipeline
from haystack.nodes import FARMReader, RouteDocuments, JoinAnswers
from haystack.utils import print_answers

### Initializing the DocumentStore

In [7]:
# DocumentStore stores the Documents that the question answering system uses to find answers to your questions
document_store = InMemoryDocumentStore(use_bm25=True)

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0


### Data Prep

In [8]:
DATA_DIR = Path("../data")

In [9]:
%%time
text_list, table_list = pdf_to_text_and_tables(DATA_DIR)

CPU times: total: 6.09 s
Wall time: 8min 22s


In [10]:
len(text_list), len(table_list)

(12, 854)

In [11]:
%%time
processed_text_list = preprocess_text_documents(text_list)

Preprocessing: 100%|█████████████████████████████████████████████████████████████████| 12/12 [00:01<00:00, 11.92docs/s]

CPU times: total: 984 ms
Wall time: 1.02 s





In [12]:
len(processed_text_list)

4124

### Write text and tables to Documentstore

In [11]:
document_store.delete_documents()

In [12]:
document_store.write_documents(processed_text_list, duplicate_documents='skip')

INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'fa4c5afcc6aa500f113179bd3a30165e' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'fa4c5afcc6aa500f113179bd3a30165e' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '14eb724bc56eb875073aa2b77bc95d91' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '9bfe67cf1a03426007557bd2d75cf35e' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '86fbacce2f3ed0cd6f8ee1d68042f7ee' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id 'a809989dd800380367695af4dd38659b' already exists in index 'document'
INFO - haystack.document_stores.base -  Duplicate Documents: Document with id '342db4ad8a6c21f09739a25d973f08f5'

In [13]:
document_store.write_documents(table_list)

Updating BM25 representation...: 100%|███████████████████████████████████████| 4192/4192 [00:00<00:00, 22412.14 docs/s]


### Initialize Retriever

In [15]:
retriever = BM25Retriever(document_store=document_store)

### Initialize text and table readers

In [20]:
text_reader = FARMReader("deepset/roberta-base-squad2", use_gpu=False)

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0
INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0
INFO - haystack.modeling.model.language_model -   * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
  return self.fget.__get__(instance, owner)()
INFO - haystack.modeling.model.language_model -  Auto-detected model language: english
INFO - haystack.modeling.model.language_model -  Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0


In [21]:
# reader = TableReader(model_name_or_path="google/tapas-base-finetuned-wtq", use_gpu=False)

# In order to get meaningful scores from the TableReader, use "deepset/tapas-large-nq-hn-reader" or
# "deepset/tapas-large-nq-reader" as TableReader models. The disadvantage of these models is, however,
# that they are not capable of doing aggregations over multiple table cells.
table_reader = TableReader("deepset/tapas-large-nq-hn-reader", use_gpu=False)

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 87.2kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████| 1.35G/1.35G [01:55<00:00, 11.7MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████████████████████████████████████| 232k/232k [00:00<00:00, 13.6MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████| 154/154 [00:00<00:00, 22.0kB/s]
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████| 558/558 [00:00<00:00, 93.0kB/s]


### Other Nodes

RouteDocuments: Splits the List of Documents retrieved by the retriever into two lists containing only Documents of type "text" or "table", respectively.

In [22]:
route_documents = RouteDocuments()

JoinAnswers: Takes Answers coming from two different Readers (in this case FARMReader and TableReader) and joins them to a single list of Answers.

In [23]:
join_answers = JoinAnswers()

### Pipeline

In [24]:
text_table_qa_pipeline = Pipeline()

In [25]:
text_table_qa_pipeline.add_node(component=retriever, name="BM25Retriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["BM25Retriever"])
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])

![](https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/img/table-qa-pipeline.png?raw=true)

### Prediction

In [36]:
prediction = text_table_qa_pipeline.run(
    query="How much marketable securities does Google have as of December 31, 2020?"
)

Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.28s/ Batches]


In [37]:
print_answers(prediction, details="minimum")

('Query: How much marketable securities does Google have as of December 31, '
 '2020?')
'Answers:'
[   {   'answer': 'Cash and\n'
                  'Cash\n'
                  'Equivalents\n'
                  'Marketable\n'
                  'Securities\n'
                  '(unaudited)\n'
                  'Money market funds\n'
                  '$\n'
                  'Marketable equity securities(1)(2)\n'
                  '\n'
                  'Mutual funds\n'
                  '\n'
                  'Total\n'
                  '$',
        'context': 'ities\n'
                   'Cash and\n'
                   'Cash\n'
                   'Equivalents\n'
                   'Marketable\n'
                   'Securities\n'
                   '(unaudited)\n'
                   'Money market funds\n'
                   '$\n'
                   'Marketable equity securities(1)(2)\n'
                   '\n'
                   'Mutual funds\n'
                   '\n'
                   

In [40]:
prediction = text_table_qa_pipeline.run(
    query="How much is the total cost of revenues in 2022?"
)

Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.56s/ Batches]


In [41]:
print_answers(prediction, details="minimum")

'Query: How much is the total cost of revenues in 2022?'
'Answers:'
[   {   'answer': '$3.2 billion',
        'context': '021 to the three months\n'
                   'ended September 30, 2022 due to an increase in other cost '
                   'of revenues and TAC of $3.2 billion and $328 million,\n'
                   'respectively. '},
    {   'answer': '$\n'
                  '$\n'
                  'Total cost of revenues as a percentage of revenues\n'
                  'Table of Contents\n'
                  'Alphabet Inc.\x0c'
                  'Cost of revenues increased $5.5 billion from the three '
                  'months ended March 31, 2021 to the three months\n'
                  'ended March 31, 2022. The increase was due to increases in '
                  'other cost of revenues and TAC of $3.2 billion and\n'
                  '$2.3 billion',
        'context': '$\n'
                   '$\n'
                   'Total cost of revenues as a percentage of revenues\n'


## Generative QA

### Data Prep

In [4]:
DATA_DIR = Path("../data")

In [5]:
%%time
text_list, table_list = pdf_to_text_and_tables(DATA_DIR)

CPU times: total: 5.58 s
Wall time: 7min 39s


In [6]:
len(text_list), len(table_list)

(12, 854)

In [7]:
%%time
processed_text_list = preprocess_text_documents(text_list)

Preprocessing: 100%|█████████████████████████████████████████████████████████████████| 12/12 [00:01<00:00, 11.21docs/s]

CPU times: total: 1.03 s
Wall time: 1.08 s





In [8]:
len(processed_text_list)

4124

### FAISSDocumentStore, DensePassageRetriever and RAGenerator

In [9]:
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import RAGenerator, DensePassageRetriever

In [10]:
# Initialize FAISS document store.
# Set `return_embedding` to `True`, so generator doesn't have to perform re-embedding
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", return_embedding=True)

In [11]:
# Initialize DPR Retriever to encode documents, encode question and query documents
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=False,
    embed_title=True,
)

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0
  return self.fget.__get__(instance, owner)()
INFO - haystack.modeling.model.language_model -  Auto-detected model language: english
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
INFO - haystack.modeling.model.language_model -  Auto-detected model language: english


In [12]:
# Initialize RAG Generator
generator = RAGenerator(
    model_name_or_path="facebook/rag-token-nq",
    use_gpu=False,
    top_k=1,
    max_length=200,
    min_length=2,
    embed_title=True,
    num_beams=2,
)

INFO - haystack.modeling.utils -  Using devices: CPU - Number of GPUs: 0
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from thi

## Write Documents

In [13]:
# Delete existing documents in documents store
document_store.delete_documents()

# Write documents to document store
document_store.write_documents(processed_text_list)

# Add documents embeddings to index
document_store.update_embeddings(retriever=retriever)

Writing Documents: 10000it [00:10, 949.02it/s]                                                                         
INFO - haystack.document_stores.faiss -  Updating embeddings for 3338 docs...
Updating Embedding:   0%|                                                                  | 0/3338 [00:00<?, ? docs/s]
Create embeddings:   0%|                                                                   | 0/3344 [00:00<?, ? Docs/s][A
Create embeddings:   0%|▎                                                         | 16/3344 [00:14<50:46,  1.09 Docs/s][A
Create embeddings:   1%|▌                                                         | 32/3344 [00:28<49:34,  1.11 Docs/s][A
Create embeddings:   1%|▊                                                         | 48/3344 [00:44<50:38,  1.08 Docs/s][A
Create embeddings:   2%|█                                                         | 64/3344 [00:59<51:09,  1.07 Docs/s][A
Create embeddings:   2%|█▍                                         

## Prediction

In [14]:
# Or alternatively use the Pipeline class
from haystack.pipelines import GenerativeQAPipeline
from haystack.utils import print_answers

In [15]:
pipe = GenerativeQAPipeline(generator=generator, retriever=retriever)

In [27]:
query = "How much was the net income for the fiscal year of 2020?"

In [28]:
res = pipe.run(query=query, params={"Generator": {"top_k": 1}, "Retriever": {"top_k": 5}})
print_answers(res, details="minimum")

'Query: How much was the net income for the fiscal year of 2020?'
'Answers:'
[{'answer': ' $1.06 billion'}]


In [35]:
res['answers'][0].meta

{'doc_scores': [0.6939489475567304,
  0.6933349249940063,
  0.693076143595989,
  0.692747936738709,
  0.6920426482928368],
 'content': ['Sales and marketing expenses increased $1.4 billion from the six months ended June 30, 2020 to the six\nmonths ended June 30, 2021, primarily driven by an increase in compensation expenses of $864 million and\nadvertising and promotional activities of $672 million. The increase in compensation expenses was largely due to an\n11% increase in headcount. The increase in advertising and promotional activities was largely affected by reduced\nspending in the prior year comparable period as a result of COVID-19.\n',
  'The increase was primarily driven by an increase in advertising and\npromotional activities of $708 million and an increase in compensation expenses of $515 million, largely resulting\nfrom a 19% increase in headcount.\nSales and marketing expenses increased $4.1 billion from the nine months ended September 30, 2021 to the\nnine months ended 