<a href="https://colab.research.google.com/github/martindevoto/machine-learning-notebooks-personal/blob/main/Intro_Haystack_pt_13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question Generation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial13_Question_generation.ipynb)

This is a bare bones tutorial showing what is possible with the QuestionGenerator Nodes and Pipelines which automatically
generate questions which the question generation model thinks can be answered by a given document.

In [None]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Collecting pip
  Downloading pip-22.0.3-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.3 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.0.3
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-tp2_p42h/farm-haystack_244abb6a4de442d5b879121f12da9922
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-tp2_p42h/farm-haystack_244abb6a4de442d5b879121f12da9922
  Resolved https://github.com/deepset-ai/haystack.git to commit 287314b2d256d4df7a2acb7e37a6767aeb1bc62a
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting python-docx
  Downloading python-docx

In [None]:
# Imports needed to run this notebook

from pprint import pprint
from tqdm import tqdm
from haystack.nodes import QuestionGenerator, ElasticsearchRetriever, FARMReader
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.pipelines import (
    QuestionGenerationPipeline,
    RetrieverQuestionGenerationPipeline,
    QuestionAnswerGenerationPipeline,)

from haystack.utils import launch_es, print_questions

\Let's start an Elasticsearch instance with one of the options below:

In [None]:
# Option 1: Start Elasticseach service via Docker
launch_es()



In [None]:
# Option 2: In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT,
    preexec_fn=lambda: os.setuid(1) # as daemon
)

# wait until ES has started
! sleep 30

Let's initialize some core components

In [None]:
text1 = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."
text2 = "Princess Arya Stark is the third child and second daughter of Lord Eddard Stark and his wife, Lady Catelyn Stark. She is the sister of the incumbent Westerosi monarchs, Sansa, Queen in the North, and Brandon, King of the Andals and the First Men. After narrowly escaping the persecution of House Stark by House Lannister, Arya is trained as a Faceless Man at the House of Black and White in Braavos, using her abilities to avenge her family. Upon her return to Westeros, she exacts retribution for the Red Wedding by exterminating the Frey male line."
text3 = "Dry Cleaning are an English post-punk band who formed in South London in 2018.[3] The band is composed of vocalist Florence Shaw, guitarist Tom Dowse, bassist Lewis Maynard and drummer Nick Buxton. They are noted for their use of spoken word primarily in lieu of sung vocals, as well as their unconventional lyrics. Their musical stylings have been compared to Wire, Magazine and Joy Division.[4] The band released their debut single, 'Magic of Meghan' in 2019. Shaw wrote the song after going through a break-up and moving out of her former partner's apartment the same day that Meghan Markle and Prince Harry announced they were engaged.[5] This was followed by the release of two EPs that year: Sweet Princess in August and Boundary Road Snacks and Drinks in October. The band were included as part of the NME 100 of 2020,[6] as well as DIY magazine's Class of 2020.[7] The band signed to 4AD in late 2020 and shared a new single, 'Scratchcard Lanyard'.[8] In February 2021, the band shared details of their debut studio album, New Long Leg. They also shared the single 'Strong Feelings'.[9] The album, which was produced by John Parish, was released on 2 April 2021.[10]"

docs = [{'content': text1}, {'content': text2}, {'content': text3}]
         
# Initialize document store and write in the documents
document_store = ElasticsearchDocumentStore()
document_store.write_documents(docs)

# Initialize Question Generator
question_generator = QuestionGenerator()

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


Downloading:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/195 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Question Generation Pipeline

The most basic version of a question generator pipeline takes a document as input and outputs generated questions
which the the document can answer.

In [None]:
question_generation_pipeline = QuestionGenerationPipeline(question_generator)
for idx, document in enumerate(document_store):

  print(f"\n * Generating questions for document {idx}: {document.content[:100]}...\n")
  result = question_generation_pipeline.run(documents=[document])
  print_questions(result)


 * Generating questions for document 0: Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Ros...


Generated questions:
 -  Who created Python?
 -  When was Python first released?
 -  What is Python's design philosophy?

 * Generating questions for document 1: Princess Arya Stark is the third child and second daughter of Lord Eddard Stark and his wife, Lady C...


Generated questions:
 -  Who is the third child and second daughter of Lord Eddard Stark and his wife, Lady Catelyn Stark?
 -  Princess Arya Stark is the sister of what Westerosi monarchs?
 -  What is Sansa, Queen in the North, and Brandon, King of the Andals?
 -  What is Arya trained as?
 -  Where is the House of Black and White located?
 -  What is the name of the first men?
 -  What is the name of the line that Frey exterminates?
 -  Where does the Red Wedding take place?

 * Generating questions for document 2: Dry Cleaning are an English post-punk band who formed in South L

## Retriever Question Generation Pipeline

This pipeline takes a query as input. It retrieves relevant documents and then generates questions based on these.

In [None]:
retriever = ElasticsearchRetriever(document_store=document_store)
rgq_pipeline = RetrieverQuestionGenerationPipeline(retriever,
                                                   question_generator)

print(f"\n * Generating questions for documents matching the query 'Ary Stark'\n")
result = rgq_pipeline.run(query='Arya Stark')
print_questions(result)


 * Generating questions for documents matching the query 'Ary Stark'


Generated questions:
 -  Who is the third child and second daughter of Lord Eddard Stark and his wife, Lady Catelyn Stark?
 -  Princess Arya Stark is the sister of what Westerosi monarchs?
 -  What is Sansa, Queen in the North, and Brandon, King of the Andals?
 -  What is Arya trained as?
 -  Where is the House of Black and White located?
 -  What is the name of the first men?
 -  What is the name of the line that Frey exterminates?
 -  Where does the Red Wedding take place?


## Question Answer Generation Pipeline

This pipeline takes a document as input, generates questions on it, and attempts to answer these questions using
a Reader model

In [None]:
reader = FARMReader('deepset/roberta-base-squad2')
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
for idx, document in enumerate(tqdm(document_store)):

  print(f"\n * Generating questions and answers for documents {idx}: {document.content[:100]}...\n")
  result = qag_pipeline.run(documents=[document])
  print_questions(result)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2
INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \
0it [00:00, ?it/s]


 * Generating questions and answers for documents 0: Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Ros...




  start_indices = flat_sorted_indices // max_seq_len

Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.23s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.09s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.09s/ Batches]
1it [00:08,  8.33s/it]


Generated pairs:
 - Q: Who created Python?
      A: Guido van Rossum
 - Q: When was Python first released?
      A: 1991
 - Q: What is Python's design philosophy?
      A: emphasizes code readability

 * Generating questions and answers for documents 1: Princess Arya Stark is the third child and second daughter of Lord Eddard Stark and his wife, Lady C...




Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.04s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.05s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.05s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.06s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.07s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.07s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.54s/ Batches]

Infer


Generated pairs:
 - Q: Who is the third child and second daughter of Lord Eddard Stark and his wife, Lady Catelyn Stark?
      A: Princess Arya Stark
 - Q: Princess Arya Stark is the sister of what Westerosi monarchs?
      A: Sansa, Queen in the North, and Brandon, King of the Andals and the First Men
 - Q: What is Sansa, Queen in the North, and Brandon, King of the Andals?
      A: sister
 - Q: What is Arya trained as?
      A: Faceless Man
 - Q: Where is the House of Black and White located?
      A: Braavos
 - Q: What is the name of the first men?
      A: Brandon
 - Q: What is the name of the line that Frey exterminates?
      A: Frey male line
 - Q: Where does the Red Wedding take place?
      A: Westeros

 * Generating questions and answers for documents 2: Dry Cleaning are an English post-punk band who formed in South London in 2018.[3] The band is compos...




Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.00s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.97s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.00s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.98s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.98s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.00s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.00s/ Batches]

Infer


Generated pairs:
 - Q: What is the name of the English post-punk band that formed in South London in 2018?
      A: Dry Cleaning
      A: Boundary Road Snacks and Drinks
 - Q: Who is the vocalist of Dry Cleaning?
      A: Florence Shaw
      A: ghan Markle and Prince Harry announced they were engaged.[5] This was followed by the release of two EPs that year: Sweet Princess in August and Boundary Road Snacks and Drinks in October. The band were included as part of the NME 100 of 2020,[6] as well as DIY magazine's Class of 2020.[7] The band signed to 4AD in late 2020 and shared a new single, 'Scratchcard Lanyard'.[8] In February 2021, the band shared details of their debut studio album, New Long Leg. They also shared the single 'Strong Feelings'.[9] The album, which was produced by John Parish
 - Q: Where did Dry Cleaning form?
      A: South London
      A: Boundary Road
 - Q: What does the band use instead of sung vocals?
      A: spoken word
      A: 2020,[6] as well as DIY magazine'




## Translated Question Answer Generation Pipeline
Trained models for Question Answer Generation are not available in many languages other than English. Haystack
provides a workaround for that issue by machine-translating a pipeline's inputs and outputs with the
TranslationWrapperPipeline. The following example generates German questions and answers on a German text
document - by using an English model for Question Answer Generation.

In [None]:
# Fill the document store with a German docuemnt
text1 = "Python ist eine interpretierte Hochsprachenprogrammiersprache für allgemeine Zwecke. Sie wurde von Guido van Rossum entwickelt und 1991 erstmals veröffentlicht. Die Design-Philosophie von Python legt den Schwerpunkt auf die Lesbarkeit des Codes und die Verwendung von viel Leerraum (Whitespace)."
docs = [{'content': text1}]
document_store.delete_documents()
document_store.write_documents(docs)

# Load machine translation models
from haystack.nodes import TransformersTranslator

in_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-de-en")
out_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-en-de")

# Wrap the previously defined QuestionAnswerGenerationPipeline
from haystack.pipelines import TranslationWrapperPipeline

pipeline_with_translation = TranslationWrapperPipeline(
    input_translator=in_translator, output_translator=out_translator,
    pipeline=qag_pipeline
)

for idx, document in enumerate(tqdm(document_store)):
  print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
  result = pipeline_with_translation.run(documents=[document])
  print_questions(result)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/750k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/284M [00:00<?, ?B/s]

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/750k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/284M [00:00<?, ?B/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.




 * Generating questions and answers for document 0: Python ist eine interpretierte Hochsprachenprogrammiersprache für allgemeine Zwecke. Sie wurde von G...




  start_indices = flat_sorted_indices // max_seq_len

Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.11s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.04s/ Batches]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.22s/ Batches]
1it [00:13, 13.78s/it]


Generated pairs:
 - Q:Wer hat Python entwickelt?
      A: Guido van Rossum
 - Q:Wann wurde Python zum ersten Mal veröffentlicht?
      A: 1991
 - Q:Worauf konzentriert sich Pythons Designphilosophie?
      A: die Lesbarkeit des Codes



