<a href="https://colab.research.google.com/github/plaban1981/Haystack_NLP/blob/main/Haystack%E2%80%99s_new_Question_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What Is Question Generation?

* Question generation is the process of automatically creating questions from a text. 
* It is typically concerned with generating fact-seeking questions, such as those involving people, locations, or dates.

* A successful question generation model has to account for both syntax and semantics. 
* It’s not enough for the model to generate grammatically correct questions — it also needs to grasp which parts of a text are central and make for interesting questions. 

* **Question generation** is closely related to **natural language understanding** and **text summarization**.

https://www.deepset.ai/blog/generate-questions-automatically-for-faster-annotation

## When Is Question Generation Useful?

* **Learning Material :**
A classic use case for question generation is the automated creation of learning material. In this scenario, the question generator comes up with questions that are used to check whether the learner comprehended a given text.

* **Data Exploration :**
A question generation use case is similar to automatic question suggestion as in the case of Google search engine which uses question suggestions to help users navigate and explore data.

* **Automated Annotation of Question-Answer Datasets :** Annotating datasets is an arduous and expensive process that often requires human annotators to manually label every data point. At the same time, representative and high-quality datasets are central to machine learning: not only do they allow you to build models that can generalize, but they’re also needed to evaluate your system’s performance.

  Using the Question Generator to extract questions from documents can save annotators a lot of time. With the questions automatically generated, annotators only need to check on question quality. These professionals keep relevant and well-formed questions, but discard or correct the irrelevant or ill-formed ones. I

## How Does Haystack’s Question Generator Work?

* The Question Generator uses a Transformer-based language model trained on a large number of question-answer pairs. 
* The default model is a version of the T5 model, which excels at creating questions from general-domain texts. 
* If our documents come from a specialized domain — say, law or medicine — or are in a language other than English, you can plug in a different model. 
* The Hugging Face model hub is a good place for finding models for a variety of use cases.

In [2]:
# Install needed libraries

!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.

Collecting grpcio-tools==1.34.1
  Downloading grpcio_tools-1.34.1-cp37-cp37m-manylinux2014_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 4.0 MB/s 
Installing collected packages: grpcio-tools
Successfully installed grpcio-tools-1.34.1
Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-o4hyefvw
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-o4hyefvw
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.0 MB/s 
[?25hCollecting mlflow<=1.13.1
  Downloading mlflow-1.13.1-py3-none-any.whl (14.1 MB)
[K     |████████████████████████████████| 14.1 MB 6.0 MB/s 
[?25hCollecting transformers==4.7.0
  Downloading transformers-4.7.0-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 62.2 MB/s 
Collecting fastapi
  Downloading fastapi-0.70.0-py3-none-any.whl (51 kB)

In [25]:
from haystack.nodes import QuestionGenerator
question_generator = QuestionGenerator(model_name_or_path="valhalla/t5-base-e2e-qg")

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1


## Create a text snippet from which to generate questions:


In [3]:
text = "On November 8 2016 India announced demonetization. As per the announcement all Rs 500 Rs 1000 high value notes became invalid by midnight. There were three main economic objectives of demonetisation—fighting black money fake notes and creating a cashless economy by pushing digital transactions. Demonetisation is an act of cancelling the legal tender status of a currency unit in circulation. Anticipating positive changes on the liquidity structure as a whole nations often adopt Demonetisation policy as a measure to counterbalance the current economic condition. Countries across the globe have used Demonetisation at some or the other point to control situations such as inflation and to boost economy."

##  Run the question generator on our text:

In [4]:
question_generator.generate(text)

  next_indices = next_tokens // vocab_size


[' When did India announce demonetization?',
 ' How many main economic objectives of demonetisation were there?',
 ' What is an act of cancelling?',
 ' When did all Rs 500 Rs 1000 high value notes become invalid?',
 ' Demonetisation is an act of cancelling what status of a currency unit in circulation?',
 ' Countries across the globe have used what type of policy to counterbalance the current economic condition?',
 ' What is the current economic condition?',
 ' Countries across the globe have used Demonetisation to control situations such as inflation and to boost economy?']

## How to Generate Questions in Haystack ?

Haystack has **two main pipeline classes** that will get your question generation systems up and running in no time.

* **Question Generation**

  - QuestionGenerationPipeline is a wrapper for the Question Generator. This pipeline has the same functionality as the **Question Generator: Documents in, questions out**.

* **Question and Answer Generation**

  - QuestionAnswerGenerationPipeline combines the **Question Generator with the Reader**. 
  - In this pipeline, the Question Generator first creates questions in the manner as in QuestionGenerationPipeline.
  - The **Reader** then returns the answers, as well as the contexts from which the answers were extracted. 
  - we can use this pipeline to create entire synthetic question answering datasets in a fully automated process

In [3]:
from google.colab import files
files.upload()

Saving QuestionGeneration_Edukemy_Newsletters.xlsx to QuestionGeneration_Edukemy_Newsletters.xlsx


{'QuestionGeneration_Edukemy_Newsletters.xlsx': b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00A7\x82\xcfn\x01\x00\x00\x04\x05\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0

In [8]:
import pandas as pd
df = pd.read_excel("/content/QuestionGeneration_Edukemy_Newsletters.xlsx")

In [7]:
df.head()

Unnamed: 0,S.No.,Document,Subject,Content,MCQ generated,Ans,Distractors,BCQ_Generated
0,,"Final Newsletter - 28th Oct, 2021",Plastic Management,"Recently, NITI Aayog has suggested a material recovery facility (MRF) model ...",No questions,,,
1,2.0,"Final Newsletter - 28th Oct, 2021",Plastic Management,What is Plastic Waste and what is its status in India Plastic waste is ‘the ...,1) 'What is the accumulation of plastic objects e.g. plastic waste?\n2) 'Wha...,1) 'plastic waste'\n2) 'india'\n3) 'accumulation'\n4) 'grams',"1) ['Landfills', 'Nuclear Waste', 'Environmental Cost']\n2) '['Bangladesh', ...",
2,3.0,"Final Newsletter - 28th Oct, 2021",Plastic Management,Plastic waste is ‘the accumulation of plastic objects e.g. plastic bottles a...,1) 'What country is responsible for plastic pollution?'\n2),1) 'india',"1) ['Bangladesh', 'Indonesia', 'China']",
3,4.0,"Final Newsletter - 28th Oct, 2021",Plastic Management,Encouraging responsible consumption The fight against COVID19 has shown us t...,1) 'What does the MRF and PWM Rules 2021 create?\n2) 'What is the best way t...,1) 'framework'\n2) 'choices'\n3) 'alternatives'\n4) 'waste',"1) ['Specific Implementation', 'Abstractions', 'Paradigm']\n2) ['Decisions',...",
4,5.0,"Final Newsletter - 28th Oct, 2021",Subsidized Electricity,Promises of free or heavily subsidized electricity announced in the election...,1) 'What is the long term impact of government paying for loss of?'\n2) 'Wha...,1) 'discoms'\n2) 'tariffs'\n3) 'stakeholders',"1) ['Delhi Government', 'Jaitley', 'Ambani']\n2) ['Tarrifs', 'Price Controls...",


In [9]:
context = df['Content'].tolist()

In [6]:
context

['Recently, NITI Aayog has suggested a material recovery facility (MRF) model for sustainable management of urban plastic waste.\nAbout Material Recovery Facility (MRF)\n•\tWhat is it?: A materials recovery facility, also known as a materials reclamation facility or recycling facility is a specialized plant that receives, separates, and prepares recyclable materials for marketing to end-user manufacturers.\n•\tHow will it be funded?: The model will initially funded by private players, supported by urban local bodies, and operated by service providers including local organizations and waste management agencies. \n•\tHow does it connect with the current PWM system?: The model not only focuses on managing plastic waste but also on the social inclusion and protection of waste pickers by improving their socio-economic conditions. \nMRF envisages to address burgeoning issue of Plastic Waste, but before we delve into PWM and its significance, it is important to understand what plastic waste i

In [10]:
len(context)

10

In [14]:
texts = []
for text in context:
  texts.append({"content":text})

In [15]:
texts

[{'content': 'Recently, NITI Aayog has suggested a material recovery facility (MRF) model for sustainable management of urban plastic waste.\nAbout Material Recovery Facility (MRF)\n•\tWhat is it?: A materials recovery facility, also known as a materials reclamation facility or recycling facility is a specialized plant that receives, separates, and prepares recyclable materials for marketing to end-user manufacturers.\n•\tHow will it be funded?: The model will initially funded by private players, supported by urban local bodies, and operated by service providers including local organizations and waste management agencies. \n•\tHow does it connect with the current PWM system?: The model not only focuses on managing plastic waste but also on the social inclusion and protection of waste pickers by improving their socio-economic conditions. \nMRF envisages to address burgeoning issue of Plastic Waste, but before we delve into PWM and its significance, it is important to understand what pla

## Question Answer Generation Model

In [15]:
!pip install grpcio==1.34.1

Collecting grpcio==1.34.1
  Downloading grpcio-1.34.1-cp37-cp37m-manylinux2014_x86_64.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 4.2 MB/s 
Installing collected packages: grpcio
  Attempting uninstall: grpcio
    Found existing installation: grpcio 1.37.1
    Uninstalling grpcio-1.37.1:
      Successfully uninstalled grpcio-1.37.1
Successfully installed grpcio-1.34.1


## Let's start an Elasticsearch instance with one of the options below:

In [None]:
# Option 1: Start Elasticsearch service via Docker
launch_es()

In [4]:
# Option 2: In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

## Initialize document store and write in the documents

In [16]:
from haystack.document_stores import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore()
document_store.write_documents(texts)

In [1]:
from haystack.nodes import QuestionGenerator, FARMReader

question_generator = QuestionGenerator()
reader = FARMReader("deepset/roberta-base-squad2")

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1


Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/195 [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 3 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0    0 
INFO - haystack.modeling.infer -  /w\  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \  /'\


## Place the two nodes into the pipeline

In [2]:
from haystack.pipeline import QuestionAnswerGenerationPipeline
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)



##  Run the pipeline on our small text corpus and store the results in a list:

In [17]:
results = []
#
for doc in document_store:
    results.append(qag_pipeline.run(documents=[doc]))

  next_indices = next_tokens // vocab_size
  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.34 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.34 Batches/s]
Inferencing Samples: 100%|██

##  Print the automatically generated question-answer pairs:

In [35]:
for doc in results:
    for qa_pair in doc["results"]:
      #print(qa_pair)
      print(f' Question : {qa_pair["query"]}')
      print(f' Answer : {qa_pair["answers"]}')
      print(len(qa_pair["answers"]))
      #print("Question: {}\nAnswer: {}\n".format(qa_pair["query"], qa_pair["answers"][0]["answer"]))

 Question :  What has NITI Aayog suggested for sustainable management of urban plastic waste?
 Answer : [<Answer {'answer': 'material recovery facility (MRF) model', 'type': 'extractive', 'score': 0.5171805918216705, 'context': 'Recently, NITI Aayog has suggested a material recovery facility (MRF) model for sustainable management of urban plastic waste.\nAbout Material Recovery', 'offsets_in_document': [{'start': 37, 'end': 75}], 'offsets_in_context': [{'start': 37, 'end': 75}], 'document_id': '5486d88f4f95f3fe89cae48518c4507b', 'meta': {}}>, <Answer {'answer': 'Plastic Waste Management Rules, 2021', 'type': 'extractive', 'score': 0.12562621012330055, 'context': 'wards eliminating the issue of dry waste altogether. How Plastic Waste Management Rules, 2021 aims to solve this problem? \n•\tBanning Single Use Plasti', 'offsets_in_document': [{'start': 3469, 'end': 3505}], 'offsets_in_context': [{'start': 57, 'end': 93}], 'document_id': '5486d88f4f95f3fe89cae48518c4507b', 'meta': {}}>, <An