<a href="https://colab.research.google.com/github/plaban1981/NLP_Question_Answer_Model/blob/main/Questgen_NLP_Library_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Questgen - An open source NLP library for Question generation algorithms.

* Question generation has a lot of use cases with the most prominent one being the ability to generate quick assessments from any given content. It would help school teachers in generating worksheets from any given chapter quickly and decrease their work burden during Covid-19.

* It uses state-of-the-art T5 transformer model from 
Hugging Face library.

* The currently supported question generation capabilities of the library are 
    * MCQs, 
    * Yes/No questions, 
    * FAQs, Paraphrasing, and 
    * Question Answering.


https://towardsdatascience.com/questgen-an-open-source-nlp-library-for-question-generation-algorithms-1e18067fcdc6

# Installations

In [1]:
!pip install git+https://github.com/ramsrigouthamg/Questgen.ai

Collecting git+https://github.com/ramsrigouthamg/Questgen.ai
  Cloning https://github.com/ramsrigouthamg/Questgen.ai to /tmp/pip-req-build-km1yccog
  Running command git clone -q https://github.com/ramsrigouthamg/Questgen.ai /tmp/pip-req-build-km1yccog
Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 2.0 MB/s 
[?25hCollecting pytorch_lightning==0.8.1
  Downloading pytorch_lightning-0.8.1-py3-none-any.whl (293 kB)
[K     |████████████████████████████████| 293 kB 45.9 MB/s 
[?25hCollecting sense2vec==1.0.3
  Downloading sense2vec-1.0.3-py2.py3-none-any.whl (35 kB)
Collecting strsim==0.0.3
  Downloading strsim-0.0.3-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.4 MB/s 
Collecting networkx==2.4.0
  Downloading networkx-2.4-py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 54.7 MB/s 
Collecting unidecode==1.1.1
  Downloading Unidecode-1.1.1-py

In [1]:
!pip install --quiet git+https://github.com/boudinfl/pke.git


  Building wheel for pke (setup.py) ... [?25l[?25hdone


## Download and extract zip of Sense2vec word vectors that are used for the generation of multiple choices.

In [2]:
!python -m nltk.downloader universal_tagset
!python -m spacy download en 

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 2.0 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


# Download Sense2vec wordvectors for generation of multiple choices

In [3]:
!wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz

--2021-09-15 15:55:05--  https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/50261113/52126080-0993-11ea-8190-8f0e295df22a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210915%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210915T155506Z&X-Amz-Expires=300&X-Amz-Signature=bfb0a872348910f3a8383e555d45067120b78dba8d735bf5ed9bb21823c5d34d&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=50261113&response-content-disposition=attachment%3B%20filename%3Ds2v_reddit_2015_md.tar.gz&response-content-type=application%2Foctet-stream [following]
--2021-09-15 15:55:06--  https://github-releases.githubusercontent.com/50261113/52126080-0993-11ea-8190-8f0e295df22a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=

* results with 2015 trained vectorsare better compared to reddit_2019

In [4]:
!tar -xvf  s2v_reddit_2015_md.tar.gz

./._s2v_old
./s2v_old/
./s2v_old/._freqs.json
./s2v_old/freqs.json
./s2v_old/._vectors
./s2v_old/vectors
./s2v_old/._cfg
./s2v_old/cfg
./s2v_old/._strings.json
./s2v_old/strings.json
./s2v_old/._key2row
./s2v_old/key2row


In [6]:
!ls s2v_old

cfg  freqs.json  key2row  strings.json	vectors


In [7]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Generate boolean (Yes/No) Questions

In [8]:
from pprint import pprint
from Questgen import main

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unz

* BoolQGen() ==> Generate boolean (Yes/No) Questions

In [9]:
qe= main.BoolQGen()

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
payload = {
            "input_text": "Sachin Ramesh Tendulkar is a former international cricketer from India and a former captain of the Indian national team. He is widely regarded as one of the greatest batsmen in the history of cricket. He is the highest run scorer of all time in International cricket."
        }

In [None]:
output = qe.predict_boolq(payload)
pprint (output)

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


{'Boolean Questions': ['Is sachin ramesh tendulkar the highest run scorer in '
                       'cricket?',
                       'Is sachin ramesh tendulkar the highest run scorer in '
                       'cricket?',
                       'Is sachin ramesh tendulkar the greatest batsman in '
                       'cricket?'],
 'Count': 4,
 'Text': 'Sachin Ramesh Tendulkar is a former international cricketer from '
         'India and a former captain of the Indian national team. He is widely '
         'regarded as one of the greatest batsmen in the history of cricket. '
         'He is the highest run scorer of all time in International cricket.'}


In [None]:
output['Boolean Questions']

['Is sachin ramesh tendulkar the highest run scorer in cricket?',
 'Is sachin ramesh tendulkar the highest run scorer in cricket?',
 'Is sachin ramesh tendulkar the greatest batsman in cricket?']

In [None]:
output['Text']

'Sachin Ramesh Tendulkar is a former international cricketer from India and a former captain of the Indian national team. He is widely regarded as one of the greatest batsmen in the history of cricket. He is the highest run scorer of all time in International cricket.'

In [12]:
payload1 = {
            "input_text": "Starting from back in November, we’ve had a lot of really important studies that showed us that memory B cells and memory T cells were forming in response to natural infection,” says Gandhi. Studies are also showing, she says, that these memory cells will respond by producing antibodies to the variants at hand.91011Gandhi included a list of some 20 references on natural immunity to covid in a long Twitter thread supporting the durability of both vaccine and infection induced immunity.12 “I stopped adding papers to it in December because it was getting so long,” she tells The BMJ.But the studies kept coming. A National Institutes of Health (NIH) funded study from La Jolla Institute for Immunology found “durable immune responses” in 95% of the 200 participants up to eight months after infection.13 One of the largest studies to date, published in Science in February 2021, found that although antibodies declined over 8 months, memory B cells increased over time, and the half life of memory CD8+ and CD4+ T cells suggests a steady presence.9Real world data have also been supportive.14 Several studies (in Qatar,15 England,16 Israel,17 and the US18) have found infection rates at equally low levels among people who are fully vaccinated and those who have previously had covid-19. Cleveland Clinic surveyed its more than 50 000 employees to compare four groups based on history of SARS-CoV-2 infection and vaccination status.18 Not one of over 1300 unvaccinated employees who had been previously infected tested positive during the five months of the study. Researchers concluded that that cohort “are unlikely to benefit from covid-19 vaccination.” In Israel, researchers accessed a database of the entire population to compare the efficacy of vaccination with previous infection and found nearly identical numbers. “Our results question the need to vaccinate previously infected individuals, they concluded.17As covid cases surged in Israel this summer, the Ministry of Health reported the numbers by immunity status. Between 5 July and 3 August, just 1% of weekly new cases were in people who had previously had covid-19. Given that 6% of the population are previously infected and unvaccinated,these numbers look very low,” says Dvir Aran, a biomedical data scientist at the Technion–Israel Institute of Technology, who has been analysing Israeli data on vaccine effectiveness and provided weekly ministry reports to The BMJ. While Aran is cautious about drawing definitive conclusions, he acknowledged “the data suggest that the recovered have better protection than people who were vaccinated.But as the delta variant and rising case counts have the US on edge, renewed vaccination incentives and mandates apply regardless of infection history.8 To attend Harvard University or a Foo Fighters concert or enter indoor venues in San Francisco and New York City, you need proof of vaccination. The ire being directed at people who are unvaccinated is also indiscriminate—and emanating from America’s highest office. In a recent speech to federal intelligence employees who, along with all federal workers, will be required to get vaccinated or submit to regular testing, President Biden left no room for those questioning the public health necessity or personal benefit of vaccinating people who have had covid-19. We have a pandemic because of the unvaccinated.So, get vaccinated. If you haven’t, you’re not nearly as smart as I said you were."
        }

In [13]:
output_1 = qe.predict_boolq(payload1)
pprint (output_1)

Token indices sequence length is longer than the specified maximum sequence length for this model (764 > 512). Running this sequence through the model will result in indexing errors


{'Boolean Questions': ['Do you have to be vaccinated to get covid 19?',
                       'Do you have to be vaccinated to get covid-19?',
                       'Do you have to be vaccinated for covid 19?'],
 'Count': 4,
 'Text': 'Starting from back in November, we’ve had a lot of really important '
         'studies that showed us that memory B cells and memory T cells were '
         'forming in response to natural infection,” says Gandhi. Studies are '
         'also showing, she says, that these memory cells will respond by '
         'producing antibodies to the variants at hand.91011Gandhi included a '
         'list of some 20 references on natural immunity to covid in a long '
         'Twitter thread supporting the durability of both vaccine and '
         'infection induced immunity.12 “I stopped adding papers to it in '
         'December because it was getting so long,” she tells The BMJ.But the '
         'studies kept coming. A National Institutes of Health (NIH) fu

# Generate MCQ Questions

In [None]:
qg = main.QGen()

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
output2 = qg.predict_mcq(payload1)
print(output2['questions'])
print(output2['statement'])

Running model for generation
 Sense2vec_distractors successful for word :  plateaus
 Sense2vec_distractors successful for word :  iberian peninsula
 Sense2vec_distractors successful for word :  heating
 Sense2vec_distractors successful for word :  origin
[{'question_statement': 'What are the plateaus that are underlain by unusually hot material?', 'question_type': 'MCQ', 'answer': 'plateaus', 'id': 1, 'options': ['Plateu', 'Strength Gains', 'Linear Progression'], 'options_algorithm': 'sense2vec', 'extra_options': ['Lifts', 'Higher Weights', 'Fat Loss', 'Training Volume', 'Muscle/Strength', 'Workouts'], 'context': 'Some plateaus, like the Colorado Plateau, the Ordos Plateau in northern China, or the East African Highlands, do not seem to be related to hot spots or to vigorous upwelling in the asthenosphere but appear to be underlain by unusually hot material. The reason for localized heating beneath such areas is poorly understood, and thus an explanation for the distribution of plateau

In [None]:
output2['questions']

[{'answer': 'plateaus',
  'context': 'Some plateaus, like the Colorado Plateau, the Ordos Plateau in northern China, or the East African Highlands, do not seem to be related to hot spots or to vigorous upwelling in the asthenosphere but appear to be underlain by unusually hot material. The reason for localized heating beneath such areas is poorly understood, and thus an explanation for the distribution of plateaus of that type is not known.There are some plateaus whose origin is not known. The reason for localized heating beneath such areas is poorly understood, and thus an explanation for the distribution of plateaus of that type is not known.There are some plateaus whose origin is not known.',
  'extra_options': ['Lifts',
   'Higher Weights',
   'Fat Loss',
   'Training Volume',
   'Muscle/Strength',
   'Workouts'],
  'id': 1,
  'options': ['Plateu', 'Strength Gains', 'Linear Progression'],
  'options_algorithm': 'sense2vec',
  'question_statement': 'What are the plateaus that are un

In [None]:
payload2 = {"input_text":"The First Battle of Panipat, on 21 April 1526, was fought between the invading forces of Babur and the Lodi dynasty. It took place in North India and marked the beginning of the Mughal Empire and the end of the Delhi Sultanate. This was one of the earliest battles involving gunpowder firearms and field artillery in the Indian subcontinent which were introduced by Mughals in this battle."}

#### True / False Questions

In [None]:
output3 = qe.predict_boolq(payload2)
pprint (output3)


{'Boolean Questions': ['Was the first battle of panipat fought between lodi '
                       'and babur?',
                       'Was the first battle of panipat fought by the mughals?',
                       'Was the first battle of panipat a battle?'],
 'Count': 4,
 'Text': 'The First Battle of Panipat, on 21 April 1526, was fought between '
         'the invading forces of Babur and the Lodi dynasty. It took place in '
         'North India and marked the beginning of the Mughal Empire and the '
         'end of the Delhi Sultanate. This was one of the earliest battles '
         'involving gunpowder firearms and field artillery in the Indian '
         'subcontinent which were introduced by Mughals in this battle.'}


#### MCQ questions

In [None]:
output3 = qg.predict_mcq(payload2)
pprint(output3)

Running model for generation
 Sense2vec_distractors successful for word :  first battle
 Sense2vec_distractors successful for word :  mughal empire
{'questions': [{'answer': 'first battle',
                'context': 'The First Battle of Panipat, on 21 April 1526, was '
                           'fought between the invading forces of Babur and '
                           'the Lodi dynasty.',
                'extra_options': [],
                'id': 1,
                'options': ['Second Battle'],
                'options_algorithm': 'sense2vec',
                'question_statement': 'What was the Battle of Panipat?',
                'question_type': 'MCQ'},
               {'answer': 'mughal empire',
                'context': 'It took place in North India and marked the '
                           'beginning of the Mughal Empire and the end of the '
                           'Delhi Sultanate.',
                'extra_options': ['Papal States'],
                'id': 2,
           

# Generate FAQs

In [None]:
output = qg.predict_shortq(payload)
pprint (output)

Running model for generation
{'questions': [{'Question': "What is Sachin Ramesh Tendulkar's career?", 'Answer': 'cricketer', 'id': 1, 'context': 'Sachin Ramesh Tendulkar is a former international cricketer from India and a former captain of the Indian national team.'}, {'Question': 'Where is Sachin Ramesh Tendulkar from?', 'Answer': 'india', 'id': 2, 'context': 'Sachin Ramesh Tendulkar is a former international cricketer from India and a former captain of the Indian national team.'}, {'Question': 'What is the best cricketer?', 'Answer': 'batsmen', 'id': 3, 'context': 'He is widely regarded as one of the greatest batsmen in the history of cricket.'}]}
{'questions': [{'Answer': 'cricketer',
                'Question': "What is Sachin Ramesh Tendulkar's career?",
                'context': 'Sachin Ramesh Tendulkar is a former international '
                           'cricketer from India and a former captain of the '
                           'Indian national team.',
                'i

In [None]:
output1 = qg.predict_shortq(payload1)
pprint (output1)

# Paraphrasing Questions

In [None]:
payload2 = {
    "input_text" : "What is Sachin Tendulkar profession?",
    "max_questions": 5
}

payload3 = {
    "input_text" : "What is machine Learning",
    "max_questions": 3
}

In [None]:
output = qg.paraphrase(payload3)
print ("Output :")
pprint (output)

0: ParaphrasedTarget: What is machine learning?
1: ParaphrasedTarget: What does machine learning mean?
2: paraphrasedTarget: What is machine learning?
Output :
{'Count': 3,
 'Paraphrased Questions': ['ParaphrasedTarget: What is machine learning?',
                           'ParaphrasedTarget: What does machine learning '
                           'mean?',
                           'paraphrasedTarget: What is machine learning?'],
 'Question': 'What is machine Learning'}


In [None]:
output = qg.paraphrase(payload2)
print ("Output :")
pprint (output)

0: ParaphrasedTarget: What is Sachin Tendulkar's profession?
1: ParaphrasedTarget: What is Sachin Tendulkar's career?
2: ParaphrasedTarget: What is Sachin Tendulkar's job?
3: ParaphrasedTarget: What is Sachin Tendulkar?
4: ParaphrasedTarget: What is Sachin Tendulkar's occupation?
Output :
{'Count': 5,
 'Paraphrased Questions': ["ParaphrasedTarget: What is Sachin Tendulkar's "
                           'profession?',
                           "ParaphrasedTarget: What is Sachin Tendulkar's "
                           'career?',
                           "ParaphrasedTarget: What is Sachin Tendulkar's job?",
                           'ParaphrasedTarget: What is Sachin Tendulkar?',
                           "ParaphrasedTarget: What is Sachin Tendulkar's "
                           'occupation?'],
 'Question': 'What is Sachin Tendulkar profession?'}


# Question Answering




## Simple Question Answering

In [None]:
answer = main.AnswerPredictor()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1206.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242068027.0, style=ProgressStyle(descri…




In [None]:
payload3 = {
    "input_text" : '''Sachin Ramesh Tendulkar is a former international cricketer from 
              India and a former captain of the Indian national team. He is widely regarded 
              as one of the greatest batsmen in the history of cricket. He is the highest
               run scorer of all time in International cricket.''',
    "input_question" : "Who is Sachin tendulkar ? "

}

In [None]:
answer.predict_answer(payload3)

'Sachin ramesh tendulkar is a former international cricketer from india and a former captain of the indian national team.'

## Boolean Question Answering

In [None]:
payload4 = {
    "input_text" : '''Sachin Ramesh Tendulkar is a former international cricketer from 
              India and a former captain of the Indian national team. He is widely regarded 
              as one of the greatest batsmen in the history of cricket. He is the highest
               run scorer of all time in International cricket.''',
    "input_question" : "Is Sachin tendulkar  a former cricketer? "

}

In [None]:
answer.predict_answer(payload4)

'Yes, sachin tendulkar is a former cricketer.'