# TyDI Question Generation: Inference example

In this notebook, we will show how to use a pretrained multilingual PassageQG model to generate questions. Given a text snippet, spacy is used to identify noun chunks (named entities) which becomes the answer and an mT5 is used to generate question givent he answer and the text snippet.

## Dependencies

If not already done, make sure to install PrimeQA with notebooks extras before getting started.

In [1]:
from primeqa.qg.models.qg_model import QGModel

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-07-28 08:08:32 INFO: Downloading default packages for language: multilingual (multilingual)...
2022-07-28 08:08:32 INFO: File exists: /u/jaydesen/stanza_resources/multilingual/default.zip
2022-07-28 08:08:32 INFO: Finished downloading models and saved to /u/jaydesen/stanza_resources.


## Loading pretrained model from huggingface

This model was trained using PrimeQA library and uploaded to huggingface hub.

In [2]:
model_name = 'PrimeQA/mt5-base-tydi-question-generator'
passage_qg_model = QGModel(model_name, modality='passage')

Loaded NER model for  Arabic
Loaded NER model for  English
Loaded NER model for  Finnish
Loaded NER model for  Russian


<br>

## Sample instance

Passages should be passed a `list` of `str`. We take one English and one Russian text to generate questions.

In [6]:
text_list = ["Sachin tendulkar was an Indian cricketer born in Mumbai. He scored nearly 350000 runs in his international career",
            
"Симби́рская губе́рния (с 1924 года Ульяновская губерния)\xa0— административно-территориальная\
единица Российской империи, Российской республики и РСФСР, существовавшая в 1796—1928 годах.\
Губернский город\xa0— Симбирск (с 1924 года Ульяновск)"]

id_list = ["abcID123", "xyzID456"]

## Generate questions

The `generate_questions` function can take two arguments.
#### Controls:
- `num_questions_per_instance`: Number of questions to generate per table (default=5)
- `answers_list`: Generated questions will have these as the answers. It should be a list of lists, 
        where each list corresponds a passage in `text_list`. (default=[])

When `answers_list` is not provided, named entity recognition method is used to sample answers.

In [7]:
passage_qg_model.generate_questions(text_list, 
                    num_questions_per_instance = 2, id_list=id_list)

Input language en
Input language ru


[{'id': 'abcID123',
  'context': 'Sachin tendulkar was an Indian cricketer born in Mumbai. He had a record of 100 centuries in his international carrier. He scored nearly 350000 international runs',
  'question': 'What country did Sachin tendulkar play?',
  'answer': 'Indian'},
 {'id': 'abcID123',
  'context': 'Sachin tendulkar was an Indian cricketer born in Mumbai. He had a record of 100 centuries in his international carrier. He scored nearly 350000 international runs',
  'question': 'How many runs did Sachin tendulkar have?',
  'answer': '100 centuries'},
 {'id': 'xyzID456',
  'context': 'Симби́рская губе́рния (с 1924 года Ульяновская губерния)\xa0— административно-территориальнаяединица Российской империи, Российской республики и РСФСР, существовавшая в 1796—1928 годах.Губернский город\xa0— Симбирск (с 1924 года Ульяновск)',
  'question': 'Как называется Губернский город?',
  'answer': 'Симбирск'},
 {'id': 'xyzID456',
  'context': 'Симби́рская губе́рния (с 1924 года Ульяновская гу

Answer sampler only supports Arabic, English, Finnish and Russian now. For other languages in TyDi dataset
we should provide the answers explicitly.

In [None]:
text_list = ["শচীন টেন্ডুলকারকে ক্রিকেট ইতিহাসের অন্যতম সেরা ব্যাটসম্যান হিসেবে গণ্য করা হয়।"]
answers_list = [["শচীন টেন্ডুলকার"]]
passage_qg_model.generate_questions(text_list, 
                                answers_list = answers_list)