# Background
Recent advances in Machine Learning allows computers to perform tasks that were previously believed to require human inuition. 
Could generating a quiz to gauge reading comprehension be such a task?

In this notebook I present an approach to generate simple fact questions based on text data.
Example questions:
* Carl XVI Gustaf born in _year_
* In 1976 Carl XVI Gustaf married _person_

Ideally the questions would be reformatted as:
* When was Carl XVI Gustaf born?
* Who did Carl XVI Gustaf marry in 1976?

# Method
Raw text data is pre-processed and dependency graphs are construced for each sentence. 
[Here](https://ai.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html) is a good resource from Google introducing the concept of dependency parsing.
Dependency graphs provide a structured representation of sentence semantics, from which we will extract relations to turn into questions.

A question will be generated from a relationship by withholding part of the relation ship.
For example:
"Carl was born in Sweden" could be turned into the question "Carl was born in _location_"

Dependency graphs for a text can be created using a custom dependency parser, or a pre-trained one.
In this notebook I will use the [Stanford Parser](https://nlp.stanford.edu/software/lex-parser.shtml#Sample).

In [10]:
from nltk.parse.stanford import StanfordDependencyParser

path = '../stanford_parser'

path_to_jar = path + '/stanford-parser-full-2018-02-27/stanford-parser-full-2018-02-27/stanford-parser.jar'
path_to_models_jar = path + '/stanford-parser-full-2018-02-27/stanford-parser-full-2018-02-27/stanford-english-corenlp-2018-02-27-models.jar'
#path_to_models_jar = path + '/stanford-english-corenlp-2018-02-27-models.jar'

dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)

In [None]:
result = dependency_parser.raw_parse('Carl XVI Gustaf was born in 1946.')

Example output:

TODO
1. Build logic to truncate dependency tree to core components that will be used in questions
2. Build logic to withhold relevant parts of tree in order to generate a question

# Limitations
The model in it's current form has several limitations.
I divide these into two categories: inherent limitations and extendable limitations.
The inherent limitations are hard to adress without a major overhaul of the model, the extendable limitations could more easily be adressed with extensions to the model.

Inherent limitations:
* The model only asks fact based questions, which limits the depth of understanding it's questions can gauge.
* Questions are generated based on pre defined rules, the model does not learn anything about asking questions.
* The model requires dependency parsed input. A parser with sufficient performance in the relevant language is necessary.

Extendable limitations:
* The models understanding of a text is limited to direct semantic relationships within a single sentence. 
    * It does not understand that two mentions of the same named entity represent the same entity.
    * It does not attempt to understand co-references such as "**Carl** was born 1946, **he** was crowned 1973".
* The question format is limited to fill in the blank type questions

# Extending the Model
The model in it's current state is very limited in what questions it can generate. 
There are some low haning fruits that could extend it's undertsnding of a text and allow a little bit more complex questiosn to be generated.
## Improving the Question Format
A simple method of improving the question format would be to create rules based on what kind of information is withheld.
If the location in a relationship like "Entity Verb Location" is withheld then the question might be formulated like "Where did Entity Verb?" etc.

This method is limited in that it requires human engineered rules.
However, having human engineered rules might be a good to ensure the quality of questions.
If this is the case then this method might not add many additional restrictions.
## Connecting Named Entities
If a named entity appears in multiple statements the model should be able to connect the this knowledge. 
For example, if the model reads the text "Carl is the king of Sweden. Carl was born 1946." it should be able to connect these two facts about Carl and form a question like: "When was the king of Sweden born?".
Such relations could be represented by a simple relational database or.
The challenge is detecting words that should be deemed to represent the same entity.
This problem is called co-reference resolution.