This is a Knowledge Graph Question Answering System based on the paper:  
S. Aghaei, E. Raad and A. Fensel, "Question Answering Over Knowledge Graphs: A Case Study in Tourism," in IEEE Access, vol. 10, pp. 69788-69801, 2022, doi: 10.1109/ACCESS.2022.31871

There are some deviations.  Attempts were made to note this, but the knowledge graph (KG) for this project is a couple orders of magnitude smaller (with about 400 facts/triples).  Due to time contraints some corners might have been cut.

I will discuss this in a few sections.

# Zoning Question Answering

This is just a simple example showing this work.

In [1]:
%%time
import semantic_parsing
sem_par = semantic_parsing.SemanticParsingClass()

  from .autonotebook import tqdm as notebook_tqdm
2022-12-02 17:12:38 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 20.7MB/s]
2022-12-02 17:12:39 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |
| depparse  | combined |

2022-12-02 17:12:39 INFO: Use device: cpu
2022-12-02 17:12:39 INFO: Loading: tokenize
2022-12-02 17:12:39 INFO: Loading: pos
2022-12-02 17:12:39 INFO: Loading: lemma
2022-12-02 17:12:39 INFO: Loading: depparse
2022-12-02 17:12:39 INFO: Done loading processors!


CPU times: user 6.01 s, sys: 886 ms, total: 6.89 s
Wall time: 4.16 s


Below are a few random questions you could ask, rerun the cell to generate new ones or change the `n` value to get more/fewer questions.  

These are from the templates in `generate_template.py` that execute SPARQL to get the answers.  The question corpus has 900 questions in all (162 questions were excluded because of bad answers those would take time to fix them, so the total number of questions is 1062 before exclusion).

In [18]:
%%time
itr = semantic_parsing.get_random_questions_answers(sem_par.generate_filtered_corpus(), n=5)
for qa in itr:
    print(f"Question: {qa['question']}")
    if qa['answer'] is True:
        print(f"Answer: Yes")
    else:
        print(f"Answer: {qa['answer']}")

Question: Are storage yards allowed?
Answer: Yes
Question: I would like to build places of religious worship.  Which zoning districts permits this use?
Answer: ['C1', 'C2', 'C3', 'C4', 'FI1', 'FI2', 'FI3']
Question: Are public and governmental services permitted in a FI3 zoning district?
Answer: Yes
Question: Are breweries permitted in a FI3 zoning district?
Answer: Yes
Question: I would like to build restaurants.  Which zoning districts permits this use?
Answer: ['C1', 'C2', 'C3', 'C4']
CPU times: user 1.62 s, sys: 8.08 ms, total: 1.62 s
Wall time: 1.62 s


Below is the Zoning KGQAS system working.

Copy the above text.  Or try to use some variation on a question.

Example questions:
* A question could be "Are quarries allowed?" is in the dataset, but mining is not so "Is mining allowed?" would be a similar use (the answer should be "Yes").  (It correctly identifies that quarries and mining are similar words.)

* The International Zoning Code uses "financial services", so asking "Is a bank allowed?" a similar use.  (the answer should be "Yes").

I asked a question, which off the top of my head I think have a No answer: 
* "Is a trailer park allowed?"  It answers "Yes," it is matching "publicly owned and operated parks".  
* Then, I ask "Which zoning districts allow trailer parks?", again it answers "Yes," matching the same Question Classification.  It matched the same question An answer that would I could beleive would be "R1" (or some other R District).  I guess I'll have to accept that Zoing KGQAS is a people pleaser.   Kidding aside, it has overfitted because only questions with Yes answers were sent to it in training.
* I ask "Is a marijuana dispensery allowed?", it matches that use as "A2" which is an Agricultural zoning district.  It gives the correct answer of "No", but it doesn't know that it should be "No".
* I ask "I would like to build a marijuana dispensery.  Which zoning districts permits this use?"  It correctly gives an empty list [].

This is thought provoking, but I'd like more experience with this before I suggest ways to fix this.

In [22]:
%%time
# question = "What is the minimum lot depth in the C3 zoning district?"   # Answer: 100 [ft_i]
# question = "Are quarries allowed?"  # A: Yes
question = "Is mining allowed?"       # A: Yes
# question = "Is a bank allowed?"     # A: Yes
# question = "Is a trailer park allowed?"   # A: Yes   (Seems incorrect)
# question = "Which zoning districts allow trailer parks?"  # A: Yes   (Seems incorrect, incorrect kind of answer)
# question = "Is a marijuana dispensery allowed?"  # A: No
# question = "I would like to build a marijuana dispensery.  Which zoning districts permits this use?"
# A: []

answer = sem_par.classify(question)

print(f"Question: {question}")
if answer is True:
    print(f"Answer: Yes")
else:
    print(f"Answer: {answer}")

2022-12-02 17:43:47.594 | INFO     | question_classification:load_model:292 - Loading XGBoost model: question_classification_model.ubj


[[ 83570  85230  85717 120000 120000 120000 120000 120000 120000 120000
  120000 120000 120000 120000 120000 120000 120000 120000 120000 120000
  120000 120000 120000 120000 120000 120000 120000 120000 120000 120000
  120000 120000 120000 120000 120000 120000 120000 120000 120000 120000]]
[2]
template name: template_use_1var_yn_answer
SPARQL TEMPLATE: 
ASK {
        ?zoning :permitsUse "${use}" .
}

VARIABLES: ('use',)
RELATIONS: None
SLOTS: {}
num_entity_slots: 1
SIMILARITY SCORES for num_entity_slots: [(1.7109150355375984, 'quarries')]
SLOTS: {'use': 'quarries'}
SPARQL:  
ASK {
        ?zoning :permitsUse "quarries" .
}

Yes
Question: Is mining allowed?
Answer: Yes
CPU times: user 4.62 s, sys: 163 ms, total: 4.78 s
Wall time: 1.32 s


# Metrics

In [None]:
Here are metrics. These take 44 minutes to run on CPU.

In [None]:
%%time
sem_par.measure_accuracy()

# Training

Training is a very fast process due to doing a simple fit.  There is not cross validation done on the data.

In [1]:
%%time
# takes apx. 2 minutes CPU time
import semantic_parsing
semantic_parsing.train_all()

CPU times: user 9 µs, sys: 1 µs, total: 10 µs
Wall time: 21 µs


  from .autonotebook import tqdm as notebook_tqdm
2022-12-02 16:25:34 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 29.0MB/s]
2022-12-02 16:25:35 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |
| depparse  | combined |

2022-12-02 16:25:35 INFO: Use device: cpu
2022-12-02 16:25:35 INFO: Loading: tokenize
2022-12-02 16:25:35 INFO: Loading: pos
2022-12-02 16:25:35 INFO: Loading: lemma
2022-12-02 16:25:35 INFO: Loading: depparse
2022-12-02 16:25:35 INFO: Done loading processors!
2022-12-02 16:28:09.807 | INFO     | question_classification:train2:260 - Training XGBoost Classifier took: 151.37237477302

Number of Questions Encoded: 900
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 



Iteration 1, loss = 6.18207124
Iteration 2, loss = 5.75650376
Iteration 3, loss = 5.35298823
Iteration 4, loss = 4.92779234
Iteration 5, loss = 4.47552798
Iteration 6, loss = 3.99844282
Iteration 7, loss = 3.51693592
Iteration 8, loss = 3.05103115
Iteration 9, loss = 2.61582491
Iteration 10, loss = 2.22430647
Iteration 11, loss = 1.88732362
Iteration 12, loss = 1.60622919
Iteration 13, loss = 1.37869731
Iteration 14, loss = 1.19859700
Iteration 15, loss = 1.05739790
Iteration 16, loss = 0.94994859
Iteration 17, loss = 0.86606615
Iteration 18, loss = 0.80163448
Iteration 19, loss = 0.75207464
Iteration 20, loss = 0.71258353
Iteration 21, loss = 0.67965179
Iteration 22, loss = 0.65290649
Iteration 23, loss = 0.63044698
Iteration 24, loss = 0.61156127
Iteration 25, loss = 0.59418470
Iteration 26, loss = 0.57907829
Iteration 27, loss = 0.56515743
Iteration 28, loss = 0.55211142
Iteration 29, loss = 0.54000027
Iteration 30, loss = 0.52863642
Iteration 31, loss = 0.51769550
Iteration 32, los

Iteration 258, loss = 0.04174633
Iteration 259, loss = 0.04134369
Iteration 260, loss = 0.04089712
Iteration 261, loss = 0.04055682
Iteration 262, loss = 0.04009746
Iteration 263, loss = 0.03965293
Iteration 264, loss = 0.03926974
Iteration 265, loss = 0.03881296
Iteration 266, loss = 0.03843513
Iteration 267, loss = 0.03803761
Iteration 268, loss = 0.03763794
Iteration 269, loss = 0.03722590
Iteration 270, loss = 0.03683251
Iteration 271, loss = 0.03644934
Iteration 272, loss = 0.03606607
Iteration 273, loss = 0.03567180
Iteration 274, loss = 0.03531040
Iteration 275, loss = 0.03493642
Iteration 276, loss = 0.03457641
Iteration 277, loss = 0.03425476
Iteration 278, loss = 0.03392155
Iteration 279, loss = 0.03358439
Iteration 280, loss = 0.03324444
Iteration 281, loss = 0.03293378
Iteration 282, loss = 0.03258051
Iteration 283, loss = 0.03226855
Iteration 284, loss = 0.03189602
Iteration 285, loss = 0.03160872
Iteration 286, loss = 0.03127016
Iteration 287, loss = 0.03097142
Iteration 