# Trivial Analytics
The purpose of this experiment is to answer this question:

> If you only could study 100 topics in preparation for Jeapordy!, which topics should you study

Hopefully this will give me some practice doing data analytics on a relatively small data set, give me some insight into something I am interested in, and expose me to some natural language processing topics.

I found a reddit poster sharing a data set with 200,000+ questions here https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

In [None]:
import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="service",
  password="jeopardy!",
)

cursor = mydb.cursor()
cursor.execute("SELECT question, answer FROM jeapordy_questions.question WHERE question like '%Egypt%'")

for i in range(0, 10):
    row = cursor.fetchone()
    print("Question: | " + row[0])
    print("Answer:   | " + row[1])
    print()

## Named Entity Recognition
So what do I mean by 'topic'. The subject of a question could be pretty broad or pretty granular. Clearly it doesn't give us enough info on what to study just looking at the category of the question. Categories like 'history' are way too broad to be useful. Meanwhile many of the Jeapordy! categories are unique to the game, playful rhymes or word games.

In general, it looks like we are trying to extract people, places, times, etc. In NLP there is a name for annotating this type of information, 'Named Entity Recognition'. Fortunately there are handy python libraries out there like spaCy that can do the heavy lifting for this. https://spacy.io/api/entityrecognizer


In [41]:
# Reference https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

q = '"The Prince of Egypt" featured Ralph Fiennes as the voice of this stubborn ruler'

doc = nlp(q)
print([(X.text, X.label_) for X in doc.ents])

[('The Prince of Egypt', 'WORK_OF_ART'), ('Ralph Fiennes', 'PERSON')]


That's great, so we can put in a string and spaCy can help us identify the named entities.

The information that is useful is most often in the body of the question or the answer.

For example, "Galileo was sentenced to home arrest after supporting the theories of this astronomer" is a question about Copernicus, which appears in the answer. Galileo is also a useful piece of information, if you knew a lot about Galielo, you probably could get to Copernicus.

The main subject of the question could appear in the question as well. "Copernicus was prosecuted by the church for publishing a paper on this model of the solar system". Copernicus is still the main topic of the question, even though he doesn't appear in the answer, 'heliocentric'. 

Seems like it would be worthwhile to create a new table for named entities and then a mapping table to map q/a combinations that contain those named entities.


## Mapping Questions to Named Entities
To map question and answer text to Named Entities, we need a new table to track those entities and their types, as well as a mapping table to handle the many to many relationship of question to named_entity. With that set up and foreign keys in place, I should be able to populate those tables pretty easily.

In [63]:
import mysql.connector
import regex
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

mydb = mysql.connector.connect(
  host="localhost",
  user="service",
  password="jeopardy!",
)

# Queries
get_all_questions = ("SELECT question_id, question, answer FROM jeapordy_questions.question LIMIT %s OFFSET %s")
delete_mappings = "DELETE FROM jeapordy_questions.question_named_entity"
delete_named_entities = "DELETE FROM jeapordy_questions.named_entity"
add_named_entity = ("INSERT INTO jeapordy_questions.named_entity (name, type) VALUES (%s, %s)")
add_mapping = ("INSERT INTO jeapordy_questions.question_named_entity (question, named_entity) VALUES (%s, %s)")

cursor = mydb.cursor()
print("Deleting old records...")
cursor.execute(delete_mappings)
cursor.execute(delete_named_entities)

limit = 100
offset = 0
data_entities = dict()
all_data_entities = dict()
data_mappings = []

for i in range(limit, 200000, limit):
  print("Starting get...")
  get_data = (limit, offset)
  print(str(get_data))
  cursor.execute(get_all_questions, get_data)

  print("Done, starting mapping...")
  for (question_id, question, answer) in cursor:
      q = regex.sub("'", "", question + " " + answer)
      doc = nlp(q)
      for X in doc.ents:
        entity = X.text.lower()
        if not entity in data_entities:
          if not entity in all_data_entities:
            data_entities[entity] = X.label_
        data_mapping = (question_id, entity)
        data_mappings.append(data_mapping)
  print("Done, starting insert...")
  cursor.executemany(add_named_entity, (list(data_entities.items())))
  cursor.executemany(add_mapping, (data_mappings))

  all_data_entities = {**all_data_entities, **data_entities} 
  data_entities.clear()
  data_mappings = []
  offset = i
print("Done, closing...")
cursor.close()

Deleting old records...
Starting get...
(100, 0)
Done, starting mapping...
Done, starting insert...


IntegrityError: 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`jeapordy_questions`.`question_named_entity`, CONSTRAINT `question_fk` FOREIGN KEY (`id`) REFERENCES `question` (`question_id`))

Facing some issues with fk contraints... probably just something in the data. Need to revisit.