# Trivial Analytics
The purpose of this post is to try to use some data analytics to answer a question that came up in a conversation between me and my trivia teammates. Before the coronavirus put our favorite bar trivia night on hold, my friends and I had a ritual of appearing there every Wednesday night at 7pm and answering 8 rounds of trivia questions on a variety of [geek pop-culture subjects](https://www.geekswhodrink.com/). The question that arose at our table was along these lines:

> What would you give to know the 10 most important topics to study for trivia?

It's pretty natural to start to try to answer this question with data. For our particular trivia game, if we had the data it would be great to know which *things* appear the most in questions and answers. Knowing, for example, that Beyonce is 10% more likely to appear in an audio round than Bruno Mars is a pretty critical piece of information for somebody with a limited amount of time to prepare to win that sweet, sweet $20 bar cash.

Extracting this type of insight is 'non-trivial', even assuming a perfect world where I had access to a data set with my bar trivia game's questions and answers. No such data set exists. But, after some sleuthing I found a reddit poster sharing a data set with 200,000+ Jeopardy! questions [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). Bad news for my trivia team, I don't have the resources to crack the code our game, but maybe we can learn something by asking an amended question of the Jeopardy! data:

> If you only could study 10 topics in preparation for Jeopardy!, which topics should you study?

## Working with the data
The first step was to set up the data in a way that would be easy for me to work with in this project. I spun up a mySQL database and added a `question` table to hold the question data from reddit. There were a number of columns that I imported, but mostly we care about the `question` and `answer` fields.

The Juptyer notebook for this post is set up to easily connect to a local mySQL database assuming it is set up a similar way. Python can connect to mySQL using a package called `mysql.connector`.

In [16]:
import mysql.connector

def connectToMySQL():
  mydb = mysql.connector.connect(
    host="localhost",
    user="service",
    password="jeopardy!",
  )
  print("Connected.")
  print()
  return mydb



With a connection to the database open, you can execute normal SQL queries. Right away we are able to ask some fairly smart things if we know what we are looking for, like *show me five questions about Egypt*.

In [20]:
mydb = connectToMySQL()
cursor = mydb.cursor()
cursor.execute("SELECT question, answer FROM jeapordy_questions.question WHERE question like '%Egypt%' LIMIT 5")

for (question, answer) in cursor:
    print("Question: | " + question)
    print("Answer:   | " + answer)
    print()

cursor.close()

Connected.

Question: | 'Cleopatra's Needle is a short walk from this Egyptian Temple in the Metropolitan Museum of Art'
Answer:   | the Temple of Dendur

Question: | 'In 46 B.C. this Egyptian came with Caesar to Rome, where her statue was placed in the temple of Venus Genetrix'
Answer:   | Cleopatra

Question: | '"The Prince of Egypt" featured Ralph Fiennes as the voice of this stubborn ruler'
Answer:   | the Pharaoh

Question: | 'This city of east central Egypt is the southern half of the site of ancient Thebes'
Answer:   | Luxor

Question: | 'A short war between Israel & Egypt & Syria in October 1973 was named for this high holiday'
Answer:   | Yom Kippur



True

So now we have a database set up and we can write queries to ask it smart things. Like mentioned above, however, this requires us to know what we are asking. Questions like *what 10 things should I study* won't fly because we can't write a query yet for *things* we don't know we care about. We need some way to figure out what *things* in the questions are important.

## Named Entity Recognition
So what do I mean by *thing*? One naive solution to our problem might be to just look for common appearances of certain words. For example, if "America" appears regularly in questions, then that might be an important country to study, right?Well, practiced trivia players know that trivia is all about going more fine-grained than that. American History may be a very important subject to study, but at the end of the day, you may need to know some specifics about Hamilton that you may gloss over if you only study American History broadly. 

Consider another issue of the word count solution, it may tell you that it is quite important to know about "Alexander", but *Alexander-who?* Alexander Hamilton and Alexander the Great might both be important, but the word count solution doesn't tell us who is *more* important.

Another idea is to use the `category` of a question. That should help us get to the meat of what a question is about, but viewers of Jeopardy! will know well that the category is usually not useful, if not downright distracting. Categories like "African Geography" are way too broad to be useful. Meanwhile, many of the Jeapordy! categories are unique to the game, playful rhymes or word games.

In general, it looks like we are trying to extract people, places, times, etc. from the questions. In Natural Language Processing (NLP) there is a name for annotating this type of information, "Named Entity Recognition". Fortunately there are handy Python libraries out there like [spaCy](https://spacy.io/api/entityrecognizer) that can do the heavy lifting for this.


In [28]:
# Reference https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

def printAnnotation(q):
    doc = nlp(q)
    print([(X.text, X.label_) for X in doc.ents])

# We won't use this function right away, but let's set it up now so we can use it later
def getAnnotation(q):
    return nlp(q)

q = '"The Prince of Egypt" featured Ralph Fiennes as the voice of this stubborn ruler'
printAnnotation(q)

[('The Prince of Egypt', 'WORK_OF_ART'), ('Ralph Fiennes', 'PERSON')]


Here we can see spaCy was able to identidy to named entities in this question, "The Prince of Egypt" was labeled as a work of art (animated film, go watch it), and "Ralph Fiennes" was labeled as a person. This works pretty well generally, but it isn't perfect. For example, below it has decided that "Israel & Egypt & Syria" are an organization all-together. Hopefully these things will *come out in the wash* so to speak, but we should keep an eye out for misleading entities.


In [27]:
q = "A short war between Israel & Egypt & Syria in October 1973 was named for this high holiday"
printAnnotation(q)

[('Israel & Egypt & Syria', 'ORG'), ('October 1973', 'DATE')]


## Mapping Questions to Named Entities
To map question and answer text to Named Entities, we need a new table to track those entities and their types, as well as a mapping table to handle the many to many relationship of question to named_entity. With that set up and foreign keys in place, I should be able to populate those tables pretty easily.

This seeding script reads questions and answers from the database, creates named enitities and maps them to questions by inserting rows in named_entitu and mapping rows in question_named_entity. For now it is ignoring CARDINAL and MONEY named entities, as I found them to be not so useful (examples: 100, $1 billing, etc.).

In [29]:
# Query to get all questions from question table with limit and offset to paginate
get_all_questions = ("SELECT question_id, question, answer FROM jeapordy_questions.question LIMIT %s OFFSET %s")

# Queries to wipe out tables before re-seeding
delete_mappings = "DELETE FROM jeapordy_questions.question_named_entity"
delete_named_entities = "DELETE FROM jeapordy_questions.named_entity"

# Queries to add named entities and mappings
add_named_entity = ("INSERT INTO jeapordy_questions.named_entity (name, type) VALUES (%s, %s)")
add_mapping = ("INSERT INTO jeapordy_questions.question_named_entity (question, named_entity) VALUES (%s, %s)")

In [19]:
mydb = connectToMySQL()
cursor = mydb.cursor()

print("Deleting old records...")
cursor.execute(delete_mappings)
cursor.execute(delete_named_entities)
mydb.commit()

limit = 10000
offset = 0
data_entities = dict()
all_data_entities = dict()
data_mappings = []

for i in range(limit, 200000, limit):
  print("Starting get...")
  get_data = (limit, offset)
  print(str(get_data))
  cursor.execute(get_all_questions, get_data)

  print("Done, starting mapping...")
  for (question_id, question, answer) in cursor:
      q = unidecode.unidecode(regex.sub("'", "", question + " " + answer))
      doc = getAnnotation(q)
      for X in doc.ents:
        if not X.label_ == 'CARDINAL' and not X.label_ == 'MONEY':
          entity = X.text.lower()
          if not entity in data_entities:
            if not entity in all_data_entities:
              data_entities[entity] = X.label_
          data_mapping = (question_id, entity)
          data_mappings.append(data_mapping)
  print("Done, starting insert...")
  cursor.executemany(add_named_entity, (list(data_entities.items())))
  mydb.commit()
  cursor.executemany(add_mapping, (data_mappings))
  mydb.commit()

  all_data_entities = {**all_data_entities, **data_entities} 
  data_entities.clear()
  data_mappings = []
  offset = i
print("Done, closing...")
cursor.close()

Deleting old records...
Starting get...
(10000, 0)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 10000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 20000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 30000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 40000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 50000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 60000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 70000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 80000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 90000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 100000)
Done, starting mapping...
Done, starting insert...
Starting get...
(10000, 110000)
Done, starting mapping...
Done, starting inse

True

With that in place, we should be able to query the mapping table for information about occurrences of certain named entities in questions. We should be able to write a rudimentary query to answer our question- what are the most important topics to study?

In [33]:
import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="service",
  password="jeopardy!",
)

cursor = mydb.cursor()
cursor.execute("SELECT named_entity,COUNT(*) FROM jeapordy_questions.question_named_entity GROUP BY named_entity ORDER BY COUNT(*) DESC LIMIT 10;")

for r in cursor:
    print(str(r))
cursor.close()

('first', 7848)
('u.s.', 3895)
('french', 2251)
('british', 1687)
('greek', 1449)
('american', 1388)
('latin', 1386)
('english', 1068)
('second', 1004)
('german', 957)


True

We're getting closer. Not too surprisingly, we see frequent occurences of what look like geographic or linquistic designations- French, American, English, Greek, German. First and Second don't really seem like topics, but more like just a part of speech, let's look into that.

In [42]:
import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="service",
  password="jeopardy!",
)

cursor = mydb.cursor()
cursor.execute(
    "SELECT question.question, named_entity.type FROM jeapordy_questions.question_named_entity " + 
    "LEFT JOIN jeapordy_questions.question " +
    "ON jeapordy_questions.question.question_id = jeapordy_questions.question_named_entity.question " +
    "LEFT JOIN jeapordy_questions.named_entity " +
    "ON jeapordy_questions.question_named_entity.named_entity = jeapordy_questions.named_entity.name " +
    "WHERE jeapordy_questions.question_named_entity.named_entity = 'first' LIMIT 30;")

for r in cursor:
    print(str(r))

cursor.close()

("'Cows regurgitate this from the first stomach to the mouth & chew it again'", 'ORDINAL')
("'Karl led the first of these Marxist organizational efforts; the second one began in 1889'", 'ORDINAL')
('\'This "Modern Girl" first hit the Billboard Top 10 with "Morning Train (Nine To Five)"\'', 'ORDINAL')
("'Warhol became the manager of this Lou Reed rock group in 1965 & produced their first album'", 'ORDINAL')
("'His first act after being sworn in as president of the Confederacy was to send a peace commission to Washington, D.C.'", 'ORDINAL')
("'The first 50-star U.S. flag was officially raised on July 4 of this year'", 'ORDINAL')
('\'On March 19, 2009 he said, "I\'m excited and honored to introduce my first guest... Barack Obama"\'', 'ORDINAL')
("'The first controlled nuclear chain reaction'", 'ORDINAL')
('\'He reviewed films & TV for the New Republic before his first book, "Goodbye, Columbus", was published in 1959\'', 'ORDINAL')
("'Colo was the first of these great apes born in captivit

True

That looks like enough information to rule out ORDINAL. Out of curiosity lets repeat this process for things like American, or German. Maybe there is another broad category to rule out.

In [46]:
import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="service",
  password="jeopardy!",
)

cursor = mydb.cursor()
cursor.execute(
    "SELECT question.question, named_entity.type FROM jeapordy_questions.question_named_entity " + 
    "LEFT JOIN jeapordy_questions.question " +
    "ON jeapordy_questions.question.question_id = jeapordy_questions.question_named_entity.question " +
    "LEFT JOIN jeapordy_questions.named_entity " +
    "ON jeapordy_questions.question_named_entity.named_entity = jeapordy_questions.named_entity.name " +
    "WHERE jeapordy_questions.question_named_entity.named_entity = 'german' LIMIT 30;")

for r in cursor:
    print(str(r))

cursor.close()

("'In 1811 this German family began its steel-making business by constructing a plant in Essen'", 'NORP')
("'In 1905 German scientist Alfred Einhorn created this first injectable local anesthetic used in dentistry'", 'NORP')
('\'Some of these are produced by bremsstrahlung, from the German for "breaking radiation"\'', 'NORP')
("'Named for a German neuropathologist, this memory loss disease may be caused by a gene on chromosome 21'", 'NORP')
("'This German composer's 5th Symphony in C Minor has a famous opening'", 'NORP')
("'In German, berg is this topographical feature on a map'", 'NORP')
("'The U.S. U-2, first built in the 1950s, was an airplane; the German U-1, first built in the 1910s, was one of these'", 'NORP')
("'Paul Baumer, a young German soldier'", 'NORP')
("'A German circus performer has made the Guinness record book for riding a bicycle with this distinction'", 'NORP')
("'Laboratory culture dish named for the German bacteriologist who invented it'", 'NORP')
('\'In "Sahara", 

True

Seems like NORP might be worth ruling out also. At this point I am starting to wonder if it is less about NLP labels we don't care about, and more about the few we DO care about. Interestingly, the first appearance of a topic that feels "trivial" in nature is "The Clue Crew" with 357 occurrences in questions. Let's take a look at those results and see how it is labeled. "The Clue Crew", obviously the group from the Nancy Drew childrens book series, is labeld as an "ORG". Logically seems like if we care about organizations, we also care about people.

In [52]:
import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="service",
  password="jeopardy!",
)

cursor = mydb.cursor()
cursor.execute(
    "SELECT named_entity.name, COUNT(*) FROM jeapordy_questions.question_named_entity " + 
    "LEFT JOIN jeapordy_questions.named_entity " +
    "ON jeapordy_questions.question_named_entity.named_entity = jeapordy_questions.named_entity.name " +
    "WHERE jeapordy_questions.named_entity.type = 'ORG' OR jeapordy_questions.named_entity.type = 'PERSON'" +
    "GROUP BY named_entity ORDER BY COUNT(*) DESC LIMIT 10;")

for r in cursor:
    print(str(r))

cursor.close()

('congress', 388)
('nyc', 359)
('the clue crew', 357)
('shakespeare', 287)
('jesus', 274)
('senate', 257)
('sarah', 232)
('house', 220)
('nba', 219)
('john', 210)


True

By merging our two queries together, paying attention to occurrences of entities by name as long as their label is either a person or an organization, we can quickly see that a student of the game (or any well-rounded individual I suppose) would benefit by reading up on the U.S. Congress, the city of new york, Nancy Drew, Shakespeare, and Jesus.