# Simple, Interactive, and Dynamic Jeopardy [Web App](http://www.jacobtbigham.com/jeopardy)
I recently finished reading Madeleine Albright's engaging *Fascism: A Warning*, and upon completing it two thoughts crossed my mind. First, I was thankful to have access to Alrbight's insights and experiences with world leaders; her unique pedigree and history offers perspective I could not dream of. Second, I was mildly alarmed that for this book—more than most others I have read—I knew I would forget the vast majority of information in the book. Keeping track of Orban, Erdogan, Chavez, and the names of cities, places, and other dictators is just too difficult for me to handle. I could, of course, have kept diligent notes while reading, but I figured that even then I might not be able to recall names and ideas when needed.

Long story much shorter: I want to develop a simple Jeopardy game into which I can feed a keyword and then receive Jeopardy questions related to that keyword. Fortunately, the Jeopardy [Archive](http://www.j-archive.com) stores hundreds of thousands of past Jeopardy questions, which others have dutifully scraped from the JSON files in the archive.

Because my personal [website](http://www.jacobtbigham.com) is hosted on SquareSpace (don't @ me), databasing these questions was a little difficult, since I was forced to use Javascript (and I can't keep Promises, apparently) and could not use SQL. Fortunately, I learned a lot more from coding this project than I bargained for.

Here's an example of gameplay:

![Game](https://i.imgur.com/xm4DLKq.gif)

A couple of features (and lack of features) worth noting:
- Users input a single keyword or key phrase, to which the app then queries the database for matches.
- Those keywords can match exactly or as substrings within other words (e.g., dog: dog, or erdogan).
- I did not employ any ML algorithms to extract key words, phrases, or topics from questions and answers.
- Instead, I give users the option to either view clues that *only* match the keyword **or** to get clues that matched the keyword *and* the other clues in the same categories as those clues. In this way, I let the Jeopardy writers do my ML for me.
- If there are not enough clues that match a given keyword, then the keyword is rejected.
- I used the Levenshtein distance between users' answers and correct answers to determine whether answers were correct.
- I separated Final Jeopardy and regular Jeopardy questions so that users can get the full Jeopardy experience.
- The first round contains one Daily Double question, and the second round contains two Daily Double questions.
- Users can restart the game with a new keyword at any time.
- Users can share their score to social media sites, along with the keyword for their custom game.
- Users can alternatively opt to play a random Jeopardy game, with clues from random and unrelated categories.

My overall approach was the following:
1. Construct a database to hold regular Jeopardy and Final Jeopardy questions
2. Build a user interface with HTML and Javascript to interact with the database and present questions
3. Enjoy!

Feedback, suggestions, and corrections are greatly appreciated!

## Wrangling

First, let's load the data, which is available [here](https://drive.google.com/drive/folders/1fxY181PdiA1KoJRG23ZVLC2Y5CIRqQWx), and take care of some of the easy wrangling. (I converted from .tsv to .csv format on my computer so that it would open automatically in Excel, wherein I removed the comments and notes sections.)

In [None]:
import pandas as pd
import numpy as np
import unicodedata

In [None]:
data = pd.read_csv("../input/jeopardy.csv", dtype = {"round": np.int16, "value": np.int16})
data.head()

Let's first verify that the daily_double only has "yes" and "no" values, and then let's change the values to True and False:

In [None]:
data.daily_double.describe()

In [None]:
data.daily_double.loc[data.daily_double == "no"] = False
data.daily_double.loc[data.daily_double == "yes"] = True
data.head()

Great! Now, just to be consistent with the capitalization of the categories (and the style of the actual game show answers), let's convert our answers and questions to uppercase:

In [None]:
data["answer"] = data["answer"].str.upper()
data["question"] = data["question"].str.upper()
data.head()

Stellar! Furthermore, though it's not immediately obvious here, many of the catepgories, answers, and questions contain quotation marks that are formatted grotesquely. Let's get those in order:

In [None]:
#for example
data["answer"].iloc[11]

In [None]:
data["answer"] = data["answer"].str.replace("\\\\", "")
data["question"] = data["question"].str.replace("\\\\", "")
data["category"] = data["category"].str.replace("\\\\", "")
data["answer"].iloc[11]

Beautiful! I also want to remove any special/foreign/accented characters (since otherwise a search would not be able to locate them!), so let's do that here using [this](https://stackoverflow.com/a/44433664) function:

In [None]:
#notice the question
data.iloc[16:17]

In [None]:
def strip_accents(text):
    try:
        text = unicode(text, 'utf-8')
    except NameError: # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore').decode("utf-8")
    return str(text)

data["category"] = data["category"].apply(strip_accents)
data["answer"] = data["answer"].apply(strip_accents)
data["question"] = data["question"].apply(strip_accents)

#and now
data.iloc[16:17]

Fantastic! Now, because the show double point values on November 26, 2001, we need to adjust clue values prior to that date to match. That the data are arranged chronologically makes this simple:

In [None]:
#notice
data.iloc[129490:129497, ]

In [None]:
data["value"].iloc[:129493] = 2*data["value"].iloc[:129493]
data.iloc[129490:129497, ]

Miraculous! By the way, notice that Final Jeopardy questions are indicated by a round value of 3, and their corresponding value is 0. Now, also not obvious is that not every point value is correct or possible:

In [None]:
ACCEPTABLE_VALUES = [200, 400, 600, 800, 1000, 1200, 1600, 2000]

data.loc[(data["round"] != 3) & ((~data["value"].isin(ACCEPTABLE_VALUES))|data["daily_double"])]

In [None]:
data.loc[349562:349571, ]

As if by providence, we notice here another problem as well: not all the clues are here, but not for the reason we might first suppose (that the clue was never revealed). If we look at the Jeopardy archive [page](http://www.j-archive.com/showgame.php?game_id=6388) for 7-25-2019, the reasons for these problems become clear..

The 4000-point clues, for example, are Daily Doubles! And the missing clue in the "GATES" category is a clue that contained an image—and those clues are omitted from the dataset.

I'm going to impute the Daily Double values by assigning them whatever value is missing based on the other clues for each category. Notably, sometimes there are missing clues in categories with Daily Doubles. In such cases, I will assign to the Daily Double question the highest of missing values, since, in the past, Daily Double questions tended lower on the board.

It's mildly more costly but easier to implement this by running over the entire dataset, not just the problem-values we identified before. This approach makes is easier to identify category chunks. I could have used a swifter Pandas aggregate and sort approach, but the iterative approach below works and isn't so slow that it's worth scrapping.

In [None]:
indexes_with_problems = [index for index in range(0, len(data)) if data["round"].iloc[index] != 3 \
                                                                and ((data["value"].iloc[index] not in ACCEPTABLE_VALUES)
                                                                     or data["daily_double"].iloc[index])]
len(indexes_with_problems)
assert(len(indexes_with_problems) == 17025)

In [None]:
ROUND_ONE_VALUES = set([200, 400, 600,  800,  1000])
ROUND_TWO_VALUES = set([400, 800, 1200, 1600, 2000])

index = 0
while(index < 349630):        #the last problem is at 349625
    current_category = data["category"].iloc[index]
    indexes_in_category = [index]
    index += 1
    while(data["category"].iloc[index] == current_category):
        indexes_in_category.append(index)
        index += 1
    for i in indexes_in_category:
        if i in indexes_with_problems:
            indexes_in_category.remove(i)
            possible_values = ROUND_ONE_VALUES.copy() if data["round"].iloc[i] == 1 else ROUND_TWO_VALUES.copy()
            for j in indexes_in_category:
                if data["value"].iloc[j] in possible_values:
                    possible_values.remove(data["value"].iloc[j])
                else:
                    print("Problem removing value from line", j, end= "\n")
            data["value"].iloc[i] = max(possible_values)
            break

Interesting problems were at fault here! For element 4257, which was a Final Jeopardy question, its category—EUROPE—was the same as the first category for the next show, so the code saw them all as one category. There's no need for an edit here, but I show the cells below to verify. The other two problematic values occurred for episodes that contained "bonus" rounds, where contestants could give either one or two of the two possible answers. I'll edit these manually since there are only two of these rounds:

In [None]:
data.iloc[4257:4263]

In [None]:
data.iloc[77347:77352]

In [None]:
data["value"].iloc[77347] = 400

In [None]:
data.iloc[79592:79597]

In [None]:
data["value"].iloc[79592] = 400

Now, just as a final sanity check, let's make sure that every category has *unique* point values and adjust any that don't:

In [None]:
data.tail(6)

In [None]:
index = 0
while(index < 349636):        #since we see above there is no problem with the last round
    while data["round"].iloc[index] == 3: #some Final Jeopardy questions appear successively without any round content
        index += 1
    current_category = data["category"].iloc[index]
    indexes_in_category = [index]
    index += 1
    while(data["category"].iloc[index] == current_category):
        indexes_in_category.append(index)
        index += 1
    for i in indexes_in_category:
        possible_values = ROUND_ONE_VALUES.copy() if data["round"].iloc[i] == 1 else ROUND_TWO_VALUES.copy()
        for j in indexes_in_category:
            if data["value"].iloc[j] in possible_values:
                possible_values.remove(data["value"].iloc[j])
            else:
                print("Problem removing value from line ", j, end= "\n")
                break

In [None]:
data.iloc[44064:44067]

This is simply a typo in the J-Archive data, and I will fix it:

In [None]:
data["value"].iloc[44065] = 400
data.iloc[44064:44067]

All set! There are some further changes I will make to the data for reasons idiosyncratic to my use case, but I'm exporting this set as `jeopardy_clean.csv` for anyone else to use. I think many would find the clean-ups beneficial. (Plus, I'll likely need to come back to this as a starting point for any changes, and I don't want to rerun the above code every time!)

In [None]:
data.to_csv("jeopardy_clean.csv", index = False)

## App-Specific Wrangling
Thinking ahead about implementation, I'd like to make a few little fixes. 

First, since I really don't care about the exact air date of each question, but I *do* care about the relative air dates (I'd like to prioritize newer questions), **I'm going to remove the air_date column**. The relative order of air dates is contained in the DataFrame index. Indeed, since database queries will access questions from top to bottom, **I'm going to reverse the vertical order of the entries**—with the caveat (CC) introduced below.

Second, I really don't care whether a clue was a Daily Double question, for two reasons: one, since I'm going to be pulling from questions across many categories anyways, it might be disruptive and unnecessarily costly to preserve this data; and two, and more importantly, the game has changed such that the location and relative difficulty of Daily Double questions seems random. So, **I'm going to drop the daily-double column**.

Third, I don't so much care what round in which a question appeared. I do, however, care whether it was a Final Jeopardy question, since the syntax of those questions (and their difficulty) tends to differ from those of the normal rounds. I could add a separate column to track this—and remove the round column—but what **I'm going to instead do is create a separate DataFrame with just Final Jeopardy questions**. My rationale for this is speed: I'll only want to search for Final Jeopardy questions *when it's time for Final Jeopardy*, so there's no sense in keeping an unneeded extra column (round) or in keeping unneeded extra clues (the Final Jeopardy clues) in the same data structure as the normal clues. **I'll remove the round column after that separation**.

Fourth, there are some clues that originally had audio and video accompaniments that are not replicated here. Often these clues contain the phrase "heard here" or "seen here." **I will remove clues that contain these phrases.**

Finally, I also don't *really* care how much each question is worth; I only care about the relative order, which is contained in the index. So, **I'm going to drop the value column**. I do care, however, whether questions were in the same category group. Because category names are not unique, **I'm going to give each category from each game a distinct category_id**. (CC) Because I'll be flipping the vertical order of the clues, **I'll first flip the vertical order of the clues in each category**, such that after overall flipping, the harder (higher-valued) questions appear lower in the database.

So, let's go ahead and make these final changes:
1. Remove the daily_double column
2. Give each category a unique category_id
3. Flip the vertical order of questions in each category
4. Remove the value column
5. Remove the air_date column
6. Flip the overall vertical order of clues
7. Remove and clues with "heard here" or "seen here" in the answer
8. Separate regular and Final Jeopardy questions
9. Remove the round column

In [None]:
data = pd.read_csv("jeopardy_clean.csv")

In [None]:
del data["daily_double"] #1

data["air_date"] = pd.to_datetime(data["air_date"], format = "%m/%d/%Y") #so that ids are in order by date

data["category_id"] = data.groupby(["air_date", "category"]).ngroup() + 1 #2

data.sort_values(by=["category_id", "value"], ascending=[False, False], inplace = True) #3 and #6

del data["value"] #4
del data["air_date"] #5

bad_strings = ["HEARD HERE", "SEEN HERE", "SHOWN HERE", "PICTURED HERE"]
for string in bad_strings:
    data = data[~data["answer"].str.contains(string)] #7

normal_questions = data.loc[data["round"] != 3] #8
final_questions = data.loc[data["round"] == 3]

del normal_questions["round"] #9
del final_questions["round"]

normal_questions.reset_index(inplace = True, drop = True)
final_questions.reset_index(inplace = True, drop = True)

normal_questions.head(8)

In [None]:
final_questions.head(5)

So that I need not run this processing code every time I want to use the cleaned data, let's go ahead and save these DataFrames in new .csv files called `jeopardy_normal.csv` and `jeopardy_final.csv`.

In [None]:
normal_questions.to_csv("jeopardy_normal.csv", index = False)
final_questions.to_csv("jeopardy_final.csv", index = False)

In [None]:
normals = pd.read_csv("jeopardy_normal.csv")
finals = pd.read_csv("jeopardy_final.csv")

Finally, let's see how many total clues, categories, and Final Jeopardy questions we have in this set:

In [None]:
num_categories = len(normals["category_id"].unique())
num_finals = len(finals)
num_clues = len(normals)
print("There are {:,} clues across {:,} categories and {:,} Final Jeopardy questions"\
          .format(num_clues,        num_categories,     num_finals), end=".")

## Preparation for Database Querying
The fact that I have separate columns for category, answer, and question is not especially memory-intensive. Namely, I have to separate the values *somehow*, and it's no costlier to store a comma than it is to store any other character. However, when I run database queries, it *is* costlier and requires unnecessarily more lines of code to run three queries (on category, answer, and question) and then merge the results than it is to just run one query for matches on the aggregate of those three values. So, I'm going to aggregate the data into just one block of text, where the category, question, and answer are separated by a `|` symbol.

In [None]:
finals["agg"] = finals["category"] + "|" + finals["answer"] + "|" + finals["question"]
del finals["category"]
del finals["answer"]
del finals["question"]
finals.head()

In [None]:
finals.to_csv("jeopardy_final_agg.csv", index = False)

In [None]:
normals["agg"] = normals["category"] + "|" + normals["answer"] + "|" + normals["question"]
del normals["category"]
del normals["answer"]
del normals["question"]
normals.head()

First iterations of my implementation of the game allowed for the board to have fewer than 25 clues. For example, in the VENICE category in the above output, on the show in which that category was played, only three of the clues were revealed. I figured it would be better (and more inclusive) to give those categories a chance, but it just looks gross, especially when there are only 1 or 2 clues in a category. So, I'm going to assign any clues that belong to an incomplete category a category_id of 0, which I can then exclude from database searches when desired. Notably, this still allows clues from incomplete categories to appear in games in which the user does not opt to keep categories intact.

In [None]:
normals["freq"] = normals.groupby("category_id")["category_id"].transform("count")
normals["category_id"].loc[normals["freq"] < 5] = 0
del normals["freq"]
normals.head()

Because there are limitations on file size for uploading to the database on Back4App, I'm going to split this set into 4 chunks. Whether the chunks split cleanly between categories is not relevant, since they'll be reaggregated in the database. New imports stack atop old imports, so chunk 1 will be the bottom quarter, up to chunk 4 the top quarter.

Notably, I could always just use a sort on the category_ids, but I'm not super sure how efficient the database I'm using is!

In [None]:
size = len(normals)
for quarter in range(1, 5):
    chunk = normals.iloc[int((quarter-1)*size/4) : int(quarter*size/4), ]
    chunk.to_csv("jeopardy_normal_agg" + str(5-quarter) + ".csv", index = False)

That does it for the preparatory steps.

## The Web Application
You can view the work-in-progress game [here](http://www.jacobtbigham.com/jeopardy) and most of the source code [here](https://github.com/jacobtbigham/jeopardy/blob/master/website_source_code.html).

One of the toughest parts of the implementation was scaling. Because the text to display constantly changes as the game proceeds (and cannot be known ahead of time), the size of text and other display elements has to continually be changed throughout the duration of a game. Indeed, you can find a great number of methods in the Javascript code that are devoted specifically to resizing.

Another difficult component of the game is dealing with answer validation. Because the question and answer data rely on user input from J-archive, there's no guarantee that all answers will be spelled correctly or formatted uniformly (e.g., if an answer is "seventeen," can we be sure that the database text will be the word "seventeen," or will it be just the numerical "17"?). To deal with some deviations in spellings—both in player input and database spelling—I've allowed a tolerance based on edit distance and answer length. I've also remove some common stop-words from answers (like "the," "a," etc.) and accepted answers if only the last name of an individual is input.

There remains much to improve and develop:
- Build a more robust system for verifying players' answer inputs
- Implement a cookie system that tracks how many times users have played so that different portions of the database are searched (and questions are less likely to repeat)
- Improve cross-browser compatibility and display uniformity
- Implement a high-score system, both for random questions and specific keywords