# Gaining an Edge for Jeopardy Questions

The goal of this projet is to identify any patterns in past Jeopardy questions in order to help one win Jeopardy.

The dataset is called `jeopardy.csv` and can be downloaded __[here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)__. Each row in the dataset represents a single question on a single episode of Jeopardy.

In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')

print(jeopardy.shape)
jeopardy.head()

(216930, 7)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Here are explanations for each column:

* `Show Number` -- the Jeopardy episode number of the show this question was in.
* `Air Date` -- the date the episode aired.
* `Round` -- the round of Jeopardy that the qustion was asked in.
* `Category` -- the category of the question.
* `Value` -- the number of dollars answering this question correctly is worth.
* `Question` -- the text of the question.
* `Answer` -- the text of the answer.

In [2]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [3]:
# Remove the spaces in each item in jeopardy.columns
jeopardy.columns = [x.strip() for x in jeopardy.columns]
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### Normalizing columns

Before one can do analysis on the Jeopardy questions, one needs to normalize all of the text columns (the `Question` and `Answer` columns). The idea behind normalization is to ensure that all words are uinformly lowercase and any punctuation is removed so `Don't` and `don't` aren't considered to be different words. 

Below is a function that normalizes text by:
* taking in a string
* converting the string to lowercase
* removing all punctuation
* returning the string

Similarly, the `Value` column should also be numeric, to allow one to manipulate it more easily. The `Air Date` column should also be in a datetime, not a string, to enable one to work with it more easily.

In [4]:
import string

In [5]:
def normalize(s):
    s = str(s)
    s = s.lower()
    s = s.translate(str.maketrans('', '', string.punctuation))
    return s


def normalize_value(v):
    v = v.replace("$", "")
    v = v.replace(",", "")
    try:
        v = int(v)
    except Exception:
        v = 0
    return v

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [7]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [8]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### Answers in questions

In order to figure out whether one should study past questions, general knowledge, or not study at all, it is helpful to know two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

One can answer the first question by seeing how many times words in the answer also occur in the question. One can answer the second question by seeing how often complex words (> 6 characters) reoccur.

The code block below aims to answer the first question.

In [9]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis =1)

In [10]:
jeopardy['answer_in_question'].mean()

0.05789123355910071

The mean above implies that the answer only appears in the question less than 6% of the time. This suggests that one will not reliably be able to hear the question and figure out the answer. In other words, it will be necessary to study in order to win Jeopardy.

### Recycled questions

The code block below aims to answer the second question posed above.

In [12]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.8717149005532044

The mean above suggests that there is about 87% overlap between terms in new questions and terms in old questions. This only considers a small set of questions, and it does not look at phrases. This makes it relatively insignificant, but it does mean that it's worth looking more into recycling of questions.

### Low value vs high value questions

Suppose one only wants to study questions that pertain to high value questions instead of low value questions. This will help earn more money if one is on Jeopardy.

One can figure out which terms correspond to high-value questions using a chi-squared test.

* Low value -- Any row where `Value` is less than `800`.
* High value -- Any row where `Value` is greater than `800`.

One can loop through each of the terms from the `terms_used` set above, and:

* Find the number of low value questions that word occurs in.
* Find the number of high value questions that word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions that word occurs in, find expected counts.
* Compute the chi-squared value based on the expected counts and the observed counts for high and low value questions.

One can then find the words with the biggest difference in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.

The code block below provides a function to calculate the chi-squared value associated with terms. Applying this function to all words would take a very long time, so it will just be applied to a small sample for now.

In [13]:
def determine_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [14]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [15]:
from random import choice

In [17]:
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))
    
observed_expected

[(0, 2),
 (1, 0),
 (1, 0),
 (2, 0),
 (0, 1),
 (0, 1),
 (3, 8),
 (0, 1),
 (1, 0),
 (1, 0)]

In [18]:
from scipy.stats import chisquare
import numpy as np

In [19]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=5.063592849467617, pvalue=0.02443353405878706),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.005878321230796754, pvalue=0.9388859030670194),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751)]

### Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this text with only terms that have higher frequencies.

### Next Steps

Here are some potential next steps:

* Find a better way to eliminate non-informative words than just removing words that are less than `6` characters long. Some ideas:
    * Manually create a list of words to remove, like `the`, `than`, etc.
    * Find a list of stopwords to remove.
    * Remove words that occur in more than a certain percentage (like `5%`) of questions.
* Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    * Use the `apply` method to make the code that calculates frequencies more efficient.
    * Only select terms that have high frequencies across the dataset, and ignore the others.
* Look more into the `Category` column and see if any interesting analysis can be done with it. Some ideas:
    * See which categories appear the most often.
    * Find the probability of each category appearing in each round.
* Use phrases instead of single words when seeing if there's overlap between questions.