# Winning Jeopardy

I will pretend I want to compete on Jeopardy, and am looking for any edge I can get to win. I will work with a dataset of Jeopardy questions to figure out some patterns in the questions that can help me win. The dataset, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file) contains 20,000 rows where each row corresponds to a single question on a single episode of Jeopardy. Here are the explanations for each column:

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.


In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv', parse_dates = [' Air Date'])
print('\nColumns: ', jeopardy.columns)
jeopardy.head()


Columns:  Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# Remove whitespace from in front of column names
jeopardy.columns = jeopardy.columns.str.replace(' ', '')

In [3]:
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [4]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
ShowNumber    19999 non-null int64
AirDate       19999 non-null datetime64[ns]
Round         19999 non-null object
Category      19999 non-null object
Value         19999 non-null object
Question      19999 non-null object
Answer        19999 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB


I will now normalize the Question and Answer columns by making all text lowercase and removing punctuation

In [5]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [6]:
jeopardy[['clean_question', 'clean_answer']].head()

Unnamed: 0,clean_question,clean_answer
0,for the last 8 years of his life galileo was u...,copernicus
1,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,the city of yuma in this state has a record av...,arizona
3,in 1963 live on the art linkletter show this c...,mcdonalds
4,signer of the dec of indep framer of the const...,john adams


I will also normalize the value column by removing the $ and converting it to an int

In [7]:
def normalize_values(val):
    val = re.sub("[^A-Za-z0-9\s]", "", val)
    try:
        val = int(val)
    except Exception:
        val = 0
        
    return val

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In order to figure out whether to study past questions, study general knowledge, or not study at all, it would be helpful to figure out two things:
- how often the answer is deducible from the question
- how often new questions are repeats of older questions

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question for now, and come back to the second.

So I will now write a function that takes in a row in jeopardy, and sees if words in the answer also occur in the question.

In [8]:
def count_matches(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    if 'the' in split_answer:
        split_answer.remove('the')  # 'the' is commonly found in answers and questions, 
                                    #  but doesn't have any meaningful use in finding the answer
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)
mean_ans_in_q = jeopardy['answer_in_question'].mean()
print(mean_ans_in_q)

0.05898946462474639


Approx. only 6% of answers contain words that were in the question. Thus, the answer if often not deducible from the question, and we'll probably have to study.

I will now investigate how often new questions are repeats of older ones. I can't completely answer this, as the dataset only contains about 10% of the total data from Jeopardy, but I can at least investigate it.

To accomplish this, I will:
- sort jeopardy in order of ascending air date
- maintain a set called 'terms_used' that will initially be empty
- iterate through each row of jeopardy
- split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used
    - if it does, increment a counter
    - add each word to terms_used

In [9]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('AirDate')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [j for j in split_question if len(j) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6894031359073245

There is about a 69% overlap between terms in new questions and in old questions. This only looks at a small set of questions, and it looks at individual terms and not phrases. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

### Low value vs high value questions:
Let's say I only want to study questions that pertain to high value questions instead of low value questions. This will help me earn more money. I can figure out which terms correspond to high value questions using a Chi-square test.

First I will categorize the questions by low or high value:
- Low value if value <= 800
High value if value > 800

Then I can loop through each of the terms in terms_used, and:
- find the number of low value questions the word occurs in
- find the number of high value questions the word occurs in
- find the percentage of questions the word appears in
- find expected counts from the percentage of questions the word occurs in
- compute the chi-squared value based on the expecte counts and the observed counts for high and low value questions

In [10]:
def high_value(row):
    value = 0
    if row['clean_value'] >= 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(high_value, axis=1)

In [11]:
def high_low_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [12]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]
observed_expected = []

for word in comparison_terms:
    observed_expected.append(high_low_count(word))


In [13]:
observed_expected

[(0, 1),
 (2, 0),
 (1, 0),
 (0, 1),
 (2, 2),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (2, 0)]

In [14]:
# Compute expected counts:
from scipy.stats import chisquare
import numpy as np

high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])

chi_squared = []
for i in observed_expected:
    total = sum(i)
    total_prop = total / jeopardy.shape[0]
    high_exp = total_prop * high_value_count
    low_exp = total_prop * low_value_count

    observed = np.array([i[0], i[1]])
    expected = np.array([high_exp, low_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.7721754541426672, pvalue=0.3795448984353682),
 Power_divergenceResult(statistic=2.590084920817076, pvalue=0.10753457004402507),
 Power_divergenceResult(statistic=1.295042460408538, pvalue=0.25512076479610835),
 Power_divergenceResult(statistic=0.7721754541426672, pvalue=0.3795448984353682),
 Power_divergenceResult(statistic=0.06721791455120528, pvalue=0.7954314156295934),
 Power_divergenceResult(statistic=0.7721754541426672, pvalue=0.3795448984353682),
 Power_divergenceResult(statistic=0.7721754541426672, pvalue=0.3795448984353682),
 Power_divergenceResult(statistic=1.295042460408538, pvalue=0.25512076479610835),
 Power_divergenceResult(statistic=0.7721754541426672, pvalue=0.3795448984353682),
 Power_divergenceResult(statistic=2.590084920817076, pvalue=0.10753457004402507)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 4, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.
