# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.

In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy.columns = jeopardy.columns.str.strip()

## Normalizing Text and Columns

In [4]:
# Cleaning the punctuation in the string
import re
def normalize_text(text):
    text = re.sub('[^A-Za-z0-9\s]', '', text)
    text = text.lower()
    return text  

def normalize_value(text):
    text = re.sub('[^A-Za-z0-9\s]', '', text)
    try:
        value = int(text)
    except:
        value = 0
    return value   

In [5]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [6]:
pd.to_datetime(jeopardy['Air Date'])

0       2004-12-31
1       2004-12-31
2       2004-12-31
3       2004-12-31
4       2004-12-31
5       2004-12-31
6       2004-12-31
7       2004-12-31
8       2004-12-31
9       2004-12-31
10      2004-12-31
11      2004-12-31
12      2004-12-31
13      2004-12-31
14      2004-12-31
15      2004-12-31
16      2004-12-31
17      2004-12-31
18      2004-12-31
19      2004-12-31
20      2004-12-31
21      2004-12-31
22      2004-12-31
23      2004-12-31
24      2004-12-31
25      2004-12-31
26      2004-12-31
27      2004-12-31
28      2004-12-31
29      2004-12-31
           ...    
19969   2009-05-14
19970   2009-05-14
19971   2009-05-14
19972   2009-05-14
19973   2009-05-14
19974   2009-05-14
19975   2009-05-14
19976   2009-05-14
19977   2009-05-14
19978   2009-05-14
19979   2009-05-14
19980   2009-05-14
19981   2009-05-14
19982   2009-05-14
19983   2009-05-14
19984   2009-05-14
19985   2009-05-14
19986   2009-05-14
19987   2009-05-14
19988   2000-03-14
19989   2000-03-14
19990   2000

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

In [7]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)


In [8]:
jeopardy['answer_in_question'].mean()

0.06049325706933587

In [9]:
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values("Air Date")

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    
    for item in split_question:
        if item in terms_used:
            match_count += 1
            
    for item in split_question:
        terms_used.add(item)
        
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
print(jeopardy['question_overlap'].mean())


0.6876260592169802


## Question Overlap

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [10]:
def high_and_low(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(high_and_low, axis = 1)


In [13]:
def counts(text):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():      
        if text in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []

for item in comparison_terms:
    result = counts(item)
    observed_expected.append(result)        
    
observed_expected

[(3, 3), (1, 0), (0, 1), (7, 11), (0, 1)]

## Applying the Chi-squared Test

In [18]:
high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])

import numpy as np
from scipy.stats import chisquare
chi_squared = []

for item in observed_expected:
    total = sum(item)
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    observed = np.array([item[0], item[1]])
    expected = np.array([expected_high, expected_low])
    
    chi_square = chisquare(observed, expected)
    chi_squared.append(chi_square)

chi_squared

[Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881886),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.9188510068856128, pvalue=0.33777684128818963),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

## Next Steps

Here are some potential next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)) instead of the subset we used in this mission.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.