# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Scenario: Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

## Jeopardy Questions

The dataset is named `jeopardy.csv`, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

The columns contained in this dataset are:

- `Show Number` -- the Jeopardy episode number of the show this question was in.
- `Air Date` -- the date the episode aired.
- `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- `Category` -- the category of the question.
- `Value` -- the number of dollars answering the question correctly is worth.
- `Question` -- the text of the question.
- `Answer` -- the text of the answer.

Let's import the data and take a look at the first few rows:

In [13]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [14]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Let's clean up the column names by modifying them to a more standardized Python format:

In [15]:
jeopardy.columns = jeopardy.columns.str.replace('^\s', '').str.replace(' ', '_').str.lower()

In [16]:
jeopardy.columns

Index(['show_number', 'air_date', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')

## Normalizing Text

Before we start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `question` and `answer` columns); ensuring that we make all words lowercase and remove punctuation (so that, for example, `Don't` and `don't` aren't considered different words when we compare them).

In [19]:
import re

def normalize_text(s):
    s = s.lower() # convert to lowercase
    s = re.sub("[^A-Za-z0-9\s]", "", s) # remove punctuation
    return s

In [20]:
jeopardy["clean_question"] = jeopardy["question"].apply(normalize_text)

In [21]:
jeopardy["clean_question"].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [22]:
jeopardy["clean_answer"] = jeopardy["answer"].apply(normalize_text)

In [23]:
jeopardy["clean_answer"].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

## Normalizing Other Columns

Now that we've normalized the text columns, there are also some other columns to normalize.

The `Value` column should be numeric, to allow us to manipulate it more easily.  We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

Also, the `Air Date` column should be a datetime, not a string, to enable us to work with it more easily.

In [24]:
def normalize_values(val):
    val = re.sub("[^A-Za-z0-9\s]", "", val) # remove punctuation
    try:
        val = int(val)
    except:
        val = 0
    
    return val

In [25]:
jeopardy["clean_value"] = jeopardy["value"].apply(normalize_values)

In [26]:
jeopardy["clean_value"].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [28]:
jeopardy["air_date"] = pd.to_datetime(jeopardy["air_date"])

In [29]:
jeopardy["air_date"].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: air_date, dtype: datetime64[ns]

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study at all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (>6 characters) reoccur.  We can answer the first question by seeing how many times words in the answer also occur in the question.  We'll work on the first question now, and come back to the second.

In [30]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    match_count = 0
    
    if "the" in split_answer:
        split_answer.remove("the") # "the" serves no meaningful purpose for us, so we remove it when it's found
    if len(split_answer) == 0:
        return 0 # prevents 0 division later
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
            
    return match_count / len(split_answer)

In [31]:
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [32]:
jeopardy["answer_in_question"].mean()

0.060493257069335914

We caculated a value, for each answer, of how many of that answer's words also appeared in the question. By then calculating the mean of the entire column, we see that, on average, a Jeopardy answer contains around 6% of the words in the question.

This would suggest that we won't simply be able to figure out an answer by hearing the question; we'll likely have to study.

## Recycled Questions / Words

Let's say  we want to investigate how often new questions are repeats of older ones.  We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can at least investigate.

To do this, we'll iterate through questions by air date and build a set of words that have been used in prior questions.  As we go, we'll compare each new question's words to the set to see which have been used previously, and use this value as a proxy for re-use of questions.  We'll omit words with fewer than 6 characters in order to filter out words like `the` and `than` that are commonly used, but don't tell us a lot about a question.

In [37]:
question_overlap = []
terms_used = set([])

jeopardy = jeopardy.sort_values(by=["air_date"])

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [word for word in split_question if len(word) > 5] # filter out words with fewer than 6 characters
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
            
    for word in split_question:
        terms_used.add(word)
        
    if len(split_question) > 0:
        match_count /= len(split_question) # convert match_count to a fraction of the # of words in the question
    
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.6877843800700828

We find that nearly 70% of words (6 characters or longer) existed in prior questions.  This doesn't tell us anything about matches of phrases, just single words, which makes this finding relatively insignificant; but it does signal that it would be worth investigating the potential reuse of questions further.

## Low-Value vs. High-Value Questions

Let's say we only want to study questions that, historically, are high-value vs. low-value.  This will help us earn more money when we're on Jeopardy.

We can figure out which terms correspond to high-value questions using a chi-squared test.  We'll first split questions into two categories: Low value, where `Value` is <800, and High value, where `Value` is >800.

We can then iterate through our `terms_used` set, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high- and low-value questions by selecting the words with the highest associated chi-squared values. For the take of time/simplicity, we'll just do it for a small sample for now.

In [48]:
def determine_value(row):
    if row["clean_value"] > 800:
        return 1
    return 0

jeopardy["high_value"] = jeopardy.apply(determine_value, axis = 1)

def low_high_counts(word):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count

observed_counts = []

comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    observed_counts.append(low_high_counts(term))
    
observed_counts

[(1, 0), (0, 1), (0, 1), (0, 1), (0, 3)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [49]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []

for counts in observed_counts:
    total = sum(counts) # total number of questions the term appears in
    total_prop = total / jeopardy.shape[0] # proportion of combined count relative to total # of rows in dataset
    high_val_expected = total_prop * high_value_count
    low_val_expected = total_prop * low_value_count
    
    observed = np.array([counts[0], counts[1]])
    expected = np.array([high_val_expected, low_val_expected])
    
    chi_squared.append(chisquare(observed, expected)) # will append each term's chi-squared statistic and p-value to chi_squared list

chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766901714)]

## Results & Potential Next Steps

None of the terms had a significant difference in usage between high-value and low-value rows.  Also, the frequencies were all lower than 5, so the chi-squared test isn't that valid.  It would be better to run this test only with terms that have higher frequencies.

Some potential next steps for this project would be:

- Finding a better way to eliminate non-informative words than just  removing words that are less than 6 characters long.
- Performing the chi-squared test across more terms to see what terms have larger differences.
- Looking more into the `category` column to see if any interesting analysis can be done with it.
- Using the whole Jeopardy dataset instead of the subset currently being used.
- Using phrases instead of single words when seeing if there's overlap between questions.  Single words don't capture the whole context of the question well.