# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named `jeopardy.csv`, and contains `20000 rows` from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). Here's the beginning of the file:

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* Show Number -- the Jeopardy episode number of the show this question was in.
* Air Date -- the date the episode aired.
* Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* Category -- the category of the question.
* Value -- the number of dollars answering the question correctly is worth.
* Question -- the text of the question.
* Answer -- the text of the answer.

In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Since there are leading/trailing whitespace, let's remove those!
We should maintain the space in between words

In [3]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Data Normalization

Before we can start doing analysis on the Jeopardy questions, we need to normalize **all of the text columns** (the Question and Answer columns). The idea is to ensure that you lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words when you compare them.

In [4]:
import string

# Convert to lowercase and remove punctuation
# This seems to be the most efficient way
# for more details, see https://stackoverflow.com/a/266162
def normalize(s:str)->str:
    return s.lower().translate(str.maketrans('', '', string.punctuation))

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

Two more things:
* **The `Value` column should also be numeric**, to allow you to manipulate it more easily. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.
* **The `Air Date` column should also be a datetime**, not a string, to enable us to work with it more easily.

In [5]:
def normalize_val(s:str)->int:
    val=s.translate(str.maketrans('', '', string.punctuation))
    try:
        val=int(val)
    except Exception:
        val=0                 
    return val

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_val)
jeopardy["Air Date"] = jeopardy["Air Date"].apply(pd.to_datetime)
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Now What?

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the first question by seeing how many times words in the answer also occur in the question.
We can answer the second question by seeing how often complex words (> 6 characters) reoccur.

We'll work on the first question now, and come back to the second.

### Tackling The First Question

In [6]:
def count_matches(row: str)->int:
    split_answer   = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count=len([item for item in split_answer if item in split_question])
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)
jeopardy["answer_in_question"].mean()

0.06035277385469894

It seems that the answer appears in the question only **6% of the time**. Therefore this suggest that you need to really understand the question to answer it correctly, and study is important for this.

### Onto The 2nd Question, Recycled Questions

Let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, we can:

* Sort jeopardy in order of ascending air date.
* Maintain a set called `terms_used` that will be empty initially.
* Iterate through each row of jeopardy.
* Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
* If it does, increment a counter.
* Add each word to `terms_used`.

This will enable us to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables us to filter out words like `the` and `than`, which are commonly used, but don't tell us a lot about a question.

In [7]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count=len([word for word in split_question if word in terms_used])
        terms_used.update(split_question)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6902117143393507

There are almost 70% overlap between terms in new questions and terms in the old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [8]:
# DETOUR (Testing add and update)
# 
# Before moving on, I would like to quickly test the following:
# In this test, I investigate the processing speed of add and update sets
# It turned out there are no significant improvements, but
# we will use update anyway.

import timeit

s = "string. With. Punctuation"
def test_add():
    terms_used = set()
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count=len([word for word in split_question if word in terms_used])
        for word in split_question:
            terms_used.add(word)
    return terms_used

def test_update():
    terms_used = set()
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count=len([word for word in split_question if word in terms_used])
        terms_used.update(split_question)
    return terms_used

print('add    :',timeit.Timer('f()', 'from __main__ import test_add as f').timeit(10))
print('update :',timeit.Timer('f()', 'from __main__ import test_update as f').timeit(10))

add    : 18.010880056302994
update : 17.905024907086045


## Evaluating High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

* Low value -- Any row where `Value` is less than 800.
* High value -- Any row where `Value` is greater than 800.

We'll then be able to loop through each of the terms from the last screen, `terms_used`, and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [9]:
def determine_value(row: pd.DataFrame)-> int:
    return 1 if row["clean_value"] > 800 else 0

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [10]:
high_valued_df = jeopardy[jeopardy['high_value']==1]
low_valued_df  = jeopardy[jeopardy['high_value']==0]

def count_usage(term: str)-> (int, int):
    high_count     = len([d for d in high_valued_df.clean_question if term in d.split(' ')])
    low_count      = len([d for d in low_valued_df.clean_question  if term in d.split(' ')])
    return high_count, low_count

comparison_terms =  list(terms_used)[:5]
observed_expected = [count_usage(term) for term in comparison_terms]

observed_expected

[(1, 4), (0, 1), (0, 3), (0, 2), (0, 1)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [11]:
from scipy.stats import chisquare
import numpy as np

high_value_count = high_valued_df.shape[0]
low_value_count  = low_valued_df.shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.18383953104516373, pvalue=0.6680941623250602),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

# Conclusion

From the Chi-squared results, none of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

Here are some potential next steps:

* Find a better way to eliminate non-informative words than just removing words that are less than `6` characters long. Some ideas:
    * Manually create a list of words to remove, like `the`, `than`, etc.
    * Find a list of stopwords to remove.
    * Remove words that occur in more than a certain percentage (like 5%) of questions.
* Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    * Use the [`apply`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) method to make the code that calculates frequencies more efficient.
    * Only select terms that have high frequencies across the dataset, and ignore the others.
* Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    * See which categories appear the most often.
    * Find the probability of each category appearing in each round.
* Use the whole Jeopardy dataset ([available here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)) instead of the subset we used in this mission.
* Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.