# Winning Jeopardy

## Jeopardy questions

Jeopardy is a popular TV show in the US where participants answer questions to win money. Let's say I want to compete on Jeopardy, and looking for any edge you can get to win. In this project, work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help win.

The dataset contains 20000 rows from the beginning of a full dataset of Jeopardy questions ([link](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)).

In [1]:
import pandas as pd

jeopardy = pd.read_csv("jeopardy.csv")

jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front. Remove the spaces in each column.

In [3]:
new_columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']
jeopardy.columns = new_columns
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Each row in the dataset represents a single question on a single episode of Jeopardy. List of columns:

* **Show Number:** the Jeopardy episode number of the show this question was in.
* **Air Date:** the date the episode aired.
* **Round:** the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* **Category:** the category of the question.
* **Value:** the number of dollars answering the question correctly is worth.
* **Question:** the text of the question.
* **Answer:** the text of the answer.

## Normalize text

Before start doing analysis on the Jeopardy questions, need to normalize all of the text columns (the Question and Answer columns). Ensure that you lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words.

In [7]:
import re, string

regex = re.compile('[%s]' % re.escape(string.punctuation))

# function to normalize questions and answers
def normalize(text):
    text_lower = text.lower()
    return regex.sub('', text_lower)

# normalize the Question column
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)

# normalize the Answer column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

In [9]:
# view result
print(jeopardy[['Question', 'clean_question']].head(5))
print(jeopardy[['Answer', 'clean_answer']].head(10))

                                            Question  \
0  For the last 8 years of his life, Galileo was ...   
1  No. 2: 1912 Olympian; football star at Carlisl...   
2  The city of Yuma in this state has a record av...   
3  In 1963, live on "The Art Linkletter Show", th...   
4  Signer of the Dec. of Indep., framer of the Co...   

                                      clean_question  
0  for the last 8 years of his life galileo was u...  
1  no 2 1912 olympian football star at carlisle i...  
2  the city of yuma in this state has a record av...  
3  in 1963 live on the art linkletter show this c...  
4  signer of the dec of indep framer of the const...  
           Answer    clean_answer
0      Copernicus      copernicus
1      Jim Thorpe      jim thorpe
2         Arizona         arizona
3      McDonald's       mcdonalds
4      John Adams      john adams
5         the ant         the ant
6  the Appian Way  the appian way
7  Michael Jordan  michael jordan
8      Washington      wash

## Normalize columns

The `Value` column should also be numeric to manipulate it more easily. Need to remove the dollar sign from the beginning of each value and convert the column from text to numeric. The `Air Date` column should also be a datetime, not a string.

In [10]:
# function to normalize questions and answers
def normalize_value(text):
    # remove any punctuation in the string
    clean_text = regex.sub('', text)
    try:
        result = int(clean_text)
    except:
        result = 0
    return result

# normalize the Value column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [11]:
# view result
print(jeopardy[['Value', 'clean_value']].head(5))

  Value  clean_value
0  $200          200
1  $200          200
2  $200          200
3  $200          200
4  $200          200


In [13]:
# convert the Air Date column to a datetime column
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

##  Answers in questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question?
* How often new questions are repeats of older questions?

Try to answer the second question by seeing how often complex words (> 6 characters) reoccur. Answer the first question by seeing how many times words in the answer also occur in the question.

In [15]:
# count how many times words in the answer occur in the question
def count_matches(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for a in split_answer:
        if a in split_question:
            match_count += 1
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

jeopardy['answer_in_question'].mean()

0.060352773854699004

The answer appears in question only in 6% of cases. It is not a good ctrategy to hope guess answers from the questions.

## Recycled questions

Let's tre investigate second question: How often new questions are repeats of older questions?

Check if the terms in questions have been used previously or not. Look only at words greater than 6 characters to filter out words like `the` and `than`, which are commonly used.

In [30]:
# count how many long words occure in older questions
question_overlap = []
terms_used = set()

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    # filter words in split_question that are more than 6 characters long
    split_question = list(filter(lambda x: len(x) > 5, split_question))
    match_count = 0
    for q in split_question:
        if q in terms_used:
            match_count += 1
        terms_used.add(q)
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6919577992203563

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

##  Low value vs high value questions

Can try to study questions that pertain to high value questions instead of low value questions. This will help earn more money on Jeopardy.

Let's figure out which terms correspond to high-value questions using a chi-squared test. First, narrow down the questions into two categories: less than 800 and greater than 800. Then using `terms_used`:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Do it for a small sample of words to reduce time.

In [31]:
# divide questions into two grops: low and high value
def high_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(high_value, axis=1)

In [37]:
# helper function to count usage of words
def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_count = []
comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    h_count, l_count = count_usage(term)
    observed_count.append([h_count, l_count])

## Apply the chi-squared test

In [38]:
from scipy.stats import chisquare
import numpy as np

# number of rows in jeopardy where high_value is 1
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]

# number of rows in jeopardy where high_value is 0
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for obs in observed_count:
    total = obs[0] + obs[1]
    # get proportion across the dataset:
    total_prop = total / jeopardy.shape[0]
    # get the expected term count for high value rows:
    expected_high_count = total_prop * high_value_count
    # get the expected term count for low value rows:
    expected_low_count = total_prop * low_value_count
    # compute the chi-squared value and p-value given the expected and observed counts
    observed = np.array([obs[0], obs[1]])
    expected = np.array([expected_high_count, expected_low_count])
    chi_sq, p_value = chisquare(observed, expected)
    chi_squared.append([chi_sq, p_value])

chi_squared

[[1.607851384507536, 0.2047940943922556],
 [2.487792117195675, 0.11473257634454047],
 [0.401962846126884, 0.5260772985705469],
 [0.02636443308440769, 0.871013484688921],
 [2.487792117195675, 0.11473257634454047]]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.