# Winning Jeopardy
***

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

Also Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.  You can actually figure out which terms correspond to high-value questions using a chi-squared test.

This project aims to answer those questions using Python, Pandas, Numpy and Scipy to answer those questions.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download -[here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)_.

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.


In [43]:
import pandas as pd
import numpy as np
import re
from scipy.stats import chisquare

In [19]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


### Clean up Dataset 
***

In [20]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front, which need to be removed to avoid confusion later on.

##### Remove the spaces in each item in jeopardy columns

In [21]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']


In [22]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

##### Normalize text columns

I need to lowercase words and remove punctuation so "Don't" and "don't" aren't considered to be different words when comparing them.

In [23]:

def normalize_text(txt):
    txt = txt.lower()
    #txt = re.sub("[^A-Za_z0-9\s]", "", txt)
    txt = re.sub("'", "", txt)
    return txt

In [24]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_question'].head()

0    for the last 8 years of his life, galileo was ...
1    no. 2: 1912 olympian; football star at carlisl...
2    the city of yuma in this state has a record av...
3    in 1963, live on "the art linkletter show", th...
4    signer of the dec. of indep., framer of the co...
Name: clean_question, dtype: object

In [25]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy['clean_answer'].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

##### Normalize value columns

The Value column should also be numeric, to allow for manipulation more easily. Need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

In [26]:
jeopardy['Value'].head()

0    $200
1    $200
2    $200
3    $200
4    $200
Name: Value, dtype: object

In [27]:
def normalize_value(txt):
    #txt = re.sub("^$", "", txt)
    txt = txt.replace("$", "")
    try:
        txt = int(txt)
    except Exception:
        txt = 0
    return txt
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

jeopardy['clean_value'].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

##### Normalize date columns

The Air Date column should also be a datetime, not a string, to enable you to work with it more easily.

In [28]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [29]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### Evaluting Preparation Strategies
***

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer is deducible from the question.
How often new questions are repeats of older questions.


##### Investigate how often the answer is deductible from the question


In [30]:
def count_answer_in_question(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for w in split_answer:
        if w in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_answer_in_question, axis=1)
mean_answer_in_question = jeopardy['answer_in_question'].mean()
print(mean_answer_in_question * 100)
    

4.558914259399283


The result shows that on average only 4.6% of words in answer showed up in question.  So far I only exclude common word "the".  It is not a viable strategy to use words in questions to help get the answer for jeopardy.

#####  Investigate how often new questions are repeats of older ones.

I can't completely answer this question, because I only have about 10% of the full Jeopardy question dataset, but I can investigate it at least.


In [31]:
question_overlap = []
terms_used = set()
#jeopardy_sort = jeopardy.sort_values(by=jeopardy['Air Date'], ascending=True)

for i, row in jeopardy.iterrows():
    templist = row['clean_question'].split()
    split_question = [x  for x in templist if len(x) >= 6]
    
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
        terms_used.add(w)
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
question_overlap_mean = jeopardy['question_overlap'].mean()
print(question_overlap_mean)


0.609511735526003


There is about 61% overlap between terms in new questions and old questions. So far result is from analysis of a small dataset and about a single term instead of a phrase.  It does mean it's worth more looking into the recycling of old questions.

#####  Figure out which terms correspond to high-value questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800.

In [33]:
# define a funciton to differentiate clean_value into 
# high (1) or low(0) value
def determine_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []
comparison_terms = list(terms_used)[0:5]

for term in comparison_terms:
    observed_expected.append(count_usage(term))
observed_expected

[(1, 0), (0, 3), (0, 1), (1, 0), (0, 2)]

In [34]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
print(high_value_count)

4972


In [36]:
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
print(low_value_count)

15027


In [45]:
chi_squared = []
for item in observed_expected:
    total = sum(item)

    total_prop = total / jeopardy.shape[0]
    expected_high_value = total_prop * high_value_count
    expected_low_value = total_prop * low_value_count
    
    observed = np.array([item[0], item[1]])
    expected = np.array([expected_high_value, expected_low_value])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=0.9926132960670793, pvalue=0.3191044998242515),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913672)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.