# Winning Jeopardy

## A project with an aim to practice hypothesis testing with python

We will analyse historical data of Jeopardy game questions in order to decide wich is the best strategy to prepare for the game.

## Data exploration and cleaning

In [1]:
import pandas as pd
jeo = pd.read_csv('jeopardy.csv')
print(jeo.head())

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


In [2]:
print(jeo.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [3]:
jeo.columns = jeo.columns.str.lstrip()
print(jeo.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


In [4]:
jeo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [5]:
import re
def normalize(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    string = re.sub("\s+", " ", string)
    return string

In [6]:
jeo['clean_question'] = jeo['Question'].apply(normalize)
jeo['clean_question'].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [7]:
jeo['clean_answer'] = jeo['Answer'].apply(normalize)
jeo['clean_answer'].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

In [8]:
def remove_dollar(string):
    string = re.sub("[^0-9]", "", string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string

In [9]:
jeo['clean_value'] = jeo['Value'].apply(remove_dollar)
jeo['clean_value'].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [10]:
jeo['Air Date'] = pd.to_datetime(jeo['Air Date'])
jeo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Show Number     19999 non-null  int64         
 1   Air Date        19999 non-null  datetime64[ns]
 2   Round           19999 non-null  object        
 3   Category        19999 non-null  object        
 4   Value           19999 non-null  object        
 5   Question        19999 non-null  object        
 6   Answer          19999 non-null  object        
 7   clean_question  19999 non-null  object        
 8   clean_answer    19999 non-null  object        
 9   clean_value     19999 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


# Analysis

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer can be used for a question.

* How often questions are repeated.

We can answer the first question by seeing how many times words in the answer also occur in the question.
We can answer the second question by seeing how often complex words (> 6 characters) reoccur.

In [11]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeo["answer_in_question"] = jeo.apply(count_matches, axis=1)

In [12]:
jeo["answer_in_question"].mean()

0.05900196524977763

On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

In [13]:
question_overlap = []
terms_used = set()
jeo = jeo.sort_values('Air Date')

for i, row in jeo.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeo['question_overlap'] = question_overlap
jeo['question_overlap'].mean()

0.6876260592169802

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800.

In [14]:
def high_value(row):
    if row['clean_value']>800:
        value = 1
    else: value = 0
    return value

In [15]:
jeo['high_value'] = jeo.apply(high_value, axis=1)

In [16]:
def word_scores(word):
    low_count = 0
    high_count = 0
    for i, row in jeo.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value']==1:
                high_count += 1
            else: low_count += 1
    return high_count, low_count    

In [17]:
from random import choice
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

In [18]:
observed_expected = []
for term in comparison_terms:
    observed_expected.append(word_scores(term))
observed_expected

[(1, 0),
 (0, 1),
 (0, 1),
 (0, 3),
 (8, 45),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 2),
 (1, 0)]

In [19]:
high_value_count = jeo['high_value'].sum()
low_value_count = len(jeo) - high_value_count
print(high_value_count)
print(low_value_count)

5734
14265


In [20]:
from scipy.stats import chisquare
import numpy as np
chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / len(jeo)
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766901714),
 Power_divergenceResult(statistic=4.777235073725721, pvalue=0.028838388504224346),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies