<h2>Winning Jeopardy</h2>

In this project we are going to use a dataset og Jeopardy questions to gain some insights on the type of questions asked and what should we focus on if we want to have better chances of winning.  
The data set can be obtained from this link  
https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

In [130]:
import pandas as pd
import numpy as np
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
from matplotlib import pyplot as plt
import random
import re
%matplotlib inline

In [99]:
jeopardy = pd.read_csv('jeopardy.csv', parse_dates=[' Air Date'])

In [100]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [101]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [102]:
jeopardy.rename(columns={' Air Date': 'Air Date', ' Round': 'Round', ' Category': 'Category', ' Value':'Value', ' Question':'Question', ' Answer': 'Answer'}, inplace=True)

In [103]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null datetime64[ns]
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB


In [104]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Fisrt we should normalize the values in the Question and Answer columns

In [105]:
def clean_col(col):
    text = col.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

In [106]:
jeopardy['clean_question'] = jeopardy['Question'].apply(clean_col)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(clean_col)

In [107]:
def clean_value(col):
    number = re.sub("[^A-Za-z0-9\s]", "", col)
    try:
        number = int(number)
    except:
        number = 0
    return number


In [108]:
jeopardy['clean_value'] = jeopardy['Value'].apply(clean_value)
jeopardy['clean_value'] = jeopardy['clean_value'].astype(int)


In this section we are going to answer the question:  
How often the answer is deducible from the question?

In [109]:
def calc_match(row):
    match_count = 0
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count +=1
    return match_count / len(split_answer)

In [110]:
jeopardy['answer_in_question'] = jeopardy.apply(calc_match, axis=1)
mean = jeopardy['answer_in_question'].mean()
print(mean)

0.05900196524977763


s it looks like only ~ 6% of the answers can be inferred fom the question so this is a relativly low number and we shouldn't count on it for our win.

<h3> Repeated Questions</h3>

We only have 10% of the questions asked so we are not able to get an exact answer but using this sample of questions can help us get some kind of an estimation

In [111]:
question_overlap = []
terms_used = set([])

In [112]:
jeopardy.sort_values('Air Date',axis=0, inplace=True)

In [113]:
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count+=1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
print(jeopardy['question_overlap'].mean())

0.6894031359073245


We see that in this sample we have around 69% of the terms that were re-used. This doesn't mean necessarily that the questions are recycled since these are tokens and not full questiosns but it does give us reason to dig deeper into tjis question. 

In [114]:
jeopardy['high_value'] = jeopardy.apply(lambda row: 1 if row['clean_value'] > 800 else 0,axis=1)

In [115]:
def calc_high_low(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if word in split_question:
            if row['high_value'] ==1:
                high_count+=1
            else:
                low_count+1
    return high_count, low_count

    

In [121]:
comparison_terms = random.sample(terms_used,10)
print(comparison_terms)

['underwater', 'camping', 'crooners', 'events', 'mussolini', 'targetblankexhibita', 'charleston', 'spleen', 'needle', 'facing']


In [122]:
observed_expected = []
for word in comparison_terms:
    observed_expected.append(calc_high_low(word))

In [123]:
print(observed_expected)

[(2, 0), (0, 0), (0, 0), (3, 0), (1, 0), (1, 0), (0, 0), (0, 0), (1, 0), (0, 0)]


In [124]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

In [125]:
chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared


invalid value encountered in true_divide



[Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=7.463376351587025, pvalue=0.006296679668748999),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=nan, pvalue=nan)]

some of the pvalues ar less than 5% and thus significant however the frequencies are too low to take these results as a statistically significant

Lets look at the categories frequencies and see if there is a connection between the category and high value questions. Looking just at the frequencies of the categories we can see the top 5 categories are:  
1. Television  
2. U.S Geography
3. Literature
4. History
5. American History

In [126]:
jeopardy['Category'].value_counts(normalize=True)*100

TELEVISION                          0.255013
U.S. GEOGRAPHY                      0.250013
LITERATURE                          0.225011
HISTORY                             0.200010
AMERICAN HISTORY                    0.200010
BEFORE & AFTER                      0.200010
AUTHORS                             0.195010
WORD ORIGINS                        0.190010
WORLD CAPITALS                      0.185009
BODIES OF WATER                     0.180009
SPORTS                              0.180009
RHYME TIME                          0.175009
SCIENCE & NATURE                    0.175009
MAGAZINES                           0.175009
SCIENCE                             0.175009
WORLD GEOGRAPHY                     0.165008
WORLD HISTORY                       0.160008
ANNUAL EVENTS                       0.160008
HISTORIC NAMES                      0.160008
BIRDS                               0.155008
IN THE DICTIONARY                   0.155008
FICTIONAL CHARACTERS                0.155008
U.S. PRESI

If we are looking at the frequencies of the categories per high value questions we can see the following:

In [128]:
observed = pd.crosstab(jeopardy['Category'],[jeopardy['high_value']])
print(table)

high_value                    0  1
Category                          
"A" IN SCIENCE                2  3
"A" PLUS                      2  3
"A" SCIENCE CATEGORY          1  3
"A"NCIENT GREEKS              2  3
"AA"                          4  1
"AD"JECTIVES                  4  1
"AE"-NCIENT CROSSWORD CLUES   2  3
"AI"                          2  3
"ANT" INFESTATION             2  3
"AS" YOU LIKE IT              2  3
"AW", SHUCKS                  3  2
"B" IN FASHION                4  1
"B" IN GEOGRAPHY              4  1
"B" MOVIES                    5  0
"B" PREPARED                  4  1
"B-I"                         2  3
"BACK" WORDS                  5  0
"BARN"S                       4  1
"BAT" TOOLS                   4  1
"BB" BOOKS                    2  3
"BEA"S                        5  0
"BLACK" OR "WHITE"            4  1
"BOO"!                        2  3
"BOOK"S                       4  1
"BY" NOW                      4  1
"C" CREATURES                 5  0
"C" PLUS            

In [131]:
chisq_value, pvalue, df, expected = chi2_contingency(observed)
print(pvalue)

1.3164918947704949e-21


So there seem to be a connection between the category and high value questions