# Winning Jeopardy with Data Science #

Purpose of the project is to explore Jeopardy show question and investigate if there are some patterns that can help winning it.  For example can we see a clue in question.  Is there an overlap between question terms?  Are the some question that which might give better return on investment in learning them?  Read on to find out.  

The project contains data set from <a href='https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/'>here</a> and includes:
* Data cleaning and other preparatory activities
* Data analysis 


In [10]:
import pandas as pd
import numpy as np
import re
jeopardy = pd.read_csv('jeopardy.csv')

In [11]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [12]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1    Air Date    19999 non-null  object
 2    Round       19999 non-null  object
 3    Category    19999 non-null  object
 4    Value       19999 non-null  object
 5    Question    19999 non-null  object
 6    Answer      19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [13]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [14]:
# Remove spaces
jeopardy.columns = jeopardy.columns.str.strip()

In [15]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [16]:
# function to clean and removing any differences that do not add to meaning (e.g. Capital letters) the questions and answers
def normalise(dirty_str):
    clean_str = []
    dirty_str = dirty_str.lower()
    dirty_str = re.sub(r'[^\w\s]','',dirty_str) 
    clean_str = re.sub(r'\_','',dirty_str)
    return clean_str

In [17]:
# saving clean questions and clean answer in separate columns leaving original in case comparison required for troubleshooting
jeopardy['clean_question'] = jeopardy['Question'].apply(normalise)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalise)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


In [18]:
# function to remove non numerical currency characters so that numerical value can be operated on
def normalise_currency(currency):
    currency = re.sub(r'[$,]','',currency) 
    try:
        currency = int(currency)
    except Exception:
        currency = 0
    return currency

In [19]:
# apply function and check value at random
jeopardy['clean_value'] = jeopardy['Value'].apply(normalise_currency)
jeopardy['clean_value'].sample(5, random_state=1)

11455     500
8110      800
16829     200
6398     1000
1544     2000
Name: clean_value, dtype: int64

In [20]:
# convert string column to date
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [21]:
# function to see if the question contains words from answer - hence providing a clue
def  answer_in_question(answer, question):    
    split_answer = answer.split()
    split_question = question.split()
    match_count = 0
    split_answer = [word for word in split_answer if word != 'the']
    if len(split_answer) == 0:
        return 0
    for w in split_answer:
        if w in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [22]:
# apply function and compute mean of how frequently words from question end up in answer
jeopardy['answer_in_question'] = jeopardy.apply(lambda row : answer_in_question(row['clean_answer'],row['clean_question']), axis = 1)

In [23]:
jeopardy['answer_in_question'].mean()

0.058347444789267004

### Is it possible to see a clue of an answer in the question ###

The chance of seeing clue in a question's answer is very low about 5.8%. 

In [24]:
# check if there is any long (over 6 characters) words/terms from one question overlap with the other 
question_overlap = []
terms_used = []

jeopardy['Air Date'] = jeopardy['Air Date'].sort_values()
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [w for w in split_question if len(w) >=6]
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
        else:
            terms_used.append(w)
    
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
        
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

print(jeopardy['question_overlap'].mean())
jeopardy['question_overlap'].describe()

0.6925935056088502


count    19999.000000
mean         0.692594
std          0.298265
min          0.000000
25%          0.500000
50%          0.750000
75%          1.000000
max          1.000000
Name: question_overlap, dtype: float64

### Is there a term overlap between questions ###

On average there appears to be 70% overlap in question terms.  



In [25]:
# Classify questions that yield more than $800 as high value, less as low
def low_high(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value


In [26]:
jeopardy['high_value'] = jeopardy.apply(low_high, axis = 1)

In [27]:
jeopardy['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [28]:
# calculate rows with high and low value
def high_low_cnt(word):
    low_count = 0
    high_count = 0
    
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

In [29]:
# select 10 random terms
import random
comparison_terms = []

for i in range(10):
    comparison_terms.append(random.choice(terms_used))

In [32]:
# check if the random terms would appear in  high or low value questions
observed_expected = []

for term in comparison_terms:
    observed_expected.append(high_low_cnt(term))

In [31]:
comparison_terms

['guisewite',
 'hostile',
 'psallein',
 'glistening',
 'dreamy',
 'hrefhttpwwwjarchivecommedia20070330j02jpg',
 'references',
 'payments',
 'specified',
 'insert']

In [33]:
# check observed and expected for each of random terms
observed_expected

[(0, 1),
 (1, 1),
 (0, 1),
 (0, 2),
 (0, 2),
 (0, 1),
 (1, 0),
 (1, 0),
 (1, 0),
 (0, 1)]

In [34]:
high_value_count = jeopardy.loc[jeopardy['high_value'] == 1, 'high_value'].count()
low_value_count = jeopardy.loc[jeopardy['high_value'] == 0, 'high_value'].count()

print('high_value_count ->', high_value_count)
print('low_value_count ->', low_value_count)

high_value_count -> 5734
low_value_count -> 14265


## Are some terms more likely to yield answer to High Value question

Use Chi Squared test to see if there is statistical significance between terms found in low and high value questions.

Null hypothesis - no difference between terms with respect to be more likely in high value or low value question.  I.e. doesn't matter which terms to learn - it won't help us.  
Alternative hypothesis - there is difference.  I.e. learning some terms might yield better return.  

Assume that threashold for significance is 5% (0.05)


In [37]:
# Use Chi Squared to see if there is statistical significance between terms found in low and high value questions

from scipy.stats import chisquare
import numpy as np

chi_squared = []
total = 0
for tup in observed_expected:
    total = tup[0]+tup[1]
    total_prop = total/len(jeopardy)
    exp_high = total_prop*high_value_count
    exp_low = total_prop*low_value_count
    expected = np.array([exp_high, exp_low])
    observed = np.array([tup[0],tup[1]])
    chi_squared.append(chisquare(observed, expected))

In [38]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## Conclusions of Chi Test##

Taking threashold for significance as 0.05, we can conclude there is no significant difference between studying terms for high and low value questions.  However the frequencies of the terms in the sample low therefore to have a solid conslusion this test needs to be re-run with more frequently used terms