# Introduction

In this project, I'll be working on a dataset of Jeopardy questions and performing statistical analysis on the data, to identify if there are any patterns in the questions that could help an individual to win if they were to compete on future iterations of the show.

In [231]:
import pandas as pd
import numpy as np
import string
from random import choice
from scipy.stats import chisquare

In [232]:
jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [233]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

We can see there are some spaces in the columns, so we'll go ahead and remove these

In [234]:
jeopardy.columns = jeopardy.columns.str.lstrip()

jeopardy.columns        

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [235]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


We can see that every column apart from the show number consists of the object datatype. 
Looking at the format of the columns, to perform any kind of meaningful analysis we'll need to normalize the Question and Answer columns so that they're uniform.  We'll create a function that cleans these columns.

# Cleaning the Columns

In [236]:
def string_cleaner(s):
    s = s.lower()
    s = s.translate(str.maketrans('', '', string.punctuation)) #removes all punctuation
    return s
    

In [237]:
jeopardy['clean_question'] = jeopardy['Question'].apply(string_cleaner)

jeopardy['clean_answer'] = jeopardy['Answer'].apply(string_cleaner)

In [238]:
jeopardy.head(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sha...,the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 bc to link rome the south of ita...,the appian way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2306 ...,michael jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 197172 a record 1122 inches o...,washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel


Now that the Question and Answer columns have been normalized, let's convert the Value and Air Date columns - to a numeric column and a datetime column respectively

In [239]:
def currency_cleaner(s):
    s = s.translate(str.maketrans('', '', string.punctuation)) #remove all punctuation
    try:
        s = int(s)
    except Exception:
        s = 0
    return s

In [240]:
jeopardy['clean_value'] = jeopardy['Value'].apply(currency_cleaner)

In [241]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [242]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [243]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


# Identifying how often answers are used for a question

In order to figure out whether to study past questions, study general knowledge, or not study at all it would be helpful to identify two things:

* How often an answer can be used for a question


* How often questions are repeated

We'll begin by answering the first question through creating a function to identify how many times words in the answer also occur in the question.

In [244]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0 # to avoid division by 0 later
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [245]:
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)
mean = jeopardy['answer_in_question'].mean()
mean

0.058861482035140716

So it seems as though only 6% of words on average from the answer are used in the question, which isn't a particularly high number.  It appears that studying would be a good strategy here in this case.

# Investigating Repeat Questions

Now we'll investigate how many times questions have repeated in the dataset.  Keep in mind this is only 10% of the whole dataset so we can't fully answer this question, but we can have an idea.

In [246]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [question for question in split_question if len(question) > 5] #to remove words like 'the'
    
    match_count = 0
    
    for question in split_question:
        if question in terms_used:
            match_count += 1
            
    for question in split_question:        
        terms_used.add(question)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
    
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap

print(jeopardy['question_overlap'].mean())

0.687124288096678


It appears there is around a 70% overlap with words in new questions and words in old questions. These are single words rather than phrases, which doesn't make this quite as significant, but still worth investigating.

# High Value vs Low Value Questions

We'll now study which questions are high value vs which questions are low value - we can identify which terms correspond to these through a chi-squared test.  We'll begin by categorising the questions into two categories:

* Low value - Any row where Value is less than 800

* High vallue - Any row where Value is greater than 800

In [247]:
def question_values(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
        

    return value

In [248]:
jeopardy['high_value'] = jeopardy.apply(question_values, axis=1)
jeopardy.head(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200,0.0,0.5,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0.0,0
19305,10,1984-09-21,Double Jeopardy!,HOMONYMS,$200,Hindu hierarchy or a play's actors,a caste (cast),hindu hierarchy or a plays actors,a caste cast,200,0.333333,0.0,0
19306,10,1984-09-21,Double Jeopardy!,TV TRIVIA,$200,"Last season, this series mourned the loss of S...",Hill Street Blues,last season this series mourned the loss of sg...,hill street blues,200,0.0,0.0,0
19307,10,1984-09-21,Double Jeopardy!,1789,$400,Why April 28th was a bad day for Capt. Bligh,the day of the mutiny on the Bounty,why april 28th was a bad day for capt bligh,the day of the mutiny on the bounty,400,0.142857,0.0,0
19308,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$400,Seaside resort that has a monopoly on East Coa...,"Atlantic City, New Jersey",seaside resort that has a monopoly on east coa...,atlantic city new jersey,400,0.0,0.0,0
19309,10,1984-09-21,Double Jeopardy!,LITERATURE,$400,"He wrote ""The 3 Musketeers""; his son wrote ""Ca...",(Alexandre) Dumas,he wrote the 3 musketeers his son wrote camille,alexandre dumas,400,0.0,0.0,0


In [249]:
def count_values(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count

In [250]:
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for term in range(10)]

In [251]:
observed_expected = []

for term in comparison_terms:
    counts = count_values(term)
    observed_expected.append(counts)

In [252]:
observed_expected

[(0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (3, 3),
 (5, 6),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 1)]

In [253]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

In [254]:
chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_term_count = total_prop * high_value_count
    low_term_count = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_term_count, low_term_count])
    chi_squared.append(chisquare(observed, expected))
    
    

In [255]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881886),
 Power_divergenceResult(statistic=1.5150423082236086, pvalue=0.21837128417807639),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

We can see that none of the terms had a p-value of below 0.05, hence none of the results are statistically significant.  Additionally, the chi-squared values were all lower than 5, which makes this test not so appropriate for these terms.  A better method would be to run this test with only terms with high frequencies.