# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions

In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
import re
for  i in jeopardy.columns:
    h=re.sub('^ ','',i)
    jeopardy.rename(columns={i:h},inplace=True)

In [4]:
jeopardy.columns
# removed starting spaces from the columns names

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


convert the air date column to date time format and value to int 

In [6]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [7]:
jeopardy['Question']

0        For the last 8 years of his life, Galileo was ...
1        No. 2: 1912 Olympian; football star at Carlisl...
2        The city of Yuma in this state has a record av...
3        In 1963, live on "The Art Linkletter Show", th...
4        Signer of the Dec. of Indep., framer of the Co...
                               ...                        
19994    Of 8, 12 or 18, the number of U.S. states that...
19995                        ...& the New Power Generation
19996    In 1589 he was appointed professor of mathemat...
19997    Before the grand jury she said, "I'm really so...
19998    Llamas are the heftiest South American members...
Name: Question, Length: 19999, dtype: object

In [8]:
def normalize_text(u):
    u=u.lower()
    s = re.sub("[^A-Za-z0-9\s]", "", u)
    s = re.sub("\s+", " ", u)
    return s
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [9]:
jeopardy['clean_value']=jeopardy['Value'].apply(normalize_values)
jeopardy['clean_questions']=jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answers']=jeopardy['Answer'].apply(normalize_text)

In [10]:
jeopardy.head()


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_value,clean_questions,clean_answers
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,200,"for the last 8 years of his life, galileo was ...",copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,200,no. 2: 1912 olympian; football star at carlisl...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,200,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,200,"in 1963, live on ""the art linkletter show"", th...",mcdonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,200,"signer of the dec. of indep., framer of the co...",john adams


In [11]:
jeopardy['Value']=jeopardy['clean_value'].astype(int)

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer is deducible from the question.
How often new questions are repeats of older questions.
You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second

In [12]:
def count(row):
    split_answer = row['clean_answers'].split(' ')
    split_question = row['clean_questions'].split(' ')
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer)==0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count+=1
    return (match_count/len(split_answer))
jeopardy['answer_in_question'] = jeopardy.apply(count,axis=1)

In [13]:
jeopardy['answer_in_question'].mean()

0.0455099719687949

In [14]:
jeopardy= jeopardy.sort_values('Air Date')


## Recycled questions

The answer only appears in the question about 4% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

In [15]:
question_overlap = []
term_used = set()
for i,row  in jeopardy.iterrows():
    split_question = row['clean_questions'].split(' ')
    for i in split_question:
        if len(i)<6:
            split_question.remove(i)
    match_count=0
    for i in split_question:
        if i in term_used:
            match_count+=1
        else:
            term_used.add(i)
    if len(split_question)>0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
print(jeopardy['question_overlap'].mean())
            

0.731513661529069



## Low value vs high value questions

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of question

In [16]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_value,clean_questions,clean_answers,answer_in_question,question_overlap
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,0,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,0,"adventurous 26th president, he was 1st to ride...",theodore roosevelt,0.0,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,200,Notorious labor leader missing since '75,Jimmy Hoffa,200,notorious labor leader missing since '75,jimmy hoffa,0.0,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,200,"washington proclaimed nov. 26, 1789 this first...",thanksgiving,0.0,0.0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,200,both ferde grofe' & the colorado river dug thi...,the grand canyon,0.0,0.166667
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,200,"Depending on the book, he could be a ""Jones"", ...",Tom,200,"depending on the book, he could be a ""jones"", ...",tom,0.0,0.25


Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800

In [17]:
def takes(row):
    if row['clean_value']>800:
        value = 1
    else:
        value = 0
    return value
jeopardy['high_value'] = jeopardy.apply(takes,axis=1)

In [18]:
def word(word):
    low_count = 0
    high_count = 0
    for i , row in jeopardy.iterrows():
        k = row['clean_questions'].split(' ')
        if word in k:
            if row['high_value']==1:
                high_count+=1
            else:
                low_count+=1
    return low_count,high_count

In [19]:
list_term_used = list(term_used)

In [20]:
from random import choice
comparision_terms=[]
for i in range(10):
    comparision_terms.append(choice(list_term_used))
comparision_terms
observed_expected = []
for i in comparision_terms:
    observed_expected.append(word(i))
observed_expected

[(1, 0),
 (0, 1),
 (3, 2),
 (1, 0),
 (13, 8),
 (1, 0),
 (1, 0),
 (1, 1),
 (0, 2),
 (3, 4)]

In [26]:
import numpy as np
from  scipy.stats import chisquare
high_value_count=jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count=jeopardy[jeopardy['high_value']==0].shape[0]
chi_squared=[]
for i in observed_expected:
    total=sum(i)
    total_prop=total/jeopardy.shape[0]
    exp_ter_count_high = total_prop*high_value_count
    exp_ter_count_low = total_prop*low_value_count
    observed = np.array([i[0],i[1]])
    expected = np.array([exp_ter_count_high,exp_ter_count_low])
    chi_squared.append(chisquare(observed,expected))
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.3995960878537224, pvalue=0.12136658322360773),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=11.341070950389984, pvalue=0.0007581159228083234),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.6887906561130311, pvalue=0.4065760282166111)]


## Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.