# Winning a Jeopardy!

In the US, Jeopardy is a popular game show where contestants answer questions to win cash prizes. It has been going strong for many years and has a significant influence on culture.

Consider that you want to play on Jeopardy and are seeking for whatever advantage you can get. We will  use a dataset of Jeopardy questions in this project to analyze trends in the questions that might help you win.

In [1]:
import pandas as pd
import numpy as np

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')

In [3]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
jeopardy.rename(columns=lambda x: x.strip(), inplace=True)


In [6]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [7]:
import re 

def normalize(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]',"", text)
    text = re.sub('[\s+]', " ",text)
    return text
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

def normalize_value(value):
    value = re.sub('[^A-a0-9a-z\s]', "",value)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value 
jeopardy['clean_value']=jeopardy['Value'].apply(normalize_value)
jeopardy['Air Date']=pd.to_datetime(jeopardy['Air Date'])

In [8]:
sum(jeopardy['clean_answer'].isnull())

0

In [9]:
def split(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0 
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0 :
        return 0 
    for i in split_answer:
        if i in split_question:
            match_count +=1
    return match_count / len(split_answer)
jeopardy['answer_in_question']= jeopardy.apply(split, axis=1)

In [10]:
jeopardy['answer_in_question'].value_counts()

0.000000    17475
0.500000     1448
0.333333      494
0.250000      155
1.000000      124
0.666667      104
0.200000       68
0.166667       27
0.400000       26
0.142857       21
0.750000       17
0.600000        9
0.125000        9
0.285714        7
0.800000        2
0.428571        2
0.181818        2
0.571429        2
0.300000        2
0.111111        2
0.350000        1
0.444444        1
0.875000        1
Name: answer_in_question, dtype: int64

In [11]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

## Recycled question

The answer typically only answers 6% of the question. We probably can't just assume that hearing a question will give us the ability to figure out the answer because this isn't a very large number. Most likely, we'll need to study.

In [13]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6877234805400583

## Low Value vs. High Value Questions

In [None]:
Terms in new questions and terms in old questions overlap by roughly 70%. This just examines a limited subset of questions and examines single terms, not sentences. Given its relative insignificance, it is nonetheless important to consider the question-recycling phenomenon further.