# Gaming Jeopardy!:
## Analyzing Topic Trends from the Famous Quiz Show

The breadth of knowledge needed to become a Jeopardy! champion is astounding and would require years (if not decades) of learning. We've acquired a dataset of 20,000 answers/questions from the show in order to detect any preferences for certain topics. Theoretically, a contestant could use this analysis to direct their studying efforts. 

*Note: In Jeopardy!, contestents are given clues called Answers to which they must determine the appropriate Question; e.g.,*

*For the Answer: For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory.*

*The contestent would respond with the Question: Who is Copernicus?*

*This naming convention is reversed from most quiz type games and can be confusing. Even the dataset lists Answers under the 'Question' column and vice versa. For the purposes of working with this dataset, we will define Question as the clue given to respondents and Answer as the response; i.e., the reverse of traditional Jeopardy! definition.*

## Exploratory Data Analysis

In [66]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [68]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [69]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [70]:
# Remvoe any leading spaces from column names
jeopardy.columns = jeopardy.columns.str.lstrip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [71]:
# create function to clean/normalize text
import string
def normalize_text(s):
    return s.lower().translate(str.maketrans('', '', string.punctuation))

# Clean 'Question' and 'Answer' columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [72]:
jeopardy.head(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe


In [73]:
# Normalize 'Value'
def normalize_dollar(s):
    if s == 'None':
        return 0
    else:
        str_num = s.translate(str.maketrans('', '', string.punctuation))
        return int(str_num)

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollar)

# Convert 'Air Date' to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [74]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


We've normalized the answer and question columns so we can analyze similaries between answer/question pairs and analyze the set as a whole. 
The 'Air Date' column has been converted to datetime type to aid data analyses and entries in the 'Value' column have been cleaned and converted to integers.

Now that the data is cleaned and normalized, we can start investigating trends.

First, we'll analyze answer/question pairs to see how often text from one appears in the other. A high proportion would suggest a strategic advantage in focusing specifically on the text of the question in order to determine the proper answer.

In [75]:
# Create function to compare question and answer text
def compare_qa(row):
    # Create lists of individual words in answer and question
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    # remove 'the' from answer as it is not meaningful
    if 'the' in split_answer:
        split_answer.remove('the')
    
    # return 0 for empty answers to prevent division by 0
    if len(split_answer) == 0:
        return 0
    
    # loop through question looking for instances of answer
    # increment match_count for each find
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    
    # determine hit rate of answer words in question string
    return match_count / len(split_answer)


jeopardy['answer_in_question'] = jeopardy.apply(compare_qa, axis=1)

In [76]:
print('Mean: ', jeopardy['answer_in_question'].mean())
print('Median: ', jeopardy['answer_in_question'].median())
print('Max: ', jeopardy['answer_in_question'].max())

Mean:  0.06035277385469894
Median:  0.0
Max:  1.0


This analysis shows that in the average question/answer pair, about 6% of the answer text will show up in the question text.

In [77]:
print('% Sharing Text: ', len(jeopardy[jeopardy['answer_in_question'] != 0]) / len(jeopardy) * 100)
print('Mean ratio of shared text: ', jeopardy[jeopardy['answer_in_question'] != 0]['answer_in_question'].mean())

% Sharing Text:  13.095654782739135
Mean ratio of shared text:  0.46086106312337705


For, deeper understanding, we can see that about 13% of answers in the data set shared any text with their corresponding questions. Of those that do share text, the average answer is almost **half** of the question text.  That is encouraging to know. Greater than 10% of answers can be half determined based off question text. Focusing on question text appears to be a viable strategy for improving results.

## Reusable questions

By analyzing word usage across many episodes' worth of questions, we may be able to determine if Jeopardy! reuses questions. Reuse of questions could present several opportunities to gain an edge on the competition. 

Here, we'll determine the overlap of individual words as a preliminary analysis. We'll only include words with length >= 6 to prevent common words from interfering with the data.

In [82]:
question_overlap = []
terms_used = set()
for index, row in jeopardy.iterrows():
    # Split question into list of words
    split_question = row['clean_question'].split(' ')
    
    # keep only words with length >= 6
    split_question = [q for q in split_question if len(q) > 5]
    
    # count word matches to previously used words
    match_count = 0
    for item in split_question:
        if item in terms_used:
            match_count += 1
            
        # adds any words not already in terms_used
        terms_used.add(item)

    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

In [83]:
jeopardy.head(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0,0.0


In [84]:
jeopardy['question_overlap'].mean()

0.6919577992203644

There an average of 70% overlap between words (>= 6 characters) in new and old questions. This, in itself, insignificant since we did not look for phrases, but just individual words. However, the data do indicate that Jeopardy! potentially does reuse questions/question topics, and could be reserached further. If analysis showed significant overlap of phrases or other mesurable topics, studying past questions could be a beneficial strategy.

## High Value Questions

As Jeopardy! is won by earning the highest value (most money) discovering trends in high value questions could help to focus efforts on the most valuable information.

We will perform a chi-squared analysis of words used in questions to determine if there is any statistically significant trend in a word. For the purposes of this analysis, we'll define high value questions as those worth more than $800.

In [91]:
# create column in jeopardy with 1 for high value, 0 for low value
def is_valuable(value):
    if value > 800:
        return 1
    else:
        return 0
jeopardy['high_value'] = jeopardy['clean_value'].apply(is_valuable)

# create function that will check if a word is found in a question
# tracks how times the word appears in high and low value questions
def match_word(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1           
    return high_count, low_count

observed_expected = []

# convert terms_used set to list
terms_used_list = list(terms_used)

# get first 5 terms to test function
comparison_terms = terms_used_list[:5]
for item in comparison_terms:
    observed_expected.append(match_word(item))

In [90]:
observed_expected

[(0, 1), (0, 1), (0, 1), (3, 13), (0, 1)]

In [93]:
# Perform chi-squared analysis
high_value_count = sum(jeopardy['high_value'])
low_value_count = len(jeopardy) - high_value_count
chi_squared = []

for item in observed_expected:
    total = sum(item)
    total_prop = total / len(jeopardy)
    high_expected = total_prop * high_value_count
    low_expected = total_prop * low_value_count
    
    import numpy as np
    observed = np.array([item[0], item[1]])
    expected = np.array([high_expected, low_expected])
    
    from scipy.stats import chisquare
    chi_squared.append(chisquare(observed, expected))

In [94]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.7701156281377792, pvalue=0.3801812842139476),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

None of the p values for the 5 words analyzed were signifia