This project draws on statistics to determine patterns in past Jeopardy questions that may give future competitors an edge. 

I'll begin my importing, examing, and cleaning my data set. 

In [2]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')

In [3]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


print(jeopardy.columns)
jeopardy.columns = jeopardy.columns.str.strip()


In [4]:
jeopardy.columns = jeopardy.columns.str.strip()

In [5]:
import re

def normal_q(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]","",string)
    return string

def normal_a(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]","",string)
    return string

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normal_q)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normal_a)

In [7]:
def normal_d(string):
    string = string.replace('$',"")
    try:
        string = int(string)
    except Exception:
        string = 0
    return string
        

In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normal_d)
jeopardy['Air Date'] = jeopardy['Air Date'].apply(pd.to_datetime)

In [9]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

Now that the data is cleaned, I want to split the answer and question columns to better identify key words. Then I want to determine which words that appear in a question also appear in the answer, which will tell me how many answers are deducible from the question. 

I define the following function that will allow me to do this:

In [10]:
def split_s(row):
    #The function takes in a row from my data set as input, and splits each word
    #into its own series.
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    match_count = 0 #match_count is a list that will include every instance that
    #a word that appears in the question also appears in the answer.
    #I'll drop 'the' since it's a generic word.
    if "the" in split_answer:
        split_answer.remove("the")
    #The next conditional statement says that if nothing appears in my answer
    # column, the output is 0.
    if len(split_answer) == 0:
        return 0
    #The next step is iterating through the split_answer series, checking each row
    #for instances in which the same word that appears in the question appears in
    #the answer. Every time this is the case, the function adds 1 to my match_count
    #list. 
    for word in split_answer:
        if word in split_question:
            match_count += 1
    #The final output is the total of match_count divided by every word in my 
    #answer series. 
    return match_count / len(split_answer)

I apply the split_s function across all rows in my dataset. The output is a series that shows, for each row, what percent of answers contain words from their corresponding questions.

In [12]:
answer_in_question = jeopardy.apply(split_s,axis=1)
answer_in_question.head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
dtype: float64

Taking the mean of this series provides an overall probability that the answers are deducible from the questions.

In [15]:
answer_in_question.mean()

0.06049325706933587

These results tell us that 6% of the questions contain the same word are the answer; and thus 6% of answers are deducible from the question. That in mind, it's best to not rely on the questions when considering your answer. 

Next, I want to see how often new questions are repeats of older ones. To do so, I loop through each row in my data and determine how often a key word in each question appears more than once. 

In [13]:
questions_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    #I start by splitting my question column into its component words and filtering
    #out words that are less than 6 character - this allows the function to hone in
    #on key terms.
    split_question = row['clean_question'].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0 #I instantiate a list that will count how many matching words appears.
    #The next step is to loop through each word determine if that word appears in
    #terms_used set. If it does, match_count increases by 1. If not, nothing happens.
    for word in split_question:
        if word in terms_used:
            match_count += 1
    #I'll loop again through the question words and add each word to my term_used set.
    for word in split_question:
        terms_used.add(word)
    #If the split_question column contains a word, I'll alter match count to be
    #the ratio of total instances that a word is used more than once to total 
    #words in the row.
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    #The final step is to append each ratio to a new list.
    questions_overlap.append(match_count)
        

Each value in the questions overlap list represents the ratio of instances in which a word repeats itself to the total instances that any word appears.

In [20]:
questions_overlap[:12]

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0, 0.0, 0.125, 0.0, 0, 0.0]

In [None]:
jeopardy['question_overlap'] = questions_overlap

In [18]:
jeopardy['question_overlap'].mean()

0.6908737315671962

I added the overlap list as a new column to my data set and took the mean. This percentage tells me the chances that a question appears more than once is about 69%. It's worth noting that this isn't a wholly representative sample because it comprises only 10% of all questions asked on Jeopardy.

I now want to determine what questions I need to study to maximize my earnings. I'll start by building function that filters question values less than or equal to $800, and creating a boolean column that shows me which row corresponds to a high value question.

In [22]:
def values(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value


In [25]:
jeopardy['high_value'] = jeopardy.apply(values,axis=1)
jeopardy['high_value'].head()

0    0
1    0
2    0
3    0
4    0
Name: high_value, dtype: int64

Next, I want to see which terms appear most frequently in each high value questions. A chi-squared test will allow me to do this, The following function which accepts a word as input will allow me to do this:

In [30]:
def counts(word):
    low_count = 0
    high_count = 0
    #I'll start with values that correspond to the number of times a word appears.
    #I'll start with 0 to begin the count.
    #Then I'll loop through each row in the data set and split the question column
    #into individual words.
    for i, row in jeopardy.iterrows():
        words = row['clean_question'].split(" ")
    #If the input word appears in my question and its row corresponds to a high
    #value question, high_count will increase by 1. If this is not the case, 
    #low_count will increase by 1. 
        if word in words:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
#The returned output is the new values of low count and high count.


Before applying this function, I'll return to my term_used set and convert it to a list. Remember, this set contains all words used in each question. To ensure I don't get overloaded, I selected a sample of five words.

In [None]:
observed_expected = []
terms_used = list(terms_used)
comparison_terms = terms_used[:5]
comparison_terms

Now that I have my list of terms to compare, I'll loop through each term, applying my counts function to determine whether or not the word corresponds to a high value or low value question. I then stored these values in a new list, observed_expected. The list yields smaller, two item lists that contain the number of times a word appears in a high value question, and the number of times it appears in a low value question.

In [None]:
for term in comparison_terms:
    val = counts(term)
    observed_expected.append(val)

The final step is to loop through this list to output a new list of chi-squared values.

In [38]:
#First, I want to see the total instances in which a high value and low value question
#appears, so I simply took the total number of rows in a filtered version of my
#dataset.
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
chi_squared = []

import numpy as np
from scipy.stats import chisquare 

#Next comes the loop.
for item in observed_expected:
#The loop first takes the sum of the two values in each list, and divides that number
#by the total number of rows in the data set. This shows me the number of times
#an item in the list appears as a percent of the whole data set.
    total = sum(item)
    total_prop = total / jeopardy.shape[0]
#I'll then multiply my ratio by the total number of high value rows and low value
#rows to get an expected value for each.
    exp_high = total_prop * high_value_count
    exp_low = total_prop * low_value_count
#I then create an array containing my observed values - that is the actual number
#of instances a word appears in a low or high value question - and an array 
#containing my expected values - that is the probability I'll see either a low
#or high value question in my data set. 
    observed = np.array([item[0], item[1]])
    expected = np.array([exp_high, exp_low])
#The final step is to take a chi-squared value for each word and append to it a
#new list.
    chi_squared.append(chisquare(observed, expected))


In [39]:
chi_squared

[Power_divergenceResult(statistic=4.122707846712507e-05, pvalue=0.9948769527982859),
 Power_divergenceResult(statistic=1.323484394756106, pvalue=0.24996766692297967),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.6765980594008285, pvalue=0.4107606373026974),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953)]

These tests show us the squared difference, on average, between how often we expect each word in our sample list to appear in a low value question and a high value question. For the first word, there is a very small difference between what questions we expect it to appear in and where it actually appears, which may appear to be a good thing for studying. If we expect that word to appear in more high value questions, we should study it! However, our p-value for that word tells us there is a near 100% probability that difference was the result of chance. In fact, all p-values for our sample words are above the threshold for statistical significance. It may be best to take a look at chi-squared values for more words, or just focus on words that appear most frequently.