# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file. 

In [75]:
import pandas as pd
import re 
import random
from scipy.stats import chisquare
import numpy as np

In [3]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
#Remove spaces in front of columns
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

## Clean and prepare the dataset

In [17]:
#Write a function to normalize questions and answers
def normalizer(words):
    punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    words = words.lower()
    for ele in words:  
        if ele in punc:  
            words = words.replace(ele, "")  
    return words
        
#Normalize the Question column    
jeopardy['clean_question'] = jeopardy['Question'].apply(normalizer)

#Normalize the Answer column    
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalizer)

In [20]:
#Write a function to normalize dollar values
def dollar_normalizer(numbers):
    punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    for ele in numbers:  
        if ele in punc:  
            numbers = numbers.replace(ele, "") 
    try:
        numbers = int(numbers)
    except:
        numbers = int(0)
    return numbers

#Normalize the Value column
jeopardy['clean_value'] = jeopardy['Value'].apply(dollar_normalizer)

In [22]:
#Use the pandas.to_datetime function to convert the Air Date column to a datetime column.
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

## What to examine

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

### Question 1: How many times words in the answer also occur in the question

In [26]:
def repeat_counter(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if "the" in split_answer: #The is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer.
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    else:
        for item in split_answer:
            if (item in split_question):
                match_count += 1
        return match_count/len(split_answer)
    
#Count how many times terms in clean_answer occur in clean_question
jeopardy["answer_in_question"] = jeopardy.apply(repeat_counter, axis=1)

In [41]:
#Find the mean of the answer_in_question column using the mean method on Series.
print('The mean is {0}%'.format(round(jeopardy["answer_in_question"].mean()*100),0))
print('So on average, the answer only makes up for about 6% of the question. This isn\'t a huge number, and means that we probably can\'t just hope that hearing a question will enable us to figure out the answer. We\'ll probably have to study.')

The mean is 6.0%
So on average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.


### Question 2: How often new questions are repeats of older questions.

We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

In [48]:
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.000000,0.000000
19305,10,1984-09-21,Double Jeopardy!,HOMONYMS,$200,Hindu hierarchy or a play's actors,a caste (cast),hindu hierarchy or a plays actors,a caste cast,200,0.333333,0.000000
19306,10,1984-09-21,Double Jeopardy!,TV TRIVIA,$200,"Last season, this series mourned the loss of S...",Hill Street Blues,last season this series mourned the loss of sg...,hill street blues,200,0.000000,0.000000
19307,10,1984-09-21,Double Jeopardy!,1789,$400,Why April 28th was a bad day for Capt. Bligh,the day of the mutiny on the Bounty,why april 28th was a bad day for capt bligh,the day of the mutiny on the bounty,400,0.142857,0.000000
19312,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$600,"Since '27, stars have made good impressions at...",Mann's Chinese Theatre,since 27 stars have made good impressions at t...,manns chinese theatre,600,0.000000,0.000000
19299,10,1984-09-21,Jeopardy!,"""B"" MOVIES",$500,Sensitive Mart Crowley treatment of gays march...,The Boys in the Band,sensitive mart crowley treatment of gays march...,the boys in the band,500,0.000000,0.000000
19274,10,1984-09-21,Jeopardy!,GEOGRAPHY,$100,Formerly Formosa,Taiwan,formerly formosa,taiwan,100,0.000000,0.000000
19275,10,1984-09-21,Jeopardy!,DOUBLE TALK,$100,"Not a Hawaiian cow, but a dress worn by Hawaii...",a muumuu,not a hawaiian cow but a dress worn by hawaiia...,a muumuu,100,0.500000,0.090909
19281,10,1984-09-21,Jeopardy!,DOUBLE TALK,$200,Affirmative reply to an admiral's command,aye-aye,affirmative reply to an admirals command,ayeaye,200,0.000000,0.000000
19282,10,1984-09-21,Jeopardy!,"""JACKS"" OF ALL TRADES",$200,"Between him & his wife, they licked the platte...",Jack Spratt,between him his wife they licked the platter ...,jack spratt,200,0.000000,0.000000


In [50]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6871708558735073

So on average, up to 68% of questions are "recycled" and are repeats of older questions

## Studying High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

We'll then be able to loop through each of the terms from the last calculation, terms_used, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [55]:
def sorter(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(sorter, axis=1)

In [59]:
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.000000,0.000000,0
19278,10,1984-09-21,Jeopardy!,"""B"" MOVIES",$100,"In '61 movie, Audrey Hepburn's alternative to ...",Breakfast at Tiffany's,in 61 movie audrey hepburns alternative to bru...,breakfast at tiffanys,100,0.000000,0.000000,0
19279,10,1984-09-21,Jeopardy!,SPORTS,$100,What Gary Player plays professionailly,golf,what gary player plays professionailly,golf,100,0.000000,0.000000,0
19280,10,1984-09-21,Jeopardy!,GEOGRAPHY,$200,Dutch is still an official language in what is...,Dutch Guiana,dutch is still an official language in what is...,dutch guiana,200,0.500000,0.000000,0
19286,10,1984-09-21,Jeopardy!,DOUBLE TALK,$300,Adopted baby of Barney & Betty Rubble,Bamm-Bamm,adopted baby of barney betty rubble,bammbamm,300,0.000000,0.000000,0
19285,10,1984-09-21,Jeopardy!,GEOGRAPHY,$300,"8th most populous country in the world, this ""...",Bangladesh,8th most populous country in the world this be...,bangladesh,300,0.000000,0.000000,0
19324,10,1984-09-21,Double Jeopardy!,TV TRIVIA,$1000,"In court, he'd always make mincemeat of Hamilt...",Perry Mason,in court hed always make mincemeat of hamilton...,perry mason,1000,0.000000,0.000000,1
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.000000,0.000000,0
19308,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$400,Seaside resort that has a monopoly on East Coa...,"Atlantic City, New Jersey",seaside resort that has a monopoly on east coa...,atlantic city new jersey,400,0.000000,0.000000,0
19309,10,1984-09-21,Double Jeopardy!,LITERATURE,$400,"He wrote ""The 3 Musketeers""; his son wrote ""Ca...",(Alexandre) Dumas,he wrote the 3 musketeers his son wrote camille,alexandre dumas,400,0.000000,0.000000,0


In [63]:
#Function that narrows the word into high value and low value based on it's count in all of the rows
def word_value(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

#Randomly pick ten elements of terms_used and append them to a list called comparison_terms.
comparison_terms = []
for i in range(10):
    comparison_terms.append(random.choice(list(terms_used)))
    
observed_expected = []
for term in comparison_terms:
    observed_expected.append(word_value(word))
    
observed_expected

[(1, 0),
 (0, 1),
 (1, 0),
 (0, 2),
 (2, 8),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 2),
 (7, 9)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [73]:
high_value_count = len(jeopardy[jeopardy['high_value']==1]) #Find the number of rows in jeopardy where high_value is 1
low_value_count = len(jeopardy[jeopardy['high_value']==0]) #Find the number of rows in jeopardy where high_value is 0

In [77]:
chi_squared = []
for items in observed_expected: #Loop through each list in observed_expected
    total = items[0] + items[1] #Add up both items in the list (high and low counts) to get the total count
    total_prop = total/len(jeopardy) #Divide total by the number of rows in jeopardy to get the proportion across the dataset
    expected_high = total_prop*high_value_count
    expected_low = total_prop*low_value_count
    
    observed = np.array([items[0], items[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.36767906209032747, pvalue=0.5442721040962595),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=1.7788002674291046, pvalue=0.18229671571722328)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

## Next Steps

Here are some potential next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset instead of the subset we used in this project.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.