# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions.

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.

## Jeopardy Questions

In [1]:
import numpy as np
import pandas as pd


In [2]:
jeopardy = pd.read_csv('jeopardy.csv')

In [3]:
jeopardy.head()


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
"""
some of the columns has white space. 
This can be inconvinient when you are working with the data set.
Let's remove it
"""
jeopardy.columns = jeopardy.columns.str.strip()

In [6]:
"""Now we can see the white spaces are removed"""
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing text

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns). The idea is to ensure that we lowercase words and remove puntuation so Don't and don't aren't considered to be different words when we compare them.

In [7]:
import re

def normalize_text(s):
    s = s.lower()
    s = re.sub("[^A-Za-z0-9\s]", "", s)
    return s

In [8]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [9]:
jeopardy[['Question','clean_question']]

Unnamed: 0,Question,clean_question
0,"For the last 8 years of his life, Galileo was ...",for the last 8 years of his life galileo was u...
1,No. 2: 1912 Olympian; football star at Carlisl...,no 2 1912 olympian football star at carlisle i...
2,The city of Yuma in this state has a record av...,the city of yuma in this state has a record av...
3,"In 1963, live on ""The Art Linkletter Show"", th...",in 1963 live on the art linkletter show this c...
4,"Signer of the Dec. of Indep., framer of the Co...",signer of the dec of indep framer of the const...
5,"In the title of an Aesop fable, this insect sh...",in the title of an aesop fable this insect sha...
6,Built in 312 B.C. to link Rome & the South of ...,built in 312 bc to link rome the south of ita...
7,"No. 8: 30 steals for the Birmingham Barons; 2,...",no 8 30 steals for the birmingham barons 2306 ...
8,"In the winter of 1971-72, a record 1,122 inche...",in the winter of 197172 a record 1122 inches o...
9,This housewares store was named for the packag...,this housewares store was named for the packag...


## Normalizing Columns

The Value column should also be numeric, to allow you to manipulate it more easily. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable you to work with it more easily.

In [10]:
def normalize_value(s):
    s = re.sub("[^A-Za-z0-9\s]", "",s)
    if s.isnumeric():
        return int(s)
    else:
        return int(0)

In [11]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [12]:
# You can see the data type is now integer
jeopardy['clean_value'].dtype

dtype('int64')

In [13]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [14]:
jeopardy.loc[0,'clean_question']

'for the last 8 years of his life galileo was under house arrest for espousing this mans theory'

In [15]:
test = ['a','b','c']
test.remove('a')

In [16]:
def count_match(row):
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for answer in split_answer:
        if answer in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [17]:
jeopardy['answer_in_question'] = jeopardy.apply(count_match, axis = 1)

In [18]:
jeopardy['answer_in_question'].mean()

0.06049325706933587

From the result above, we can see about 6% of answers could be deduced from the questions. If you don't have no clue, it might be a good betting strategy. 

## Recycled Questions

Let's say we want to investigate how often new questions are repeats of older ones. we can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

In [19]:
question_overlap = []
terms_used = set()

jeopardy.sort_values(by = 'Air Date', inplace=True)

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
jeopardy["question_overlap"].mean()

0.6876260592169802

The script above looks for if each terms came up in the previous questions. Which could be insignificant, but if certain topics are recycles from the past, it could be worth looking at the previous questions. 

## Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help we earn more money when we're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

- Low value: Any row where Value is less than 800.
- High value: Any row where Value is greater than 800.

We can loop through each of the terms from terms_userd and:
- find the number of low value questions the word occurs in.
- find the number of high value questions the word occurs in. 
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [25]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [26]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(1, 0), (0, 1), (2, 1), (0, 1), (1, 6)]

## Applying the Chi-Squared test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [36]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]
chi_squared = []

observed_expected

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.7083506539662141, pvalue=0.39999189913636146)]

## Conclusion

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 10, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.