# Leveraging Chi-Squared Tests to Win Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). Here's the beginning of the file:

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number - the Jeopardy episode number
- Air Date - the date the episode aired
- Round - the round of Jeopardy
- Category - the category of the question
- Value - the number of dollars the correct answer is worth
- Question - the text of the question
- Answer - the text of the answer

In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Close inspection shows us that some of those columns have spaces in them. Before we can do anything else, lets remove those spaces and replace the column names with the updated versions.

In [3]:
new_columns = [col_name.strip().replace(' ', '') for col_name in jeopardy.columns]
jeopardy.columns = new_columns

In [4]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Normalizing Text for Analysis

Before we can start any analysis on our text columns ('Question' and 'Answer'), we first need to normalize all values so that they stick to the same standards. In this context, this will mean changing everything to lower case and removing punctuation.

In [5]:
import re
def normalize_text(text):
    new_string = text.lower() # change all characters to lower case
    new_string = re.sub('[^A-Za-z0-9\s]', '', new_string) # remove any character that is not a white space, letter or number
    new_string = re.sub('\s+', ' ', new_string) # replace any number of spaces with a single space    
    return new_string

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

## Normalizing Numerical Values

After being done with the Question and Answer columns, there are more that need normalizing. In this case, we will be working with the 'AirDate' and 'Value' columns.

The former needs to be turned into datetime format instead of a string, and the latter needs to be clead of dollar signs for proper analyzing.

In [7]:
def normalize_numbers(text):
    new_string = re.sub('[^A-Za-z0-9\s]', '', text) # remove any character that is not a white space, letter or number
    try:
        new_int = int(new_string) # attempt to make the value into an integer
    except Exception:
        new_int = 0 # return a 0 if the conversion fails
    return new_int

In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_numbers)
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

In [9]:
jeopardy[['clean_question', 'clean_answer','clean_value', 'AirDate']].head(10)

Unnamed: 0,clean_question,clean_answer,clean_value,AirDate
0,for the last 8 years of his life galileo was u...,copernicus,200,2004-12-31
1,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,2004-12-31
2,the city of yuma in this state has a record av...,arizona,200,2004-12-31
3,in 1963 live on the art linkletter show this c...,mcdonalds,200,2004-12-31
4,signer of the dec of indep framer of the const...,john adams,200,2004-12-31
5,in the title of an aesop fable this insect sha...,the ant,200,2004-12-31
6,built in 312 bc to link rome the south of ital...,the appian way,400,2004-12-31
7,no 8 30 steals for the birmingham barons 2306 ...,michael jordan,400,2004-12-31
8,in the winter of 197172 a record 1122 inches o...,washington,400,2004-12-31
9,this housewares store was named for the packag...,crate barrel,400,2004-12-31


## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. 

We will begin the attempt to answer the first question. This can be done simply by separating each word in the questions and answers into two separate lists, where will count how many times the answer in the latter appears in the question. We will also do some work to remove sections that might cause problems, such as common non-answer words (i.e.: 'the') or rows where we have no answer (which could cause a division by zero error).


In [10]:
def answer_in_question(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the') # remove 'the' since its a common, non-meaningful part of answers
    
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_answer) # return % of words in the answer that can be found in the question
    

Now that we've made our function, we'll apply it to our answers to create a new 'answer_in_question' column:

In [11]:
jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis = 1)

In [12]:
print(jeopardy['answer_in_question'].mean())

0.05900196524977763


Results suggest that on average, 6% of the words in an answer also appear in the question. This might tell us that there is not much use in simply analyzing the answers and hoping for the best. We might find more luck studying the specific questions and how frequently they appear.

## Recycled Questions

We cannot completely investigate how often questions are repeated, since we only have about 10% of the total questions. However, we can at least dig into it and attempt an approximation. We will try to do this by investigating the re-occurence of complex words. For this exercise, those will be any term that has at least 6 characters.

If we then sort our questions by date in ascending order, and then remove all the non-complex words, we can try to discern a pattern of recycled terms over time.

In [13]:
jeopardy.sort_values('AirDate').reset_index(drop = True, inplace = True)

In [15]:
question_overlap = []
terms_used = set()

for row, values in jeopardy.iterrows():
    split_question = values['clean_question'].split()
    split_question = [term for term in split_question if len(term)>=6]
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1 
        terms_used.add(word)
    if len(split_question)>0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6925960057338647

There is about 70% overlap between new questions and the terms used in past questions. This looks at single terms, not phrases, so it is not very significant. However, it is worth giving it some further attention.

## Low Value vs High Value Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

You'll then be able to loop through each of the terms from the last screen, terms_used, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [16]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)
    

In [17]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [18]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(1, 0),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (1, 3),
 (0, 2),
 (1, 0)]

## Applying the Chi-Squared Test

Now that we have the counts of high and low value apparitions for a few terms, we can calculate the expected counts and use these for the chi-squared test.

In [19]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    expected_high_val = total_prop * high_value_count
    expected_low_val = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([expected_high_val, expected_low_val])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

## Conclusions

None of the terms in this sample showed significant difference in usage between high and low value questions. Also, the frequencies of the terms detract from the validity of a chi-square test, in this case.

For next steps, we could:

- Create a list of terms like 'the' that we can eliminate to improve results
- Perform the test only on terms that happen more often
- Modify the test to include phrases, which might capture context better