# Winning Jeopardy

This is an project of analyzing text while figuring out strategies to win at Jeopardy.

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help a participant of Jeopardy to win. The dataset can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

## Prepare dataset

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")

In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


### Remove spaces in front of column names

In [4]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

### Normalizae the _Question_ column

* Lowercase words
* Remove punctuation so _Don't_ and _don't_ are considered no different. 


In [5]:
import re

def normalize_text(text):
    """
    A function to normalize questions and answers:
    take in a string; 
    convert the string to lowercase;
    remove all punctuation in the sting;
    return string.
    """
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text
    

### Normalize the _Value_ colum to be numeric 

In [6]:
def normalize_values(text):
    """
    A function to normalize the Value column to be numeric:
    need to remove the dollar sign from the beginning of each value 
    and convert the column from text to numeric.
    """
    text = re.sub("[^A-Za-z0-9\s]", "", text) # remove any punct 
    try:
        text = int(text) # convert string to integer
    except Exception:
        text = 0 # if the conversion has an error, assign 0 instead
    return text

In [7]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [8]:
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200


### Normalize the _Air Date_ column to datetime column

In [9]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [10]:
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200


In [11]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Figure out whether to study past questions 

* How oftern the answer is deducible from the question.

In [12]:
def match(row):
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [13]:
answer_in_question = jeopardy.apply(match, axis=1)

In [14]:
answer_in_question.mean()

0.06049325706933587

There is around 6% of an answer coming from its corresponding question. This is not a huge number, so probably we can not hope that listening the questions will enable us to figure out the answers.

## Investigate how often new questions are repeats of older ones

Here we want to know how often new questions are repeats of old one. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataser, but we can investigate it at least. 


In [15]:
question_overlap = []
terms_used = set()
# Sort jeopardy by ascending air date
jeopardy = jeopardy.sort_values(by=['Air Date'], ascending=True)

In [16]:
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q) > 5] # remove any word shorter than 6 characters
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0: 
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6876260592169802

The above two cell codes try to answer how often new questions are repeats of older ones. To do so, we look at words greater than 6 characters in the question column (filter out words like then and than), and count how many times these words have occurred in the former questions.

The results showed 68% overlap between terms in new questions and terms in old questions. This calculation looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Figure out high-value questions using a chi-squared test

Figure out which terms correspond to high-value questions using a chi-squared test. Narrow down the quetions into two categories:
* Low value -- any row where Value is less than 800
* High value -- any row where Value is greater than 800

In [21]:
def determine_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [28]:
len(terms_used)

24469

In [22]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1), (1, 0), (6, 7), (1, 0), (1, 0)]

In [29]:
high_value_count = len(jeopardy[jeopardy['high_value']==1])
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]

In [20]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared    

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.9428227445585855, pvalue=0.16336237241877416),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

## Chi-squred results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.