## Looking for an edge in winning Jeopardy show
------------
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named `JEOPARDY_CSV.csv`, and contains over `200000 rows` from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

In [20]:
# Imports
import pandas as pd
import re
from scipy.stats import chisquare
import numpy as np

# Settings

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
# Reading dataset

jeopardy = pd.read_csv("JEOPARDY_CSV.csv")
jeopardy.shape
jeopardy.head(2)
jeopardy.tail(2)
jeopardy.columns

(216930, 7)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
216928,4999,2006-05-11,Double Jeopardy!,QUOTATIONS,$2000,"From Ft. Sill, Okla. he made the plea, Arizona...",Geronimo
216929,4999,2006-05-11,Final Jeopardy!,HISTORIC NAMES,,A silent movie title includes the last name of...,Grigori Alexandrovich Potemkin


Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
# Trimming spaces in column names

jeopardy.columns = [col.strip() for col in jeopardy.columns]
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

----------
Lets normalize the text columns `Question` and `Answer`. The idea is to lowercase all words and remove punctuations. This will avoid different result for the same words.

In [5]:
# imported re module

def norm_text(input_string):
    input_string = str(input_string)
    return re.sub("[^A-Za-z0-9\s]", "", input_string).lower()
    
jeopardy["clean_question"] = jeopardy['Question'].apply(norm_text)
jeopardy["clean_answer"] = jeopardy['Answer'].apply(norm_text)

We will also clean `Value` and `Air Date` columns.
- Removing '$' and ',' from **Value** column and coverting it to int
- Coverting **Air Date** to datetime format

In [6]:
def norm_value(input_string):
    input_string = re.sub("[^A-Za-z0-9\s]", "", input_string)
    try:
        input_string = int(input_string)
    except Exception:
        input_string = 0
    return input_string    

jeopardy["clean_value"] = jeopardy['Value'].apply(norm_value)
jeopardy["Air Date"] = pd.to_datetime(jeopardy['Air Date'])

--------------
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the first question by seeing how many times words in the answer also occur in the question.

In [7]:
def deducible(row):
    split_answer = row["clean_answer"].split(' ')
    split_question = row["clean_question"].split(' ')
    
    # removing 'the' word if exists from split_answer as it is quite common
    if "the" in split_answer:
        split_answer = [i for i in split_answer if i != 'the']
    if len(split_answer) == 0:
        return 0
    # check for a match word in question
    match_count = 0
    for ans in split_answer:
        if ans in split_question:
            match_count += 1
    
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(deducible, axis=1)

In [12]:
print("number of times one or more words from answer present in question:", jeopardy[jeopardy['answer_in_question'] != 0].shape[0])
print("mean of 'answer_in_question' column:", jeopardy['answer_in_question'].mean())

number of times one or more words from answer present in question: 27473
mean of 'answer_in_question' column: 0.05879718229728192


## Answer terms in the question
The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

--------------------
Now let's investigate how often new questions are repeats of older ones.
- we shall look at words that are atleast 6 characters, which enables us to filter out words like **the** and **than**, which are commonly used, but don't tell you a lot about a question.

In [9]:
# sort df by date
jeopardy = jeopardy.sort_values("Air Date")

question_overlap = []
terms_used = set()
for i,row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) >= 6]
    
    match_count = 0
    for each in split_question:
        if each in terms_used:
            match_count += 1
        terms_used.add(each)
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap

In [11]:
print("mean of 'question_overlap' column:", jeopardy['question_overlap'].mean())

mean of 'question_overlap' column: 0.8727083792207625


## Question overlap
There is about 87% overlap between terms in new questions and terms in old questions. This makes it relatively significant, that it's worth looking more into the recycling of questions.

--------------------------
Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [13]:
def high_value(value):
    if value > 800:
        return 1
    return 0

jeopardy['high_value'] = jeopardy['clean_value'].apply(high_value)

In [15]:
def compare_word(word):
    low_count = 0
    high_count = 0
    for idx, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            low_count += 1
    return high_count, low_count

observed_expected = []
comparison_terms = list(terms_used)[0:5]

for term in comparison_terms:
    observed_expected.append(compare_word(term))

comparison_terms    
observed_expected

['pistons',
 'hrefhttpwwwjarchivecommedia20040714j18jpg',
 'biochemistry',
 'neoconservatives',
 'fuller']

[(2, 7), (0, 1), (0, 2), (1, 1), (5, 13)]

In [19]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.16455916070771673, pvalue=0.6849932289997671),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989),
 Power_divergenceResult(statistic=0.46338644448358013, pvalue=0.49604555208958945),
 Power_divergenceResult(statistic=0.002551837432310809, pvalue=0.9597114268675199)]

## Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.