![download.jpg](attachment:download.jpg)

# Winning Jeopardy!

[Jeopardy!](https://www.jeopardy.com/) is a popular TV trivia game show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say you want to compete on Jeopardy!, and are looking for any edge you can get to win. In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named `jeopardy.csv`, and contains 20000 rows from the beginning of a full dataset of Jeopardy! questions, which you can download from [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` - the Jeopardy episode number of the show this question was in.
- `Air Date` - the date the episode aired.
- `Round` - the round of Jeopardy that the question was asked in. Jeopardy has - - several rounds as each episode progresses.
- `Category` - the category of the question.
- `Value` - the number of dollars answering the question correctly is worth.
- `Question` - the text of the question.
- `Answer` - the text of the answer.

Let's begin by reading in the data and doing some exploration.

In [43]:
import pandas as pd
import re
import numpy as np
from scipy.stats import chisquare
from random import choice

In [44]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [45]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the columns have spaces in front of their names. Let's fix that.

In [46]:
jeopardy = jeopardy.rename(columns={'Show Number':'Show Number', ' Air Date': 'Air Date', ' Round': 'Round', ' Category': 'Category', ' Value': 'Value', ' Question':'Question', ' Answer': 'Answer'})

In [47]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing the columns

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `Question` and `Answer` columns). The idea is to ensure that we lowercase words and remove punctuation so `Don't` and `don't` are not considered to be different words when you compare them.

In [48]:
def normalize(string):
    string = str(string)
    string = string.lower()
    string = re.sub('[^A-Za-z0-9\s]', '', string)
    return string

In [49]:
jeopardy['clear_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clear_answer'] = jeopardy['Answer'].apply(normalize)

The `Value` column should also be numeric, to allow us to manipulate it more easily. We need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `Air Date` column should also be a datetime, not a string, to enable us to work with it more easily.

In [50]:
def normalize_value(string):
    string = str(string)
    string = string.replace('$', '')
    try:
        string = int(string)
    except:
        string = 0
    return string

In [51]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [52]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy['Air Date'].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]

## Exploring the questions and answers

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the first question by seeing how many times words in the answer also occur in the question. We can answer the second question by seeing how often complex words (> 6 characters) reoccur. 

In [53]:
def match_count(row):
    split_answer = row['clear_answer'].split(" ")
    split_question = row['clear_question'].split(" ")
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for x in split_answer:
        if x in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [54]:
jeopardy['answer_in_question'] = jeopardy.apply(match_count, axis=1)
mean = jeopardy['answer_in_question'].mean()
mean

0.060493257069335914

In the above few cells we found a way to answer the first question by applying a function to each row of the Jeopardy dataframe. The function counts if the words from the answer also appear in the question and returns the match count divided by the length of the answer. Above we see the mean of the results.

Now, let's focus on the second question.

In [55]:
questions_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values(by=['Air Date'])

for i, row in jeopardy.iterrows():
    split_question = row['clear_question'].split(' ')
    split_question = [word for word in split_question if len(word)>5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    questions_overlap.append(match_count)
        
jeopardy['question_overlap'] = questions_overlap
jeopardy['question_overlap'].mean()

0.6876260592169776

It looks like there is a 70% overlap of terms used in old and new questions. However, this is on term-level.

In [56]:
def word_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [57]:
jeopardy['high_value'] = jeopardy.apply(word_value, axis=1) 


In [58]:
def word_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clear_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [59]:
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for x in range(10)]

In [60]:
observed_expected = []

for term in comparison_terms:
    observed_expected.append(word_count(term))

In [61]:
high_value_count = jeopardy['high_value'].value_counts()[1]
low_value_count = jeopardy['high_value'].value_counts()[0]

In [62]:
chi_squared = []

for l in observed_expected:
    total = sum(l)
    total_prop = total/jeopardy.shape[0]
    expected_high_count = total_prop*high_value_count
    expected_low_count = total_prop*low_value_count
    observed = np.array([l[0], l[1]])
    expected = np.array([expected_high_count, expected_low_count])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.6765980594008285, pvalue=0.4107606373026975),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673),
 Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.06325251982741063, pvalue=0.8014271475031749)]