# Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. 

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named `jeopardy.csv`, and contains `20000` rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

In [1]:
import pandas as pd
jeopardy = pd.read_csv('Jeopardy.csv')

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* `Show Number` -- the Jeopardy episode number of the show this question was in.
* `Air Date` -- the date the episode aired.
* `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* `Category` -- the category of the question.
* `Value` -- the number of dollars answering the question correctly is worth.
* `Question` -- the text of the question.
* `Answer` -- the text of the answer.

In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the columns have spaces, let's remove those spaces.

In [4]:
jeopardy.columns = jeopardy.columns.str.replace(' ', '')

In [5]:
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalising Text
Before we can start doing analysis on the Jeopardy questions, we need to normalise all of the text columns (the Question and Answer columns). The idea is to ensure that you lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words when you compare them.

In [6]:
import re

def normalise_text(text):
    text = text.lower()
    text = re.sub(r'[^A-Za-z0-9\s]','',text) 
    text = re.sub(r'\s+',' ',text)
    return text

In [7]:
jeopardy['Answer'] = jeopardy['Answer'].astype(str) #change type to string

In [8]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalise_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalise_text)

In [9]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalising Columns

The `Value` column should also be numeric, to allow for manipulate more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `AirDate` column should also be a datetime, not a string, to be able to work with it more easily.

In [10]:
def normalise_value(text):
    text = re.sub(r'[^A-Za-z0-9\s]','',text) 
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [11]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalise_text)
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

In [12]:
jeopardy.dtypes

ShowNumber                 int64
AirDate           datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value               object
dtype: object

In [13]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. 

Let's focus on the first question

In [14]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer: #'the' is commonly found in answers and questions, 
        #but doesn't have any meaningful use in finding the answer.
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    for answer in split_answer:
        if answer in split_question:
            match_count += 1
    return match_count/len(split_answer)      

In [15]:
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [16]:
#Finding the mean of the answer_in_question 
jeopardy['answer_in_question'].mean()

0.05900196524977763

On average, the answer only makes up for about **6%** of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

## Recycled Questions

Let's say we want to investigate how often new questions are repeats of older ones. we can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, we can:

* Sort `jeopardy` in order of ascending air date.
* Maintain a *set* called `terms_used` that will be empty initially.
* Iterate through each row of `jeopardy`.
* Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
    * If it does, increment a counter.
    * Add each word to `terms_used`.

This will enable us to check if the terms in questions have been used previously or not. Only looking at words with **six or more characters** enables us to filter out words like `the` and `than`, which are commonly used, but don't tell  a lot about a question.

In [17]:
question_overlap = []
terms_used = set()
jeopardy_sorted = jeopardy.sort_values(['AirDate'], ascending = True)

In [18]:
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [m for m in split_question if len(m) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count +=1
    for word in split_question:    
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /=len(split_question)
    question_overlap.append(match_count)

In [19]:
jeopardy_sorted['question_overlap'] = question_overlap
jeopardy_sorted['question_overlap'].mean()

0.6908737315671962

There is about **70%** overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low value vs high value questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when you're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

* Low value -- Any row where `Value` is **less** than `800`.
* High value -- Any row where `Value` is **greater** than `800`.

We'll then be able to loop through each of the terms, `terms_used`, and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [20]:
def determine_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

In [22]:
jeopardy_sorted2 = jeopardy_sorted[jeopardy_sorted['clean_value'] != 'none']

In [23]:
jeopardy_sorted2['clean_value'] = jeopardy_sorted2['clean_value'].astype('int64')
jeopardy_sorted2["high_value"] = jeopardy_sorted2.apply(determine_value, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  jeopardy_sorted2['clean_value'] = jeopardy_sorted2['clean_value'].astype('int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  jeopardy_sorted2["high_value"] = jeopardy_sorted2.apply(determine_value, axis = 1)


In [24]:
jeopardy_sorted2

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.000000,0.000000,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.000000,0.000000,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this n...,the grand canyon,200,0.000000,0.000000,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.000000,0.000000,0
19305,10,1984-09-21,Double Jeopardy!,HOMONYMS,$200,Hindu hierarchy or a play's actors,a caste (cast),hindu hierarchy or a plays actors,a caste cast,200,0.333333,0.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1953,6294,2012-01-19,Double Jeopardy!,WEAPONS OF WORLD WAR II,$800,"Ships in the U.S. Navy's Casablanca class of ""...",aircraft carriers,ships in the us navys casablanca class of esco...,aircraft carriers,800,0.000000,1.000000,0
1954,6294,2012-01-19,Double Jeopardy!,ACTING PRESIDENTS ON TV,$800,Dennis Haysbert & D.B. Woodside as David & Way...,24,dennis haysbert db woodside as david wayne pal...,24,800,0.000000,1.000000,0
1955,6294,2012-01-19,Double Jeopardy!,4 N,$800,"""U"" know it means not deliberate; I'm sorry, t...",unintentional,u know it means not deliberate im sorry that s...,unintentional,800,0.000000,1.000000,0
1945,6294,2012-01-19,Double Jeopardy!,AMERICAN HISTORY,$400,In December 1974 this former New York governor...,Rockefeller,in december 1974 this former new york governor...,rockefeller,400,0.000000,1.000000,0


In [25]:
def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy_sorted2.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count +=1
            else:
                low_count +=1
    return high_count, low_count

In [26]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (2, 0),
 (0, 2),
 (0, 0),
 (2, 10),
 (3, 10),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 2)]

## Applying the Chi-Squared Test

In [27]:
from scipy.stats import chisquare
import numpy as np
#The number of rows in jeopardy where high_value is 1
high_value_count = jeopardy_sorted2[jeopardy_sorted2['high_value']== 1].shape[0]
#The number of rows in jeopardy where high_value is 0
low_value_count = jeopardy_sorted2[jeopardy_sorted2['high_value'] == 0].shape[0]

In [28]:
chi_squared = []
total = 0
for obs in observed_expected:
    total = sum(obs)
    total_prop = total/(jeopardy_sorted2.shape[0]) #to get the proportion across the dataset
    expected_high = total_prop * high_value_count # to get the expected term count for high value rows.
    expected_low = total_prop * low_value_count # to get the expected term count for low value rows.
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))

chi_squared

  terms = (f_obs.astype(np.float64) - f_exp)**2 / f_exp


[Power_divergenceResult(statistic=0.41165912843707375, pvalue=0.5211285963246591),
 Power_divergenceResult(statistic=4.858388559469828, pvalue=0.027512021231787174),
 Power_divergenceResult(statistic=0.8233182568741475, pvalue=0.3642117684516033),
 Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=0.9068908302205858, pvalue=0.3409407210310256),
 Power_divergenceResult(statistic=0.2329739508708927, pvalue=0.6293274031630456),
 Power_divergenceResult(statistic=0.41165912843707375, pvalue=0.5211285963246591),
 Power_divergenceResult(statistic=2.429194279734914, pvalue=0.11909409782120144),
 Power_divergenceResult(statistic=0.41165912843707375, pvalue=0.5211285963246591),
 Power_divergenceResult(statistic=0.8233182568741475, pvalue=0.3642117684516033)]

## Chi-Squared Test

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the numbers must be large enough. Each entry must be 5 or more, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher numbers.