# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say if we want to compete on Jeopardy, and we are looking for any edge to win. So the goal of our this project is to figure out some patterns in the questions asked that could help us to win.

So for this we need to work with a dataset of Jeopardy questions, which we can download [here](<https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file>). The dataset is named jeopardy.csv, and contains some 19999 Jeopardy questions.

In [1]:
#Reading the file

In [2]:
import pandas as pd

jeopardydata = pd.read_csv('jeopardy.csv')
jeopardydata.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardydata.shape

(19999, 7)

In [4]:
jeopardydata.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

As we can see above, we have 19999 rows and 7 columns. Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

|Column Name|Description|
|:----|:----|
|Show Number|the Jeopardy episode number of the show this question was in.|
|Air Date|the date the episode aired.|
|Round|the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.|
|Category|the category of the question.|
|Value|the number of dollars answering the question correctly is worth.|
|Question|the text of the question.|
|Answer|the text of the answer.|

And as we can also see in above output that some of the column names have spaces in front. So we need to remove the spaces.

In [5]:
#Removing spaces in front of column names

In [6]:
col = jeopardydata.columns
col = col.str.strip()
jeopardydata.columns = col
jeopardydata.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns),the idea is to ensure that there are all lowercase words and no punctuation.

In [7]:
#Writing a function to normalize question and answer columns

In [8]:
import string
def normalize_text(text):
    text = text.lower()
    for character in string.punctuation:
        text = text.replace(character,'')
    return text    

In [9]:
#Creating a new column for normalized text

In [10]:
jeopardydata['clean_question'] = jeopardydata['Question'].apply(normalize_text)
jeopardydata['clean_answer'] = jeopardydata['Answer'].apply(normalize_text)

In [11]:
jeopardydata['clean_question'][:6]

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
5    in the title of an aesop fable this insect sha...
Name: clean_question, dtype: object

In [12]:
jeopardydata['clean_answer'][:6]

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
5       the ant
Name: clean_answer, dtype: object

Now that we have normalized the text columns, there are also some other columns to normalize.

The Value column should also be numeric, to allow us to manipulate it more easily. We need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable us to work with it more easily too.

In [13]:
#Writing a function to normalize Value column

In [14]:
def normalize_values(text):
    clean = ''
    for num in text:
        if num in '0123456789':
            clean += num
    if not clean:
        clean = '0'
        return int(clean)
    else:
        return int(clean)

In [15]:
jeopardydata['clean_value'] = jeopardydata['Value'].apply(normalize_values)

In [16]:
jeopardydata['clean_value'][:6]

0    200
1    200
2    200
3    200
4    200
5    200
Name: clean_value, dtype: int64

In [17]:
#Converting the Air Date column from string datatype to a datetime datatype column.

In [18]:
jeopardydata['Air Date'][:6]

0    2004-12-31
1    2004-12-31
2    2004-12-31
3    2004-12-31
4    2004-12-31
5    2004-12-31
Name: Air Date, dtype: object

In [19]:
jeopardydata['Air Date'] = pd.to_datetime(jeopardydata['Air Date'])

In [20]:
jeopardydata['Air Date'][:6]

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
5   2004-12-31
Name: Air Date, dtype: datetime64[ns]

In order to figure out whether to study past questions, study general knowledge, or not study at all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second later.

In [21]:
def answer_deducible(data):
    split_answer = data['clean_answer'].split(' ')
    split_question = data['clean_question'].split(' ')
    match_count = 0
    #The is commonly found in answers and questions, 
    #but doesn't have any meaningful use in finding the answer.
    if 'the' in split_answer:
        split_answer.remove('the')
    #To prevent a division by zero error later.
    if len(split_answer) == 0:
        return 0
    
    for element in split_answer:
        if element in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [22]:
#Counting how many times terms in clean_answer occur in clean_question

In [23]:
jeopardydata['answer_in_question'] = jeopardydata.apply(answer_deducible, axis=1)

In [24]:
jeopardydata['answer_in_question'].value_counts()

0.000000    17380
0.500000     1450
0.333333      551
0.250000      170
1.000000      122
0.666667      102
0.200000       82
0.166667       28
0.400000       28
0.142857       19
0.750000       18
0.285714       10
0.600000        9
0.125000        9
0.428571        3
0.181818        2
0.800000        2
0.571429        2
0.300000        2
0.111111        2
0.307692        1
0.444444        1
0.222222        1
0.375000        1
0.100000        1
0.153846        1
0.875000        1
0.272727        1
Name: answer_in_question, dtype: int64

In [25]:
jeopardydata['answer_in_question'].mean()

0.060352773854699004

From the above output we find that the mean is really too low, thus our first point to deduce an answer from the question seems negligible for our winning strategy. 

Now let us investigate our second point mentioned above - how often new questions are repeats of older ones. 

We can't completely answer this, because we only have about 10% of the full Jeopardy questions dataset, but we can investigate it at least.

In [26]:
#Finding the frequency of words greater than 6 characters to find the the repetition of old questions in Jeopardy

In [27]:
question_overlap = []
terms_used = set()

jeopardydata = jeopardydata.sort_values('Air Date')

for x, row in jeopardydata.iterrows():
        split_question = row['clean_question'].split(' ')
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
            else:
                terms_used.add(word)
        if len(split_question) > 0:
            match_count = match_count / len(split_question)
        question_overlap.append(match_count)
jeopardydata['question_overlap'] = question_overlap

jeopardydata['question_overlap'].mean()

0.6889055316620302

The above result is from only 10% of the full Jeopardy questions dataset. But no doubt the mean seems to be good and it does derives the fact that we need to prepare on old questions of Jeopardy should be one of the points in our winning strategy.

We can include in our winning strategy something like to focus on studying questions that pertain to high value questions instead of low value questions. This can help us earn more money when we are on Jeopardy.

So we can do this by actually figuring out which terms correspond to high-value questions using a chi-squared test. We need to first need to narrow down the questions into two categories: 

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

And we can achieve this by using the set `terms_used` from our previous code and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. 

Right now for this we will be just working on a small sample from our dataset `jeopardydata`.

In [28]:
#Determining which questions are high and low value

In [29]:
def question_value(data):
    value = 0
    if data['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardydata['high_value'] = jeopardydata.apply(question_value, axis=1)

In [30]:
jeopardydata['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [31]:
#Calculating the observed count

In [32]:
def highlow_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardydata.iterrows():
        if word in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return [high_count, low_count]

observed_expected = []
comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    observed_expected.append(highlow_count(term))

observed_expected

[[1, 1], [0, 1], [1, 0], [0, 1], [1, 0]]

Now that we have found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [33]:
#Calculating the expected count

In [34]:
from scipy.stats import chisquare

high_value_count = jeopardydata[jeopardydata["high_value"] == 1].shape[0]
low_value_count = jeopardydata[jeopardydata["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / len(jeopardydata)
    expected_highvalue = total_prop * high_value_count
    expected_lowvalue = total_prop * low_value_count
    expected = [expected_highvalue, expected_lowvalue]
    chi_squared.append(chisquare(obs, expected))

chi_squared

[Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]