Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. If you need help at any point, you can consult our solution notebook here.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here. Here's the beginning of the file:

In [164]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [165]:
jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number - the Jeopardy episode number
- Air Date - the date the episode aired
- Round - the round of Jeopardy
- Category - the category of the question
- Value - the number of dollars the correct answer is worth
- Question - the text of the question
- Answer - the text of the answer

In [166]:
# Removing blank spaces
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

### Normalizing text

Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question and Answer columns). We covered normalization before, but the idea is to ensure that you put words in lowercase and remove punctuation so Don't and don't aren't considered to be different words when you compare them.

In [167]:
import re

def normalize_text(s):
    s = s.lower()
    s = re.sub('[^\w\s]', '', s)
    return s

a = 'Hol^"A'
normalize_text(a)

'hola'

Normalizing the `Question` and `Answer` columns:

In [168]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

### Normalizing columns

Now that you've normalized the text columns, there are also some other columns to normalize.

The `Value` column should be numeric, to allow you to manipulate it easier. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `Air Date` column should also be a datetime, not a string, to enable you to work it easier.

In [169]:
def normalize_dollars(s):
    try:
        return int(re.sub('[^\w\s]', '', s))
    except:
        return 0

In [170]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollars)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

### Answers in questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

You can answer the second question by seeing **how often complex words (> 6 characters) reoccur**. 
You can answer the first question by seeing **how many times words in the answer also occur in the question**. We'll work on the first question and come back to the second.

In [171]:
def count_matches(row): 
    """
    Counts how many times words in 'clean_answer' appear in 'clean_question'. The input is a pd.Series (a row in the dataset)
    """
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    # removing 'the'
    if 'The' in split_answer:
        split_answer.remove('The')
    elif 'the' in split_answer:
        split_answer.remove('the')
    
    # Edge case
    if len(split_answer) == 0:
        return 0
    
    # Matches
    match_count = len( set(split_answer).intersection(set(split_question)) )
    
    return match_count / len(split_answer)

In [172]:
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis = 1)

In [173]:
jeopardy['answer_in_question'].value_counts(normalize = True) * 100

0.000000    86.879344
0.500000     7.235362
0.333333     2.770139
0.250000     0.845042
1.000000     0.580029
0.666667     0.515026
0.200000     0.435022
0.166667     0.155008
0.142857     0.130007
0.400000     0.120006
0.750000     0.090005
0.125000     0.055003
0.285714     0.035002
0.600000     0.035002
0.800000     0.015001
0.428571     0.010001
0.714286     0.010001
0.875000     0.010001
0.100000     0.010001
0.181818     0.010001
0.222222     0.010001
0.111111     0.010001
0.300000     0.010001
0.571429     0.005000
0.307692     0.005000
0.153846     0.005000
0.375000     0.005000
0.272727     0.005000
Name: answer_in_question, dtype: float64

When we find out that in some cases words that appear in the question are also present in the correct answer. However, it is actually just a minority of the total questions. It is probably not worthy base the answer in the question.

### Recycled questions

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

To do this, you can:

- Sort `jeopardy` in order of ascending air date.
- Maintain a set called `terms_used` that will be empty initially.
- Iterate through each row of `jeopardy`.
- Split `clean_question` into words, remove any word shorter than `6` characters, and check if each word occurs in `terms_used`.
    - If it does, increment a counter.
    - Add each word to `terms_used`.
     
This allows you to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables you to filter out words like `the` and `than`, which are commonly used, but don't tell you a lot about a question.

In [174]:
questions_overlap = []
terms_used = set([])

jeopardy.sort_values(by = 'Air Date', inplace = True)

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200,0.0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0


In [175]:
# Iterating over dataframe
for index,series in jeopardy.iterrows():
    # Splitting question
    split_question = series['clean_question'].split(' ')
    split_question = [x for x in split_question if len(x) >= 6]
    
    # Counter of matches for a specific row...
    match_count = 0
    for s in split_question:
        if s in terms_used:
            match_count += 1
        terms_used.add(s)
    
    if len(split_question) > 0:
        questions_overlap.append(match_count / len(split_question))
    else:
        questions_overlap.append(0)

In [176]:
jeopardy['question_overlap'] = pd.Series(questions_overlap)
jeopardy['question_overlap'].mean()

0.6894006357823148

It seems that there are many questions that use the same terms as before... They are recycling a lot of questions OR they are basically asking about the same themes and things.

### Low value vs High value questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

**You can actually figure out which terms correspond to high-value questions using a chi-squared test**. You'll first need to narrow down the questions into two categories:

- **Low value** -- Any row where Value is less than 800.
- **High value** -- Any row where Value is greater than 800.

You'll then be able to loop through each of the terms from the last screen, `terms_used`, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [177]:
jeopardy['high_value'] = jeopardy.apply(lambda x: 1 if x['clean_value'] > 800 else 0, axis = 1)

In [178]:
def word_high_low(word):
    "Returns two numbers: the frequency with which the word appear in high- or low-value questions"
    high_count = low_count = 0

    for index,series in jeopardy.iterrows():
        split_question = series['clean_question'].split(' ')
        if word in split_question:
            if series['high_value'] == 1:
                high_count += 1
            else: 
                low_count += 1
    
    return high_count, low_count

In [179]:
import random

comparison_terms = random.choices(list(terms_used), k = 40)

print(comparison_terms)

observed_expected = list( map(word_high_low, comparison_terms) )

observed_expected

['jakarta', 'tanganyika', 'sylphide', 'hrefhttpwwwjarchivecommedia20001123_dj_19wmvithese', 'intoned', 'consort', 'francesco', 'target', 'appearedin', 'gossip', '365bodypoint', 'junker', 'hrefhttpwwwjarchivecommedia20080313_dj_28jpg', 'anybody', 'brillante', 'restore', 'retsyn', 'currie', 'conestoga', 'josiah', 'raffaele', 'sugarless', 'dammini', 'welcomes', 'portion', 'father', 'leadeth', 'raping', 'grandparents', 'bonneville', 'lifeguard', 'sheryl', 'inmate', 'devourer', 'hrefhttpwwwjarchivecommedia20090105_dj_09jpg', 'flavian', 'highestranking', 'judean', 'hrefhttpwwwjarchivecommedia20091002_j_28jpg', 'vocabulary']


[(1, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (1, 0),
 (0, 3),
 (0, 1),
 (5, 2),
 (0, 1),
 (0, 2),
 (0, 1),
 (1, 0),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 4),
 (0, 1),
 (1, 0),
 (1, 1),
 (0, 2),
 (1, 0),
 (0, 1),
 (1, 0),
 (0, 1),
 (2, 8),
 (26, 84),
 (0, 2),
 (0, 1),
 (0, 2),
 (0, 1),
 (0, 2),
 (0, 4),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 1)]

### Applying the chi-squared test

In [180]:
from scipy.stats import chisquare

high_value_counts = jeopardy['high_value'].sum()
low_value_counts = jeopardy.shape[0] - high_value_counts

chi_squared = []
p_values = []

for term in observed_expected:
    total = term[0] + term[1]
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop * high_value_counts
    expected_low = total_prop * low_value_counts

    chi, p_value = chisquare(list(term), [expected_high, expected_low])
    #print(chi, p_value)
    chi_squared.append(chi)
    p_values.append(p_value)

In [181]:
chi_squared, p_values, observed_expected, comparison_terms

([0.4448774816612795,
  0.401962846126884,
  2.487792117195675,
  0.401962846126884,
  2.487792117195675,
  1.205888538380652,
  0.401962846126884,
  6.2575220449142,
  0.401962846126884,
  0.803925692253768,
  0.401962846126884,
  2.487792117195675,
  2.487792117195675,
  0.401962846126884,
  0.401962846126884,
  1.607851384507536,
  0.401962846126884,
  2.487792117195675,
  0.4448774816612795,
  0.803925692253768,
  2.487792117195675,
  0.401962846126884,
  2.487792117195675,
  0.401962846126884,
  0.36767906209032747,
  1.3636119408688154,
  0.803925692253768,
  0.401962846126884,
  0.803925692253768,
  0.401962846126884,
  0.803925692253768,
  1.607851384507536,
  0.401962846126884,
  0.401962846126884,
  0.401962846126884,
  2.487792117195675,
  0.401962846126884,
  0.401962846126884,
  0.401962846126884,
  0.4448774816612795],
 [0.5047776487545996,
  0.5260772985705469,
  0.11473257634454047,
  0.5260772985705469,
  0.11473257634454047,
  0.27214791766901714,
  0.5260772985705469

In [183]:
i = 7
chi_squared[i], p_values[i], observed_expected[i], comparison_terms[i]

(6.2575220449142, 0.012366706058156086, (5, 2), 'target')