# Preparing for Jeopardy

Jeopardy is an American TV show in which participants answer questions to win money. It has been running for many years. Over multiple rounds, contestants can choose a category, and get a question from that category, where different questions have different dollar values. A more extensive description can be found [here](https://en.wikipedia.org/wiki/Jeopardy!#Gameplay).

Imagine you want to participate in Jeopardy - and win. And you wonder yourself: how do I prepare for this? Is it just a matter of studying a lot? Or possibly, is there something to learn from questions from the past?

A couple of years ago, someone crawled Jeopardy archives and [posted on Reddit](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/) a short article from which one can download a file with no less then 216,930 earlier Jeopardy questions, with answers and other data.

In this notebook, we are going to explore if there is something to learn from Jeopardy history that will help you prepare. The notebook contains the following sections:
1. Initial data exploration
2. Data re-formatting  
2.1 Columns "Question" and "Answer"    
2.2 Column "Value"   
2.3 Column "Air Date"
3. Data analysis  
3.1 Answer included in question  
3.2 Repeated and popular terms  
3.3 Terms used in high-value questions  
4. Conclusions

## 1. Initial data exploration

Let's start with reading in the data (I took the .csv file) and explore it.  

In [1]:
# Import pandas library 
import pandas as pd

# Import the data into a dataframe
jeopardy = pd.read_csv('JEOPARDY_CSV.csv')

# Show some rows
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# Get column information
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
Show Number    216930 non-null int64
 Air Date      216930 non-null object
 Round         216930 non-null object
 Category      216930 non-null object
 Value         216930 non-null object
 Question      216930 non-null object
 Answer        216928 non-null object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [3]:
# Get the number of rows and columns
jeopardy.shape

(216930, 7)

Some column names appear to have leading spaces. That's inconvenient, so let's remove those.

In [4]:
# Remove the leading spaces from the column names
jeopardy.columns = jeopardy.columns.str.strip()

# Check the result
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
Show Number    216930 non-null int64
Air Date       216930 non-null object
Round          216930 non-null object
Category       216930 non-null object
Value          216930 non-null object
Question       216930 non-null object
Answer         216928 non-null object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [5]:
# Check from when the questions originate 
jeopardy['Air Date'].value_counts().sort_index()

1984-09-10    48
1984-09-11    50
1984-09-12    51
1984-09-13    53
1984-09-14    54
              ..
2012-01-20    58
2012-01-23    61
2012-01-24    59
2012-01-25    61
2012-01-27    30
Name: Air Date, Length: 3640, dtype: int64

In [6]:
# Show a sample again with better layout

# Avoid truncation
pd.set_option('display.max_colwidth', -1)
# Display with left alignment
jeopardy.head(10).style.set_properties(**{'text-align': 'left'}).set_table_styles([ dict(selector='th', props=[('text-align', 'left')] ) ])

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect shared billing with a grasshopper",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,"Built in 312 B.C. to link Rome & the South of Italy, it's still in use today",the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,306 steals for the Bulls",Michael Jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state",Washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packaging its merchandise came in & was first displayed on,Crate & Barrel


In [7]:
# Summary of the questions that we have
print(jeopardy.shape[0],'questions dating from', jeopardy['Air Date'].min(),'to', jeopardy['Air Date'].max())

216930 questions dating from 1984-09-10 to 2012-01-27


In [8]:
# Check how many questions per round
jeopardy['Round'].value_counts()

Jeopardy!           107384
Double Jeopardy!    105912
Final Jeopardy!     3631  
Tiebreaker          3     
Name: Round, dtype: int64

So we have 216,930 questions from between 1984 and 2012. It looks like the data in the columns is pretty complete (almost no missing values, only 2 answers). Most columns took the format of an `object`.

One somewhat surprising observation are the "Questions" and "Answers". From [a game-play description](https://en.wikipedia.org/wiki/Jeopardy!#Gameplay) I understood that participants do not so much get a "question" that they need to "answer", but rather get an "answer" for which they need to come up with the right "question". The sample above does not really show that. If someone is told "Mc Donald's", I can hardly imagine someone asking which fast food chain served its billionth burger live on The Art Linkletter Show in 1963.  

I am not sure about the cause of this descrepancy between (my understanding of) Jeopardy game-play and the question-and-answer-archive. However, for the purpose of our study, it seems okay to just consider this "questions" with "answers".



## 2. Data re-formatting

For analysis purposes, it is helpful though to reformat and normalize parts of the data. That's what we will do in this section.


### 2.1 Columns "Question" and "Answer"

To be able to correctly count words (in `Question` and `Answer`):
* remove interpunction
* put everything in lowercase

In [9]:
# Import re library to enable reformatting 
import re

# Create a function that takes a string and returns it normalized (no interpunction, all lowercase)
def normalize_string(input):
    replaced_interpunction = re.sub(r'\W', ' ', input).lower()
    removed_spaces = re.sub(' +',' ', replaced_interpunction).strip()
    return removed_spaces

In [10]:
# Test function
normalize_string('Hello!! Do DO dO  16:17 ?two2,and:FOO, bar?')

'hello do do do 16 17 two2 and foo bar'

Looks good. Let's apply this to columns `Question` and `Answer`. We'll add new columns with the result.

In [11]:
# Add 2 columns, with normalized versions of Question and Answer. (Offer a 'string versions' of the objects.)
jeopardy['question_clean'] = jeopardy['Question'].astype('str').apply(normalize_string)
jeopardy['answer_clean'] = jeopardy['Answer'].astype('str').apply(normalize_string)

In [12]:
# Show result on a random sample
jeopardy.sample(10, random_state = 0)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,question_clean,answer_clean
112079,3452,1999-09-14,Double Jeopardy!,SWEET 16,$1000,"This king had a lot taken off the top January 21, 1793",Louis XVI,this king had a lot taken off the top january 21 1793,louis xvi
50465,3967,2001-11-27,Jeopardy!,MISS UNIVERSE,$800,"Crowned in Cyprus in May 2000, Bombay U. grad Lara Dutta represented this country",India,crowned in cyprus in may 2000 bombay u grad lara dutta represented this country,india
71223,5920,2010-05-14,Double Jeopardy!,FAUX,$800,"Nickname of Sam, leader of The Pharaohs, who sang ""Wolly Bully""","""The Sham""",nickname of sam leader of the pharaohs who sang wolly bully,the sham
26234,2916,1997-04-14,Double Jeopardy!,FICTIONAL FEMALES,$800,"Miranda, a young woman, appears in several of her works, including ""Pale Horse, Pale Rider""",Katherine Anne Porter,miranda a young woman appears in several of her works including pale horse pale rider,katherine anne porter
86973,1295,1990-03-30,Jeopardy!,FAMOUS JOES & JOSEPHS,$200,This Delaware senator chairs the Senate Judiciary Committee,Joseph Biden,this delaware senator chairs the senate judiciary committee,joseph biden
127358,5685,2009-05-01,Double Jeopardy!,BROWN,$800,"Between 1960 & 1986, he racked up 44 Top 40 hits, but no No. 1s",James Brown,between 1960 1986 he racked up 44 top 40 hits but no no 1s,james brown
148314,3810,2001-03-09,Jeopardy!,IN A MINUTE,$500,"Under the slogan ""Real Estate for the Real World"", this company claims on average to buy or sell a home every minute",Century 21,under the slogan real estate for the real world this company claims on average to buy or sell a home every minute,century 21
115787,2883,1997-02-26,Double Jeopardy!,BLACK AMERICA,$1000,Ebony & Jet are among the magazines launched by this publisher,John Johnson,ebony jet are among the magazines launched by this publisher,john johnson
118519,3472,1999-10-12,Jeopardy!,A LITTLE DICKENS,$500,It's no mystery why this work was Dickens' last; he didn't live to finish it,The Mystery of Edwin Drood,it s no mystery why this work was dickens last he didn t live to finish it,the mystery of edwin drood
193578,5447,2008-04-22,Jeopardy!,MELROSE PLACE,$400,"Hey, nice to meet <a href=""http://www.j-archive.com/media/2008-04-22_J_08.jpg"" target=""_blank"">this</a> actress who played Jennifer Mancini in 1997; ""Charmed"", I'm sure",Alyssa Milano,hey nice to meet a href http www j archive com media 2008 04 22_j_08 jpg target _blank this a actress who played jennifer mancini in 1997 charmed i m sure,alyssa milano


Looks good. The last one in the table shows there is some messy data, where a hyperlink to a picture was included in the data.

We'll ignore that for now, but let's keep it in mind.

### 2.2 Column "Value"

Next, we'll change column `Value` into a numeric field, to be able to manipulate it easier. In the samples so far we see entries like \\$200 and \\$1,800. Let's first check if there is more.

In [1]:
# Check which different values there are for field Value
jeopardy['Value'].unique()

NameError: name 'jeopardy' is not defined

In [14]:
# Check the amount of 'None' values
len(jeopardy[jeopardy['Value']=='None'])

3634

Not entirely sure what they are, but let's replace all 'None' values with 0. Convert everything else to a number.

In [15]:
# Create a function that takes a string as in Value column and returns a number
def normalize_value(input):
    if input == 'None':
        output = 0
    else:
        keep_numbers = input.replace('$','').replace(',','')
        output = int(keep_numbers)
    #replaced_interpunction = re.sub(r'\W', ' ', input).lower()
    #removed_spaces = re.sub(' +',' ', replaced_interpunction).strip()
    return output

In [16]:
# Test the function
print (normalize_value('None'), normalize_value('$200'), normalize_value('$1,534'), normalize_value('$200')+normalize_value('$1,534'))

0 200 1534 1734


Looks good. Let's apply this to column `Value`. We'll add a new column with the result.

In [17]:
# Add a column, with normalized versions of Value. (Offer a 'string version' of the object.)
jeopardy['value_clean'] = jeopardy['Value'].astype('str').apply(normalize_value)

In [18]:
# Verification 1: check that all are numbers, by summing them
print('Total is:', jeopardy['value_clean'].sum())
# Verification 2: show all values
print(sorted(jeopardy['value_clean'].unique()))

Total is: 160525700
[0, 5, 20, 22, 50, 100, 200, 250, 300, 350, 367, 400, 500, 585, 600, 601, 700, 750, 796, 800, 900, 1000, 1020, 1100, 1111, 1183, 1200, 1203, 1246, 1263, 1300, 1347, 1400, 1407, 1492, 1500, 1512, 1534, 1600, 1700, 1777, 1800, 1801, 1809, 1810, 1900, 2000, 2001, 2021, 2100, 2127, 2200, 2222, 2300, 2344, 2400, 2500, 2547, 2600, 2700, 2746, 2800, 2811, 2900, 2990, 3000, 3100, 3150, 3200, 3201, 3300, 3389, 3400, 3499, 3500, 3599, 3600, 3700, 3800, 3900, 3989, 4000, 4008, 4100, 4200, 4238, 4300, 4400, 4500, 4600, 4637, 4700, 4800, 5000, 5001, 5100, 5200, 5201, 5400, 5401, 5500, 5600, 5700, 5800, 6000, 6100, 6200, 6300, 6400, 6435, 6600, 6700, 6800, 7000, 7200, 7400, 7500, 7600, 7800, 8000, 8200, 8400, 8500, 8600, 8700, 8800, 8917, 9000, 9200, 9500, 9800, 10000, 10400, 10800, 11000, 11200, 11600, 12000, 12400, 13000, 13200, 13800, 14000, 14200, 16400, 18000]


Looks good.

### 2.3 Column "Air Date"

Next, we'll turn `Air Date` into a date field, which is easier to analyze. We'll add a new column.

In [19]:
# Add a column, with the airdate as date-time
jeopardy['date_clean'] = pd.to_datetime(jeopardy['Air Date'], format = '%Y-%M-%d')
# Check result
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 11 columns):
Show Number       216930 non-null int64
Air Date          216930 non-null object
Round             216930 non-null object
Category          216930 non-null object
Value             216930 non-null object
Question          216930 non-null object
Answer            216928 non-null object
question_clean    216930 non-null object
answer_clean      216930 non-null object
value_clean       216930 non-null int64
date_clean        216930 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(8)
memory usage: 18.2+ MB


In [20]:
# Check a sample
jeopardy.head(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,question_clean,answer_clean,value_clean,date_clean
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus,for the last 8 years of his life galileo was under house arrest for espousing this man s theory,copernicus,200,2004-01-31 00:12:00
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe,no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves,jim thorpe,200,2004-01-31 00:12:00


Looks good.

## 3. Data analysis

Now we have a set of data in an easy-to-analyze format we can start analysis.

### 3.1 Answer included in question

What may happen sometimes, is that the question already contains the answers, or parts of it. If that happens a lot, that may help you develop your strategy to win.

We are going to analyze such overlap between questions and answers by calculating (for every question-with-answer) which fraction of words in the answer is also a word in the question. The words `the` and `a` will be excluded: they appear a lot, but are not meaningful for this analysis.


In [21]:
# Create a function that takes a row of the dataframe as an input, and returns
# how many times terms in answer_clean occur in question_clean
def count_overlap(row):
    # Split question and answer in individual words
    split_answer = row['answer_clean'].split()
    split_question = row['question_clean'].split()
    # print(split_question) # commented out after verifying
    
    # Remove all occurrences of 'the' from the question (as this is not meaningful)
    while 'the' in split_question:
        split_question.remove('the')
    # print(split_question) # commented out after verifying
    
    # Do the same for 'a' (added after seeing the result)
    while 'a' in split_question:
        split_question.remove('a')
    # print(split_question) # commented out after verifying
    
    # Count how many words in the answer appear in the question, calculate the fraction
    result = 0
    match_count = 0
    if len(split_answer) > 0:
        for word in split_answer:
            if word in split_question:
                match_count +=1
        # print ('match_count:', match_count) # commented out after verifying
        # print ('answer length:',len(split_answer)) # commented out after verifying
        result = match_count / len(split_answer)
    
    return result       
    

In [22]:
# test an example that contains overlap
test_row1 = jeopardy.iloc[118519]
print(test_row1['question_clean'])
print(test_row1['answer_clean'])
count_overlap(test_row1)

it s no mystery why this work was dickens last he didn t live to finish it
the mystery of edwin drood


0.2

In [23]:
# test an example having 'the' in the question multiple times
# (This test tested whether all instances of 'the' were removed; after commenting out print this is not visible anymore)
test_row2 = jeopardy.iloc[7]
print(test_row2['question_clean'])
print(test_row2['answer_clean'])
count_overlap(test_row2)

no 8 30 steals for the birmingham barons 2 306 steals for the bulls
michael jordan


0.0

In [24]:
# test an example where 'the' is in the overlap  question containing 'the' multiple times
test_row2 = jeopardy.iloc[5]
print(test_row2['question_clean'])
print(test_row2['answer_clean'])
count_overlap(test_row2)

in the title of an aesop fable this insect shared billing with a grasshopper
the ant


0.0

Looks good so far, so let's apply this to the dataframe (add a new column), and then check some more results.


In [25]:
# Add a new column to include the indicator of overlap between question and answer
jeopardy['answer_in_question'] = jeopardy.apply(count_overlap, axis = 1)

In [26]:
# See the outcome: fraction of words in the answer that is also in the question
jeopardy['answer_in_question'].value_counts()

0.000000    195068
0.500000    8179  
0.333333    6437  
0.250000    2435  
1.000000    1237  
0.200000    1151  
0.666667    654   
0.166667    467   
0.400000    350   
0.142857    210   
0.285714    129   
0.125000    118   
0.750000    98    
0.600000    73    
0.111111    51    
0.222222    40    
0.428571    36    
0.375000    28    
0.100000    21    
0.800000    20    
0.571429    14    
0.090909    12    
0.714286    9     
0.181818    9     
0.300000    9     
0.083333    8     
0.153846    7     
0.625000    7     
0.833333    7     
0.272727    5     
0.555556    3     
0.857143    3     
0.444444    3     
0.888889    2     
0.076923    2     
0.363636    2     
0.066667    2     
0.133333    2     
0.545455    2     
0.071429    2     
0.117647    1     
0.052632    1     
0.769231    1     
0.454545    1     
0.583333    1     
0.727273    1     
0.384615    1     
0.875000    1     
0.909091    1     
0.105263    1     
0.818182    1     
0.700000    1     
0.642857    

In [27]:
# Check some samples with an overlap fraction of 0.5
jeopardy[jeopardy['answer_in_question'] ==0.5].head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,question_clean,answer_clean,value_clean,date_clean,answer_in_question
53,4680,2004-12-31,Double Jeopardy!,MUSICAL TRAINS,$2000,"In 1961 James Brown announced ""all aboard"" for this train","""Night Train""",in 1961 james brown announced all aboard for this train,night train,2000,2004-01-31 00:12:00,0.5
68,5957,2010-07-06,Jeopardy!,"GEOGRAPHY ""E""",$600,"This island in the South Pacific is named for the day of its discovery, a religious holiday",Easter Island,this island in the south pacific is named for the day of its discovery a religious holiday,easter island,600,2010-01-06 00:07:00,0.5
80,5957,2010-07-06,Jeopardy!,"GEOGRAPHY ""E""","$2,000",The family history you wrote for school might include entering the U.S. at this island in New York Bay,Ellis Island,the family history you wrote for school might include entering the u s at this island in new york bay,ellis island,2000,2010-01-06 00:07:00,0.5
83,5957,2010-07-06,Jeopardy!,BE FRUITFUL & MULTIPLY,$1000,"2 x 1,035",2070,2 x 1 035,2 070,1000,2010-01-06 00:07:00,0.5
112,5957,2010-07-06,Double Jeopardy!,JUST THE FACTS,$2000,He's the older son of Prince Charles and the late Princess Diana,Prince William,he s the older son of prince charles and the late princess diana,prince william,2000,2010-01-06 00:07:00,0.5


In [28]:
# Check some samples with an overlap fraction of 0.8
jeopardy[jeopardy['answer_in_question'] ==0.8].head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,question_clean,answer_clean,value_clean,date_clean,answer_in_question
8375,4787,2005-05-31,Jeopardy!,TALK LIKE A BRIT,$400,"Of stay in bed, hit someone on the head or rub till it's red, what you do if you cosh",hit someone on the head,of stay in bed hit someone on the head or rub till it s red what you do if you cosh,hit someone on the head,400,2005-01-31 00:05:00,0.8
10975,4362,2003-07-15,Jeopardy!,STUPID ANSWERS,$1000,It's the state song of the state of Maine,"""The State of Maine Song""",it s the state song of the state of maine,the state of maine song,1000,2003-01-15 00:07:00,0.8
15184,3340,1999-02-26,Double Jeopardy!,PUT 'EM IN ORDER,$200,"Oscar Winners ""The English Patient"", ""Unforgiven"", ""Braveheart""","Unforgiven, Braveheart, The English Patient",oscar winners the english patient unforgiven braveheart,unforgiven braveheart the english patient,200,1999-01-26 00:02:00,0.8
19375,6070,2011-01-21,Double Jeopardy!,JOB HUNTING,$1600,"In a 60-year-old man age really takes its toll on the body, no matter which sport he works in",manager (in man age really),in a 60 year old man age really takes its toll on the body no matter which sport he works in,manager in man age really,1600,2011-01-21 00:01:00,0.8
19644,4583,2004-07-07,Jeopardy!,WHAT'S THE NEXT LINE?,$800,"""Yes we have no bananas...""","""...We have no bananas today""",yes we have no bananas,we have no bananas today,800,2004-01-07 00:07:00,0.8


In [29]:
# Check some samples with an overlap fraction of 1.0
jeopardy[jeopardy['answer_in_question'] ==1.0].head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,question_clean,answer_clean,value_clean,date_clean,answer_in_question
266,4931,2006-02-06,Double Jeopardy!,NOT A CURRENT NATIONAL CAPITAL,$400,"Ljubljana, Bratislava, Barcelona",Barcelona,ljubljana bratislava barcelona,barcelona,400,2006-01-06 00:02:00,1.0
272,4931,2006-02-06,Double Jeopardy!,NOT A CURRENT NATIONAL CAPITAL,$800,"Istanbul, Ottawa, Amman",Istanbul,istanbul ottawa amman,istanbul,800,2006-01-06 00:02:00,1.0
278,4931,2006-02-06,Double Jeopardy!,NOT A CURRENT NATIONAL CAPITAL,$1200,"Sofia, Sarajevo, Saigon",Saigon,sofia sarajevo saigon,saigon,1200,2006-01-06 00:02:00,1.0
284,4931,2006-02-06,Double Jeopardy!,NOT A CURRENT NATIONAL CAPITAL,$1600,"Bucharest, Bonn, Bern",Bonn,bucharest bonn bern,bonn,1600,2006-01-06 00:02:00,1.0
290,4931,2006-02-06,Double Jeopardy!,NOT A CURRENT NATIONAL CAPITAL,$2000,"Belize City, Guatemala City, Panama City",Belize City,belize city guatemala city panama city,belize city,2000,2006-01-06 00:02:00,1.0


In [30]:
# Calculate the mean of the overlap fraction
jeopardy['answer_in_question'].mean()

0.042752736697187994

What we can observe:
* for the vast majority of questions (195K out of 216K) there is no overlap at all: the answer does not appear in the question; and on average only 0.04 words of the answer were part of the question
* then, when looking at examples with relatively much overlap, it's not going to be helpful; e.g. multiple choice questions where the answer is indeed part of the question; or the question asks 'which island' and the answer contains 'island'

A strategy where you hope to find answers in the questions themselves is not going to help you whatsoever to win Jeopardy.

## 3.2 Repeated and popular terms

Jeopardy has been running for many, many years. One may wonder till what extent the same questions are repeated.

We are going to analyze, not by finding questions that are literally the same, but by analyzing for all questions which fraction of the words in those questions appeared in questions before as well. We will focus on longer terms (6 characters or more), as those are the termss that typically form the 'heart' of the question. Shorter words (not only 'the' and 'a', but also e.g. 'more' and 'each') that are repeated won't teach us a lot.  

We'll on the fly also simply create an overview of the terms that appear most in the questions, including a count how many times. That could be interesting information as well.

In [31]:
# Iterate over the rows of the jeopardy dataframe, and calculates the fraction of terms (words longer than 5 characters)
# that appeared in earlier questions as well. Store these values in a list.

# On the fly, create a dictionary of all terms and their frequency over all questions.

# Initate a list with the overlap_count, a set with all terms used, and a dictionary with all words used
questions_overlap = []
terms_used = set()
terms_dictionary = {}

# Iterare 
for i, row in jeopardy.iterrows():
    split_question = row['question_clean'].split()
    # print(split_question) # commented out after verifying
    # Only keep the words 6 characters or longer 
    split_question = [word for word in split_question if len(word) >5]
    # print(split_question) # commented out after verifying
    
    match_count = 0
    
    # Check for every term whether used before already. If so: increase counts. If not: add to set and dictionary.
    for word in split_question:
        if word in terms_used:
            match_count += 1
            terms_dictionary[word]+=1
        else:
            terms_dictionary[word] = 1
        terms_used.add(word)
    
    # Calculate fraction 
    if len(split_question)>0:
        fraction_used_before = match_count / len(split_question)
    
    # Append fraction to list
    questions_overlap.append(fraction_used_before)

In [32]:
# Add the calculated fractions (a list with the correct length and in correct sequence) as a new column to the datafame
jeopardy['fraction_overlap_before'] = questions_overlap

In [33]:
# Take a look at the final rows in the dataframe, where one would expect no overlap with earliers questions
jeopardy[['date_clean','question_clean','answer_clean','fraction_overlap_before']].head(5)

Unnamed: 0,date_clean,question_clean,answer_clean,fraction_overlap_before
0,2004-01-31 00:12:00,for the last 8 years of his life galileo was under house arrest for espousing this man s theory,copernicus,0.0
1,2004-01-31 00:12:00,no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves,jim thorpe,0.0
2,2004-01-31 00:12:00,the city of yuma in this state has a record average of 4 055 hours of sunshine each year,arizona,0.0
3,2004-01-31 00:12:00,in 1963 live on the art linkletter show this company served its billionth burger,mcdonald s,0.0
4,2004-01-31 00:12:00,signer of the dec of indep framer of the constitution of mass second president of the united states,john adams,0.0


In [34]:
# Take a look at the final rows in the dataframe, where one would expect no overlap with earliers questions
jeopardy[['date_clean','question_clean','answer_clean','fraction_overlap_before']].tail(20)

Unnamed: 0,date_clean,question_clean,answer_clean,fraction_overlap_before
216910,2006-01-11 00:05:00,in his prime this athlete said it s hard to be humble when you re as great as i am,muhammad ali,1.0
216911,2006-01-11 00:05:00,it s home to the holmenkollen ski jump,oslo,1.0
216912,2006-01-11 00:05:00,we d like to enlighten you about the musical sidd it s based on this novel,siddhartha,0.5
216913,2006-01-11 00:05:00,he created the musical riddles called the enigma variations,edward elgar,1.0
216914,2006-01-11 00:05:00,one species of this bird breeds in the arctic tundra vacations at the other end of the globe,a tern,1.0
216915,2006-01-11 00:05:00,in his teens he worked in an assistant d a s office later his perry mason character made fools of d a s,erle stanley gardner,1.0
216916,2006-01-11 00:05:00,oscar wilde called this 4 letter word the curse of the drinking classes,work,1.0
216917,2006-01-11 00:05:00,guyanese capital named for a hanoverian monarch,georgetown,1.0
216918,2006-01-11 00:05:00,a naughty 18th c novel originally titled memoirs of a woman of pleasure inspired the 2006 musical named for her,fanny hill,1.0
216919,2006-01-11 00:05:00,if this riddling belgian surrealist painter born 1898 worked for jeopardy he might write this is not a clue,magritte,0.833333


In [35]:
# Take a look at the fraction-overlap in the final 100 rows
print(jeopardy.tail(100)['fraction_overlap_before'].value_counts())

1.000000    80
0.800000    5 
0.833333    3 
0.875000    3 
0.888889    2 
0.400000    1 
0.600000    1 
0.666667    1 
0.500000    1 
0.857143    1 
0.750000    1 
0.000000    1 
Name: fraction_overlap_before, dtype: int64


So it seems that almost all words have been used before.... (For what it's worth.)

In [36]:
# Calculate the mean value of this fraction-overlap 
import numpy as np
np.mean(questions_overlap)

0.9225954554223076

So what we can observe is that most longer words that are used in questions, have been used before. Certainly for the later episodes. That is not so surprising, given that we have more than 20 years worth of questions.

I am not convinced though that this piece of knowledge is going to help a lot.

Possibly a more complex analysis could help:
* rather than looking at individual words, look at combinations of words
* don't look back to all history, but only to e.g. the last one or two years if there is any overlap

These are complex analyses to do so, and I am not really convinced that it will give a lot of insight.

What we can still do is just take some frequently used terms, and look at some questions that include those terms. Do we happen to see any overlap? For this, we can use the dictionary with word counts that we created. (Which is on overview that by itself can be interesting already.)


In [37]:
# Create (and show) a dictionary with those terms that are used more than 1000 times in questions
top_terms = {k:v for (k,v) in terms_dictionary.items() if v > 1000}

for k in sorted(top_terms, key=top_terms.get, reverse=True):
    print(k, top_terms[k])


archive 12979
target 10717
_blank 10649
country 6045
called 5487
president 3294
american 3210
became 3165
played 3014
before 2920
capital 2883
french 2576
famous 2560
island 2534
people 2341
letter 2318
largest 2150
company 2133
author 2074
during 2001
national 1988
british 1976
century 1922
character 1868
little 1831
around 1823
between 1682
series 1636
family 1602
meaning 1571
founded 1546
school 1423
include 1409
million 1377
america 1350
museum 1332
university 1332
number 1331
popular 1322
musical 1322
english 1313
because 1274
second 1268
classic 1268
reports 1268
through 1245
father 1203
person 1177
george 1158
german 1151
general 1130
england 1100
leader 1095
nation 1074
created 1060
italian 1046
william 1034
former 1032


A lot of terms that are used more than 1000 times in questions (over the course of 20 years).

The ones at the very top ('archive', 'target', and '\_blank') are surprising actually.

Let's go have a look at examples with some of some frequently-used terms.



In [38]:
# Function to print examples of a term
# It takes a term and a number n as its inputs
# Then the first n questions containing the term will be printed

def print_n_examples_for_term(term, n):
    printed = 0
    row_index = 0
    while printed < n:
        row = jeopardy.iloc[row_index]
        split_question = row['question_clean'].split()
        split_question = [word for word in split_question if len(word) >5]
        # print(split_question) # commented out after verifying
        if term in split_question:
            print(row['Question'])
            printed +=1
        row_index +=1

In [39]:
# Print for 'galileo' to test (as we know the first question in the dataframe contains Galileo)
print_n_examples_for_term('galileo', 3)

For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory
The 4 largest moons of this planet are called Galilean satellites after Galileo, who saw them in 1610
Galileo was the first person to see the rings around this planet


In [40]:
# Print 5 examples for 'archive'
print_n_examples_for_term('archive', 5)

<a href="http://www.j-archive.com/media/2004-12-31_DJ_23.mp3">Beyond ovoid abandonment, beyond ovoid betrayal... you won't believe the ending when he "Hatches the Egg"</a>
The shorter glass seen <a href="http://www.j-archive.com/media/2004-12-31_DJ_12.jpg" target="_blank">here</a>, or a quaint cocktail made with sugar & bitters
<a href="http://www.j-archive.com/media/2004-12-31_DJ_26.mp3">Ripped from today's headlines, he was a turtle king gone mad; Mack was the one good turtle who'd bring him down</a>
<a href="http://www.j-archive.com/media/2004-12-31_DJ_25.mp3">Somewhere between truth & fiction lies Marco's reality... on Halloween, you won't believe you saw it on this St.</a>
<a href="http://www.j-archive.com/media/2004-12-31_DJ_24.mp3">"500 Hats"... 500 ways to die.  On July 4th, this young boy will defy a king... & become a legend</a>


In [41]:
# Print 5 examples for '_blank'
print_n_examples_for_term('_blank', 5)

The shorter glass seen <a href="http://www.j-archive.com/media/2004-12-31_DJ_12.jpg" target="_blank">here</a>, or a quaint cocktail made with sugar & bitters
Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_26.jpg" target="_blank">this</a> type of mollusk you see
Say <a href="http://www.j-archive.com/media/2010-07-06_DJ_27.jpg" target="_blank">this</a> state that was admitted to the Union in 1859
<a href="http://www.j-archive.com/media/2010-07-06_DJ_14.jpg" target="_blank">This dog breed seen here</a> is a loyal and protective companion
Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_28.jpg" target="_blank">this</a> bug; don't worry, it doesn't breathe fire


In [42]:
# Print 5 examples for '_blank'
print_n_examples_for_term('target', 5)

The shorter glass seen <a href="http://www.j-archive.com/media/2004-12-31_DJ_12.jpg" target="_blank">here</a>, or a quaint cocktail made with sugar & bitters
Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_26.jpg" target="_blank">this</a> type of mollusk you see
Say <a href="http://www.j-archive.com/media/2010-07-06_DJ_27.jpg" target="_blank">this</a> state that was admitted to the Union in 1859
<a href="http://www.j-archive.com/media/2010-07-06_DJ_14.jpg" target="_blank">This dog breed seen here</a> is a loyal and protective companion
Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_28.jpg" target="_blank">this</a> bug; don't worry, it doesn't breathe fire


In [43]:
# Print 10 examples for 'country'
print_n_examples_for_term('country', 10)

Africa's lowest temperature was 11 degrees below zero in 1935 at Ifrane, just south of Fez in this country
Cross-country skiing is sometimes referred to by these 2 letters, the same ones used to denote 90 in Roman numerals
Parts of the Arabian and Libyan deserts are found in this African country
A 7.0 magnitude earthquake in this Caribbean country Jan. 12, 2010 brought a world outpouring of aid
Andy Garcia is a native of this country whose flag is seen here
This Mediterranean country whose flag is seen here is "The Word"
Porfirio Diaz seized power in this country in 1876, ruled for 35 years, fled in 1911 & died in exile
Exiled for manslaughter, Eric the Red was forced to leave this country around 981
Moshoeshoe II was exiled twice before regaining this southern African country's throne in 1995
Under the 1814 Treaty of Kiel, this country gave Norway to Sweden but kept Greenland & other islands


In [44]:
# Print 10 examples for 'president'
print_n_examples_for_term('president', 10)

Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States
His first act after being sworn in as president of the Confederacy was to send a peace commission to Washington, D.C.
In the midst of the Korean War, this South Korean president was elected to his second of 4 terms
Its headquarters compound in Langley, Virginia is named for Former President George Bush
This president's 1972 visit to China inspired an opera that played at the Kennedy Center in 1988
This political satire starred John Travolta as a Southern governor running for president
If a president is impeached, this official presides over the trial in the Senate
On January 20, 1965 he was inaugurated as U.S. vice president
Gerald Ford was the last president born under this "crab"by sign
In 1976 this current president of France founded the Rally for the Republic Party


What we can observe:
* that the most-used terms 'archive', '\_blank' and 'target' are not really terms used in questions. Rather, many questions contain hyperlinks that include these terms
* if we check frequently used terms like 'country' and 'president', we see all different questions.

The value of this 2nd observation must be discounted though. These are the first examples for each row chronologically, and it would of course be better to check for examples that are years apart.

For what we have seen though, this does not help us that much with finding a study strategy for Jeopardy.

## 3.3 Terms used in high-value questions

Let's check if there are any particular terms that appear significantly more in high-value questions.

We will do the following, for multiple popular terms: 
* check how many times this term appears in low-value questions (<= 800) vs high-value questions (>800)
* do a chi-squared test if that ratio is realistic given expectations of overall low-value vs high-value questions

Let me explain in layman's terms how that works. Suppose that 30% of all Jeopardy questions are low-value questions and 70% are high-value questions. Now we are going to check for all questions that contains the term 'president'. Then we have an expectation that also for those, the ratio is 30% / 70%. If you find 29% / 71%: sounds still reasonable. If you find 10% / 90%: that looks suspicious though. With a chi-squared test one can calculate how likely such an observation still is under the hypothesis that the expected distribution is 30% / 70%. If that likelihood is very small (e.g. smaller than 5%) we reject the idea that the outcome is a mere coincidence, and conclude that the term 'president' is truly under- or over-represented (which one depends on the observation) in the high-value questions.

For more background about chi-squared tests, refer to the internet, there are many descriptions. Here is [one example](https://www.ling.upenn.edu/~clight/chisquared.htm).

So the null-hypothesis is "The term <.....> is not over- or under-represented in high-value questions". 

In [45]:
# Get overall numbers of low_value and high_value questions
low_value_max = 800

low_value_count = len(jeopardy[jeopardy['value_clean']<=800])
high_value_count = len(jeopardy[jeopardy['value_clean']> 800])

low_value_fraction = low_value_count / (low_value_count + high_value_count)
high_value_fraction = high_value_count / (low_value_count + high_value_count)

print(low_value_count, high_value_count, low_value_count + high_value_count, low_value_fraction, high_value_fraction)
                               

155508 61422 216930 0.7168579726178952 0.2831420273821048


In [46]:
# Create a function that gets a term and returns how frequently it occurs in low-value and high-value questions

def return_low_high (term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['question_clean'].split()
        split_question = [word for word in split_question if len(word) >5]
        # print(split_question) # commented out after verifying
        if term in split_question:
            # print(split_question, row['value_clean']) # commented out after verifying
            if row['value_clean'] <= 800:
                low_count += 1
            if row['value_clean'] > 800:
                high_count +=1
    return low_count, high_count
    

In [47]:
# Test the function on th term 'galileo'
test_term_1 = 'galileo'

a, b = return_low_high(test_term_1)
print (a,b)


32 7


In [48]:
# Test the function on the term 'president'
test_term_1 = 'president'

a, b = return_low_high(test_term_1)
print (a,b, a+b)

2297 883 3180


For 'president' one would initially expect a total of 3294 as that is the count we had for 'president' in our dictionary. One explanation can be that the term 'president' appeared more than once in some questions. For the dictionary each of them was counted. On this occassion, we count questions that contain the term 'president', regardless how many times the word is in that question.

I did not check whether this explains indeed. Rather, I conclude that the functions seems to work, so let's apply it to a couple of popular terms. 

In [49]:
# Select popular terms that are used in questions a lot
popular_terms = ['country','president','american', 'capital', 'island']

# For those terms, calculate how many times it appears in low-value and high-value questions. Story in a dictionary.
popular_terms_low_high = {}
for term in popular_terms:
    popular_terms_low_high[term] = return_low_high(term)


In [50]:
# Show the result
popular_terms_low_high

{'country': (4332, 1647),
 'president': (2297, 883),
 'american': (2115, 1053),
 'capital': (1988, 797),
 'island': (1665, 770)}

So for each of these terms we can now do a chi-squared test to figure out if these counts are significantly different than what one could expect.

Let's first do a quick round for the term 'country', then do it for all terms in a more structured way.

In [51]:
# Calculated expected values for 'country' and print them
country_expected_low = low_value_fraction * (4332 + 1647)
country_expected_high = high_value_fraction * (4332 + 1647)
print(country_expected_low, country_expected_high)

4286.093818282396 1692.9061817176046


In [52]:
# Import chisquare test from library
from scipy.stats import chisquare

# Execute chisquared test for 'country'
observed = np.array([4332, 1647])
expected = np.array([country_expected_low, country_expected_high])
chisquare_value, pvalue = chisquare(observed, expected) # returns a list

# Print result
print(chisquare_value,pvalue)

1.7365061769211498 0.18758212679269415


Now let's do this for all popular terms that we selected in the same way, and print the results in a readable way.

In [53]:
# For all selected popular terms, perform a chi-squared test and present the results in a readable format
for term in popular_terms:
    observed_low = popular_terms_low_high[term][0]
    observed_high = popular_terms_low_high[term][1]
    observed_total = observed_low + observed_high
    expected_low = low_value_fraction * (observed_total)
    expected_high = high_value_fraction * (observed_total)
    print('Term:', term)
    print('Observed (low/high):', observed_low, observed_high, "{:.1%} {:.1%}".format(observed_low/observed_total, observed_high/observed_total))
    print('Expected (low/high):', round(expected_low,1), round(expected_high,1), "{:.1%} {:.1%}".format(expected_low/observed_total, expected_high/observed_total)) 
    observed = np.array([observed_low, observed_high])
    expected = np.array([expected_low, expected_high])
    chisquare_value, pvalue = chisquare(observed, expected) # returns a list
    print('P-value:', pvalue)
    print('In words: the probability of the null-hypothesis that the term', term, 'is not over- or underrepresented in high-value questions is', pvalue)
    print('\n')
    

    

Term: country
Observed (low/high): 4332 1647 72.5% 27.5%
Expected (low/high): 4286.1 1692.9 71.7% 28.3%
P-value: 0.18758212679269415
In words: the probability of the null-hypothesis that the term country is not over- or underrepresented in high-value questions is 0.18758212679269415


Term: president
Observed (low/high): 2297 883 72.2% 27.8%
Expected (low/high): 2279.6 900.4 71.7% 28.3%
P-value: 0.49362469359700045
In words: the probability of the null-hypothesis that the term president is not over- or underrepresented in high-value questions is 0.49362469359700045


Term: american
Observed (low/high): 2115 1053 66.8% 33.2%
Expected (low/high): 2271.0 897.0 71.7% 28.3%
P-value: 7.641747011862597e-10
In words: the probability of the null-hypothesis that the term american is not over- or underrepresented in high-value questions is 7.641747011862597e-10


Term: capital
Observed (low/high): 1988 797 71.4% 28.6%
Expected (low/high): 1996.4 788.6 71.7% 28.3%
P-value: 0.722302281221693
In wor

Observations:
* For 'american' and 'island' we see a probability of almost zero that the observed numbers are a mere coincidence. Looking at the numbers, one can state that for these terms there is a significant **over**representation in high-value questions
* For 'country', 'president' and 'capital' there is some over-or under-representation as well, but there is not sufficient evidence (likelihood) that that is not by chance.

So we could give some advice now: study things that are 'american' or relate to 'islands'. It is questionable though whether this will really help.

## 4. Conclusions

We started with the question whether from the analysis of questions-and-answers from the past, we can give advice about how to prepare if you are going to participate in Jeopardy. And saw the following:
* Expecting to find the answers of questions within the question themselves is not a viable strategy. If there are words in the question that are also part of the answer at all (which doesn't happen a lot in the first place), that is not going to help.
* Almost all terms in questions (longer than 5 characters) have appeared before in questions. However, there are really a lot of 'popular' terms, and when looking at some examples this did not mean that the questions were repeated. 
* There appear to be popular terms that are (statistically significant) used relatively a lot in high-value questions.

It is very hard to defend though that any of this knowledge is going to help you prepare for Jeopardy a lot.

One could certainly analyze more, e.g.:
* Play with the functions developed in this notebook, e.g. to detect more terms that are overrepresented in high-value questions.
* Do more thorough analysis. E.g. when comparing with history, only look at the past couple of years rather than to more than 20 years. Or look at combinations of terms rather than individual ones in questions.
* Do an analysis using the 'Category' column (that we ignored so far).

That's fun doing! If your goal is to prepare for Jeopardy though, I doubt though whether this is a good investment of your time, given what we saw so far. You may better just study general knowledge instead!