# Jeopardy Questions

In [3]:
# Read dataset into a dataframe and display first five rows
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
# Show details of all columns 
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
# Some of the column names have spaces in front, need to remove the space
jeopardy.columns = jeopardy.columns.str.lstrip()
jeopardy.columns = jeopardy.columns.str.replace(' ', '')
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

# Normalizing Text

The idea is to ensure that we lowercase words and remove punctuation so "Don't" and "don't" aren't considered to be different words when you compare them.

In [6]:
# Establish a function to normalize texts, including making it lower case and removing punctuations
import string
from string import punctuation
def text_normalizer(text):
    lower_text = text.lower()
    no_punct = ""
    for character in lower_text:
        if character not in punctuation:
            no_punct = no_punct + character
        elif character in punctuation:
            no_punct = no_punct + ''
    return no_punct


In [7]:
# Apply the normalizer function above to normalize texts in 'Question' column
jeopardy['clean_question'] = jeopardy['Question'].apply(text_normalizer)
jeopardy['clean_question']

0        for the last 8 years of his life galileo was u...
1        no 2 1912 olympian football star at carlisle i...
2        the city of yuma in this state has a record av...
3        in 1963 live on the art linkletter show this c...
4        signer of the dec of indep framer of the const...
5        in the title of an aesop fable this insect sha...
6        built in 312 bc to link rome  the south of ita...
7        no 8 30 steals for the birmingham barons 2306 ...
8        in the winter of 197172 a record 1122 inches o...
9        this housewares store was named for the packag...
10                                          and away we go
11       cows regurgitate this from the first stomach t...
12       in 1000 rajaraja i of the cholas battled to ta...
13       no 1 lettered in hoops football  lacrosse at s...
14       on june 28 1994 the natl weather service began...
15       this companys accutron watch introduced in 196...
16       outlaw murdered by a traitor and a coward whos.

In [8]:
# Also apply the normalizer to "Answer" column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(text_normalizer)
jeopardy['clean_answer']

0                                               copernicus
1                                               jim thorpe
2                                                  arizona
3                                                mcdonalds
4                                               john adams
5                                                  the ant
6                                           the appian way
7                                           michael jordan
8                                               washington
9                                            crate  barrel
10                                          jackie gleason
11                                                 the cud
12                                     ceylon or sri lanka
13                                               jim brown
14                                            the uv index
15                                                  bulova
16                                             jesse jam

# Normalizing Columns

Now we need to normalize "Value" and "Air Date" columns

In [9]:
def dollar_normalizer(text):
    no_punct = ""
    for character in text:
        if character not in punctuation:
            no_punct = no_punct + character
        elif character in punctuation:
            no_punct = no_punct + ''
    try:
        int_no_punct = int(no_punct)
    except ValueError:
        int_no_punct = 0
    return int_no_punct


In [10]:
# Normalize the "Value" column
jeopardy['clean_value'] = jeopardy['Value'].apply(dollar_normalizer)
jeopardy['clean_value']

0         200
1         200
2         200
3         200
4         200
5         200
6         400
7         400
8         400
9         400
10        400
11        400
12        600
13        600
14        600
15        600
16        600
17        600
18        800
19        800
20        800
21        800
22       2000
23        800
24       1000
25       1000
26       1000
27       1000
28       1000
29        400
         ... 
19969    1200
19970    1200
19971    1500
19972    1200
19973    1200
19974    1200
19975    1600
19976    1600
19977    1600
19978    1600
19979    1600
19980    1600
19981    1200
19982    2000
19983    2000
19984    2000
19985    2000
19986    2000
19987       0
19988     100
19989     100
19990     100
19991     100
19992     100
19993     100
19994     200
19995     200
19996     200
19997     200
19998     200
Name: clean_value, Length: 19999, dtype: int64

# Answers Appearing in Questions

The goal is to see whether studying past questions, or general knowledge, or not studying at all, would help the performance. 

In order to do that, we need to figure out
1) How often the answer is deductible from the question
2) How often new questions are repeats of the older questions

We now try to answer the first question by looking at how often the answers have occured in the question. 

In [11]:
# establish a function to calculate what percentage of answers have appeared in questions
def match_perc(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for answer in split_answer:
        if answer in split_question:
            match_count = match_count + 1
    result = match_count / len(split_answer)
    return result
  

In [12]:
# apply the above function to every row in the dataframe
jeopardy['answer_in_question'] = jeopardy.apply(match_perc, axis=1)
jeopardy['answer_in_question']

0        0.000000
1        0.000000
2        0.000000
3        0.000000
4        0.000000
5        0.000000
6        0.000000
7        0.000000
8        0.000000
9        0.333333
10       0.000000
11       0.000000
12       0.000000
13       0.000000
14       0.500000
15       0.000000
16       0.000000
17       0.000000
18       0.000000
19       0.000000
20       0.000000
21       0.000000
22       0.000000
23       0.000000
24       0.500000
25       0.000000
26       0.000000
27       0.000000
28       0.000000
29       0.000000
           ...   
19969    0.000000
19970    0.000000
19971    0.000000
19972    0.000000
19973    0.000000
19974    0.333333
19975    0.000000
19976    0.000000
19977    0.000000
19978    0.000000
19979    0.000000
19980    0.500000
19981    0.500000
19982    0.000000
19983    0.000000
19984    0.000000
19985    0.000000
19986    0.000000
19987    0.000000
19988    0.000000
19989    0.000000
19990    0.000000
19991    0.000000
19992    0.000000
19993    0

In [13]:
jeopardy['answer_in_question'].mean()

0.06035277385469894

From the analysis above, we see that 6% of the time, the answers have previously appeared in the questions, which is not a significant number. 

# Recycled Questions

Now we want to figure out how often new questions are actually repeats of older ones. 

In [14]:
# sort the dataframe by air date

jeopardy.sort_values(by=['AirDate'])

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.000000
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.000000
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.000000
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200,0.000000
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.000000
19305,10,1984-09-21,Double Jeopardy!,HOMONYMS,$200,Hindu hierarchy or a play's actors,a caste (cast),hindu hierarchy or a plays actors,a caste cast,200,0.333333
19306,10,1984-09-21,Double Jeopardy!,TV TRIVIA,$200,"Last season, this series mourned the loss of S...",Hill Street Blues,last season this series mourned the loss of sg...,hill street blues,200,0.000000
19307,10,1984-09-21,Double Jeopardy!,1789,$400,Why April 28th was a bad day for Capt. Bligh,the day of the mutiny on the Bounty,why april 28th was a bad day for capt bligh,the day of the mutiny on the bounty,400,0.142857
19308,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$400,Seaside resort that has a monopoly on East Coa...,"Atlantic City, New Jersey",seaside resort that has a monopoly on east coa...,atlantic city new jersey,400,0.000000
19309,10,1984-09-21,Double Jeopardy!,LITERATURE,$400,"He wrote ""The 3 Musketeers""; his son wrote ""Ca...",(Alexandre) Dumas,he wrote the 3 musketeers his son wrote camille,alexandre dumas,400,0.000000


In [15]:
# Analysis of how often the words in the questions have previously been mentioned. 
question_overlap = []
terms_used = set()
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    #for word in split_question:
        #if len(word) >= 6:
            #split_question = [word]
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()
        

0.6902117143393507

From analysis above, we see that 80% of the terms in the questions were previously mentioned. Of course it doesn't look at phrases and sentences, but it is still worth looking into recycling questions. 

In [16]:
terms_used

{'arranges',
 'freudenberger',
 'antarctic',
 'convict',
 'hrefhttpwwwjarchivecommedia20080408dj18ajpg',
 'seferovic',
 'stempel',
 'defeated',
 'islahiye',
 'pillar',
 'bathtub',
 'machine',
 'inspires',
 'ismaili',
 'gourds',
 'amphibians',
 'oriskany',
 'ashkenazic',
 'marching',
 'eaters',
 'fuller',
 'schedule',
 'abilene',
 'yoshino',
 'lionsgate',
 'targetblanktalksa',
 'krakatoa',
 'referring',
 'hrefhttpwwwjarchivecommedia20060307dj10wmv',
 'pioneering',
 'okanagan',
 'injuring',
 'schizophrenic',
 'surveyor',
 'piperita',
 'beggars',
 'bounce',
 'macnicol',
 'dictates',
 'amorous',
 'reestablished',
 'assures',
 'dominica',
 '188085',
 'hrefhttpwwwjarchivecommedia20091229j04wmvjimmy',
 'gutenberg',
 'kalmar',
 'obyrne',
 'commerce',
 'fredrics',
 'marlow',
 'sauerbraten',
 'kindly',
 'fellowsoldiers',
 'reflects',
 'wholesale',
 'conningham',
 'postmarks',
 'mccurtain',
 'yesterday',
 'relative',
 'durability',
 'swoopes',
 'critics',
 'japana',
 'defile',
 'predetermined',
 

# Low Value vs. High Value Questions

We will want to focus on questions that contain high value which translates into more money. We need to know what are the words that occured on high value questions and what are the words that occured in low value questions. 

In [17]:
# create a function to determine if the question is high value or low value
def value_decider(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value


In [18]:
# Apply the function to all rows in the dataframe to see whether each row represents high value or not
jeopardy['high_value'] = jeopardy.apply(value_decider, axis=1)
jeopardy['high_value'] 

0        0
1        0
2        0
3        0
4        0
5        0
6        0
7        0
8        0
9        0
10       0
11       0
12       0
13       0
14       0
15       0
16       0
17       0
18       0
19       0
20       0
21       0
22       1
23       0
24       1
25       1
26       1
27       1
28       1
29       0
        ..
19969    1
19970    1
19971    1
19972    1
19973    1
19974    1
19975    1
19976    1
19977    1
19978    1
19979    1
19980    1
19981    1
19982    1
19983    1
19984    1
19985    1
19986    1
19987    0
19988    0
19989    0
19990    0
19991    0
19992    0
19993    0
19994    0
19995    0
19996    0
19997    0
19998    0
Name: high_value, Length: 19999, dtype: int64

In [19]:
# establish a function to generate number of counts that word is associated with high value or low value
def word_value_count(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [20]:
observed_expected = []
comparison_terms = list(terms_used)[0:5]
comparison_terms

['arranges',
 'freudenberger',
 'antarctic',
 'convict',
 'hrefhttpwwwjarchivecommedia20080408dj18ajpg']

In [21]:
for term in comparison_terms:
    observed_expected.append(word_value_count(term))
observed_expected

[(0, 1), (0, 1), (1, 3), (1, 3), (0, 1)]

# Applying the Chi-squared Test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [22]:
# Find the number of rows of "high value"
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
high_value_count

5734

In [23]:
# Find the number of rows of "low value"
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
low_value_count

14265

In [25]:
chi_squared = []
import scipy
from scipy.stats import chisquare
import numpy as np

for item in observed_expected:
    total = sum(item)
    total_prop = total / jeopardy.shape[0]
    high_count_expected = total_prop * high_value_count
    low_count_expected = total_prop * low_value_count
    observed = np.array([item[0], item[1]])
    expected = np.array([high_count_expected, low_count_expected])
    chis_v, p_v = chisquare(observed, expected)
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

From the Chi-square analysis, we learn that the chi-square is ranging from 0.03 to 0.40, with p-value ranging from 53% to 87%. the null hypothesis is that there should not be significant difference between terms used in high value questions and low value questions. Given the low chi-square and high p-value, we conclude that null hypothesis should be accepted. 