# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. In this project, i will try to analyze past answer and question dataset from [reddit](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file) and apllying chi squared test to find the best way to study for jeopardy show

## Check the data

In [1]:
import pandas as pd

#read dataset
jeopardy = pd.read_csv('jeopardy.csv')

#print out the first 5 rows
jeopardy.head()




Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
#print columns of jeopardy
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [3]:
#remove space in front name's column
jeopardy.columns = jeopardy.columns.str.lstrip()

#check the columns
print(jeopardy.columns)


Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


## Normalizing text

In [4]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [5]:
#function to convert string and remove punctutation
import re
def normalize_text(strings):
    strings = re.sub('\W',' ', strings)
    strings = strings.lower()
    return strings


#normalize the question columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)

#normalize the answer columns
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [6]:
#check the data
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams


## Normalizing value


In [7]:
#function for normalize dollar values
def normalize_value(values):
    values = re.sub('\W',' ',values)
    try:
        values = int(values)
    except Exception:
        values = 0
    
    return values

#change value columns
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

#convert air date column to datetime type
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
    

## Answers in questions

In [8]:
#function to count mathches between answer and questions
def count_matches(row):
    #split clean answer and clean question
    split_answer = row['clean_answer'].split(' ')
    split_questions = row['clean_question'].split(' ')
    
    match_count = 0
    
    # remove the
    if 'the' in split_answer:
        split_answer.remove('the')
        
    #return 0 if len(split_answer) == 0
    if len(split_answer) == 0 :
        return 0
    
    #loop word in split answer
    for word in split_answer:
        if word in split_questions:
            match_count += 1
            
    #divided match count with len split_answer  
    result = match_count/len(split_answer)
    
    return result

#count answer in question
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

#count_mean

jeopardy['answer_in_question'].mean()

0.09565366087691443

from above, we know only 9% answer in question, it's mean that oly little proporsion anwer in question and we should thinking other strategy for study

## Recycled questions

In [9]:
#create empty list and set
question_overlap = []
terms_used = set()

#sort value jeopardy
jeopardy = jeopardy.sort_values('Air Date')

#loop the data
for i, row in jeopardy.iterrows():
    
    #split the data and only choose in length string higher than 6
    split_question = row['clean_question'].split(' ')
    split_question = [x for x in split_question if len(x) > 6]
    
    #loop the qoestion
    match_count = 0
    for word in split_question:        
        if word in terms_used:
            match_count += 1
          
    #input every word in question to terms_used
    for word in split_question:
        terms_used.add(word)
        
    #except : if length question == 0 return 0
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    #append match_count to question_overlap
    question_overlap.append(match_count)
    
#create new column
jeopardy['question_overlap'] = question_overlap

#calculate mean
jeopardy['question_overlap'].mean()        
        
            
        


0.6516398560953363

65 % of questions was repeated from past, it is mean that there was a chance a win if we study using past data

## Low value vs high value questions


In [15]:
#def function to classifiy high value
def value_questions(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
   
#apply the function
jeopardy['high_value'] = jeopardy.apply(value_questions, axis=1)

In [17]:
#def function to count word in high or low value
def high_and_low(word):
    low_count = 0
    high_count = 0
    
    #loop the data
    for i, row in jeopardy.iterrows():
        
        #split the data
        split_question = row['clean_question'].split(' ')
        
        #loop word in split question
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
             
    #return the data
    return high_count, low_count
            

In [22]:
#check the data using the first five word in terms_used
observed_expected = []
comparison_terms = list(terms_used)[:5]

for word in comparison_terms:
    comparison = high_and_low(word)
    observed_expected.append(comparison)

print(comparison_terms)
observed_expected

['browning', 'subordinates', 'princeps', 'dissident', 'spirits']


[(0, 1), (0, 1), (0, 1), (1, 0), (0, 7)]

## Appliying the chi-squared test

In [31]:
from scipy.stats import chisquare
import numpy as np

#count number of row depnds on high value columns
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

#create empty list
chi_squared = []

#loop data in observed expected
for row in observed_expected:
    total = sum(row)
    
    #calculate total high and low with n row in jeopardy
    total_prop = total / jeopardy.shape[0]
    
    #calculate high and low expected value
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    #apllying chi_squared test
    observed = np.array([row[0], row[1]])
    expected = np.array([high_value_exp, low_value_exp])
    
    chi_squared.append(chisquare(observed, expected))

In [32]:
#check the data
chi_squared

[Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=2.3160976908231845, pvalue=0.12804088400667396)]

## results

non of them have statistically significant, most of data chi square list have pvalue more than 0.05, number of high and low frequency which just only 1 or 2 maybe cause of this result. It is mean we should investigate only word with high frequency