# Chi Square - Analyzing data of jeopardy

In this project we will use the Chi Square method to analyse the data of the game Jeopardy. The goal of this analysis is to find out if we can predict some answers. For example, how often old questions re-appear and if some answers are deductible from the question itself through key words.

### Cleaning the data

In [1]:
import pandas as pd
import numpy as np

jeopardy = pd.read_csv("jeopardy.csv")

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
jeopardy.columns.str.replace(' ', '')

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [6]:
import re
def normalize(elem):
    elem = elem.lower()
    elem = re.sub('\W',' ',elem)
    return elem

def normalize_values(elem):
    elem = re.sub('\W',' ',elem)
    try:
        elem = int(elem)
    except Exception:
        elem = 0
    return elem

In [7]:
jeopardy2 = jeopardy.copy()

jeopardy2['clean_question'] = jeopardy2['Question'].apply(normalize)
jeopardy2['clean_answer'] = jeopardy2['Answer'].apply(normalize)
jeopardy2['clean_value'] = jeopardy2['Value'].apply(normalize_values)

In [8]:
jeopardy2['Air Date'] = pd.to_datetime(jeopardy2['Air Date'])
jeopardy2.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,200


In [9]:
jeopardy2.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### Answers deductible from questions

First we are going to try and see how often the answer is deductible from the questions. To do this, we look at how often the words contained in the anwers are also in the question.

In [10]:
def count_matches(elem):
    split_answer = elem['clean_answer'].split()
    split_question = elem['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0: return 0
    for i in split_answer:
        if i in split_question: match_count += 1
    return match_count / len(split_answer)

jeopardy2['answer_in_question'] = jeopardy2.apply(count_matches, axis = 1)

In [11]:
jeopardy2.head(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,200,0.0


In [12]:
jeopardy2['answer_in_question'].mean()

0.06294645581984942

We get an average of 0.06. This means that 6% of answers have the same word in the question. This is quite low.
With an average of 0.06, we cannot predict correct answers based purely on key words.

### Repetition of older questions

Now we are going to check how often questions are repeated

In [13]:
jeopardy2 = jeopardy2.sort_values('Air Date', ascending = True)
jeopardy2['question_overlap'] = 99
jeopardy2.head(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride...,theodore roosevelt,0,0.0,99
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,99


In [14]:
question_overlap = []
terms_used = {}
terms_used = set()

j=0
for i, row in jeopardy2.iterrows():
    split_question = row['clean_question'].split()
#   -- Initially wrote this but didn't work well --
#     for word in split_question:
#         print(word, len(word))
#         if len(word) < 6:
#             split_question.remove(word)
    split_question2 = []
    for word in split_question:
        if len(word) >= 6 :
            split_question2.append(word)
#    print(split_question2)  #test
    match_count = 0
    for word in split_question2:
        if word in terms_used: match_count += 1
        terms_used.add(word)
    if len(split_question2) > 0:
        match_count = match_count / len(split_question2)
    question_overlap.append(match_count)
        
jeopardy2['question_overlap'] = question_overlap

# -- This was used to test a few rows --    
#     j += 1
#     if j == 4: break

In [15]:
for i in range(10):
    print(list(terms_used)[i])


ashlyn
eminem
highways
oldest
zodiac
prisoner
waianae
shortest
occult
nights


In [16]:
print(jeopardy2['question_overlap'].mean())
print(len(question_overlap))
print(jeopardy2.shape)
jeopardy2.head(2)

0.7216032437204958
19999
(19999, 12)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride...,theodore roosevelt,0,0.0,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0


There's about 72% overlap. This is quite a lot. But it's important to realize that this overlap is only regarding words, and not full sentences. So this overlap is not very precise in terms of question repetition.

In [17]:
jeopardy2.sample(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
13333,5132,2006-12-26,Jeopardy!,"""OO"" 7-LETTER WORDS",$800,It's the scientific study of animals,zoology,it s the scientific study of animals,zoology,800,0.0,1.0
8279,1291,1990-03-26,Double Jeopardy!,THE RENAISSANCE,$400,The ruthless Cesare Borgia was the model for t...,"""The Prince""",the ruthless cesare borgia was the model for t...,the prince,400,0.0,0.0
10275,3205,1998-07-03,Double Jeopardy!,TRAVEL CANADA,$600,An enormous 30'-high nickel overlooks the town...,Ontario,an enormous 30 high nickel overlooks the town...,ontario,600,0.0,0.428571


### Chi-squared analysis

Below we will now do a chi-squared analysis.

In [18]:
def value(elem):
    if elem['clean_value'] > 800 : value = 1
    else : value = 0
    return value

jeopardy2['high_value'] = jeopardy2.apply(value, axis = 1)

In [19]:
def count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy2.iterrows():
        sequence = row['clean_question'].split(' ')
        if word in sequence :
            if row['high_value'] == 1:
                high_count += 1
            else :
                low_count += 1
    return high_count, low_count

In [20]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for word in comparison_terms:
    observed_expected.append(count(word))
    
observed_expected

[(1, 2),
 (0, 2),
 (2, 12),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 2),
 (0, 2),
 (0, 3),
 (0, 1)]

In [21]:
high_value_count = jeopardy2[jeopardy2['high_value'] == 1]['high_value'].count()
low_value_count = jeopardy2[jeopardy2['high_value'] == 0]['high_value'].count()

In [22]:
import numpy as np
from scipy.stats import chisquare

chi_squared = []
for i in observed_expected :
    total = i[0] + i[1]
    total_prop = total / jeopardy2.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    observed = np.array([i[0], i[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))

In [23]:
chi_squared

[Power_divergenceResult(statistic=0.11526980495624546, pvalue=0.7342224981885828),
 Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673),
 Power_divergenceResult(statistic=0.838195592262166, pvalue=0.3599133427437925),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673),
 Power_divergenceResult(statistic=0.661742197378053, pvalue=0.4159455550913673),
 Power_divergenceResult(statistic=0.9926132960670793, pvalue=0.31910449982424866),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378)]

We see that the results don't show a significant difference in usage between high value and low value rows. And the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.