# Finding ways of winning Jeopardy

In this project I will be finding patterns in Jeopardy questions to see which questions will lead to a better outcome of winning. I will analyze the questions to see which questions most often appear and if they can give answers.

In [None]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()
print(jeopardy.columns)
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


First let's clean the punctuation and make the questions all lower case.

In [None]:
import string

def normalize_text(text):
  text = text.lower()
  text = text.translate(str.maketrans('','',string.punctuation))
  return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


Next let's make the value an integer without the dollar sign and the make the 'Air Date' column a datetime object.

In [None]:
def normalize_values(text):
  text = str(text)
  text = text.translate(str.maketrans('','',string.punctuation))
  try:
    text = int(text)
  except Exception:
    text = 0
  return text

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [None]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.dtypes

Unnamed: 0,0
Show Number,int64
Air Date,datetime64[ns]
Round,object
Category,object
Value,object
Question,object
Answer,object
clean_question,object
clean_answer,object
clean_value,int64


In order to figure out whether to study past questions, study general knowledge or not study it at all. It would be helpful to figure out these two questions:

1. How often the answer can be used for a question
2. How often questions are repeated

Let's answer the first question by building a function which will count how many terms that occur in clean_answer also occur in clean_question.

In [None]:
def match_count(row):
  split_answer = row['clean_answer'].split()
  split_question = row['clean_question'].split()
  match_count = 0
  if 'the' in split_answer:
    split_answer.remove('the')
  if len(split_answer) == 0:
    return 0
  for item in split_answer:
    if item in split_question:
      match_count += 1
  return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(match_count, axis=1)


In [None]:
jeopardy['answer_in_question'].mean()

0.058861482035140716

On average the answer only makes up about 6% of the question, and this means that we just cannot hope hearing the question will helps us determine the answer. We or the user will have to study more.

Next we should determine how often questions are recycled or are repeats of older ones. We cannot do that accurately because we have about only 10% of the data, but let's investiagte it at least.

We will looks at the terms used and any terms larger than 6 characters will added to terms_used set. This will allow us to disregard terms like 'the' or 'than'.

In [None]:
question_overlap = []
terms_used = set()

jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
  split_question = row['clean_question'].split(' ')
  split_question = [q for q in split_question if len(q) > 5]
  match_count = 0
  for word in split_question:
    if word in terms_used:
      match_count += 1
    terms_used.add(word)
  if len(split_question) > 0:
    match_count/=len(split_question)
  question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6919577992203644

There is about 70% overlap between terms in new questions and old questions. It is insignificant because it only looks at 10% of the dataset and only with singular terms but it does hint at looking at recycled questions more.

We can also analyze questions that are of higher value (>800 USD) which will allow us to earn more money. We can do this using the Chi-sqaured test. But first let's order and sort the questions which have a value higher than 800.

In [None]:
def determine_value(row):
  if row['clean_value'] > 800:
    value = 1
  else:
    value = 0
  return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [None]:
def count_usage(term):
  low_count = 0
  high_count = 0
  for i, row in jeopardy.iterrows():
    if term in row['clean_question'].split(' '):
      if row['high_value'] == 1:
        high_count += 1
      else:
        low_count += 1
  return high_count, low_count

Apply to random 10 terms in questions to get the observed values for high and low value questions.

In [None]:
from random import choice

terms_used = list(terms_used)
comparison_terms = [choice(terms_used) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
  observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (0, 1),
 (1, 0),
 (1, 1),
 (0, 1),
 (0, 1),
 (5, 2),
 (0, 1),
 (0, 1),
 (2, 12)]

Now let's apply the Chi-sqaured test between observed values for both high and low-values. We got above and determine the expected values and then use the scipy.stats.chisquare function to get the chi-squared and p-values.

In [None]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
from scipy.stats import chisquare
import numpy as np

for obs in observed_expected:
  total = sum(obs)
  total_prop = total/jeopardy.shape[0]
  high_value_exp = total_prop * high_value_count
  low_value_exp = total_prop * low_value_count

  observed = np.array([obs[0], obs[1]])
  expected = np.array([high_value_exp, low_value_exp])
  chi_squared.append(chisquare(observed, expected))


In [None]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=6.2575220449142, pvalue=0.012366706058156086),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.4167013079324282, pvalue=0.23394711554753567)]

# Chi-squared results

Looking at the results for the Chi-squared list none of terms had a significant difference between high and low value rows. Additionally the p-values (frequencies) were all lower than  5 so the chi-sqaured test is not as valid. It would yield better results if the frequencies were higher.