# Finding Earning Opportunities - Analysing Patterns to Win Jeopardy 

# Introduction 

The purpose of this project is to look for a way to win Jeopardy. Jeopardy is a popular TV show based in the US where players answer questions to win money. We'll be working with a Jeopardy dataset to discover useful patterns in the questions that can help us win. More specifically, we'll try to answer the following questions:

- "How often can an answer be used for a question?"
- "How often are questions repeated?"
- "Are high-value questions more useful than low-value questions?"

Overall, we did not find any statistically significant relationship among the variables we have investigated. Here are the answers that we found to the abovementioned questions:

- Only 5.7% of questions have their answers in the questions asked, which means that we cannot win trying to discover the answers of questions using the question itself.
- Questions are repeated 87% of the time. Although we're only looking at a small set of questions, this finding means that it might be worth investigating repeated questions further.
- No statistically significant difference was found between high value and low value rows. Moreover, the frequencies are all lower than 5, so the chi-squared test is not as valid. 

Eventually, we came to the conclusion that further analysis is needed to find more relevant correlations. 

# Reading in the Data

In [20]:
import numpy as np
import pandas as pd
import random
import re
from scipy.stats import chisquare 

In [2]:
jeopardy = pd.read_csv('~/Desktop/my_projects/data/JEOPARDY_CSV.csv')

In [3]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
# Removing unnecessary white spaces
jeopardy.columns = jeopardy.columns.str.strip()

In [6]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [7]:
jeopardy.dtypes

Show Number     int64
Air Date       object
Round          object
Category       object
Value          object
Question       object
Answer         object
dtype: object

# Normalizing Text

In [8]:
# Defining a function that normalizes text
def normalize_text(text):
    text = str(text).lower() # converting the string to lowercase
    normalized_text = re.sub("[^\w\s]", "", text) # removing all punctuation with a regex that excludes word and space characters 
    return normalized_text

# Applying the function to the Question and Answer columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


# Normalizing Columns

In [9]:
# Writing a function that normalizes columns
def normalize_column(string):
    normalized_string = re.sub("[^\w\s]", "", string) # removing any punctuation in the string 
    try: 
        int_string = int(normalized_string) # coverting the string into an integer 
    except Exception:
        int_string = 0 # assigning 0 if the conversion has an error 
    return int_string

# Applying the function to the Value column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_column)

# converting the Air Date column from a string to a datetime column 
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


# Answers in Questions

In order to understand what we should study to win Jeopardy, it'd be useful to find out how often the answer can be used for a question. This will help us answer our first question: "How often can an answer be used for a question?". We can discover this by looking at how many times words in the answer also occur in the question. 

In [10]:
# Writing a function that counts the number of word matches between questions and answers
def match_count(row):
    split_answer = row['clean_answer'].split() # turning each answer into a list of words 
    split_question = row['clean_question'].split() # turning each question into a list of words
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the") # removing 'the' since it doesn't have any meaningful use in finding the answer
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer) # finding the number of times we can use a word relative to the overall list of words in split_answer

# Applying the function to the clean_question and clean_answer columns 
jeopardy['answer_in_question'] = jeopardy.apply(match_count, axis=1)

# Finding the mean of the answer_in_question column
jeopardy['answer_in_question'].mean()

0.05792070323661354

On average, only 5.7% of questions have their answers in the questions asked. This is not a high enough result and means that we cannot win trying to discover the answers of questions using the question itself. Therefore, it might be more efficient for us to study for jeopardy.

# Recycled Questions

Let's try to answer the second question: "How often are questions repeated?". This might be a helpful information for us to win the game. We might not be able to completely answer this question, since we only have about 10% of the full Jeopardy question dataset. However, we can at least investigate it.

In [11]:
# Checking if the terms in questions have been used previously or not
question_overlap = [] # keeping an list initially empty for the overlap of questions 
terms_used = set() # maintaining an initially empty set for the terms used 
jeopardy = jeopardy.sort_values(by=['Air Date']) # sorting the dataset in order of ascending air date

for i, row in jeopardy.iterrows(): # using iterrows() to loop through each row of jeopardy
    split_question = row['clean_question'].split(" ") # splitting each word around whitespaces 
    split_question = [q for q in split_question if len(q) > 5] # filtering out words like 'the' and 'than' with lambda function
    match_count = 0 
    for word in split_question: # looping through each word in split_question
        if word in terms_used:
            match_count += 1 # incrementing match_count if the term occurs in term_used 
    for word in split_question:
        terms_used.add(word) # using the add() method to add each word of split_question to term_used
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count) # appending the final count of matches to the question_overlap list 
        
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.8721766377741468

On average, there is an 87% overlap between words used in new questions and words used in old questions. Here, we're only looking at a small set of questions, and we're specifically looking at single terms rathern than whole phrases. This means that our findings are relatively insignificant. However, it also means that it might be worth investigating further into repeated questions.

# Low Value vs High Value Questions

The third and last question is: "Are high-value questions more useful than low-value questions?". By only studying high-value questions, we might be able to optimise our effort and earn more money when playing Jeopardy. We can find the words with the biggest differences in usage between high and low value questions by selecting the words with the highest associated chi-squared values. However, doing this for all of the words would be extremely time consuming. Hence why we will perform this analysis only on a small sample for now.

In [12]:
# Narrowing down the questions into two categories
def determine_value(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0

# Adding a high_value column to determine which questions are high and low value
jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [13]:
# Finding the number of low_value and high_value questions each word occur in 
def value_count(word):
    low_count = 0
    high_count = 0 
    for i, row in jeopardy.iterrows(): # iterating through each row in the dataset 
        if word in row['clean_question'].split(" "): # splitting each word around whitespaces to create a python list  
            if row['high_value'] == 1: # checking if the word is in a question that has a high value 
                high_count += 1 # incrementing high_count if the word appear in a high-value question
            else:
                low_count += 1
    return high_count, low_count # returning the final count of low and high-value words 

In [None]:
# Applying value_count function on a random selection of words 
comparison_terms = random.sample(terms_used, 10) # randomly picking ten elements from terms_used 
observed_expected = [] # initiating an empty list 

for word in comparison_terms:
    v = value_count(word) 
    observed_expected.append(v) # running function on the term to get high and low value counts 

since Python 3.9 and will be removed in a subsequent version.
  comparison_terms = random.sample(terms_used, 10) # randomly picking ten elements from terms_used


In [19]:
observed_expected

[(0, 1),
 (1, 3),
 (2, 9),
 (0, 1),
 (1, 1),
 (1, 0),
 (1, 0),
 (1, 3),
 (0, 1),
 (67, 77)]

# Applying the Chi-squared Test

Now that we've computed the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [27]:
# Computing the expected counts, the chi-squared value, and the p-value
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
chisquared = []

for obs in observed_expected:
    total = sum(obs) # adding up both items in the list (high and low counts) to get the total count 
    total_prop = total / jeopardy.shape[0] # calculating the proportion of the total count across the dataset 
    high_value_exp = total_prop * high_value_count # computing the expected term count for high value rows
    low_value_exp = total_prop * low_value_count # computing the expected term count for low value rows
    
    observed = np.array([obs[0], obs[1]]) 
    expected = np.array([high_value_exp, low_value_exp])
    chisquared.append(chisquare(observed, expected)) # computing the chi-squared value and p-value given the expected and observed counts
    
chisquared

[Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.021646150708492677, pvalue=0.8830323245068887),
 Power_divergenceResult(statistic=0.5563890274396994, pvalue=0.45571882813430864),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.46338644448358013, pvalue=0.49604555208958945),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.021646150708492677, pvalue=0.8830323245068887),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=23.535066715761165, pvalue=1.2265772157679437e-06)]

As observed above, no statistically significant difference was found between high value and low value rows. Moreover, the frequencies are all lower than `5`, so the chi-squared test is not as valid. It might be better to run this test only with terms that have higher frequencies.

# Conclusion

In this project, we looked for a way to win Jeopardy by analysing a Jeopardy dataset. We've tried to discover useful patterns by answering to the following questions:

- "How often can an answer be used for a question?"
- "How often are questions repeated?"
- "Are high-value questions more useful to win than low-value questions?"

Overall, we did not find any statistically significant relationship among the variables we have investigated. Here are the answers that we found to the abovementioned questions:

- Only 5.7% of questions have their answers in the questions asked, which means that we cannot win trying to discover the answers of questions using the question itself.
- Questions are repeated 87% of the time. Although we're only looking at a small set of questions, this finding means that it might be worth investigating repeated questions further.
- No statistically significant difference was found between high value and low value rows. Moreover, the frequencies are all lower than 5, so the chi-squared test is not as valid. 

Thus, further analysis is needed to find more relevant correlations. We think that further analysis might be needed to find more significant correlations. For example, we could find a better way to eleiminate non-informative words than just removing words that are less than `6` characters long. This strategy could be achieved by manually creating a list of words to remove (g.e., `the`, `than`, etc.), or remove words that occur in more than a certain percentage (g.e., `5%`) of questions. 