# Jeopardy game
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. 
Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

 - `Show Number` - the Jeopardy episode number
 - `Air Date` - the date the episode aired
 - `Round` - the round of Jeopardy
 - `Category` - the category of the question
 - `Value` - the number of dollars the correct answer is worth
 - `Question` - the text of the question
 - `Answer` - the text of the answer

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

jeopardy = pd.read_csv('JEOPARDY_CSV.csv')
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1    Air Date    216930 non-null  object
 2    Round       216930 non-null  object
 3    Category    216930 non-null  object
 4    Value       216930 non-null  object
 5    Question    216930 non-null  object
 6    Answer      216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


# Data cleaning and formatting

In [3]:
jeopardy = jeopardy.rename(str.lstrip, axis='columns')

Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question and Answer columns). We covered normalization before, but the idea is to ensure that you put words in lowercase and remove punctuation so Don't and don't aren't considered to be different words when you compare them.

In [4]:
import re
def normalize_text(string):
    """returns a string with no punctuation and in lowercase"""
    string = re.sub('\W', ' ', string)
    string = re.sub('\s+', ' ', string)
    string = string.lower()
    return string


In [5]:
jeopardy['Answer'] = jeopardy['Answer'].fillna('None')

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

Now that you've normalized the text columns, there are also some other columns to normalize.

The `Value` column should be numeric, to allow you to manipulate it easier. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `Air Date` column should also be a datetime, not a string, to enable you to work it easier.

In [7]:
# clean values
jeopardy['Value'] = jeopardy['Value'].str.replace('$', '', regex=False).str.replace(',', '', regex=False).str.replace('None', '0', regex=False)
# convert to integer
jeopardy['Value'] = pd.to_numeric(jeopardy['Value'])

In [8]:
# convert to datetime format
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [9]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Show Number     216930 non-null  int64         
 1   Air Date        216930 non-null  datetime64[ns]
 2   Round           216930 non-null  object        
 3   Category        216930 non-null  object        
 4   Value           216930 non-null  int64         
 5   Question        216930 non-null  object        
 6   Answer          216930 non-null  object        
 7   clean_question  216930 non-null  object        
 8   clean_answer    216930 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(6)
memory usage: 14.9+ MB


# Explore the questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

 - How often the answer can be used for a question.
 - How often questions are repeated.
 
You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [19]:
def answer_in_question(row):

    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count/len(split_answer)
    

In [20]:
jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis=1)

In [21]:
jeopardy['answer_in_question'].mean()

0.06141460672046272

# Recycled questions
On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

In [58]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.8987531410003741

There is a significant overlap in terms across the questions in jeopardy. Does it mean that the questions repeat themselves?

## Low Value vs High Value questions
Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

 - Low value -- Any row where Value is less than 800.
 - High value -- Any row where Value is greater than 800.

In [59]:
def sort_values(row):
    if row['Value'] > 800:
        value=1
    else:
        value=0
    return value

In [60]:
jeopardy['high_value'] = jeopardy.apply(sort_values, axis=1)

In [61]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [62]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(3, 1),
 (0, 5),
 (0, 2),
 (1, 2),
 (10, 28),
 (0, 1),
 (1, 0),
 (2, 3),
 (1, 1),
 (0, 1)]

Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.

In [63]:
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for e in observed:
    total = sum(e)
    total_prop = total/jeopardy.shape[0]
    high_value_exp = total_prop*high_value_count
    low_value_exp = total_prop*low_value_count
    obs = np.array([e[0], e[1]])
    exp = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(obs, exp))
print(chi_squared)

[Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751), Power_divergenceResult(statistic=13.80662515952375, pvalue=0.00020262047598010479), Power_divergenceResult(statistic=0.14893637555628556, pvalue=0.699553874986618), Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751), Power_divergenceResult(statistic=5.063592849467617, pvalue=0.02443353405878706), Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695), Power_divergenceResult(statistic=1.042554628893999, pvalue=0.3072280762642473), Power_divergenceResult(statistic=0.07446818777814278, pvalue=0.7849388502668134), Power_divergenceResult(statistic=2.0361210587719096, pvalue=0.15360089742564473), Power_divergenceResult(statistic=0.03723409388907139, pvalue=0.846989214486915)]


That's it for the guided steps! We recommend you explore the data more.

Here are some potential next steps:

Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
Manually create a list of words to remove, like the, than, etc.
Find a list of stopwords to remove.
Remove words that occur in more than a certain percentage (like 5%) of questions.
Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
Use the apply method to make the code that calculates frequencies more efficient.
Only select terms that have high frequencies across the dataset, and ignore the others.
Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
See which categories appear the most often.
Find the probability of each category appearing in each round.
Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.