## Project Jeopardy:

This project was supposed to provide a statistical analysis of the most efficient jeopardy questions to study. Unfortunately this Project was a complete disaster. In general the DataQuest projects set problems to solve and expect the student to come up with the solution. In this project the instructions just contained the solution point by point. This means the student was never challenged or forced to come up with their own solution. So unfortunately the the educational benefit was very limited. This project is only on my Github for completeness' sake. There is also the possibility that I will re-visit this project at some point.

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
jeopardy = pd.read_csv('jeopardy.csv')
pd.set_option.max_colwidth = 200

## Data Cleaning

In [2]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


The column names have spaces in them and are capitalized. In order to make them easier to work with. This will be changed

In [3]:
jeopardy.columns = jeopardy.columns.str.replace('\s', '').str.lower()
print(jeopardy.columns)

Index(['shownumber', 'airdate', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')


In [4]:
jeopardy.head(3)

Unnamed: 0,shownumber,airdate,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


Next the `answer` and `question` columns will be normalized. To do this all words will be made lower case and the punctuation will be removed.  The easiest way to achieve this is to use Pandas' .str.replace and .str.lower methods.

In [5]:
jeopardy['clean_question'] = jeopardy['question'].str.lower().str.replace('[^\s\w\d]', '')
jeopardy['clean_answer'] = jeopardy['answer'].str.lower().str.replace('[^\s\w\d]', '')

In [6]:
jeopardy[['clean_question', 'clean_answer']]

Unnamed: 0,clean_question,clean_answer
0,for the last 8 years of his life galileo was u...,copernicus
1,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,the city of yuma in this state has a record av...,arizona
3,in 1963 live on the art linkletter show this c...,mcdonalds
4,signer of the dec of indep framer of the const...,john adams
...,...,...
19994,of 8 12 or 18 the number of us states that tou...,18
19995,the new power generation,prince
19996,in 1589 he was appointed professor of mathemat...,galileo
19997,before the grand jury she said im really sorry...,monica lewinsky


Next the `values` column will be normalized, to be of the int type. NaN values will be replaced with 0

In [11]:
values = jeopardy.value.str.extract('(\d+)', expand = False)
values[values.isna()] = 0
values = values.astype(int)
jeopardy['clean_values'] = values

In [12]:
jeopardy.clean_values.dtype

dtype('int32')

In [15]:
jeopardy.airdate

0        2004-12-31
1        2004-12-31
2        2004-12-31
3        2004-12-31
4        2004-12-31
            ...    
19994    2000-03-14
19995    2000-03-14
19996    2000-03-14
19997    2000-03-14
19998    2000-03-14
Name: airdate, Length: 19999, dtype: object

The `airdate` column is saved as a string. Transforming it into datetime format will make it easier to work with, because the `datetime` module contains many useful methods

In [16]:
jeopardy['airdate_clean'] = pd.to_datetime(jeopardy.airdate)

In [17]:
jeopardy.airdate_clean.dtype

dtype('<M8[ns]')

## Data Analysis

We will now find out if there are questions, that include parts of their answer in the question.

In [18]:
def reoccurence(row):
    '''Calculate the ratio of words (excluding filler words) that occur both in a question and its answer
    Args:
        row (int): index of the row to analyze
    Returns:
        float: ratio of words that occur in both the question and its answer
    '''
    split_answer = row.clean_answer.split()
    split_question = row.clean_question.split()
    match_count = 0
    for item in ['the', 'or', '&', 'a', 'of']:
            if item in split_answer:
                split_answer.remove(item)
    for item in ['the', 'or', '&', 'a', 'of']:
            if item in split_question:
                split_question.remove(item)
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer: 
        if word in split_question:
            match_count += 1
    
    return(match_count / len(split_answer))
jeopardy['answer_in_question'] = jeopardy.apply(reoccurence, axis = 1)
# jeopardy.iloc[8:10].apply(reoccurence, axis = 1)

In [19]:
jeopardy.answer_in_question.mean()*100

4.108361613493449

About 4% of words in the answer appear in the corresponding question

In [20]:
jeopardy.loc[jeopardy.answer_in_question != 0, ['question', 'answer','answer_in_question']]

Unnamed: 0,question,answer,answer_in_question
14,"On June 28, 1994 the nat'l weather service beg...",the UV index,0.500000
24,This Asian political party was founded in 1885...,the Congress Party,0.500000
38,"During the 1954-1955 Sun sessions, Elvis climb...","the ""Mystery Train""",0.500000
53,"In 1961 James Brown announced ""all aboard"" for...","""Night Train""",0.500000
68,This island in the South Pacific is named for ...,Easter Island,0.500000
...,...,...,...
19951,The name of this Jamaican bay is from the Span...,Montego (Bay),0.500000
19963,"African Americans, 13% of the U.S., were nearl...",the First Gulf War,0.333333
19974,"Langdon in ""Angels & Demons"" is looking for <a...",an antimatter bomb,0.333333
19981,In 1899 Secretary of State John Hay proclaimed...,open-door policy,0.500000


This doesn't seem very helpful when studying to win jeopardy.

The next question to answer is how often subjects are re-used. To get an idea of this figure, the ratio of re-used words with at least 6 letters will be calculated:

In [21]:
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values('airdate_clean', ascending = False)
for i, row in jeopardy.iterrows():
    match_count = 0
    split_question = row.clean_question.split()
    split_question = [q for q in split_question if len(q) > 5]

    for item in split_question:
        if item in terms_used:
            match_count += 1
    for item in split_question:
        terms_used.add(item)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap   
terms_list = list(terms_used)
question_overlap = []
terms_used = set()    

In [22]:
print(jeopardy.question_overlap.mean())

0.6945657437055904


About 70% of all words with 6 or more letters are re-used in the jeopardy questions. This implies that studying the subjects in previous questions could be a useful technique when preparing to go on the show.

Splitting up the questions according to difficulty. In order to do this, the questions that are worth at least $800 and those that are worth less will be separated into two different datasets.

In [26]:
jeopardy.loc[jeopardy.clean_values >= 800,'high_value'] = 1
jeopardy.loc[jeopardy.clean_values < 800,'high_value'] = 0

In [27]:
jeopardy[['value', 'high_value']].head()

Unnamed: 0,value,high_value
1931,$600,0.0
1936,$800,1.0
1947,$400,0.0
1946,$400,0.0
1945,$400,0.0


Turning the questions and answers into lists of words:

In [28]:
jeopardy['split_question'] = jeopardy.clean_question.str.split()
jeopardy['split_answer'] = jeopardy.clean_answer.str.split()

In [29]:
print(jeopardy.split_question.head())
print(jeopardy.split_answer.head())

1931    [this, singer, of, jack, diane, had, to, fight...
1936    [this, chilean, city, whose, name, means, vall...
1947    [the, british, a22, mark, iv, tank, carried, a...
1946    [if, you, cant, stand, the, heat, theres, alwa...
1945    [in, december, 1974, this, former, new, york, ...
Name: split_question, dtype: object
1931    [john, mellencamp]
1936          [valparaiso]
1947           [churchill]
1946      [steak, tartare]
1945         [rockefeller]
Name: split_answer, dtype: object


Finally to get an idea over which subjects are best for someone focusing on studying for only the high-value questions, the number of occurrences will be analyzed for each group of questions. This will be done for the first five terms in `term_list` as an example.

In [32]:
comparison_terms = terms_list[:5]
print(comparison_terms)

observed_expected = []
for item in comparison_terms:
    high_count = jeopardy[jeopardy.high_value == 1].clean_question.str.contains(item).sum()
    low_count = jeopardy[jeopardy.high_value == 0].clean_question.str.contains(item).sum()
    observed_expected.append([high_count,low_count])
observed_expected

high_value_count, low_value_count = jeopardy.high_value.value_counts()[[0,1]]
print(high_value_count)
print(low_value_count)

from scipy.stats import chisquare
chi_squared = []
for entry in observed_expected:
    total = sum(entry)
    total_prob = total / jeopardy.shape[0]
    expected_high = total_prob * high_value_count
    expected_low = total_prob * low_value_count
    expected = np.array([expected_high, expected_low])
    observed = np.array(entry)
    chi_squared.append(chisquare(observed,expected))
    
chi_squared

['alpine', 'bloomer', 'prestidigitation', 'marketed', 'obamas']
12047
7952


[Power_divergenceResult(statistic=1.3361664150114714, pvalue=0.24771116174632518),
 Power_divergenceResult(statistic=0.08752306839292578, pvalue=0.7673499986093351),
 Power_divergenceResult(statistic=1.5149647887323945, pvalue=0.2183830639074686),
 Power_divergenceResult(statistic=3.029929577464789, pvalue=0.0817415642907401),
 Power_divergenceResult(statistic=0.6600813480534574, pvalue=0.4165312258269849)]

In total the chi_squared analysis showed that none of the terms seemed particularly useful to study.

# Summary

In this project, past questions used in the show Jeopardy! Were analyzed with the intent of finding statistical anomalies in the data, that could be exploited when studying while preparing to be on the show.

The analysis showed that about 70% of words with six or more letters are re-used in questions. This implies, that studying the subjects of past questions can be a useful tactic when preparing to be on the show.