# Guided Project: Winning Jeopardy
In this project, we will look through 20,000 questions from the game show Jeopardy to see if there are any patterns that could give a contestant an edge on their competitors. The dataset we are using is just the first 20,000 rows from [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/?st=jjb25rd7&sh=23fa3130). 

# Importing data 

In [1]:
import pandas as pd 
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


# Some cleaning
Some of the columns have a space in front, this could be annoying to deal, so it would be benefical to remove it now. 

In [2]:
new_columns = []
for x in jeopardy.columns:
    x = x.lower()
    x = x.replace(' ', '')
    new_columns.append(x)
    
jeopardy.columns = new_columns
print (jeopardy.columns)

Index(['shownumber', 'airdate', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')


Before analysis, we should normalize the Question and Answer columns. This involves changing all the text to lower case and removing punctuation. By doing so, we ensure that there is consistency, so that something like "Don't" and "don't" are not considered to be different.

In [3]:
import re
def normalize_text(x):
    lowercase = x.lower()
    cleantext = re.sub("[^A-Za-z0-9\s]", "", lowercase)
    return cleantext

jeopardy['clean_question'] = jeopardy['question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['answer'].apply(normalize_text)


The Value and Air Date columns are both strings in the data. We should change these to integers and datetimes respectively. The Value column also contains 'None' for rows without value, this should be a 0 to allow us to analyze the data better.  

In [4]:
def onlyinteger(x):
    nopunct = re.sub("[^A-Za-z0-9\s]", "", x)
    try:
        value = int(nopunct)
        return value  
    except:
        return 0
    
jeopardy['clean_value'] = jeopardy['value'].apply(onlyinteger)

In [5]:
jeopardy['airdate'] = pd.to_datetime(jeopardy['airdate'])

# What makes a question
We should look through the questions and see if there is any patterns. Patterns such as if answers are deducible from the question or if questions are repeats of previous questions. This could help us decide if we should study previous questions, general knowledge, or specific areas. 

In [6]:
def jeopardy_ansinq(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0 
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(jeopardy_ansinq, axis=1)
print(jeopardy['answer_in_question'].mean())

0.06049325706933587


Based on our analysis above, the answer appears in the question only about 6% of the time. This means we cannot just rely on getting the answer from the question and we have to find other method of studying. Perhaps, questions are repeated enough on Jeopardy that we can study past questions. 

In [7]:
question_overlap = []
terms_used = set() 

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [x for x in split_question if len(x) >= 6]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap    
print(jeopardy['question_overlap'].mean())

0.6908737315671962


There are about 70% of terms reused in previous questions. However, we should keep in mind that the data we are looking at is just a sample of a bigger dataset. Also, in the cell above, we are only looking at terms and not phrases. So it is pretty insignificant, however it is something to look into.

In [8]:
def highlow(x):
    value = 0
    if x['clean_value'] > 800:
        value = 1
    return value 
    
jeopardy['high_value'] = jeopardy.apply(highlow, axis=1)

In [9]:
def count_usage(x):
    low_count = 0
    high_count = 0 
    for index, row in jeopardy.iterrows():
        if x in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1 
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []
comparison_terms = list(terms_used)[:5]

for x in comparison_terms:
    result = count_usage(x)
    observed_expected.append(result)
    
print(observed_expected)

[(0, 1), (2, 10), (0, 2), (0, 1), (0, 1)]


In [10]:
import numpy as np
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]
chi_squared = []

for x in observed_expected:
    total = sum(x)
    total_prop = total / len(jeopardy)
    expectedhigh = total_prop * high_value_count
    expectedlow = total_prop * low_value_count
    
    observed = np.array([x[0], x[1]])
    expected = np.array([expectedhigh, expectedlow])
    s, p = chisquare(observed, expected)
    chi_squared.append((s, p))

chi_squared

[(0.401962846126884, 0.5260772985705469),
 (0.8456210901225915, 0.35779406898197064),
 (0.803925692253768, 0.3699222378079571),
 (0.401962846126884, 0.5260772985705469),
 (0.401962846126884, 0.5260772985705469)]

There is no significant difference between high value and low value row in term usage. The chi-squared test is not valid, due to the low frequencies (lower than 5). It would be more efficient to run this test on terms with higher frequencies.