# Winning Jeopardy



## Introduction

### Preliminary

This Notebook is the conclusion of the ***Hypothesis Testing: Fundamentals** course from [dataquest.io](dataquest.io). It is a guided project whose aim is to use all the techniques and skills learnt during the course. We will be working here with a data set containing 20.000 questions from the popular TV show **Jeopardy**. The data set is called `jeopardy.csv` and it is an extract from an original data set that has been uploaded in a [reddit](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/) post.

#### Blockquotes usage
> I am sometimes using blockquotes as this one, meaning that for the rest of the project I am quoting some elements given by dataquest. For the sake of simplicity and clarity, I estimated that they did not need any reformulation and were immediately usable and convenient for me and the reader.

### Context


>Jeopardy is a popular TV show in the US where participants answer questions to win money.  
>"*Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.*"

The dataset is named `jeopardy.csv`, and contains 20000 rows.  
Eeach row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:
- Show Number - the Jeopardy episode number
- Air Date - the date the episode aired
- Round - the round of Jeopardy
- Category - the category of the question
- Value - the number of dollars the correct answer is worth
- Question - the text of the question
- Answer - the text of the answer


## Reading and cleaning the Data

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")

In [2]:
jeopardy.shape

(19999, 7)

In [3]:
jeopardy.head(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
# removing the spaces from the columns names
jeopardy.columns=jeopardy.columns.str.replace(" ",'')

In [6]:
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [7]:
jeopardy.dtypes

ShowNumber     int64
AirDate       object
Round         object
Category      object
Value         object
Question      object
Answer        object
dtype: object

As it can be seen above, all the columns - except the ShowNumber (integer) - contains object types. It makes sense as they all contain text. Nevertheless our first thought is that the *AirDate* column could be converted as date (or datetime) type. Regarding the *Value* column, it is giving the value of a question in the following format : "\$200", so we could potentially remove the "$" sign and convert the column to integers. 

### Normalizing Text (Question and Answer columns)

In [8]:
import re
def normalize(string):
    string = string.lower()
    string = re.sub(r'[^\w\s]','',string)
    return string
    

In [9]:
jeopardy["clean_question"]=jeopardy["Question"].apply(normalize)
jeopardy["clean_answer"]=jeopardy["Answer"].apply(normalize)

### Normalizing Columns (Values & AirDate)

As said before we are going to :
- remove the $ sign from the Value column and convert the string to an integer
- convert the AirDate column from a string to a data so it's easier to manipulate

In [10]:
def dol_to_int(string):
    try :
        string = re.sub(r'[^\w\s]','',string)
        string = int(string)
        return string
    except ValueError:
        return 0
    


In [11]:
jeopardy["clean_value"]=jeopardy["Value"].apply(dol_to_int)

In [12]:
jeopardy["AirDate"] = pd.to_datetime(jeopardy["AirDate"])

## Answers in questions

In [13]:

def answer_in_q(row):
    split_answer = row["clean_answer"].split(' ')
    split_question = row["clean_question"].split(' ')
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for e in split_answer:
        if e in split_question:
            match_count +=1
    return match_count / len(split_answer)

    
    

In [14]:
jeopardy["answer_in_question"]=jeopardy.apply(answer_in_q, axis=1)

In [15]:
jeopardy["answer_in_question"].mean()

0.060493257069335914

So on average, 6% of the Answers are contained in the questions, meaning that in our strategy we can't skip study general culture !

## Recycled questions ? 

In [90]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("AirDate")

for i,row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [e for e in split_question if len(e)>5]
    match_count = 0 
    for word in split_question:
        if word in terms_used:
            match_count += 1
    terms_used.add(word)
    if len(split_question)>0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap



In [91]:
print(jeopardy["question_overlap"].mean())

0.4950582436420868


There is about 50% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions (recall 10% of the whole set), and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions in our strategy to win the game.

## Low Value vs High Value Questions

In [21]:
def high_or_low(row):
    if row["clean_value"] > 800:
        return 1
    else:
        return 0
    

In [22]:
jeopardy["high_value"] = jeopardy.apply(high_or_low,axis=1)

In [26]:
def high_low_count(word):
    low_count = 0
    high_count = 0
    for i,row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row["high_value"]==1:
                high_count +=1
            else:
                low_count +=1
    return high_count,low_count


In [69]:
import random
random_sample = random.sample(terms_used,10)
comparison_terms = [e for e in random_sample]
observed_expected = []

In [73]:
for e in comparison_terms:
    observed_expected.append(high_low_count(e))
print(observed_expected)
    

[(0, 1), (1, 1), (0, 1), (0, 1), (0, 1), (3, 8), (0, 3), (2, 1), (1, 2), (0, 2)]


### Applying the Chi-squared test

In [79]:
high_value_count = len(jeopardy[jeopardy["high_value"]==1])
low_value_count = len(jeopardy[jeopardy["high_value"]==0])

In [95]:
chi_squared = []
chi_squared_2 =[]

In [97]:
from scipy.stats import chisquare
import numpy as np

for L in observed_expected:
    total = L[0]+L[1]
    total_prop = total / len(jeopardy)
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    chi_squared.append(chisquare(L,[expected_high,expected_low]))
    #observed = np.array([L[0],L[1]])
    #expected = np.array([expected_high,expected_low])
    #chi_squared_2.append(chisquare(observed,expected))
    

In [89]:
chi_squared

[Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.010522836989240831, pvalue=0.91829561813933991),
 Power_divergenceResult(statistic=1.2058885383806519, pvalue=0.27214791766901714),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868263753),
 Power_divergenceResult(statistic=0.031881167234403623, pvalue=0.85828871632352932),
 Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708)]

For every word, the p-value is much higher than the threshold - 0.05. Hence we fail to reject the null hypothesis.
High value questions are no more likely to repeat than low value questions.