# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. 

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help to win Jeopardy.  The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* Show Number - the Jeopardy episode number
* Air Date - the date the episode aired
* Round - the round of Jeopardy
* Category - the category of the question
* Value - the number of dollars the correct answer is worth
* Question - the text of the question
* Answer - the text of the answer

### 1. Import the dataset 

In [4]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [5]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [6]:
# Remane column names, remove spaces in front of the names
jeopardy.columns=['Show Number', 'Air Date', 'Round', 
                  'Category', 'Value',
                  'Question', 'Answer']
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### 2. Normalizing Columns

* We need to normalize all of the text columns. We will put words in lowercase and remove punctuation so Don't and don't aren't considered to be different words when you compare them.

* The Value column should be numeric. We'll remove the dollar sign from the beginning of each value and convert the column from text to numeric.

* The Air Date column should also be a datetime, not a string, to enable you to work it easier.


In [7]:
import re

def normalize_text(text):
    normalized_text = re.sub('\W', ' ', text).lower()
    return normalized_text

def normalize_value(text):
    value = re.sub('[^0-9]', '', text)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value

In [8]:
jeopardy['clean_question'] =jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] =jeopardy['Answer'].apply(normalize_text)
jeopardy['clean_value'] =jeopardy['Value'].apply(normalize_value)
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,200


In [9]:
jeopardy['Air Date'] =pd.to_datetime(jeopardy['Air Date'])
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### 3. Answer in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

1) How often the answer can be used for a question.
 
 * We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.


2) How often questions are repeated.
 
 * We can answer the second question by seeing how often complex words (> 6 characters) reoccur.

In [10]:
#  How often the answer can be used for a question.
def match_counts(row): 
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count =0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for w in split_answer:
        if w in split_question:
            match_count += 1
    return ( match_count /len(split_answer) )

jeopardy['Answer Match Count']=jeopardy.apply(match_counts,axis=1)
jeopardy['Answer Match Count'].mean()

0.06294645581984949

by average, only 6% of the answer words can be used for a question.

### 4. Recycled Questions

We want to investigate how often new questions are repeats of older ones, we can do the following:

* Sort jeopardy in order of ascending air date.
* Maintain a set called terms_used that will be empty initially.
* Iterate through each row of jeopardy.
* Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
  * If it does, increment a counter.
  * Add each word to terms_used.

In [11]:
question_overlap =[]
terms_used =set()

jeopardy =jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [w for w in split_question if len(w) > 5]
    match_count  = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
    for w in split_question:
        terms_used.add(w)  
    if len(split_question) >0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()     

0.7197989717809739

There is about 72% terms in questions have been used previously. It does mean that it's worth looking more into the recycling of questions


### 5. Low Value vs Height Value Questions

We only want to study questions that pertain to high value questions instead of low value questions. This will help to earn more money when you're on Jeopardy.

We'll first need to narrow down the questions into two categories:
* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

Weu'll then be able to loop through each of the terms from the last screen, terms_used, and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.


In [12]:
def is_high_value(row):
    value = 0
    if  row["clean_value"]>800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(is_high_value,axis=1)

In [13]:
def count_usage(word):
    low_count = 0
    high_count =0
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count,low_count

from random import sample
comparison_terms  = sample(terms_used, 10)

observed_expected =[]

for term in comparison_terms:
    observed_expected.append(count_usage(term))
    
observed_expected

[(0, 1),
 (1, 0),
 (1, 0),
 (0, 1),
 (0, 3),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 1)]

### 6. Apply the Chi-Squared Test

In [14]:
counts=jeopardy['high_value'].value_counts()
print(counts)
high_value_count = counts[1]
low_value_count = counts[0]

0    14265
1     5734
Name: high_value, dtype: int64


In [15]:
import numpy as np
from scipy.stats import chisquare

chi_squared =[]

for obs in observed_expected:
    total = sum(obs)
    total_prop =  total/jeopardy.shape[0]
    expected_high_value = total_prop * high_value_count
    expected_low_value = total_prop * low_value_count
    observed = np.array([obs[0], obs[1]])
    expected = np.array([expected_high_value,expected_low_value])
    chi_squared.append(chisquare(observed, expected))  # returns a list

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

### 7. Chi-squared results

pvalue are far more than 0.05, which means there is no significant difference in usage between high value and low value rows. 