# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money.

In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

### Summary of results:
- There are about 6% of the answers are contained in the question itself. The percentage is really low, it means if there are 100 words in the answer, only in average 6 of them world appear in the question. It's not reasonable to find the answer in the question.
- There are about 69% of the terms in a question have been repeated. The percentage is quite hight and it affects the studying strategy for Jeopardy because most of the terms used in the past are likely to appear in the future questions.
- None of the terms had a significant difference in usage between high value and low value questions. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

## Reading in dataset

In [97]:
import pandas as pd

# read in dataset
data = pd.read_csv('jeopardy.csv')

# quick exploration of the dataset
data.shape

(19999, 7)

In [98]:
# first 5 rows
data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Cleaning Data
### Removing unnecessary spaces in columns names

In [99]:
data.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the cloumn names contain spaces in front and I am going to remove the spaces.

In [100]:
data.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

### Normalizing text columns
I am going to normalize the text columns in this session by using a defined function.

In [101]:
import re

def normalize_text(text):
    '''
    function converting the string to lowercase and removing all punctuation in the string.
    '''
    text = text.lower()
    text = re.sub('[^0-9a-z\s]', '', text)
    text = re.sub('\s+', ' ', text)
    return text

In [102]:
# apply function to Question column
data['clean_question'] = data['Question'].apply(normalize_text)

# apply function to Answer column
data['clean_answer'] = data['Answer'].apply(normalize_text)

### Normalizing value columns
I am going to normalize the value columns in this session by using a defined function.

In [103]:
def normalize_value(text):
    '''
    function removing all punctuation in the string and converting the string to integer.
    ff the conversion has an error, assign 0 instead.
    '''
    text = re.sub('[^0-9\s]', '', text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [104]:
# apply function to Value column
data['clean_value'] = data['Value'].apply(normalize_value)

### Converting Air Date column to datetime format
I am going to convert the Air Date columns to datetime format

In [105]:
data['Air Date'] = pd.to_datetime(data['Air Date'])

## Studying past questions

To study past questions, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

The first question can be answered by seeing how many times words in the answer also occur in the question.
The second question can be answered by seeing how often complex words (> 6 characters) reoccur.

I will work on the first question first.

### Deducible questions

In [106]:
def count_answer_in_question(row):
    '''
    function to split answer and question to a series and find out 
    if the answer is existing in the question,
    return proportion of answer existing in questions
    '''
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the') # 'the' is common but useless
    if len(split_answer) == 0:
        return 0 # to avoid division by zero error later
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [107]:
# apply function to dataset
answer_in_question = data.apply(count_answer_in_question, axis = 1)

# find the mean of number of answer in question
answer_in_question.mean()

0.05900196524977763

There are about 6% of the answers are contained in the question itself. The percentage is really low, it means if there are 100 words in the answer, only in average 6 of them world appear in the question. It's not reasonable to find the answer in the question.

Then I will investigate how often new questions are repeats of older ones. With the existing dataset, it's not possible to answer this question completely, because the dataset contains only 10% of the full Jeopardy question dataset, but I will still investigate it as practice.

The repetition of the word is determined by words with 6 or more characters, because works like 'the' or 'than' are commonly used, but do not provide a lot of information about the question. If the term has been used again in the later question, it's considered as repeated question. 

### Repeated questions

In [108]:
# empty list for repeated question
question_overlap = []

# empty list for used termss
terms_used = set()

# sort data by ascending air date
data = data.sort_values(['Air Date'])

for row in data.iterrows():
    row = row[1]
    split_question = row['clean_question'].split()
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
        
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)
    
# assign question_overlap to data
data['question_overlap'] = question_overlap

# find mean of question overlap column
data['question_overlap'].mean()

0.6876260592169802

There are about 69% of the terms in a question have been repeated. The percentage is quite hight and it affects the studying strategy for Jeopardy because most of the terms used in the past are likely to appear in the future questions.

## High value questions
There is a strategy that only study questions that pertain to high value questions instead of low value questions. In this session, I am going to figure out which terms correspond to high-value questions using a chi-squared test. The questions are separated as following:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

In [109]:
def define_value(row):
    '''
    function to define if that question is high value/low value
    '''
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

# apply function to dataset
data['high_value'] = data.apply(define_value, axis = 1)

In [110]:
def value_counts(word):
    '''
    function to count high value terms and low value terms
    '''
    low_count = 0
    high_count = 0
    for row in data.iterrows():
        row = row[1]
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [111]:
from random import sample

# randomly pick 10 elements for comparsion
comparison_terms = []
comparison_terms.append(sample(terms_used, 10)) # or choice
comparison_terms = comparison_terms[0]

# from random import choice

# terms_used_list = list(terms_used)
# comparison_terms = [choice(terms_used_list) for _ in range(10)]

# empty list for observed and expected value
observed_expected = []

# apply function to comparison terms to find out observed count
for term in comparison_terms:
    observed_expected.append(value_counts(term))
    
observed_expected

[(3, 2),
 (0, 1),
 (0, 2),
 (0, 1),
 (0, 1),
 (0, 2),
 (1, 0),
 (1, 1),
 (0, 1),
 (1, 3)]

Now that I've found the observed counts for a few terms, the expected counts and the chi-squared value can be computed.

In [115]:
import numpy as np
from scipy.stats import chisquare

# separate high value set and low value set
high_value_count = len(data[data['high_value'] == 1])
low_value_count = len(data[data['high_value'] == 0])

# empty list for chi squared values
chi_squared = []

for obs_list in observed_expected:
    total = np.sum(obs_list)
    total_prop = total / len(data)
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    observed = np.array([obs_list[0], obs_list[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=2.3995960878537224, pvalue=0.12136658322360773),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

## Conclusion
In this project, I have the following findings:
- There are about 6% of the answers are contained in the question itself. The percentage is really low, it means if there are 100 words in the answer, only in average 6 of them world appear in the question. It's not reasonable to find the answer in the question.
- There are about 69% of the terms in a question have been repeated. The percentage is quite hight and it affects the studying strategy for Jeopardy because most of the terms used in the past are likely to appear in the future questions.
- None of the terms had a significant difference in usage between high value and low value questions. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

## Further Study
Here are some potential next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.