# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

We want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named `jeopardy.csv`, and contains `20000` rows from the beginning of a full dataset of Jeopardy questions, which we can download [here]. Here's the beginning of the file:

![Jeopardy Dataset Preview](images/jeopardy_dataset_preview.png)

As we can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` -- the Jeopardy episode number of the show this question was in.
- `Air Date` -- the date the episode aired.
- `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- `Category` -- the category of the question.
- `Value` -- the number of dollars answering the question correctly is worth.
- `Question` -- the text of the question.
- `Answer` -- the text of the answer.

[here]: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file

In [1]:
# files
!ls

images	jeopardy.csv  project16_winning_jeopardy.ipynb


In [96]:
# import libraries
import numpy as np
import pandas as pd
import string

from scipy.stats import chisquare

## Jeopardy Questions

Let's read the dataset into a DataFrame and explore it.

In [4]:
jeopardy = pd.read_csv('jeopardy.csv')

In [24]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 9 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null object
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
dtypes: int64(1), object(8)
memory usage: 1.4+ MB


In [5]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [6]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [10]:
# Remove spaces in front of Column Names
jeopardy.columns = [
    'Show Number', 'Air Date',
    'Round', 'Category', 
    'Value', 'Question',
    'Answer'
]

In [11]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing Text

Before we jump into analyzing our dataset, we need to normalize all the text columns (the `Question` and `Answer` columns). Doing this step will ensure that we lowercase words and remove punctuations so `Don't` and `don't` aren't considered to be different words when we compare them.

In [19]:
# function to normalize questions and answers
def normalize_text(word):
    '''converts string to lowercase and
    remove all punctuations'''
    word = word.lower()
    
    # remove all punctuations
    word = ''.join([char for char in word if char not in string.punctuation])
    
    return word

In [22]:
# Normalize the Question column
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)

In [23]:
# Normalize the Answer column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [25]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalizing Columns

There are some other columns that we need to normalize.

The `Value` column should also be numeric, to allow us to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `Air Date` column should be a datetime instead of string to allow us to work with it more easily. We can use the [`pandas.todatatime`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) function.

In [33]:
def normalize_dollar_values(dollar_amount):
    '''removes punctuation and 
    converts string to an integer
    returns 0 if conversion has any error'''

    try:
        # remove all punctuations
        dollar_amount = ''.join([char for char in dollar_amount if char not in string.punctuation])
    
        number = int(dollar_amount)
    except:
        number = 0
    
    return number

In [36]:
# test all unique values of Value column
for val in jeopardy['Value'].unique():
    print(val,'-->', normalize_dollar_values(val))

$200 --> 200
$400 --> 400
$600 --> 600
$800 --> 800
$2,000 --> 2000
$1000 --> 1000
$1200 --> 1200
$1600 --> 1600
$2000 --> 2000
$3,200 --> 3200
None --> 0
$5,000 --> 5000
$100 --> 100
$300 --> 300
$500 --> 500
$1,000 --> 1000
$1,500 --> 1500
$1,200 --> 1200
$4,800 --> 4800
$1,800 --> 1800
$1,100 --> 1100
$2,200 --> 2200
$3,400 --> 3400
$3,000 --> 3000
$4,000 --> 4000
$1,600 --> 1600
$6,800 --> 6800
$1,900 --> 1900
$3,100 --> 3100
$700 --> 700
$1,400 --> 1400
$2,800 --> 2800
$8,000 --> 8000
$6,000 --> 6000
$2,400 --> 2400
$12,000 --> 12000
$3,800 --> 3800
$2,500 --> 2500
$6,200 --> 6200
$10,000 --> 10000
$7,000 --> 7000
$1,492 --> 1492
$7,400 --> 7400
$1,300 --> 1300
$7,200 --> 7200
$2,600 --> 2600
$3,300 --> 3300
$5,400 --> 5400
$4,500 --> 4500
$2,100 --> 2100
$900 --> 900
$3,600 --> 3600
$2,127 --> 2127
$367 --> 367
$4,400 --> 4400
$3,500 --> 3500
$2,900 --> 2900
$3,900 --> 3900
$4,100 --> 4100
$4,600 --> 4600
$10,800 --> 10800
$2,300 --> 2300
$5,600 --> 5600
$1,111 --> 1111
$8,200 --

In [40]:
# Normalize the 'Value' column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollar_values)

In [41]:
# convert 'Air Date' column to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [42]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


In [43]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [56]:
def count_matches(row):
    match_count = 0
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    # remove 'the' - very common word with no meaningful use
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count / len(split_answer)

In [57]:
answer_in_questions = jeopardy.apply(count_matches, axis=1)

In [58]:
answer_in_questions.head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
dtype: float64

In [60]:
answer_in_questions.unique()

array([0.        , 0.33333333, 0.5       , 0.25      , 0.2       ,
       0.66666667, 1.        , 0.28571429, 0.4       , 0.6       ,
       0.16666667, 0.75      , 0.57142857, 0.14285714, 0.375     ,
       0.30769231, 0.15384615, 0.125     , 0.3       , 0.1       ,
       0.42857143, 0.44444444, 0.11111111, 0.18181818, 0.27272727,
       0.875     , 0.22222222, 0.8       ])

In [59]:
answer_in_questions.mean()

0.06035277385469894

**Observations**:

We observe that on average, the answer was a part of the question 6% of the time.

This is a very small number, hence we cannot rely on hearing the question to help us figure out the answer. We will have to take the hard route and study.

## Recycled Questions

Let's investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, we can:

- Sort `jeopardy` in order of ascending air date.
- Maintain a set called `terms_used` that will be empty initially.
- Iterate through each row of `jeopardy`.
- Split `clean_question` into words, remove any word shorter than `6` characters, and check if each word occurs in `terms_used`.
    - If it does, increment a counter.
    - Add each word to `terms_used`.

In [73]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    
    # remove words with less than 6 characters
    split_question = [word for word in split_question if len(word) >= 6]
    
    match_count = 0
    
    # identify words previously used
    # add new words to terms_used set
    for word in split_question:
        if word in terms_used:
            match_count += 1
            
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
    
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

In [74]:
jeopardy['question_overlap'].head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: question_overlap, dtype: float64

In [75]:
jeopardy['question_overlap'].mean()

0.6919577992203644

**Observations**:

We observe a 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low Value vs High Value Questions

Let's study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

- Low value -- Any row where `Value` is less than `800`.
- High value -- Any row where `Value` is greater than `800`.

We'll then be able to loop through each of the terms from the last screen, `terms_used`, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [81]:
def update_value(row):
    '''set values greater than 800 to 1
    otherwise set them to 0'''
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    
    return value

In [86]:
# determine which questions are high and low value
jeopardy['high_value'] = jeopardy.apply(update_value, axis=1)

In [87]:
def count_usage(term):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        
        if term in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return (high_count, low_count)

In [90]:
# get first five terms from terms_used set
comparison_terms = list(terms_used)[:5]
comparison_terms

['pretender',
 'hrefhttpwwwjarchivecommedia20070330j16jpg',
 'guelphs',
 'babies',
 'doorposts']

In [91]:
observed_expected = []

for term in comparison_terms:
    term_result = count_usage(term)
    observed_expected.append(term_result)

In [92]:
observed_expected

[(1, 1), (1, 0), (1, 0), (1, 4), (0, 1)]

## Applying the Chi-Squared Test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [97]:
high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])

chi_squared = []

for high_count, low_count in observed_expected:
    total = high_count + low_count
    
    total_prop = total / jeopardy.shape[0]
    
    expected_high_value = total_prop * high_value_count
    expected_low_value = total_prop * low_value_count
    
    observed = np.array([high_count, low_count])
    expected = np.array([expected_high_value, expected_low_value])
    
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.18383953104516373, pvalue=0.6680941623250602),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

**Observations**:

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't a valid. It would be better to run this test with only terms that have higher frequencies.

## Next Steps

Here are some potential next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than `6` characters long. Some ideas:
    - Manually create a list of words to remove, like `the`, `than`, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like `5%`) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the [apply] method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the `Category` column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available [here]) instead of the subset we used in this mission.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.


[apply]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
[here]: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file