# Winning Jeopardy

Let's imagine that you'd like to compete in the popular US TV show Jeopardy and are looking for any way to win. 

In our project, we'll work to with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win!



In [1]:
import pandas as pd

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')

In [3]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.shape

(19999, 7)

In [5]:
jeopardy.isnull().sum()

Show Number    0
 Air Date      0
 Round         0
 Category      0
 Value         0
 Question      0
 Answer        0
dtype: int64

In [6]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Looks like some of the columns in the dataframe have a leading space. Let's trim those away.

In [7]:
#removing leading whitespace in column names
jeopardy.columns = jeopardy.columns.str.strip()

In [8]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Now that we've cleaned them up, here are explanations of each column:

`Show Number` - the Jeopardy episode number
`Air Date` - the date the episode aired
`Round` - the round of Jeopardy
`Category` - the category of the question
`Value` - the number of dollars the correct answer is worth
`Question` - the text of the question
`Answer` - the text of the answer

There are 20,000 entries in our dataset, without single null value to account for.

We'll move forward with some deeper level data cleaning and normalize all the text columns (`Question` and `Answer`).

## Data Formatting and Cleaning

### Normalize Text

In [9]:
import re
#function cleans text (lower case, remove char and multi-space)
def normalize_strings(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

In [10]:
#assigning cleaned Question and Answer entries to two new columns: clean_questions, clean_answers
jeopardy['clean_questions'] = jeopardy['Question'].apply(normalize_strings)
jeopardy['clean_answers'] = jeopardy['Answer'].apply(normalize_strings)

In [11]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_questions,clean_answers
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


With the text entries in `Question` and `Answer` taken cared of, let's move on to normalizing columns.

### Normalizing Columns

There are a couple of columns that need their datatypes addressed.

In order to more easily manipulate its entries, the values in the `Value` need to be changed from text to numeric. As well as have the dollar sign removed from the begining of each value.

The `Air Date` column should be a datetime, not a string.

#### `Value` Column

Let's work on defining the function that will take in a value and transform it into a cleaned numeric.

In [12]:
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [13]:
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [14]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_questions,clean_answers,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


#### `Air Date` Column

With the help of the pandas.to_datetime function, `Air Date` entries will modified to datetime objects.

In [15]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'], format='%Y-%m-%d')

In [16]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number        19999 non-null int64
Air Date           19999 non-null datetime64[ns]
Round              19999 non-null object
Category           19999 non-null object
Value              19999 non-null object
Question           19999 non-null object
Answer             19999 non-null object
clean_questions    19999 non-null object
clean_answers      19999 non-null object
clean_value        19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


## Finding the Edge

In order to figure out whether to study past questions, study general knowledge, or even study at all, it would be helpful to figure at least a couple of things.

1. How often the answers can be used for a question.
2. How often questions are repeated

We can start to tackle the first question by seeing how many times words in the answer also occur in the question.

### In the Form of a Question

In [17]:
#function takes in df row as series
def count_matches(row):
    split_answer = row['clean_answers'].split()
    split_question = row['clean_questions'].split()
    
    if 'the' in split_answer:
        split_answer.remove('the')
        
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    
    for i in split_answer:
        if i in split_question:
            match_count +=1
    return match_count/len(split_answer)

In [18]:
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

In [19]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

So, the above seems to indicate that answer only makes up about 6% of the questoin. Keying off the question's words to find the answer isn't a very effective strategy.

Let's start considering option 2: How often are questions repeated?

### Questions, Again

A small caveat on this end of our analysis. We only have access to about 10% of the full Jeopardy question dataset. With that in mind, it's still worth investigating if questions are ever repeated.

The methodology for the below cell:

1. Sort `jeopardy` in order of ascending air date.
2. Maintain a set called `terms_used` that will be empty initially.
3. Iterate through each row of `jeopardy`.
4. Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
    - If it does, increment a counter.
    - Add each word to `terms_used`.

In [20]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values(by=['Air Date'])

for i, row in jeopardy.iterrows():
    split_question = row['clean_questions'].split(' ') #returns list
    split_question = [q for q in split_question if len(q) > 5] #filters list for words > 5 characters
    match_count = 0
    for word in split_question: #check + increment count if the word is in set 
        if word in terms_used:
            match_count +=1
    for word in split_question: #add word to set
        terms_used.add(word)
    if len(split_question) > 0: #avoid division by zero
        match_count = match_count/len(split_question) #ratio of matching words
    question_overlap.append(match_count) #append ratio to list

jeopardy['question_overlap'] = question_overlap #new df column
jeopardy["question_overlap"].mean()    #deliver mean word overlap

0.6876260592169802

As it turns out, there's nearly a 70% overlap between terms used in old questions and terms used in new questions! Now, the caveat should be repeated that the question in our dataset only account for about 10% of all Jeopardy question in total. And it also bears mentioning that the above only tracks the frequency of words being used, and not whole phrases.

## Low Value vs High Value Questions

What if we wanted to narrow the scope of questions based on their value? As in, what if we wanted to focus our research and study on the higher value questions?

If we could figure out which terms correspond to high-value questions, we just might be able to gain an edge. 

We can try to do just that using a chi-squared test. We first narrow down the questions into two catergories: 

- Low value: where `Value` is less than 800
- High value: where `Value` is more than 800

By looping through each of the terms from `terms_used`, we can then find the words with the biggest differences in usage between high and low value questions. 

The steps we could take are:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.


In [21]:
def value_marker(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

In [22]:
jeopardy['high_value'] = jeopardy.apply(value_marker, axis=1)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_questions,clean_answers,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this n...,the grand canyon,200,0.0,0.5,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0.0,0


In [23]:
#function takes in a word and
#first checks if the word is present in any given row in jeopardy['clean_questions']
#then checks if the that row is a pre-defined high or low value question
#returns the number of times the word occurs in both high and low value questions
def value_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_questions'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count +=1
            else:
                low_count +=1
    return high_count, low_count

In [42]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(value_count(term))

observed_expected

[(0, 1),
 (1, 0),
 (3, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [43]:
from scipy.stats import chisquare
import numpy as np
#applying the chi-squared test
#count: how many rows where high_value is 0/1
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total/jeopardy.shape[0]
    high_value_exp = total_prop*high_value_count
    low_value_exp = total_prop*low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared                       

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=4.198022975221989, pvalue=0.0404711362009595),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## Chi-Squared Results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies

# Conclusion

## Next Steps
We could try to find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:

    - Manually create a list of words to remove
    - Find a list of stopwords to remove
    - Remove words that occur in the more than a certain percentage (say, 5%) of questions
    
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available here) instead of the subset we used in this lesson.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.
