# JEOPARDY - EXPLORING DATA FOR THE EDGE TO WIN
## INTRODUCTION
This project would focus on the popular culture game of Jeopardy. The aim is to explore data and identify trends or insight that would allow us better prepare or prepare strategically to win the game. For this purpose, we would be using sample data that covers the first 20000 questions of aired episodes, from the [full data set](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). We would like to answer the following questions:
* How often the answer is deducible from the question?
* How often new questions are repeats of older questions?

Let's import the data and get familiar with it through some general EDA.

In [1]:
import numpy as np
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell  # Configure jupyther to displays...
InteractiveShell.ast_node_interactivity = "all"             # multiple outputs at the same time.

jeopardy = pd.read_csv('Downloads/jeopardy.csv')
jeopardy.head(5)
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

We can notice that there is a one-space indent with most of the column labels. This should be handled to make it easier to work with the table dataset during further analysis.

In [2]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns
jeopardy.head()

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Normalise Columns
For the purpose of our inquiry, we would be working especially with the Question, Answer and Value columns. As such, it would be useful to normalize these columns for analysis.

In [3]:
import re 

# Normalise text data
def norm_text(string):
    string = re.sub("[^A-Za-z0-9\s]", '', string)
    string = string.lower()
    return string

# Normalise numeric data
def norm_numeric(input_series):
    numeric = re.sub("[^A-Za-z0-9\s]", "", input_series)
    try:
        numeric = int(numeric)
    except:
        numeric = 0
    return numeric

# Apply normalisation
jeopardy['clean_question'] = jeopardy['Question'].apply(norm_text)
jeopardy['clean_answer']  = jeopardy['Answer'].apply(norm_text)    
jeopardy['clean_value'] = jeopardy['Value'].apply(norm_numeric)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

jeopardy.dtypes
jeopardy['clean_value'].head()
jeopardy['clean_question'].head()
jeopardy['clean_answer'].head()

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

## Analyse Data
With the relevant data normalised, we can now start to investigate the questions we have on the historical data. 
### How often the answer is deducible from the question?

In [4]:
def find_answer_in_question(input_row):
    '''
    Find out how ofter answers are present in their respective questions.
    '''
    split_answer = input_row['clean_answer'].split(' ')
    split_question = input_row['clean_question'].split(' ')
    match_count = 0
    if 'the' in split_answer:                    # 'the' is present in most answers but not relevant as an answer
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for each_word in split_answer:
        if each_word in split_question:
            match_count += 1
    result = match_count/len(split_answer)
    return result

jeopardy['answer_in_question'] = jeopardy.apply(find_answer_in_question, axis = 1)

jeopardy.head()

jeopardy['answer_in_question'].mean()


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0


0.060493257069335914

The goal was to see if that answers were in the question. The result above, summerised by the mean of the 'answer_in_question' column results in about 6%. This shows that only a small percentage of the answer are in the question, given our limited sample. Therefore not studying in hopes that the answer would be in the question is not an advisable option for winning, statistically.

### How often new questions are repeats of older questions?

In [5]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values(by = 'Air Date')

for idx, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [each for each in split_question if len(each) > 5] # to filter out words like the and than, which are 
                                                                        # commonly used, but don't tell you a lot about a question.
    match_count = 0                                               
    for each_word in split_question:
        if each_word in terms_used:
            match_count += 1
        terms_used.add(each_word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
        
jeopardy['question_overlap'].mean()

0.6894031359073217

There is about a 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. In addition, it does not confirm that the answers correlate with the respective questions. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

Let us say that we want to strategically study for higher value questions, to optimise earnings on the show. One reasonable approach is to identidy terms that coincide with high value questions way more that low value ones.  

In [6]:
# To categorise the value of questions
def high_vs_low(input_data):
    if input_data['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

# Apply value categorisation
jeopardy['high_value'] = jeopardy.apply(high_vs_low, axis = 1)

# To count frequency of term is high vs low value question
def observed_count(word_string):
    high_count = 0
    low_count = 0
    for idx, data in jeopardy.iterrows():
        split_question = data['clean_question'].split(' ')
        if word_string in split_question:
            if data['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return [high_count, low_count]

observed_expected = []
comparison_terms = list(terms_used)[:5]

# Count frequency of first few terms in high vs low value questions
for each in comparison_terms:
    observed_expected.append(observed_count(each))
    
comparison_terms
observed_expected

['hundredone', 'elixir', 'lorado', 'barcelona', 'bronsons']

[[1, 0], [0, 2], [0, 1], [1, 3], [0, 1]]

The frequency of term in both high and low value questions are low, at least for these first few terms.

Let us proceed to use a chi squared test to try to identify if some terms are more prevalent in high value terms.

In [7]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

### Chi-squared results
As mention above, the spread of frequencies were mostly lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

### Next Steps
* Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
 * Manually create a list of words to remove, like the, than, etc.
 * Find a list of stopwords to remove.
 * Remove words that occur in more than a certain percentage (like 5%) of questions.
* Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
 * Use the apply method to make the code that calculates frequencies more efficient.
 * Only select terms that have high frequencies across the dataset, and ignore the others.
* Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
 * See which categories appear the most often.
 * Find the probability of each category appearing in each round.
* Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
* Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.