# Winning Jeopardy

__Context:__Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Official website: https://www.jeopardy.com/ More infos: https://en.wikipedia.org/wiki/Jeopardy!


__Dataset:__ "jeopardy.csv", it contains 20000 rows from the beginning of a full dataset of Jeopardy questions

__Goal:__ Let's say I want to compete on Jeopardy, and I'm looking for any edge I can get to win. In this project. Figure out some patterns in the questions that could help you me.

In [128]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### First exploration of the dataset

In [129]:
jeopardy = pd.read_csv('jeopardy.csv')

In [130]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [131]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Each row in the dataset represents a single question on a single episode of Jeopardy.

- Show Number: the Jeopardy episode number of the show this question was in.
- Air Date: the date the episode aired.
- Round: the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category: the category of the question.
- Value: the number of dollars answering the question correctly is worth.
- Question: the text of the question.
- Answer: the text of the answer

In [132]:
# Remove the spaces in each item in jeopardy.columns.
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [133]:
jeopardy.shape

(19999, 7)

In [134]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


### Normalizing text

Before starting the analysis on the Jeopardy questions, one need to normalize all of the text columns. The idea is to ensure that words are lowercased with punctuation removed.

In [135]:
# import regular expression to manipulate strings
import re

In [136]:
def normalize_text(text):
    # take in a string & return it normalized
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)  # remove all punctuation in the string
    return text

In [137]:
# Use the Pandas Series.apply method to apply the function to each item
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)

In [138]:
# check result
jeopardy[45:50]

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
45,4680,2004-12-31,Double Jeopardy!,DR. SEUSS AT THE MULTIPLEX,$1600,"<a href=""http://www.j-archive.com/media/2004-1...",Mulberry Street,a hrefhttpwwwjarchivecommedia20041231dj25mp3so...,mulberry street
46,4680,2004-12-31,Double Jeopardy!,AIRLINE TRAVEL,$1600,In 2004 United launched this new service that ...,Ted,in 2004 united launched this new service that ...,ted
47,4680,2004-12-31,Double Jeopardy!,THAT OLD-TIME RELIGION,$1600,With Mary I's accession in 1553 he ran to Gene...,(John) Knox,with mary is accession in 1553 he ran to genev...,john knox
48,4680,2004-12-31,Double Jeopardy!,MUSICAL TRAINS,$1600,"This band's ""Train In Vain"" was a hidden track...",The Clash,this bands train in vain was a hidden track on...,the clash
49,4680,2004-12-31,Double Jeopardy!,"""X""s & ""O""s",$1600,Cross-country skiing is sometimes referred to ...,XC,crosscountry skiing is sometimes referred to b...,xc


### Normalizing other columns

- The Value column should also be numeric (see dollar sign at the beginning)
- The Air Date column should also be a datetime, not a string.

In [139]:
def normalize_values(text):
    # same thing, but return an int after normalization
    text = re.sub("[^A-Za-z0-9\s]", "", text) 
    text = text[1:]   # remove 1st char with $
    text.replace("$", "")
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [140]:
# remove dollar sign with pandas within the string , \ to escape $
# jeopardy["clean_value"] = jeopardy["Value"].replace({'\$':''}, regex = True)

# then normalize values
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [141]:
# Use the pandas.to_datetime function to convert the Air Date column to a datetime column
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [142]:
jeopardy.tail()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18,of 8 12 or 18 the number of us states that tou...,18,0
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince,the new power generation,prince,0
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo,in 1589 he was appointed professor of mathemat...,galileo,0
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky,before the grand jury she said im really sorry...,monica lewinsky,0
19998,3582,2000-03-14,Jeopardy!,LLAMA-RAMA,$200,Llamas are the heftiest South American members...,Camels,llamas are the heftiest south american members...,camels,0


In [143]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### Answers in questions

- 1st question: How often the answer is deducible from the question? 
- 2nd question: How often new questions are repeats of older questions?

##### 1st question: how many times words in the answer also occur in the question.

In [144]:
def count_matches(row):
    # Split the clean_answer/question columns on the space character ()
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    
    # remove "the"
    if "the" in split_answer:
        split_answer.remove("the")
    
    # return 0 if null string
    if len(split_answer) == 0:
        return 0
    
    # count nb of matches
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    return match_count / len(split_answer)

In [145]:
# Count how many times terms in clean_answer occur in clean_question
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
# axis=1 argument to apply the function across each row

# Find the mean of the answer_in_question
jeopardy["answer_in_question"].mean()

0.06049325706933587

The answer only appears in the question about 6% of the time.

##### 2nd question:  How often complex words (> 6 characters) reoccurs
We can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least

In [146]:
question_overlap = []
terms_used = set()

# Use the iterrows Dataframe method to loop through each row of jeopardy.
for i, row in jeopardy.iterrows():
    
        # Split clean_question into words
        split_question = row["clean_question"].split(" ")
        
        # remove any word shorter than 6 characters
        split_question = [q for q in split_question if len(q) > 5]
        
        match_count = 0
        
        # and check if each word occurs in terms_used
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
            
        question_overlap.append(match_count)

In [147]:
jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.6908737315671962

There is about 70% overlap between terms in new questions and terms in old questions. The research has been maded on words (a single term) not on phrases. There could be answers to different questions with the same words but with a different meaning...

### Low value vs high value questions

Choosing high value questions will help us earn more money :)

Figure out which terms correspond to high-value questions using a chi-squared test. First, narrow down the questions into two categories:
- Low value: Any row where Value is less than 800.
- High value: Any row where Value is greater than 800

In [148]:
# function that takes in a row from a Dataframe, and determine its value
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

In [149]:
# Determine which questions are high and low value
jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [150]:
# function that takes in a word, and return its usage in low/high value questions
def count_usage(term):
    low_count, high_count = 0, 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [151]:
# Convert terms_used into a list using the list function, and assign the first 5 element
comparison_terms = list(terms_used)[:5]
observed_expected = []

# Loop through each term in comparison_terms
for term in comparison_terms:
    observed_expected.append(count_usage(term))

# print the results :)
observed_expected

[(0, 1), (0, 5), (0, 3), (0, 1), (0, 1)]

In [152]:
from scipy.stats import chisquare

# Find the number of rows in jeopardy where high_value is 1
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]

# same thing with low value
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []

# Loop through each list in observed_expected
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    
    # Use the scipy.stats.chisquare function to compute the chi-squared value and 
    # p-value given the expected and observed counts
    chi_squared.append(chisquare(observed, expected))

In [153]:
chi_squared

[Power_divergenceResult(statistic=0.00045022511255627816, pvalue=0.9830713497776855),
 Power_divergenceResult(statistic=0.002251125562781391, pvalue=0.9621577453499471),
 Power_divergenceResult(statistic=0.0013506753376688345, pvalue=0.9706831172455431),
 Power_divergenceResult(statistic=0.00045022511255627816, pvalue=0.9830713497776855),
 Power_divergenceResult(statistic=0.00045022511255627816, pvalue=0.9830713497776855)]

### Results

Not finished, results seems to be weird...

### Further potential investigations:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.