Winning Jeopardy
====================

A DataQuest Guided Project
---------------------

In this project, I work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

### Import and Clean Data

In [169]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams


In [170]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [171]:
jeopardy.rename(columns=lambda x: x.strip(), inplace=True)
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [172]:
##https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate
import string

def removePun(x):
    translator = str.maketrans("", "", string.punctuation)
    return x.lower().translate(translator)

#def removePun(text):
#    text = text.lower()
#    text = re.sub("[^A-Za-z0-9\s]", "", text)
#    return text


jeopardy["clean_question"] = jeopardy["Question"].apply(removePun)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(removePun)

import re

def normalize_values(text):
    text = re.sub("[^\d\.]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

jeopardy.head(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus,for the last 8 years of his life galileo was under house arrest for espousing this mans theory,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe,no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona,the city of yuma in this state has a record average of 4055 hours of sunshine each year,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's,in 1963 live on the art linkletter show this company served its billionth burger,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams,signer of the dec of indep framer of the constitution of mass second president of the united states,john adams,200
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect shared billing with a grasshopper",the ant,in the title of an aesop fable this insect shared billing with a grasshopper,the ant,200
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,"Built in 312 B.C. to link Rome & the South of Italy, it's still in use today",the Appian Way,built in 312 bc to link rome the south of italy its still in use today,the appian way,400
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,306 steals for the Bulls",Michael Jordan,no 8 30 steals for the birmingham barons 2306 steals for the bulls,michael jordan,400
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state",Washington,in the winter of 197172 a record 1122 inches of snow fell at rainier paradise ranger station in this state,washington,400
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packaging its merchandise came in & was first displayed on,Crate & Barrel,this housewares store was named for the packaging its merchandise came in was first displayed on,crate barrel,400


In [173]:
def overlap(row):
    match_count = 0
    split_question = row["clean_question"].split(" ")
    split_question = [i for i in split_question if i != "the" and i != '']

    split_answer = row["clean_answer"].split(" ")
    split_answer = [i for i in split_answer if i != "the" and i != '']

    try:
        for item in split_answer:
            if item in split_question:
                match_count += 1
        return match_count / len(split_answer)
    except:
        return 0
                
jeopardy["answer_in_question"] = jeopardy.apply(overlap, axis=1)
mean_answer = jeopardy["answer_in_question"].mean()
mean_answer

0.05820696157463009

There's approximately a 6% overlap of questions in answers. Below shows some examples.

In [174]:
jeopardy[["clean_question","clean_answer","answer_in_question"]][jeopardy["answer_in_question"] > 0 ].head(10)

Unnamed: 0,clean_question,clean_answer,answer_in_question
14,on june 28 1994 the natl weather service began issuing this index that rates the intensity of the suns radiation,the uv index,0.5
24,this asian political party was founded in 1885 with indian national as part of its name,the congress party,0.5
31,it can be a place to leave your puppy when you take a trip or a carrier for him that fits under an airplane seat,a kennel,0.5
38,during the 19541955 sun sessions elvis climbed aboard this train sixteen coaches long,the mystery train,0.5
53,in 1961 james brown announced all aboard for this train,night train,0.5
67,small slender missile thrown at a board in a game,a dart,0.5
68,this island in the south pacific is named for the day of its discovery a religious holiday,easter island,0.5
73,it can be a separating line in your hair or a role in a play,a part,0.5
79,a graphic representation of information,a chart,0.5
80,the family history you wrote for school might include entering the us at this island in new york bay,ellis island,0.5


Next, for each question create a set of words.
Then for each question, evaluate the overlap with each other question in the dataframe.
Get the average rate of overlap of all questions.

In [177]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6902117143393427

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

Below explores the average rate of overlap between questions in a sample of 1000 questions.

In [162]:
jsample = jeopardy.sample(n=1000)

def overlap_q(row):

    #row = jeopardy.loc[4]
    listofintersects = []
    q_set = row["questions_sets"]
    q_set_len = len(row["questions_sets"])
    for i, r in jsample.iterrows():
        new_set_len = 0
        if i == row.name:
            listofintersects.append(0)
        else:
            new_set_len = len(q_set.intersection(r["questions_sets"]))
            try:
                listofintersects.append(new_set_len/q_set_len)
            except:
                listofintersects.append(0)

    return (np.mean(listofintersects))

jsample["question_overlap2"] = jsample.apply(overlap_q, axis=1)
mean_answer2 = jsample["question_overlap2"].mean()
mean_answer2
        

0.001634513571151063

## Chi-Square Testing
Below we mark questions as high value if they exceed 800. We use Chi-Square testing to evaluate if there's any significance in the terms used in high value questions as opposed to lwo value questions.

In [178]:
def highorlow(row):
    if row["clean_value"] > 800:
        value = 1
    else: 
        value = 0
    return value

jeopardy["high_value"] = jeopardy.apply(highorlow, axis=1)


In [179]:
def highvaloccurence(word):
    lowcount = 0
    highcount = 0
    
    for i, r in jeopardy.iterrows():
        split_question = r["clean_question"].split(" ")
        if word in r["clean_question"]:
            if r["high_value"] == 1:
                highcount += 1
            else:
                lowcount += 1
    return highcount, lowcount
            


Sample the first 5 terms used from the earlier set. Count the occurences of each term in high value and low value questions.

In [186]:
terms_used_5 = list(terms_used)[:5]
print(terms_used_5)
observed_expected = []
for word in terms_used_5:
    temp_occurence = 0
    temp_occurence = highvaloccurence(word)
    observed_expected.append(temp_occurence)
    
observed_expected

['perseus', 'protogermanic', 'memphisbased', 'costome', 'deepfried']


[(2, 1), (1, 0), (1, 0), (1, 0), (3, 3)]

In [187]:
high_value_count = jeopardy[jeopardy["high_value"]==1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"]==0].shape[0]
chi_squared = []

Use chi-square testing to compare the observed vs expected counts.

In [191]:
from scipy.stats import chisquare

for x, y in observed_expected:
    total = x+y
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    observed = np.array([x, y])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared


[Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881886)]

All p values are > 0.05 so the results are statistically insignificant. There's no difference on the usage of these terms for high or low value questions.