<h1> Winning Jeopardy </h1>

In this project, I will look at data that redditors put together with 200k+ Jeopardy questions.  I will use the data and practice chi-squared analysis to analyze past questions from Jeopardy.  Hopefully this knowledge will point me in the right direction to win in the future!

[Link to the original dataset](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

In [20]:
import pandas
import csv

jeopardy = pandas.read_csv("jeopardy.csv")

jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel


<h2> Basic data cleaning </h2>
Some of the column names have spaces before them- I will fix this and assign these back to the columns

In [21]:
import re

def remove_space(a_string):
    return re.sub(' +', '', a_string)

new_cols= [remove_space(col_name) for col_name in jeopardy.columns]
jeopardy.columns= new_cols
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

<h2> Normalizing Columns </h2>
In order to compare words, the text needs to be normalized so that questions that have different punctuation and capitalization are treated the same.

In [22]:
def answer_normal(a_string):
    a_string = re.sub("[^A-Za-z0-9\s]", "", a_string)
    a_string = re.sub("\s+", " ", a_string)
    return a_string.lower()

jeopardy["clean_answer"]= jeopardy["Answer"].apply(answer_normal)
jeopardy["clean_question"]= jeopardy["Question"].apply(answer_normal)
jeopardy["clean_answer"].head(5)

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

Right now the dollar values in the dataset are strings.  I will normalize them and convert them to integers so that I can analyze the data.

In [23]:
jeopardy.dtypes
def dollar_normal(a_string):
    try:
        a_string = re.sub("[^A-Za-z0-9\s]", "", a_string)
        a_string = re.sub("\s+", " ", a_string)
        return int(a_string)
    
    except Exception:
        return 0
    
jeopardy["clean_value"]= jeopardy["Value"].apply(dollar_normal)
jeopardy["clean_value"].head(5)

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

Likewise, the "air_date" column also contains strings.  This will make it difficult to have a time series or compare quesions and answers by different dates.  This will be changed to a datetime object so that these dates can be easily compared.

In [24]:
import datetime as dt

def normal_date(a_string):
    return dt.datetime.strptime(a_string, "%Y-%m-%d")

jeopardy["air_date"]= jeopardy["AirDate"].apply(normal_date)
jeopardy["air_date"].head(5)

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: air_date, dtype: datetime64[ns]

<h2> Analyzing questions and answers </h2>

I will analyze word occurance in questions and answers.  First I need to do some data prep (which I will do inside of the function) such as:
- Splitting the clean_answer and clean_question
- Remove words like "the" from the questions and answers- likely won't be meaningful for analysis
- If there was no answer, return 0 instead of an empty list

In [25]:
def row_analysis(row) -> float:
    """
    Returns the number of words that match in the question and answer/
    the total number of words in the answer.
    """
    match_count=0
    split_answer= row["clean_answer"].replace("the", "").split()
    if len(split_answer)==0:
        return 0.0
    split_question= row["clean_question"].split()
    
    #calculating common words
    for word in split_answer:
        if word in split_question:
            match_count+=1
            
    #dividing by total num words
    return match_count/len(split_answer)

In [26]:
jeopardy["answer_in_question"]= jeopardy.apply(row_analysis, axis=1)
#mean of answer_in_question
jeopardy["answer_in_question"].mean()

0.057955758538287654

This mean indicates that only 5.7% of the words of the questions are found in the answer.  I didn't analyze what the words were that were influencing this (could be words like "and" etc).  Because this number is so low, I don't think that it would be a good strategy to analyze common words for studying for Jeopardy.

<h2> Repeat Questions </h2>
It would be helpful to know which themes and words repeat for questions- this will surely help my studying! 

In [27]:
jeopardy= jeopardy.sort_values(by= "air_date")
jeopardy.head(5)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_answer,clean_question,clean_value,air_date,answer_in_question
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,theodore roosevelt,adventurous 26th president he was 1st to ride ...,0,1984-09-21,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,jimmy hoffa,notorious labor leader missing since 75,200,1984-09-21,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,thanksgiving,washington proclaimed nov 26 1789 this first n...,200,1984-09-21,0.0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,the grand canyon,both ferde grofe the colorado river dug this n...,200,1984-09-21,0.0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,tom,depending on the book he could be a jones a sa...,200,1984-09-21,0.0


In [31]:
question_overlap=[]
terms_used= set()


def question_overlap_change(row):
    split_question = row["clean_question"].split(" ")
    more_than_six_letters= [word for word in split_question if len(word)>=6]
    match_count=0.0
    for word in more_than_six_letters:
        if word in terms_used:
            match_count+=1
        terms_used.add(word)
        
    if len(more_than_six_letters) >0:
        match_count= match_count/len(more_than_six_letters)
    question_overlap.append(match_count)
#     row["question_overlap"]=match_count

for index, row in jeopardy.iterrows():
    question_overlap_change(row)
    
jeopardy["question_overlap"]= question_overlap
jeopardy["question_overlap"].mean()


0.6894031359073245

68% of words in older questions are in newer questions.  This may be a good way to study- figure out which questions are repeated and study these themes.