Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

In order to figure out whether to study past questions, study general knowledge, or not study at all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

In [3]:
import pandas as pd
jeopardy=pd.read_csv('jeopardy.csv')
#let's figure out how the dataset looks like
jeopardy.head()
jeopardy.columns
#some columns have spaces in front, let's remove them.
jeopardy.columns=jeopardy.columns.str.strip()

Before we move forward, we'd like to normalize all of the text columns, meaning we'd like to lowercase words and remove punctuations. For this purpose, we will use regex. You can test your regex [here](https://regex101.com/#javascript)

In [4]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

#also let's convert the date column to actual datetime.
jeopardy["Air Date"]=pd.to_datetime(jeopardy["Air Date"])

#### How often the answer is deducible from the question?

In [5]:
def match_count(df):
    split_answer=df["clean_answer"].split(" ")
    split_q=df["clean_question"].split(" ")
    match_count=0
    if "the" in split_answer: #we'll remove "the" from answer as it is very common
        split_answer.remove("the")
    if len(split_answer)==0: #for no answer function returns 0
        return 0
    for item in split_answer:
        if item in split_q:
            match_count+=1
    return match_count/len(split_answer)

jeopardy["answer_in_question"]=jeopardy.apply(match_count, axis=1)
meanpercent=jeopardy["answer_in_question"].mean()*100
print("answer is in question for {0:.2f} percent of times".format(meanpercent))    

answer is in question for 6.05 percent of times


So the chance of just showing up and listening to questions with the hopes of finding the answer in the question is so slim. So we probably have to study in order to win.

#### How repetitive words in questions are? 

In [30]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ") #split the question by " "
    split_question = [q for q in split_question if len(q) > 5] #only keep words that have at least 6 charachters.
    match_count = 0
    for word in split_question: #iterate over every word in the question
        if word in terms_used: #first checks if the word is in the list of used terms.
            match_count += 1
    for word in split_question: #adds each word of the split question in used terms list
        terms_used.add(word)
    if len(split_question) > 0: #estimates the number of matches in the question
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.69087373156719623

Therefore, it is useful to look at previous questions. Seems like in about 70% of the times words are repititive. Please note, we only compared words, and not questions.

Now say we only want to study questions that lead to a higher value. In order to do so, first we need to define our high and low value rewards.
* low value: lower than \$800
* high value: more than \$800

In [31]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [34]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 2), (0, 1), (1, 0), (0, 6), (1, 0)]

In [39]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

print(low_value_count)

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

14265


[Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.4117770767613038, pvalue=0.12042559006950899),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047)]