In this project, we'll be working with a [dataset](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/) of Jeopardy questions to figure out some patterns in the questions that could help winning. 

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions.

## Importing libraries and reading the data

In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
new_column=[]
for item in jeopardy.columns:
    new_column.append(item.replace(' ','')) 
jeopardy.columns = new_column
    

In [4]:
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing text 

In [5]:
import string 
def normalize_text(s):
    s = s.lower()
    for punctuation in string.punctuation:
        s = s.replace(punctuation, '')
    return s

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)

In [7]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

## Normalizing columns 

In [8]:
def normalize_dollar(s):
    for punctuation in string.punctuation:
        s = s.replace(punctuation, '')
    try:
        i = int(s)
    except ValueError:
        i = 0
    return i 

In [9]:
# Normalize the Value column 
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollar)

In [10]:
# Convert Air date column to datetime column 
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

## Answers in questions 

In [11]:
def split(row):
    split_answer = row['clean_answer'].split(' ')
    split_question =row['clean_question'].split(' ')
    match_count = 0 
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0 
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [12]:
jeopardy['answer_in_question'] = jeopardy.apply(split,axis = 1)
jeopardy['answer_in_question'].mean()

0.060352773854699004

The mean is not null but it is very low, in other words, only 6% of answers occur in the questions which is not very helpful to know the answer from only hearing the question. 

## Recycled questions 

In [14]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    [split_question.remove(item) for item in split_question if len(item)<6]
    #split_question = [item for item in split_question if len(item)>5]
    match_count = 0
    for item in split_question:
        if item in terms_used:
            match_count += 1
    for item in split_question:
        terms_used.add(item)
    if len(split_question)> 0:
        match_count/=len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()
     

0.8022646904108177

Considering the mean of question overlap column, we can see that about 80% of the time terms are replicated in old and new questions and it's worth seeing to which category these questions belong and worth investigating more the recycling of questions. 

## Low value vs high value questions

In [15]:
def question_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0 
    return value 
jeopardy['high_value'] = jeopardy.apply(question_value,axis = 1)

In [18]:
def count_value(word):
    low_count = 0
    high_count = 0 
    for _, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected=[]
terms_used_list = list(terms_used)
comparison_terms = terms_used_list[1:6]
for term in comparison_terms:
    observed_expected.append(count_value(term))
observed_expected

[(1, 0), (1, 6), (0, 1), (35, 80), (0, 1)]

## Applying the chi-squared test

In [17]:
from scipy.stats import chisquare
import numpy as np
 
high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]
chi_squared=[]
for item in observed_expected:
    total = item[0]+item[1]
    total_prop = total/jeopardy.shape[0]
    expected_high_value_count = total_prop*high_value_count
    expected_low_value_count = total_prop*low_value_count
    
    observed = np.array([item[0], item[1]])
    expected = np.array([expected_high_value_count, expected_low_value_count])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=4.4007463431988825, pvalue=0.035923206140745186),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.7083506539662141, pvalue=0.3999918991363616),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.17484833718921128, pvalue=0.6758383987751316)]

* Most of the obtained chi_squared values are with associated p_values greater than 0.05 which means that our results are not statistically significant.  