# Introduction
Jeopardy is popular TV show in the US where participates answer question to win money. In this project we will consider ways to win jeopardy by using past data. This dataset contains 20000 rows and several columns.
The columns are as follows

| Column Name | Description |
| ------------ | ------------ | 
| Show Number | Episode number of Jeopardy show |
| Air date    | the day show aired |
| Round       | what round was going on that day |
| Category    | category of the question |
| Value       | prize on each question |
| Answer      | answer of the question |

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#reading the datafram from jeopardy.csv
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
#return column list of jeopardy dataframe
jeopardy.columns.tolist()

['Show Number',
 ' Air Date',
 ' Round',
 ' Category',
 ' Value',
 ' Question',
 ' Answer']

We can see that column names have whitespaces surrounding them, we have to remove that whitespace.

In [5]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns.tolist()

['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [6]:
#lets check the data type of each column
jeopardy.dtypes

Show Number     int64
Air Date       object
Round          object
Category       object
Value          object
Question       object
Answer         object
dtype: object

### Normalization of data
We know normalization of integers, but what is normalization in context of object(string). We mean that we have to remove any punctuation from the string and have to lower case each string so we will be able to compare any two strings.

In [9]:
#defining method for normalization of strings
def normalize(s):
    import string
    pun = string.punctuation
    s = s.lower()
    s = "".join(st for st in s if st not in pun)
    return s

In [10]:
# now applytin normalize function on question and answer respectively
clean_question = jeopardy['Question'].apply(normalize)
clean_answer = jeopardy['Answer'].apply(normalize)

In [16]:
# defining method for normalizing dollar values or cleaning dollar column
def dollar(money):
    money = normalize(money)
    try:
        return int(money)
    except ValueError:
        return 0

In [17]:
clean_value = jeopardy['Value'].apply(dollar)

In [18]:
clean_value.head()

0    200
1    200
2    200
3    200
4    200
Name: Value, dtype: int64

In [19]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy['Air Date'].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]

In [20]:
jeopardy['clean_answer'] = clean_answer
jeopardy['clean_question'] = clean_question
jeopardy['clean_value'] = clean_value

In [21]:
jeopardy.isnull().sum()

Show Number       0
Air Date          0
Round             0
Category          0
Value             0
Question          0
Answer            0
clean_answer      0
clean_question    0
clean_value       0
dtype: int64

Now let's check our columns from the data

In [22]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer', 'clean_answer', 'clean_question', 'clean_value'],
      dtype='object')

In [24]:
# defining new functions 
def answerInQuestion(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    split_answer = list(filter(('the').__ne__,split_answer))
    if len(split_answer)==0:
        return 0
    match_count = 0
    for answer in split_answer:
        if answer in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [25]:
jeopardy['answer_in_question'] = jeopardy.apply(answerInQuestion,axis=1)

In [26]:
jeopardy['answer_in_question'].describe()

count    19999.000000
mean         0.059737
std          0.166078
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: answer_in_question, dtype: float64

## Deduction 1
We have mean of about 0.059 which indicates around 5.9% of the answer appears in question, because it's not huge number there is low possibility of relying on question for finding the answer.

In [27]:
def countAnswerInQuestion(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    split_answer = list(filter(('the').__ne__,split_answer))
    return set(split_answer) < set(split_question)

In [28]:
jeopardy['count_answer_in_question'] = jeopardy.apply(countAnswerInQuestion,axis=1)
jeopardy['count_answer_in_question'].describe()

count     19999
unique        2
top       False
freq      19881
Name: count_answer_in_question, dtype: object

In [29]:
jeopardy['count_answer_in_question'].mean()

0.0059002950147507378

## Deduction 2
Here we can see that around 0.59% of the time answer was available in the question and it is not huge number so it is not beneficiary to rely on question for answering the question.

In [31]:
jeopardy.sort_values('Air Date',inplace=True)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_answer,clean_question,clean_value,answer_in_question,count_answer_in_question
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,theodore roosevelt,adventurous 26th president he was 1st to ride ...,0,0.0,False
19274,10,1984-09-21,Jeopardy!,GEOGRAPHY,$100,Formerly Formosa,Taiwan,taiwan,formerly formosa,100,0.0,False
19275,10,1984-09-21,Jeopardy!,DOUBLE TALK,$100,"Not a Hawaiian cow, but a dress worn by Hawaii...",a muumuu,a muumuu,not a hawaiian cow but a dress worn by hawaiia...,100,0.5,False
19276,10,1984-09-21,Jeopardy!,"""JACKS"" OF ALL TRADES",$100,He celebrated his 39th birthday 41 times,Jack Benny,jack benny,he celebrated his 39th birthday 41 times,100,0.0,False
19277,10,1984-09-21,Jeopardy!,SHIPS,$100,"""Unsinkable"" for most of its maiden voyage in ...",the Titanic,the titanic,unsinkable for most of its maiden voyage in 1912,100,0.0,False


In above table we can see the leftmost column has index values which are not in order. To put them in order with our new schema we have to reset_index and also have to drop this existing index we can do this as follows

In [35]:
jeopardy = jeopardy.reset_index(drop=True)

In [37]:
question_overlap = list()
terms_used = set()
for i , row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [word for word in split_question if len(word)>5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)
np.mean(question_overlap)

0.68897934881067435

## Deduction 3
Here we can see that the 68.9% of the time question is taken from previous competitions. So it's good practice to take review the questions from previous years before going to competition.

In [38]:
def value_spliter(row):
    if row['clean_value']>800:
        return 1
    return 0

In [39]:
jeopardy['high_value'] = jeopardy.apply(value_spliter,axis=1)

In [43]:
def high_low(word):
    high_count = 0
    low_count = 0
    for idx,row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value']==1:
                high_count += 1
            else:
                low_count += 1
    return (high_count,low_count)

In [44]:
observed_expected = list()
terms_used = list(terms_used)
comparison_terms = terms_used[:5]

In [45]:
for word in comparison_terms:
    observed_expected.append(high_low(word))
print(observed_expected)

[(1, 0), (2, 0), (0, 1), (0, 1), (1, 1)]


In [46]:
high_value_count = jeopardy['high_value'].sum()
low_value_count = jeopardy.shape[0] - high_value_count

In [50]:
from scipy.stats import chisquare
chi_squared = list()
for each in observed_expected:
    total = each[0] + each[1]
    total_prop = total/jeopardy.shape[0]
    high_value_exp = total_prop*high_value_count
    low_value_exp = total_prop*low_value_count
    obs = np.array([each[0],each[1]])
    exp = np.array([high_value_exp,low_value_exp])
    chi = chisquare(obs,exp)
    chi_squared.append(chi)

In [51]:
for i in chi_squared:
    print(i)

Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047)
Power_divergenceResult(statistic=4.9755842343913503, pvalue=0.025707519787911092)
Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)
Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)
Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963)
