# Guided Project: Winning Jeopardy

In [1]:
import pandas as pd
j = pd.read_csv('jeopardy.csv')

In [2]:
j.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
j.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
j.columns=j.columns.str.replace(' ','')

## Normalizing Text Columns

A function to normalize the columns 'Question' and 'Answer'

In [5]:
def norm(st):
    st = st.lower()
    import re
    st = re.sub(r'[^A-Za-z0-9\s]',"",st)
    st = re.sub(r'\s+', " ", st)
    return st

In [6]:
j['clean_question']=j['Question'].apply(norm)
j['clean_answer'] = j['Answer'].apply(norm)

A function to clean the `Value` column

In [7]:
def norm_val(val):
    import re
    val = re.sub(r'\D+', "", val)
    
    if val:
        pass
    else:
        val = 0
    val = int(val)
    return val

In [8]:
j['clean_value'] = j['Value'].apply(norm_val)

In [9]:
j.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
ShowNumber        19999 non-null int64
AirDate           19999 non-null object
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: int64(2), object(8)
memory usage: 1.5+ MB


In [10]:
j.AirDate = pd.to_datetime(j.AirDate)

In [11]:
j.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
ShowNumber        19999 non-null int64
AirDate           19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


In [12]:
j.head(3)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200


## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

The second question can be anwered by seeing how often complex words (> 6 characters) reoccur. The first question - by seeing how many times words in the answer also occur in the question. 

In [13]:
def match_ratio(row):
    split_answer = row.loc['clean_answer'].split()
    split_question = row.loc['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count +=1
    return match_count/len(split_answer)

In [14]:
j['answer_in_question'] = j.apply(match_ratio, axis = 1)

In [15]:
mean_answer_in_q = j.answer_in_question.mean()
median_ans_q = j.answer_in_question.median()

In [16]:
mean_answer_in_q

0.05900196524977763

We find that on average 6% of words in the question are contained in the answer. We cannot make any conclusions from this fact as in the majority of cases - more than 17K rows out of 20K this ratio is at zero.

In [17]:
median_ans_q

0.0

## Recycled questions

If we want to investigate how often new questions are repeats of older ones, we can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, we can:

* Sort jeopardy in order of ascending air date.
* Maintain a set called `terms_used` that will be empty initially.
* Iterate through each row of `jeopardy`.
* Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
    * If it does, increment a counter.
    * Add each word to terms_used.
    
This will enable us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like `the` and `than`, which are commonly used, but don't tell us a lot about a question.

In [18]:
question_overlap = []
terms_used = set()
j.sort_values('AirDate', ascending = True, inplace = True)

for i, row in j.iterrows():
    split_question = row['clean_question'].split()
    split_question = [q for q in split_question if len(q)>5]
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count+=1
    for w in split_question:
        terms_used.add(w)
    if len(split_question)>0:
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)
    
j['question_overlap'] = question_overlap

print(j.question_overlap.mean())
    

0.6876260592169802


There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low value vs high value questions


Let's say we only want to study questions that pertain to high value questions instead of low value questions. 

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

We'll then be able to loop through each of the terms from the last screen, *terms_used*, and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [19]:
def value(row):
    
    if row['clean_value']>800:
        v = 1
    else: v = 0
    
    return v

In [20]:
j['high_value'] = j.apply(value, axis =1)

In [21]:
j.high_value.value_counts(dropna=False)

0    14265
1     5734
Name: high_value, dtype: int64

In [22]:
def high_low_counts(word):
    low_c = 0
    high_c = 0
    for i, row in j.iterrows():
        split_clean_q = row['clean_question'].split()
        if word in split_clean_q:
            if row['high_value'] == 1:
                high_c +=1
            else:
                low_c +=1
    return (high_c, low_c)

In [23]:
from random import sample
comparison_terms = sample(terms_used, 10)

In [24]:
comparison_terms

['chases',
 'brings',
 'nightingale',
 'transports',
 'bushes',
 'shadowy',
 'humbug',
 '150yearold',
 'gertrudes',
 'harmon']

In [25]:
observed_expected = []
for i in comparison_terms:
    observed_expected.append(high_low_counts(i))
    

In [26]:
observed_expected

[(0, 3),
 (5, 9),
 (0, 2),
 (0, 1),
 (0, 2),
 (0, 1),
 (0, 2),
 (1, 0),
 (0, 1),
 (1, 1)]

## Applying the Chi-Squared Test

In [27]:
high_value_count = j.high_value.value_counts()[1]

In [28]:
low_value_count = j.high_value.value_counts()[0]

In [29]:
from scipy.stats import chisquare as chi
import numpy as np

chi_squared = []
for i in observed_expected:
    total = i[0]+i[1]
    total_prop = total/j.shape[0]
    expected_high = total_prop*high_value_count
    expected_low = total_prop*low_value_count
    chi_squared.append(chi(i,(expected_high,expected_low)))

In [30]:
chi_squared

[Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.33955667615496254, pvalue=0.5600852286656143),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996)]

## Chi-squared Results


None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.