# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. I am going to work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help to win.

The dataset is named jeopardy.csv and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.

First I am going to read the dataset and explore.

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front, I am going to remove them:

In [3]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Let's have a close look at the format of each column.

In [4]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


## Normalize the columns

Before starting the analysis, we need to normalize and fix the datatypes of some columns. I need to lowercase **Question** and **Answer** columns and remove the punctuation. the **Value** column should be numeric and the **Air Date** should be a datetime.

First I am going to write a function to get in a string and return that string in lowercase and without punctuation.

In [5]:
import re
def normalize(text):
    text  = text.lower()
    text = re.sub('[^\w\s]', '', text)
    return text

#test normalize function
normalize("Hello! How are you?")

'hello how are you'

Let's apply the normalize function to **Question** and **Answer** columns and save the result in **clean_question** and **clean_answer** columns.

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_question'].head(5)

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [7]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)
jeopardy['clean_answer'].head(5)

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

To normalize the **Value** column I am going to remove the dollar sign from the beginning, convert it from text to numeric and save the result to a new column called **clean_value**.

In [8]:
def normalize_value(value):
    value = re.sub('[^\w\s]', '', value)
    try:
        value_int = int(value)
    except ValueError:
        value_int = 0
    return value_int
# test
normalize_value('$200')

200

In [9]:
#apply normalize_value function to Value column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

The **Air Date** column should also be datatime to enable us to work with easily.

In [10]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

Let's see the types of all columns especially the new ones again.

In [11]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Show Number     19999 non-null  int64         
 1   Air Date        19999 non-null  datetime64[ns]
 2   Round           19999 non-null  object        
 3   Category        19999 non-null  object        
 4   Value           19999 non-null  object        
 5   Question        19999 non-null  object        
 6   Answer          19999 non-null  object        
 7   clean_question  19999 non-null  object        
 8   clean_answer    19999 non-null  object        
 9   clean_value     19999 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


## Study

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

To answer the second question I need to figure out how often complex words (> 6 characters) reoccur and for the first question I need to see how many times words in the answer also occur in the question.

let's start with the first question. I am going to write a function to calculate for each question the ratio of the number of words in answers that are found in questions. Then I am going to apply it to all of the questions and calculate the average of them. In this function, 'the' is removed from the words that are investigated since in not a valuable word.

In [12]:
def count_matches_ratio(row):
    answer = row['clean_answer']
    question = row['clean_question']
    split_answer = answer.split()
    split_question = question.split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count/len(split_answer)
  
jeopardy['answer_in_question'] = jeopardy.apply(count_matches_ratio, axis = 1)  
jeopardy['answer_in_question'].mean()

0.059001965249777744

On average about 6% of the words of answers are found in the questions. So the chance of deducing the answer from the question is quite low. 

## Repeated questions

Let's go through the second question and investigate how often new questions are repeated of older ones. I can not completely answer this question the dataset includes only 10% of the full jeopardy question dataset but I am going to investigate it.

I am going to check if the terms with six or more characters in questions have been used previously or not.

In [13]:
question_overlap = []
terms_used = set()
jeopardy.sort_values('Air Date', inplace = True)
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [q for q in split_question if len(q)>= 6]
    match_count = 0
    for term in split_question:
        if term in terms_used:
            match_count += 1
        terms_used.add(term)
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()
        

0.6894006357823155

About 69% of the complex words in questions are repeated so it seems studying the past questions can be really helpful to win.

## Study questions with high value

Let's focus our study on questions that pertain to high value questions instead of low value questions. This is helpful to earn more money.

I can actually figure out which terms correspond to high-value questions using a chi-squared test. I'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

I'll then be able to loop through each of the terms from **terms_used**, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

I can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so I'll just do it for a small sample now.

In [14]:
def categorize_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(categorize_value, axis = 1)

In [15]:
def count_values(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [16]:
#Randomly pick ten elements of terms_used
from random import choice
comparison_terms = [choice(list(terms_used)) for i in range(10)]
comparison_terms

['hosted',
 'depths',
 'coleridge',
 'scissorhands',
 'bellringing',
 'sporting',
 'hrefhttpwwwjarchivecommedia20080408_dj_24bjpg',
 'schindler',
 'narcocorridos',
 'confederates']

In [17]:
observed_expected = []
for word in comparison_terms:
    observed_expected.append(count_values(word))
observed_expected

[(3, 14),
 (0, 1),
 (0, 1),
 (0, 2),
 (0, 1),
 (1, 3),
 (1, 0),
 (0, 1),
 (1, 0),
 (0, 2)]

Now that I've found the observed counts for a few terms, I can compute the expected counts and the chi-squared value.

In [18]:
high_value_count = sum(jeopardy['high_value'])
low_value_count = jeopardy[jeopardy['high_value'] == 0]['high_value'].count()
print('high_value_count = {}'.format(high_value_count))
print('low_value_count = {}'.format(low_value_count))

high_value_count = 5734
low_value_count = 14265


In [171]:
import numpy as np
from scipy.stats import chisquare

chi_squared = []
for high_count, low_count in observed_expected:
    total = high_count + low_count
    total_prop = total/jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([high_count, low_count])
    expected = np.array([high_value_exp, low_value_exp])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared


[Power_divergenceResult(statistic=1.0102851115076668, pvalue=0.314834544813388),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571)]

Looking at the above result none of the p values is less than 0.05 so there is no significant difference in usage in high value and low value for these words. Additionally, the frequencies were all except one lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

### Eliminate non-informative words
I am trying to eliminate non-informative words to decrease the size of terms_used so I may be able to run count_values function on more data. First I am going to remove **stopwords**.

#### Romve stopwords
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up the valuable processing time.
Let's remove these words.

In [29]:
len(terms_used)

24470

In [35]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
for word in stop_words:
    if word in terms_used:
        terms_used.remove(word)
len(terms_used)

24453

#### Remove hrefhttp
looking at the words in terms_used there are some links which seem not relevant to our project question, so I am going to remove them as well.

In [45]:
terms_used_sr = pd.Series(list(terms_used))
terms_used_sr = terms_used_sr[~terms_used_sr.str.contains('hrefhttp')]
len(terms_used_sr)

23250

There are still 23250 words in terms_used. At this stage, I am going to look at the count_values function and see if I can make it run faster.

### Re-write count_values function 

Looking at the count_values function there is a loop that iterates over the whole jeopardy dataset. I am going to replace it with the pandas columns operations to make it faster. To make it easier to understand the result, the new function returns the word as well.

In [225]:
def count_values_faster(word):
    high_count = 0
    low_count = 0
    
    #regex pattern to match the whole word only
    pattern = r"\b{}\b".format(word)
    high_count = jeopardy[(jeopardy['clean_question'].str.contains(pattern, regex = True)) &
                         (jeopardy['high_value'] == 1)]['high_value'].count()
    low_count = jeopardy[(jeopardy['clean_question'].str.contains(pattern, regex = True)) &
                        (jeopardy['high_value'] == 0)]['high_value'].count()
    return word, high_count, low_count

Let's test to make sure that we get the same result as the count_values function.

In [224]:
observed_test = []
for word in comparison_terms:
    observed_test.append(count_values_faster(word))
print(observed_test)

[('hosted', 3, 14), ('depths', 0, 1), ('coleridge', 0, 1), ('scissorhands', 0, 2), ('bellringing', 0, 1), ('sporting', 1, 3), ('hrefhttpwwwjarchivecommedia20080408_dj_24bjpg', 1, 0), ('schindler', 0, 1), ('narcocorridos', 1, 0), ('confederates', 0, 2)]


The test is passed and the results are the same with higher efficiency.

I am going to apply this new function on the all terms_used. It takes time to run completely but it is more applicable than count_values.

In [212]:
terms_used_sr = pd.Series(list(terms_used))
frequencies = terms_used_sr.apply(count_values_faster)
frequencies

0          (fraction, 0, 1)
1          (solution, 0, 4)
2            (beheld, 1, 1)
3            (decide, 1, 2)
4            (moored, 0, 1)
                ...        
24447     (mythology, 4, 6)
24448     (candidely, 0, 1)
24449        (clinic, 4, 8)
24450     (cervantes, 0, 5)
24452    (brazilwood, 0, 1)
Length: 23250, dtype: object

### Words with higher frequencies

To make the chi_squared test valid, let's filter the words with high frequency and run the chio squred test on the top 1000 highest frequencies.

In [213]:
def get_high_frequecies(data, size):
    frequencies = pd.DataFrame(data, 
                               columns = ['word', 'high_value', 'low_value'])
    frequencies['total_value'] = frequencies['high_value'] + frequencies['low_value']
    frequencies.sort_values('total_value', ascending = False, inplace = True)
    return(frequencies.head(size))



high_frequecies = get_high_frequecies(list(frequencies),1000)
high_frequecies

Unnamed: 0,word,high_value,low_value,total_value
13421,called,168,346,514
20594,country,141,332,473
3015,played,77,212,289
11928,became,79,203,282
8318,american,77,174,251
...,...,...,...,...
1399,physics,9,5,14
10955,couple,3,11,14
19034,stopped,1,13,14
7613,reported,2,12,14


In [234]:
def calculate_chi_squared(row):
    chi_squared = []
    total_prop = row['total_value']/jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    observed = np.array([row['high_value'], row['low_value']])
    expected = np.array([high_value_exp, low_value_exp])
    
    chi_value, p_value = chisquare(observed, expected)
    
    chi_squared.append((row['word'], chi_value, p_value, row['high_value'], row['low_value']))
    return chi_squared
                         
    
chi_squared = high_frequecies.apply(calculate_chi_squared, axis = 1)
chi_squared.head(5)

13421    [(called, 4.048305063534577, 0.044215717944225...
20594    [(country, 0.29967829483482744, 0.584084171311...
3015     [(played, 0.5810990283039111, 0.44588185909193...
11928    [(became, 0.05956570730840162, 0.8071836789959...
8318     [(american, 0.4938111242657224, 0.482232156839...
dtype: object

At this stage, I am going to filter the words with the p_values less than 0.05 to figure out which words are significantly different in high value and low value. I am also looking for words with higher frequency in high_value questions rather than low_value ones.

In [232]:
x = [c[0] for c in chi_squared]
chi_squared_df = pd.DataFrame([c[0] for c in chi_squared], 
                              columns = ['word', 'chi_squared', 'p_value', 'high_value', 'low_value'])
chi_squared_df = chi_squared_df.sort_values('p_value')
chi_squared_df = chi_squared_df[(chi_squared_df['p_value'] < 0.05) & 
                                (chi_squared_df['high_value'] > chi_squared_df['low_value']) ]
chi_squared_df

Unnamed: 0,word,chi_squared,p_value,high_value,low_value
176,monitora,45.947439,1.214686e-11,35,13
77,target_blanksarah,24.358972,7.995351e-07,40,33
232,target_blankkelly,20.921282,4.785483e-06,25,16
94,african,17.283572,3.219584e-05,35,33
504,painter,16.941684,3.854581e-05,16,8
162,target_blankjimmy,16.114608,5.962236e-05,28,24
224,target_blankjon,13.979777,0.0001847876,23,19
485,pulitzer,13.429676,0.0002476749,15,9
394,liquid,12.719123,0.0003619354,17,12
457,example,11.99798,0.0005325823,15,10


In [236]:
chi_squared_df.shape[0]

35

## Conclusion
In this project, a dataset of Jeopardy questions has been used to figure out some patterns in the questions that could help to win. After exploring I figured out that

- On average about 6% of the words of answers are found in the questions. So the chance of deducing the answer from the question is quite low.
- About 69% of the complex words in questions are repeated so studying the past questions can be really helpful to win.

Then I focused my study on questions that pertain to high value questions instead of low value ones. This is helpful to earn more money. Using chi squared test I have got a list of 35 words with higher usage in high value questions and with a statistically significant difference of usage in high value and low value questions.

The next step can be finding the questions with the high value containing these words. These questions can be recommended to study to win.