## Project - Winning Jeopardy

Jeopardy! is a popular American game TV show that debuted in 1964 where contestants receive clues in the form of answers and must phrase their responses in the form of a question.
We want to figure out which strategy has the most substantial chance of success, in other words which patterns we can extract from the data that could help a contestant prepare for Jeopardy. 

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer can be used for a question;
* How often questions are repeated;
* Which terms correspond to high-value questions using a chi-squared test;
* Which domain of questions constitutes the majority for the top categories.

For this project we are going to use a [dataset](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file) that contains 20000 rows from the full dataset of Jeopardy questions. Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* `Show Number` - the Jeopardy episode number
* `Air Date` - the date the episode aired
* `Round` - the round of Jeopardy
* `Category` - the category of the question
* `Value` - the number of dollars the correct answer is worth
* `Question` - the text of the question
* `Answer` - the text of the answer

![Image](https://images.unsplash.com/photo-1604815887789-c076c46a1110?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1771&q=80)
_Photo by justinphoto on Unsplash_

### Exploring the dataset

In [1]:
import pandas as pd
import re
from random import choice
from scipy.stats import chisquare
import numpy as np

In [2]:
jeopardy=pd.read_csv("C:/Users/Denisa/Desktop/Project Apps/project 16/jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front so we will remove the spaces from each item in jeopardy.columns and assign the result back to jeopardy.columns to fix the column names in jeopardy.

In [4]:
jeopardy.columns = jeopardy.columns.str.strip()

In [5]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [6]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


### Normalizing Columns

Before we start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `Question` and `Answer` columns). For this we will write a function that:
* Takes in a string.
* Converts the string to lowercase.
* Removes all punctuation in the string.
* Returns the string.<br>

The `Value` column should be numeric, so that we can manipulate it easier. The `Air Date` column should also be converted to datetime. 

In [7]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [8]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [9]:
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200
...,...,...,...,...,...,...,...,...,...,...
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18,of 8 12 or 18 the number of us states that tou...,18,200
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince,the new power generation,prince,200
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo,in 1589 he was appointed professor of mathemat...,galileo,200
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky,before the grand jury she said im really sorry...,monica lewinsky,200


In [10]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [11]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### Answers in Questions

One strategy would be not to spend time studying at all but using hints from the question to formulate an answer. To see if the strategy is viable we are going to write a function to find out how often the answer can be used for a question

In [12]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [13]:
jeopardy["answer_in_question"].mean()

0.05900196524977763

An average of 6% means that there are low chances that hearing a question will enable us to determine the answer, so we'll probably have to study.

### Recycled Questions

Next we want to find out how often new questions are repeated of older ones to find out if studying older questions is a good strategy. We will not consider any words shorter than 6 characters (to filter out words like the/than/and/or etc, which are commonly used, but don't tell you a lot about a question).

In [14]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6876260592169802

It looks like on average 68% of terms that are used in older questions occur again, so it might be a good strategy to look at questions that were already used to prepare for the show if we want to participate.

### Low Value vs High Value Questions

To increase our chances to earn more money when on Jeopardy we want to study questions that pertain to high value questions instead of low value questions. 
We will figure out which terms correspond to high-value questions using a chi-squared test. First we need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

We can find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [15]:
def value_category(row):
 
    
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
    
jeopardy['high_value'] = jeopardy.apply(value_category, axis=1)

In [16]:
def count_value(word):
    low_count = 0
    high_count = 0
    for row in jeopardy.iterrows():
        row = row[1]
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [17]:
comparison_terms = []
comparison_terms = [choice(list(terms_used)) for i in range(10)] # picks a random smaple of 10 terms with replacement

observed_expected = []
for i in comparison_terms:
    result = count_value(i)
    observed_expected.append(result)
    
print(observed_expected)

[(0, 1), (2, 4), (0, 1), (0, 1), (0, 1), (2, 0), (0, 3), (1, 0), (2, 2), (0, 1)]


In [18]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
for value in observed_expected:
    total = sum(value)
    total_prop = total / jeopardy.shape[0]
    exp_high = high_value_count * total_prop
    exp_low = low_value_count * total_prop
    
    observed = np.array([value[0], value[1]])
    expected = np.array([exp_low, exp_high])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=4.235420876606389, pvalue=0.03958880694352712),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=7.463376351587025, pvalue=0.006296679668748999),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

In [19]:
results = pd.DataFrame(observed_expected, index = comparison_terms, columns = ["Low value count", "High value count"])
results["Chi"] = chi_squared
results[["Chi Square", "p value"]] = pd.DataFrame(results.Chi.tolist(), index= results.index)
results.drop("Chi", axis=1, inplace = True)

results

Unnamed: 0,Low value count,High value count,Chi Square,p value
saintexupery,0,1,2.487792,0.114733
garfunkel,2,4,4.235421,0.039589
chukutien,0,1,2.487792,0.114733
snells,0,1,2.487792,0.114733
leylandmanaged,0,1,2.487792,0.114733
pencils,2,0,0.803926,0.369922
frontiersman,0,3,7.463376,0.006297
dispensersa,1,0,0.401963,0.526077
dividing,2,2,0.889755,0.345544
kirsch,0,1,2.487792,0.114733


### Chi-Squared Results

The pvalues are all less than the 5% threshold. None of the terms was statistically significant in usage between high value and low value rows. Also the chi-squared test is more valid when the freqeuncies are larger

### Popular Categories per Round

The game of Jeopardy has 4 different rounds and we are interested to find out the most frequent category of questions in each of them.

In [20]:
jeopardy['Round'].value_counts(normalize=True)*100

Jeopardy!           49.507475
Double Jeopardy!    48.812441
Final Jeopardy!      1.675084
Tiebreaker           0.005000
Name: Round, dtype: float64

In [21]:
for i in jeopardy['Round'].unique():
    j_round = jeopardy.groupby(['Round']).get_group(i)
    top_cat = j_round['Category'].value_counts(normalize=True)[0]*100
    top_cat_name = j_round['Category'].value_counts().index[0] 
    
    print(f'''
    {top_cat_name} category has {top_cat:.3}% of the questions in {i} round.
''')


    WORD ORIGINS category has 2.39% of the questions in Final Jeopardy! round.


    LITERATURE category has 0.359% of the questions in Double Jeopardy! round.


    TELEVISION category has 0.353% of the questions in Jeopardy! round.


    CHILD'S PLAY category has 1e+02% of the questions in Tiebreaker round.



Jeopardy! and Double Jeopardy! rounds comprise 98% of the data. However even that we know the top categories of questions for these rounds they constitute only a small percent, so probably the questions come from numerous fields and focusing on a particular category won't be significant

### Conclusion

From the analysis above we concluded that:
* An average of 5.8% of the words that form the answer also occur in the question 
* The mean proportion of terms that occured in past questions and are recycled is 68 %
* According to the chi-squared test, we didn't find statistically significant values of frequencies of words that appear more on high value questions.
* Focusing on a particular domain of questions is not significant since they constitute only a small percentage of the total questions
Therefore, from these strategies the best one would be to study past questions.