# Intro

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

## The Data

The dataset is named `jeopardy.csv`, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). 

We'll import needed modules, read the data in, and have a quick look next.

In [2]:
import pandas as pd
import csv

jeopardy = pd.read_csv("jeopardy.csv")

jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
...,...,...,...,...,...,...,...
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky


We can see that each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

 - `Show Number` -- the Jeopardy episode number of the show this question was in.
 
 - `Air Date` -- the date the episode aired.
 
 - `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
 
 - `Category` -- the category of the question.
 
 - `Value` -- the number of dollars answering the question correctly is worth.
 
 - `Question` -- the text of the question.
 
 - `Answer` -- the text of the answer.

In [9]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

We can also detect that some of the column names have extra spaces, we'll remove those:

In [4]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

### Normalizing Text

We'll now normalize the text, basically changing to lower case and removing punctuation. 

Steps: 

 - Write a function to normalize questions and answers. It should:
 
 - Take in a string.
 - Convert the string to lowercase.
 - Remove all punctuation in the string.
 - Return the string.
 - Normalize the `Question` column.
 - Use the Pandas `Series.apply` method to apply the function to each item in the `Question` column.
 - Assign the result to the `clean_question` column.
 - Normalize the `Answer` column.
 - Use the Pandas `Series.apply` method to apply the function to each item in the `Answer` column.
 - Assign the result to the `clean_answer` column.

In [5]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [6]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [7]:
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200
...,...,...,...,...,...,...,...,...,...,...
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18,of 8 12 or 18 the number of us states that tou...,18,200
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince,the new power generation,prince,200
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo,in 1589 he was appointed professor of mathemat...,galileo,200
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky,before the grand jury she said im really sorry...,monica lewinsky,200


### Normalizing other columns

The Value column should be numeric, to allow easier manipulation. We need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

Steps: 
 - Write a function to normalize dollar values. It should:
   - Take in a string.
   - Remove any punctuation in the string.
   - Convert the string to an integer.
   - If the conversion has an error, assign 0 instead.
   - Return the integer.

The Air Date column should be converted to a datetime from a string.

In [8]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [10]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### Answers to Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

 - How often the answer is deducible from the question.
 - How often new questions are repeats of older questions.
 
You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

Steps:

 - Write a function that takes in a row in jeopardy, as a Series. It should:
   - Split the `clean_answer` column around spaces and assign to the variable `split_answer`.
   - Split the `clean_question` column around spaces and assign to the variable `split_question`.
   - Create a variable called `match_count`, and set it to `0`.
   - If `the` is in `split_answer`, remove it using the `remove` method on lists. `The` is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer.
   - If the length of `split_answer` is 0, return 0. This prevents a division by zero error later.
   - Loop through each item in `split_answer`, and see if it occurs in `split_question`. If it does, add 1 to `match_count`.
   - Divide `match_count` by the length of `split_answer`, and return the result.
   - Count how many times terms in `clean_answer` occur in `clean_question`.
   - Use the Pandas `DataFrame.apply` method to apply the function to each row in jeopardy.
   - Pass the `axis=1` argument to apply the function across each row.
   - Assign the result to the `answer_in_question` column.
   
   
- Find the mean of the `answer_in_question` column using the `mean` method on Series.

In [13]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [14]:
jeopardy["answer_in_question"].mean()

0.060493257069335914

### Answer terms in the question

The answer only appears in the question about `6%` of the time.  This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer.  We'll probably have to study.

Let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, we can:

 - Sort jeopardy in order of ascending air date.
 - Maintain a set called `terms_used` that will be empty initially.
 - Iterate through each row of jeopardy.
 - Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
 - If it does, increment a counter.
 - Add each word to `terms_used`.

This will enable us to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables us to filter out words like `the` and `than`, which are commonly used, but don't tell us a lot about a question.

In [15]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6876260592169776

There is about `70%` overlap between terms in new questions and terms in old questions.  This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms.  This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.


### Low value vs high value questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money on Jeopardy.

We can figure out which terms correspond to high-value questions using a chi-squared test. we'll first need to narrow down the questions into two categories:

 - Low value -- Any row where Value is less than 800.
 - High value -- Any row where Value is greater than 800.
You'll then be able to loop through each of the terms from the last screen, `terms_used`, and:

 - Find the number of low value questions the word occurs in.
 - Find the number of high value questions the word occurs in.
 - Find the percentage of questions the word occurs in.
 - Based on the percentage of questions the word occurs in, find expected counts.
 - Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

In [16]:
# takes in a row from a Dataframe
def determine_value(row):
    
    value = 0
    
    # clean_value column is greater than 800
    if row["clean_value"] > 800:
        
        # Assign 1
        value = 1
        
        # Otherwise 0
    return value

# apply the function to each row in jeopardy
jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [17]:

# takes in a word
def count_usage(term):
    
    # set counters to 0
    low_count = 0
    high_count = 0
    
    # Loops through each row in jeopardy with iterrows 
    for i, row in jeopardy.iterrows():
        
        # Split the clean_question column on the space character 
        if term in row["clean_question"].split(" "):
            
            # If the word is in the split question
            # If the high_value column is 1, 
            if row["high_value"] == 1:
                
                # add 1 to high_count
                high_count += 1
            
            # otherwise increment low_count
            else:
                low_count += 1
    
    return high_count, low_count

# Convert terms_used into a list using the list function, 
# and assign the first 5 elements to comparison_terms
comparison_terms = list(terms_used)[:5]

# Create an empty list called observed_expected
observed_expected = []

# Loop through each term in comparison_terms
for term in comparison_terms:
    
    # apply the function to each term to get the high value and low value counts
    # and append to the list result to observed_expected
    observed_expected.append(count_usage(term))

observed_expected

[(1, 2), (1, 0), (1, 0), (1, 0), (4, 5)]

### Applying the chi-squared test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.



In [18]:
from scipy.stats import chisquare
import numpy as np

# Find the number of rows in jeopardy where high_value is 1, assign to high_value_count
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]

# Find the number of rows in jeopardy where high_value is 0, and assign to low_value_count
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

# Create an empty list called chi_squared
chi_squared = []

# Loop through each list in observed_expected
for obs in observed_expected:
    
    # Get the total count
    # Add up both items in the list (high and low counts) 
    # assign to total
    total = sum(obs)
    
    # Get the proportion across the dataset
    # Divide total by the number of rows in jeopardy 
    # Assign to total_prop
    total_prop = total / jeopardy.shape[0]
    
    # Get the expected term count for high value rows
    # Multiply total_prop by high_value_count   
    high_value_exp = total_prop * high_value_count
    
    # Get the expected term count for low value rows
    # Multiply total_prop by low_value_count
    low_value_exp = total_prop * low_value_count
    
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    
    # Use scipy.stats.chisquare to compute the chi-squared value 
    # and p-value given the expected and observed counts.
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.094860558700322, pvalue=0.2953967699181073)]

### Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows.  Additionally, the frequencies were all lower than `5`, so the chi-squared test isn't as valid.  It would be better to run this test with only terms that have higher frequencies.