# Winning on Jeopardy!
**Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help someone to win.**

The dataset is named jeopardy.csv, and it's a subset containing 20000 rows (from around 220.000) from the beginning of a full dataset of Jeopardy questions.

Let's begin by reading the file and do some initial stuff necessary to the project.

In [None]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv") # read the dataset.
jeopardy.head() # display the first 5 rows.

In [None]:
print(jeopardy.columns) # print out the columns.

# clear whitespace on column names.
jeopardy.columns = jeopardy.columns.str.strip()

print(jeopardy.columns) # print columns again.

### Normalize the Question/Answer columns
The idea is to ensure that we lowercase words and remove punctuation so Don't and don't aren't considered to be different words when you compare them.

In [None]:
import re
import string
def normalize_text(text_str):
    text_str = text_str.lower() # lowercase the string
    # remove non-word characters (punctuation symbols)
    text_str = text_str.replace(r"[^\w\s]+", '')  
    return text_str

# Perform normalization.
# Apply the function to each item in the Question column 
# and assign the result to the clean_question column (a new column).
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)

# Apply the function to each item in the Answer column 
# and assign the result to the clean_answer column (a new column).
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)

jeopardy

### Normalize the Value & Air Date columns.
The Value column should also be numeric, to allow us to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and the "," thousand separator and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable us to work with it more easily.

In [None]:
# Define a second, specialized normalization function.
def normalize_text_2(text_str):
    # Remove the "$" character in front of the value.
    text_str = text_str.replace("$", '')
    
    # Remove the "," character (thousand separator) from the value.
    text_str = text_str.replace(",", '') 
    
    # Convert the string to integer...
    try:
        to_integer = int(text_str)
    # ... in failure, assign 0.
    except:
        to_integer = 0
    
    return to_integer

In [None]:
# Normalize the Value column.
# Apply the function to each item in the Value column & assign
# the result to the "clean_value column" (a new column).
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_text_2)

# Apply the function to convert the Air Date column to a datetime column.
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

jeopardy.dtypes # Display the types to double check.

### Study the data or not?
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

**1. How often the answer is deducible from the question.**

**2. How often new questions are repeats of older questions.**

We can answer the second question by seeing how often complex words (> 6 characters) reoccur.
We can answer the first question by seeing how many times words in the answer also occur in the question.

#### Question #1: How often the answer is deducible from the question.

In [None]:
# Define a function that gets a row from jeopardy and:
def count_matches(row):
    # Split the clean_answer column around spaces
    split_answer = row["clean_answer"].split(" ")

    # Split the clean_question column around spaces
    split_question = row["clean_question"].split(" ")
    
    match_count = 0
    
    # Remove the "the" article from the answer.
    split_answer = [elem for elem in split_answer if elem != "the"]

    # If there's no answer at all, return with 0.
    if len(split_answer) == 0:
        return 0
    
    # Loop through each item in split_answer and see if it occurs
    # in split_question. If it is, count it.
    for item in split_answer:
        if item in split_question: 
            match_count += 1       
            
    # return the mean of matched words occuring in question
    # against the total length of the answer.
    return match_count / len(split_answer)

In [None]:
# Apply the function in each row of Jeopardy and count how many times
# terms in clean_answer occur in clean_question.
# Create a new column with the result.
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [None]:
# Find the mean of the answer_in_question column.
mean_answer_in_q = jeopardy["answer_in_question"].mean()
print(round(mean_answer_in_q * 100, 3))

#### Question #1 Answer.
The answer only appears in the question **about 4.5% of the time.** This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer.

So, the answer to **"How often the answer is deducible from the question?"** is that it rarely happens. 

#### Conclusion:
We'll probably have to study.

#### Question #2. How often new questions are repeats of older questions?
We can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

In [None]:
question_overlap = []
terms_used = set()

# Sort Jeopardy by ascending air date. (OLDER FIRST)
jeopardy = jeopardy.sort_values("Air Date", ascending=True)

# Loop through each row of jeopardy.
for idx, row in jeopardy.iterrows():
    # Split the clean_question column of the row on the space character.
    split_question = row["clean_question"].split(" ")
    
    # Remove any word shorter than 6 characters.
    split_question = [r_words for r_words in split_question
                      if len(r_words) >= 6]

    match_count = 0
    
    # Loop through each word in split_question. If the term occurs
    # already in terms_used, count it. 
    # (That means the term is repeated in the past)
    for word in split_question:
        if word in terms_used:
            match_count += 1
        else: 
        # if not, occurs in terms, add it.
            terms_used.add(word)
    
    # If there are terms, find the mean of the terms against
    # the total length of words greater than 6 chars
    # (and that's because some terms are present more than once)
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    # Add the mean just counted to the list
    # and repeat the process to the next row of Jeopardy.
    question_overlap.append(match_count)

# Create a new column with the question_overlap list
# and find the mean of it (the mean of all means).   
jeopardy["question_overlap"] = question_overlap
print("Proportional mean:", round(
    jeopardy["question_overlap"].mean() * 100, 2))

#### Question #2 Answer.
We have 60% overlap between terms in new questions and terms in old questions. from this measurement, we can observe the following:

* Don't forget we have a small set of the Jeopardy dataset (around 10%) .
* The test was carried out against single words (terms) and not with phrases.

#### Conclusion:

Although we have limitations, it's worth to investigate it further. I'm having a feeling that Jeopardy, recycles questions over time, by (maybe) reforming them in one way or another.

## Focus on high value questions.
Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

In [None]:
# High/Low value function return. 
# We make the distinction based on > $800 or < $800
def quest(row):
    if row["clean_value"] > 800: # if value is greater than 800
        value = 1                # count it as "High" value
    else:
        value = 0                # else, count it as "Low"
    return value

# Apply the function to dataset and get a new column
# with 1's (aka "High") or 0's (aka "Low")
jeopardy["high_value"] = jeopardy.apply(quest, axis=1)

In [None]:
# A function that counts the values.
def quest_word(word):
    low_count = 0
    high_count = 0
    
    # Loops through each row in Jeopardy
    for idx, row in jeopardy.iterrows():

        # Split the question into words
        sq = row["clean_question"].split(" ")
        
        # if the word passed into the func, contained in the question...
        if word in sq:
            # ... and if trhe row we're looping is a "High" row
            if row["high_value"] == 1:
                high_count += 1 # count as "High".
            else:
                low_count += 1  # count as "Low".
    # return the counters            
    return high_count, low_count

### Let's do some testing by selecting 10 random elements.

In [None]:
import random

# Sample the terms set. Select 10 terms randomly
comparison_terms = random.sample(terms_used, k=10)
# chi-squared observed and expected values.
observed_expected = []

# Loop through each term and run the function on the term
# to get the high value and low value counts.
for term in comparison_terms:
    result = quest_word(term)
    observed_expected.append(result)

# print the observed and expected values.
print(observed_expected)

#### Compute the expected counts and the chi-squared value.
Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [None]:
from scipy.stats import chisquare
import numpy as np

# Find the number of rows in jeopardy where high_value is 1.
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]

# Find the number of rows in jeopardy where high_value is 0.
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = [] # This list will hold the results.
# Loop through each list in observed_expected
for obs in observed_expected:
    # Add up both items in the list (high and low counts) 
    # to get the total count.
    total = sum(obs)
    
    # Divide total by the number of rows in jeopardy
    # to get the proportion across the dataset.
    total_prop = total / jeopardy.shape[0]
    
    # Multiply total_prop by high_value_count to get 
    # the expected term count for high value rows.
    high_value_exp = total_prop * high_value_count
    
    # Multiply total_prop by low_value_count to get 
    # the expected term count for low value rows.
    low_value_exp = total_prop * low_value_count
    
    # Compute the chi-squared value and p-value 
    # given the expected and observed counts.
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

### Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.