# Winning Jeopardy - An analysis on a sample of questions 

![jeopardy](jeopardy.jpg)

## Introduction:
[Jeopardy](https://www.jeopardy.com/) is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

We'll work on a dataset that contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be  downloaded from [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

### Data Dictionary
|Column|Description|
|:-----|:----------|
|Show Number | the Jeopardy episode number|
|Air Date | the date the episode aired|
|Round | the round of Jeopardy|
|Category | the category of the question|
|Value | the number of dollars the correct answer is worth|
|Question | the text of the question|
|Answer | the text of the answer|




## Goal:
We're looking for any way to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

---

In [75]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
import re

from random import choice

from scipy.stats import chisquare



---

## Loading and Exploring Dataset

In [2]:
data = pd.read_csv('jeopardy.csv')

In [3]:
data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
data.tail()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky
19998,3582,2000-03-14,Jeopardy!,LLAMA-RAMA,$200,Llamas are the heftiest South American members...,Camels


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1    Air Date    19999 non-null  object
 2    Round       19999 non-null  object
 3    Category    19999 non-null  object
 4    Value       19999 non-null  object
 5    Question    19999 non-null  object
 6    Answer      19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


- We do not have Null values
- Except for `Show Number`, all the columns are in object Dtype.
    - `Value` and `Air Date` are intrinsically of int and Date Dtypes

In [6]:
data.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

- Most column names come with a leading space in their names.

To fix these issues and others, lets move onto Data Cleaning

---

## Data Cleaning
- Normalize all of the text columns (the `Question` and `Answer` columns) by
    - Lower case all the words
    - Remove all the punctuations

- Normalize the `Value` and `Air Date` columns. The former should be numeric and the latter should be datetime.

In [7]:
# Remove leading spaces from column names
data.columns  = data.columns.str.strip()

In [8]:
# Check
data.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [9]:
def normalize_string(string):
    '''
    Normalizes the text by lower casing, 
    removing punctuations the text
    
    Args:
        string: str; text to normalize
        
    Returns:
        str; Normalized text
    '''
    string = string.lower() # Lower case. 
    string = re.sub('[^A-Za-z0-9\s]', '', string) # Remove all puntuations.
    string = re.sub('\s+', ' ', string) # Remove extra spaces.
    
    return string

lets apply normalize_string() on `Question` and `Answer` columns and normalize them into new columns `clean_question` and `clean answer`

In [10]:
# Column Question
data['clean_question'] = data['Question'].apply(normalize_string)

# Column Answer
data['clean_answer'] = data['Answer'].apply(normalize_string)

In [11]:
# Check Question
data[['clean_question','Question']].head()

Unnamed: 0,clean_question,Question
0,for the last 8 years of his life galileo was u...,"For the last 8 years of his life, Galileo was ..."
1,no 2 1912 olympian football star at carlisle i...,No. 2: 1912 Olympian; football star at Carlisl...
2,the city of yuma in this state has a record av...,The city of Yuma in this state has a record av...
3,in 1963 live on the art linkletter show this c...,"In 1963, live on ""The Art Linkletter Show"", th..."
4,signer of the dec of indep framer of the const...,"Signer of the Dec. of Indep., framer of the Co..."


In [12]:
# Check Question
data[['clean_answer','Answer']].head()

Unnamed: 0,clean_answer,Answer
0,copernicus,Copernicus
1,jim thorpe,Jim Thorpe
2,arizona,Arizona
3,mcdonalds,McDonald's
4,john adams,John Adams


In [13]:
def normalize_value(string):
    '''
    Normalizes the string by removing 
    punctuations including the USD sign
    and tries to convert it into int.
    If it doesn't convert, value 0 is
    assigned
    
    Args:
        string: str; string
        
    Returns:
        int; the numeric value, or 0
    '''
    value = re.sub('[^\w\s]','',string) # regex pattern 
    
    try: 
        value=int(value) # to convert numbers in string into integers
    except Exception: 
        value=0  # Assigning 0 in case of exception
   
    return value

lets apply normalize_value() on `Value` and normalize it into a new column `clean_value` and Convert `Air Date` into datetime

In [14]:
# Column Value
data['clean_value'] = data['Value'].apply(normalize_value)

# Column Air Date Conversion into datetime 
data['Air Date'] = pd.to_datetime(data['Air Date'])

In [15]:
# Check
data[['Value', 'clean_value']].head()

Unnamed: 0,Value,clean_value
0,$200,200
1,$200,200
2,$200,200
3,$200,200
4,$200,200


---

## Data Analysis

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

### What are the chances that the answer is found in the question itself?

In [16]:
def match_counter(row): 
    '''
    Takes in a row in DataFrame as a 
    Series andcounts how many words 
    occur in question as well as its 
    answer.
    Args:
        row: DataFrame, row.
    Returns:
        float; density of co-occuring words in 
        the answer
    '''
    # Splits the strings
    split_question = row['clean_question'].split()  
    split_answer = row['clean_answer'].split()
    
    # Counter to count co occurences
    match_count = 0
    
    if 'the' in split_answer: # ignore word 'the'. 
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0 # prevent division by zero error
    
    # Loop through both 
    for word in split_answer:
        if word in split_question:
            match_count += 1 # count
            
    return match_count / len(split_answer)

Lets count how many times terms in `clean_answer` occur in `clean_question` and assign the result of each row to a new column
`answer_in_question`

In [17]:
# Apply to whole dataframe and pass the axis=1 argument to apply the function across each row.
data['answer_in_question'] = data.apply(match_counter, axis=1)

In [18]:
# Calculate Mean
data['answer_in_question'].mean()

0.059001965249777744

- On average, 6% words occur in questions as well as in the answers. In other words, there is only a 6% chance that the answer also repeats itself in its question. Therefore, one should not suppose to get the answer of a question within the words of the question.

###  How often new questions are repeats of older ones?
Given that we only have a fraction of (10%) of the full Jeopardy question dataset, we can't answer it completely. However, an investigation can be done.

In [19]:
question_overlap = []
terms_used = set()

# Sort the data in ascending order with respect to Air Date
data = data.sort_values(by='Air Date')

for index, row in data.iterrows():
    split_question = row['clean_question'].split(' ')
    
    # an arbitrary value to filter out words like 'the', 'is' etc
    split_question = [q for q in split_question if len(q) > 5]
    
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
        
data['question_overlap'] = question_overlap

# Mean of the column
data['question_overlap'].mean()

0.6876260592169776

- So we have 69% meaningful words overlap in questions. Though it represents the overlap on only 10% of the questions of the full jeopardy dataset, there is indeed some question recycling.

### What are the trends of High-Value questions?

First, lets categorize the questions into high-value and low-value questions and then, by looping through the set 'terms used' we made earlier, lets
- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.
- We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

To look into the questions's values, lets get into `clean_value` column

In [28]:
# Mean of questions' values
data['clean_value'].mean()

748.3362668133407

Lets narrow down the questions into two categories:

- Low value: Any row where Value is less than 800.
- High value: Any row where Value is greater than 800.

In [45]:
def get_value(row):
    '''
    Takes in a row in DataFrame as a 
    Series and categorizes question 
    as either of high or low value with
    a bar is set on 800.
    Args:
        row: row
    Returns:
        bool; 1 if True and 0 if False
    '''
    if row['clean_value'] > 800:
        return 1
    # else
    return 0
        

In [46]:
# Apply the function 
data['high_value'] = data.apply(get_value, axis=1)

# Check
data['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [69]:
def count_values(word):
    '''
    count the number of high and low valued questions 
    the word occures in
    Args:
        word: a word in string
    Returns:
        returns the count of high and low valued questions
    '''
    low_count = 0
    high_count = 0
    for i, row in data.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

Lets choose some words from the set 'terms_used' randomly and see how many times they appear in high-valued and low-valued questions each

In [70]:
# from random import choice
comparison_terms = []

# Lets choose 10 words randomly
for i in range(10): 
    comparison_terms.append(choice(list(terms_used))) 

comparison_terms

['carassos',
 'abagnale',
 'respect',
 'balances',
 'muchsought',
 '10sup23sup',
 'binghams',
 'longitude',
 'ranklesa',
 'serenading']

In [72]:
# Save the results into a list
observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_values(term)) # Apply the recently made function

# see the results
observed_expected

[(0, 1),
 (1, 1),
 (2, 6),
 (0, 1),
 (0, 1),
 (1, 1),
 (1, 0),
 (1, 1),
 (1, 0),
 (0, 1)]

## Applying Chi-Squared Test

In [79]:
# Number of rows of each isolated set of data, i-e, high-valued and low-valued questions 
high_value_count = data[data['high_value'] == 1].shape[0]
low_value_count = data[data['high_value'] == 0].shape[0]

In [80]:
# from scipy.stats import chisquare
chi_squared = []

for observation in observed_expected:
    total = observation[0] + observation[1] # Summing up the pair
    total_prop = total/data.shape[0] # Dividing variable total by the total rows in entire df
    
    # Expected Term Counts for high-value and low_value counts
    high_value_expected = total_prop * high_value_count
    low_value_expected = total_prop * low_value_count
    
    # Chi-Squared value and p-value given the expected and observed counts.
    observed = np.array([observation[0], observation[1]])
    expected = np.array([high_value_expected, low_value_expected])
    
    # Calculate and Append tp the list chi_squared
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.05272886616881538, pvalue=0.818381104912348),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

- None of the calculated p-values are less than 0.05. Therefore there is no signficant dfifference between observed and expected values.

## Which categories appear the most often and what is the probability of each category appearing in each round?

Each game of Jeopardy! features three contestants competing in three rounds: 
1. Jeopardy! 
2. Double Jeopardy! and 
3. Final Jeopardy!

In [94]:
data['Category'].value_counts(ascending=False)

TELEVISION             51
U.S. GEOGRAPHY         50
LITERATURE             45
AMERICAN HISTORY       40
HISTORY                40
                       ..
UP YOUR "ALLEY"         1
EXPLORATION             1
ROCK SONGS              1
RIVERS                  1
SPORTS & THE MOVIES     1
Name: Category, Length: 3581, dtype: int64

- There are around 3581 different categories

##### Most common Categories for each round  

In [115]:
for each_round in data['Round'].unique():
    print('ROUND:',each_round)
    print(data[data['Round'] == each_round]['Category'].value_counts()[:10])
    print('-'*20)


ROUND: Final Jeopardy!
WORD ORIGINS         8
U.S. PRESIDENTS      5
FAMOUS NAMES         4
AUTHORS              4
SPACE EXPLORATION    3
ASIA                 3
FAMOUS WOMEN         3
U.S. STATES          3
WORLD GEOGRAPHY      3
THE 50 STATES        3
Name: Category, dtype: int64
--------------------
ROUND: Double Jeopardy!
LITERATURE           35
SCIENCE & NATURE     30
ISLANDS              30
BEFORE & AFTER       30
IN THE DICTIONARY    30
U.S. GEOGRAPHY       28
OPERA                25
HISTORIC NAMES       25
WORLD CAPITALS       25
SCIENCE              25
Name: Category, dtype: int64
--------------------
ROUND: Jeopardy!
TELEVISION        35
SPORTS            26
FOOD FACTS        25
RHYME TIME        25
U.S. CITIES       25
BIRDS             23
U.S. GEOGRAPHY    22
COMMON BONDS      20
MUSEUMS           20
BRAND NAMES       20
Name: Category, dtype: int64
--------------------
ROUND: Tiebreaker
CHILD'S PLAY    1
Name: Category, dtype: int64
--------------------


In [124]:
# Lets look into the probabilities 
for each_round in data['Round'].unique():
    print('ROUND:',each_round)
    print(data[data['Round'] == each_round]['Category'].value_counts(normalize=True)[:10])
    print('-'*20)

ROUND: Final Jeopardy!
WORD ORIGINS         0.023881
U.S. PRESIDENTS      0.014925
FAMOUS NAMES         0.011940
AUTHORS              0.011940
SPACE EXPLORATION    0.008955
ASIA                 0.008955
FAMOUS WOMEN         0.008955
U.S. STATES          0.008955
WORLD GEOGRAPHY      0.008955
THE 50 STATES        0.008955
Name: Category, dtype: float64
--------------------
ROUND: Double Jeopardy!
LITERATURE           0.003585
SCIENCE & NATURE     0.003073
ISLANDS              0.003073
BEFORE & AFTER       0.003073
IN THE DICTIONARY    0.003073
U.S. GEOGRAPHY       0.002868
OPERA                0.002561
HISTORIC NAMES       0.002561
WORLD CAPITALS       0.002561
SCIENCE              0.002561
Name: Category, dtype: float64
--------------------
ROUND: Jeopardy!
TELEVISION        0.003535
SPORTS            0.002626
FOOD FACTS        0.002525
RHYME TIME        0.002525
U.S. CITIES       0.002525
BIRDS             0.002323
U.S. GEOGRAPHY    0.002222
COMMON BONDS      0.002020
MUSEUMS         

- In the first round, *Jeopardy!*, most of the questions belonged to categories like Televsion, Sports, Food facts, Rhyme Time, U.S cities, Birds, U.S Geography, museums and Brand names. We can observe by looking at the categories that they do not require the contestant to have advanced knowledge. Categories are very general and 'surface level' of knowledge should be good enough to compete in Round 1.
- Whereas in the second round, *Double Jeopardy!*, the categories seem a bit more demanding for knowledge of the contestant. Categories like, Literature, Science, Islands, U.S. Geography, Historical names, World Captials and Opera are not part of 'everyone's interest'. Most common categories are Literature, Science and Nature, Before and After and Islands
- The final round, 'Final Jeopardy! comprises of some domain-specific categories like Word's origins, Space Exploration, U.S Presidents, Authors' names and Asia. Most commonly asked questions are from the categories of Words origins, U.S presidents, Authors and Famous names

However, The probabilities are too low to be considered. We can not do the 'guess work' about which category we might be facing our question from given such probabilities.