## Jeopardy Questions Project

This project focuses on the cleaning, exploration, and analysis of Jeopardy game show questions using Python, pandas and Chi-Square Test. It serves as a practical example of data wrangling techniques applied to real-world unstructured text data.

The goal is to answer questions such as:

- What are the most common question topics in Jeopardy?
- Is there a relationship between the difficulty of a question and its monetary value?
- Are certain categories more frequently featured in higher-value questions?

I will start with using pandas to open the file.

In [16]:
import pandas as pd 

jeopardy = pd.read_csv('JEOPARDY_CSV.CSV')

jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
...,...,...,...,...,...,...,...
216925,4999,2006-05-11,Double Jeopardy!,RIDDLE ME THIS,$2000,This Puccini opera turns on the solution to 3 ...,Turandot
216926,4999,2006-05-11,Double Jeopardy!,"""T"" BIRDS",$2000,In North America this term is properly applied...,a titmouse
216927,4999,2006-05-11,Double Jeopardy!,AUTHORS IN THEIR YOUTH,$2000,"In Penny Lane, where this ""Hellraiser"" grew up...",Clive Barker
216928,4999,2006-05-11,Double Jeopardy!,QUOTATIONS,$2000,"From Ft. Sill, Okla. he made the plea, Arizona...",Geronimo


In [17]:
print(jeopardy[:5])

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


In [18]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

There are spaces in some of the column names, so I will clean the data first before starting to analyze it.

In [23]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']

In [25]:
import csv
import re

def normalize_string(string):
    string = str(string).lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    string = re.sub("\s+", " ", string)
    return string 

def normalize_values(string):
    if pd.isna(string):
        return 0
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string


In [27]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_string)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_string)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [28]:
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200
...,...,...,...,...,...,...,...,...,...,...
216925,4999,2006-05-11,Double Jeopardy!,RIDDLE ME THIS,$2000,This Puccini opera turns on the solution to 3 ...,Turandot,this puccini opera turns on the solution to 3 ...,turandot,2000
216926,4999,2006-05-11,Double Jeopardy!,"""T"" BIRDS",$2000,In North America this term is properly applied...,a titmouse,in north america this term is properly applied...,a titmouse,2000
216927,4999,2006-05-11,Double Jeopardy!,AUTHORS IN THEIR YOUTH,$2000,"In Penny Lane, where this ""Hellraiser"" grew up...",Clive Barker,in penny lane where this hellraiser grew up th...,clive barker,2000
216928,4999,2006-05-11,Double Jeopardy!,QUOTATIONS,$2000,"From Ft. Sill, Okla. he made the plea, Arizona...",Geronimo,from ft sill okla he made the plea arizona is ...,geronimo,2000


In [30]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [33]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [35]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0 
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

In [36]:
jeopardy['answer_in_question'].mean()

0.05792070323661065

### How This Metric Can Help to Analyze Jeopardy Questions

The **'answer_in_question'** column represents the proportion of words from each answer that also appear in the corresponding question. The calculated mean of approximately **0.058** suggests that, on average, only about **5.8%** of an answer's words are directly found in the question.

This insight reveals that Jeopardy questions rarely include exact words from the answer, which implies that simply memorizing answer keywords may not be the most effective strategy. 

### Recycled Questions 

In [41]:
question_overlap = []
terms_used = set()

jeopardy['Air Date'] = jeopardy['Air Date'].sort_values()

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1 
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()

0.8735125558087675

### Question Overlap and Repetition in Jeopardy

I have looked into how often **recycled language** appears across Jeopardy questions **by calculating the proportion of reused terms**. We analyze whether long words (more than 5 letters) from earlier questions appear again in later questions. The resulting mean of approximately **0.87** indicates **a very high overlap**.

This suggests that Jeopardy frequently **reuses key vocabulary or phrasing styles** across different questions, especially for domain-specific or thematic words.

How this can guide studying:

- Focus on common terminology used in prior questions.
- Practice recognizing repeated question patterns.
- Prioritize learning high-frequency long words, especially those tied to specific topics (history, science, literature, etc.).

This insight can significantly improve your strategic preparation by identifying commonly recycled question elements.

### Low Value vs High Value Questions 

In [44]:
def clean_value(row):
    value = 0 
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0 
    return value 

jeopardy['high_value'] = jeopardy.apply(clean_value, axis=1)


In [45]:
def count_value(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [54]:
from random import choice 

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    count = count_value(term)
    observed_expected.append(count)

observed_expected

[(0, 1),
 (0, 2),
 (1, 0),
 (0, 2),
 (0, 1),
 (0, 1),
 (7, 19),
 (0, 2),
 (2, 6),
 (0, 1)]

### Applying the Chi-squared Test

In [48]:
from scipy.stats import chisquare 
import numpy as np 

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for l in observed_expected:
    total = l[0] + l[1]
    total_prop = total/ jeopardy.shape[0]
    high = total_prop*high_value_count
    low = total_prop*low_value_count
    
    observed = np.array([l[0], l[1]])
    expected = np.array([high, low])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.31102499028369535, pvalue=0.5770518904538492),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=2.1740540543895293, pvalue=0.14035579428041794),
 Power_divergenceResult(statistic=5.063592849467617, pvalue=0.02443353405878706),
 Power_divergenceResult(statistic=0.021646150708492677, pvalue=0.8830323245068887),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.46338644448358013, pvalue=0.49604555208958945),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695)]

### Chi-Squared Test Results: Value vs. Feature Relationships

I used the Chi-squared test to evaluate whether certain features are statistically associated with **high-value vs. low-value** Jeopardy questions.

For each feature group in **observed_expected**,

- Calculated the total count and its proportion of the dataset.
- Estimated expected counts for high and low value questions assuming **no relationship**.
- Compared observed vs. expected values using the **Chi-squared test**.

Interpretation:

- Most p-values are well **above 0.05**, indicating no statistically significant relationship between those features and whether a question is high value.
- However, a few p-values (like 0.038 and 0.024) suggest **some weak evidence** that certain features might correlate with high-value questions.

**Conclusion**: While most features do not show a strong link with question value, a few may be worth exploring further to identify patterns that could help in prioritizing high-stakes study topics.