# Jeopardy

## **Introduction**

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. We want to compete on Jeopardy, and are looking for any way to win

## **Dataset**

The dataset is named `JEOPARDY_FULL.csv`, and contains the full dataset of Jeopardy questions from 2004 until 2012, uploaded to the [/r/datasets](https://www.reddit.com/r/datasets) subreddit by Redditor [trexmatt](https://www.reddit.com/user/trexmatt/). Here are explanations of each column:

-   `Show Number` - the Jeopardy episode number
-   `Air Date` - the date the episode (in format YYYY-MM-DD)
-   `Round` - the round of Jeopardy (one of "Jeopardy!", "Double Jeopardy!", "Final Jeopardy!" or "Tiebreaker")
-   `Category` - the category of the question
-   `Value` - the number of dollars the correct answer is worth (This is "None" for Final Jeopardy! and Tiebreaker questions )
-   `Question` - the text of the question (Sometimes contains hyperlinks and other things messy text such as when there's a picture or video question)
-   `Answer` - the text of the answer


In [1]:
import pandas as pd

In [2]:
path = '../../../../08_Zadania_baza/DataScience/DataQuest/Guided Projects/Beginner/Winning Jeopardy'
jeopardy = pd.read_csv(f"{path}/JEOPARDY_FULL.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [6]:
jeopardy.columns.to_list()

['Show Number',
 ' Air Date',
 ' Round',
 ' Category',
 ' Value',
 ' Question',
 ' Answer']

In [7]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1    Air Date    216930 non-null  object
 2    Round       216930 non-null  object
 3    Category    216930 non-null  object
 4    Value       213296 non-null  object
 5    Question    216930 non-null  object
 6    Answer      216927 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


## **Cleaning dataset**

### 1. Cleaning columns names

Some of the columns have spacec in front of names

In [11]:
jeopardy.rename(columns=lambda x: x.lstrip(), inplace=True)

jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### 2. Removing Null Values

Some of the questions are Final Jeopardy or tiebreaker, which explains the 3,634 `NaN` values in that column.
There are also 3 `NaN` values in the `Answer` column. We can't really use empty answers in this project so we'll simply drop them. 

In [None]:
def check_missing_values(dataset: pd.DataFrame):
    total = dataset.isnull().sum()
    percent = (dataset.isnull().mean() * 100).round(4) # round(dataset.isnull().mean() * 100, 3)

    total_df = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    print(total_df)

check_missing_values(jeopardy)

             Total  Percent
Show Number      0   0.0000
Air Date         0   0.0000
Round            0   0.0000
Category         0   0.0000
Value         3634   1.6752
Question         0   0.0000
Answer           3   0.0014


In [None]:
jeopardy.dropna(subset=['Answer', 'Value'], inplace=True)
check_missing_values(jeopardy)

             Total  Percent
Show Number      0      0.0
Air Date         0      0.0
Round            0      0.0
Category         0      0.0
Value            0      0.0
Question         0      0.0
Answer           0      0.0


### 3. Normalizing Text Columns

We need to normalize all of the text columns (the `Question` and `Answer` columns). The idea is to ensure that we put words in lowercase and remove punctuation so `Don't` and `don't` aren't considered to be different words when we compare them.

In [19]:
import re

def normalize_text(text: str):
    text = text.lower()
    # wszystko, co nie jest literą, cyfrą lub spacją zamień na ""
    text = re.sub(r"[^A-Za-z0-9\s]", "", text)
    # Zastępuje wielokrotne spacje (lub inne znaki białe, np. tabulatory) pojedynczą spacją.
    text = re.sub(r"\s+", " ", text)
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(func=normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(func=normalize_text)

jeopardy.sample(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
88481,4272,2003-03-11,Jeopardy!,FIND THE SPOONERISM,$600,Can you give me one of these for our date toni...,rain check,can you give me one of these for our date toni...,rain check
73185,1227,1989-12-26,Jeopardy!,"""BABY"" SONGS",$500,"(AUDIO DAILY DOUBLE): A No. 12 hit in 1976, it...","""Baby, I Love Your Way""",audio daily double a no 12 hit in 1976 it reac...,baby i love your way
174419,5640,2009-02-27,Jeopardy!,"WE'RE ALL ""WAITING""",$1000,Terry McMillan's bestselling 1992 novel about ...,Waiting to Exhale,terry mcmillans bestselling 1992 novel about 4...,waiting to exhale
102035,5357,2007-12-18,Jeopardy!,TOP 40 HITMAKERS,$200,"""Twelve Thirty (Young Girls Are Coming To The ...",The Mamas & The Papas,twelve thirty young girls are coming to the ca...,the mamas the papas
177330,3161,1998-05-04,Jeopardy!,GUINNESS RECORDS,$500,The largest cut one of these green gems is a w...,Emerald,the largest cut one of these green gems is a w...,emerald
99815,4315,2003-05-09,Jeopardy!,POTENT POTABLES,$600,"""It's another"" one of these orange juice cockt...",Tequila Sunrise,its another one of these orange juice cocktail...,tequila sunrise
111366,3538,2000-01-12,Double Jeopardy!,PLAYING DOCTOR,$1000,"Michael Steadman on the show ""thirtysomething""...",Ken Olin,michael steadman on the show thirtysomething h...,ken olin
204928,4553,2004-05-26,Jeopardy!,POP QUIZ,$600,"In the Janis Joplin hit ""Me And Bobby McGee"", ...",freedom,in the janis joplin hit me and bobby mcgee thi...,freedom
103121,5465,2008-05-16,Jeopardy!,POTPOURRI,$600,Title of the chief of the DOJ,the Attorney General,title of the chief of the doj,the attorney general
113995,3941,2001-10-22,Jeopardy!,"AL, HISTORY'S PASSIVE AGGRESSIVE PAL",$100,"In 1871 Chicago, Al promises to retrieve this ...",Mrs. O'Leary,in 1871 chicago al promises to retrieve this w...,mrs oleary


### 4. Normalize value column

The `Value` column should be numeric, to allow you to manipulate it easier. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric:

In [20]:
def normalize_number(text: str) -> int:
    text = str(text)
    text = re.sub(r"[^0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy['clean_value'] = jeopardy['Value'].apply(func=normalize_number)

jeopardy.sample(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
190300,3838,2001-04-18,Double Jeopardy!,THE NEW YORK TIMES HEADLINES,"$2,000",Month & year of the historic headline seen her...,"October, 1929",month year of the historic headline seen here ...,october 1929,2000
169164,3366,1999-04-05,Double Jeopardy!,ASTROLOGY FOR SKATERS,$600,"If this ""fishy"" sign is yours, your famous fee...",Pisces,if this fishy sign is yours your famous feet m...,pisces,600
205541,6061,2011-01-10,Jeopardy!,LAUNDRY DETERGENT,$1000,"Philip III, king of France 1270-1285, was nick...",Bold,philip iii king of france 12701285 was nicknam...,bold,1000
14217,5591,2008-12-22,Jeopardy!,"HELLO, DALI",$800,"In ""The Persistence of Memory"", Dali shows 4 o...",watches or clocks,in the persistence of memory dali shows 4 of t...,watches or clocks,800
112076,3452,1999-09-14,Double Jeopardy!,BLACK HERITAGE STAMPS,$800,A. Philip Randolph unionized men in this job h...,Sleeping car porters,a philip randolph unionized men in this job he...,sleeping car porters,800
152849,6142,2011-05-03,Double Jeopardy!,"MATH, TEACHERS!",$1200,My 1997 Honda Civic went 403 miles on 13 gallo...,31,my 1997 honda civic went 403 miles on 13 gallo...,31,1200
84421,5422,2008-03-18,Jeopardy!,GRAB BAG,$800,This 2-word phrase comes from a Greek belief t...,swan song,this 2word phrase comes from a greek belief th...,swan song,800
43121,3937,2001-10-16,Double Jeopardy!,A DATE WITH DISASTER,$400,"Seen here, he shot a man in Texas, November 24...",Jack Ruby,seen here he shot a man in texas november 24 1963,jack ruby,400
132182,5608,2009-01-14,Double Jeopardy!,A LITTLE BIT ROCK & ROLL,$800,"""Baby, you're much 2 fast"", sings Prince in th...","""Little Red Corvette""",baby youre much 2 fast sings prince in this hit,little red corvette,800
8308,4946,2006-02-27,Jeopardy!,ANGELS,$200,Arte Moreno bought the baseball team in 2003 &...,Anaheim,arte moreno bought the baseball team in 2003 r...,anaheim,200


### 5. Ait Date column to datetime type

The `Air Date` column should be a datetime, not a string. 

In [21]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.sample(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
208760,2991,1997-09-08,Jeopardy!,TRIBES,$100,The Havasupai have been living in a branch of ...,Grand Canyon,the havasupai have been living in a branch of ...,grand canyon,100
169314,5428,2008-03-26,Jeopardy!,HATS,$600,Movie choreographer Mr. Berkeley is on a first...,busby,movie choreographer mr berkeley is on a firstn...,busby,600
184359,4323,2003-05-21,Jeopardy!,SPEAKERS OF THE HOUSE,$400,Speakers from this state have included Joseph ...,Illinois,speakers from this state have included joseph ...,illinois,400
4722,5419,2008-03-13,Double Jeopardy!,MAKING PIANOS AT STEINWAY,$1200,"(<a href=""http://www.j-archive.com/media/2008-...",concert grand,a hrefhttpwwwjarchivecommedia20080313dj28jpg t...,concert grand,1200
1522,5392,2008-02-05,Double Jeopardy!,FUNNY FOR NOTHIN',$400,"On his first night taking over ""The Daily Show...",Jon Stewart,on his first night taking over the daily show ...,jon stewart,400
51934,4895,2005-12-16,Double Jeopardy!,BODIES OF WATER,$1200,The General Rafael Urdaneta Bridge spans the n...,(Lake) Maracaibo,the general rafael urdaneta bridge spans the n...,lake maracaibo,1200
34250,3066,1997-12-22,Jeopardy!,SONGS OF THE '60s,$200,"Song including the lines ""You make my heart si...","""Wild Thing""",song including the lines you make my heart sin...,wild thing,200
20715,2849,1997-01-09,Jeopardy!,NOTABLE RELATIVES,$100,Supermodel & L'Oreal spokeswoman Hunter Reno i...,Janet Reno,supermodel loreal spokeswoman hunter reno is t...,janet reno,100
164946,2856,1997-01-20,Double Jeopardy!,EDUCATORS,$600,"When Tuskeegee opened its doors in 1881, he wa...",Booker T. Washington,when tuskeegee opened its doors in 1881 he was...,booker t washington,600
163430,4538,2004-05-05,Double Jeopardy!,VOCABULARY TEST,$2000,"Spelled differenty, it can be a daisy, or a ba...",flower,spelled differenty it can be a daisy or a baki...,flower,2000


## **Analysis**

In [22]:
jeopardy.iloc[:, 7:]

Unnamed: 0,clean_question,clean_answer,clean_value
0,for the last 8 years of his life galileo was u...,copernicus,200
1,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,the city of yuma in this state has a record av...,arizona,200
3,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,signer of the dec of indep framer of the const...,john adams,200
...,...,...,...
216924,in 2006 the cast of this longrunning hit embar...,stomp,2000
216925,this puccini opera turns on the solution to 3 ...,turandot,2000
216926,in north america this term is properly applied...,a titmouse,2000
216927,in penny lane where this hellraiser grew up th...,clive barker,2000


In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer can be used for a question
* How often questions are repeated

We can answer the first question by seeing how many times words in the answer also occur in the question. We can answer the second question by seeing how often complex words (> 6 characters) reoccur.

We'll start by tackling the first question.

### 1. How often the answer can be used for a question.

In [23]:
def check_answer_in_question(row: pd.Series) -> float:
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")

    if "the" in split_answer:
        split_answer.remove("the")

    if len(split_answer) == 0:
        return 0
    
    # how many words are common between both lists 
    match_count =  len(set(split_answer).intersection(split_question))

    return match_count / len(split_answer)

jeopardy['answer_in_question'] = round(jeopardy.apply(check_answer_in_question, axis=1), 2)
jeopardy.sample(15)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
132990,3891,2001-07-02,Jeopardy!,PLANTS & TREES,$500,Each autumn the Japanese celebrate the feast o...,the chrysanthemum,each autumn the japanese celebrate the feast o...,the chrysanthemum,500,0.0
35227,6189,2011-07-07,Double Jeopardy!,THE LIBERTY BELL RANG...,$400,"On July 8, 1776 to announce the first public r...",the Declaration of Independance,on july 8 1776 to announce the first public re...,the declaration of independance,400,0.33
126831,4453,2004-01-07,Double Jeopardy!,LEANN RHYMES?,$2000,"""Borstal Boy"" is the autobiography of this Iri...",Behan,borstal boy is the autobiography of this irish...,behan,2000,0.0
99966,3364,1999-04-01,Double Jeopardy!,SPORTS,$600,The logo of this city's NHL team is a winged c...,Detroit Red Wings,the logo of this citys nhl team is a winged ca...,detroit red wings,600,0.0
38868,3386,1999-05-03,Jeopardy!,PARENTS ARE PEOPLE TOO,$500,This term for a right that divorcing parents m...,Custody,this term for a right that divorcing parents m...,custody,500,0.0
183415,3099,1998-02-05,Jeopardy!,MOVIE VICTIMS,$300,In 1996 he co-starred as the customer victimiz...,Matthew Broderick,in 1996 he costarred as the customer victimize...,matthew broderick,300,0.0
14758,4628,2004-10-20,Double Jeopardy!,"GEOGRAPHY ""B""",$400,"In Europe, the beautiful blue Danube eventuall...",the Black Sea,in europe the beautiful blue danube eventually...,the black sea,400,0.5
130623,4526,2004-04-19,Jeopardy!,MOVIE PHONE,$400,"She played squeaky clean Jan, who shared a par...",Doris Day,she played squeaky clean jan who shared a part...,doris day,400,0.0
69797,5820,2009-12-25,Double Jeopardy!,PRESIDENTS AT REST,$800,"At his Presidential Library in Simi Valley, Ca...",Reagan,at his presidential library in simi valley cal...,reagan,800,0.0
203444,4800,2005-06-17,Jeopardy!,PACKAGING,$600,Whitman's was the 1st company to put a guide o...,(a box of) chocolates,whitmans was the 1st company to put a guide on...,a box of chocolates,600,0.5


In [24]:
mean_values = jeopardy['answer_in_question'].mean()
print(f"{mean_values * 100:.3}%")


5.79%


This number is quite small, so we can't hope that the question's wording will give us enough information to get the answer right.

### 2. How often questions are repeated

In [25]:
question_overlap = []
terms_used = set()
jeopardy.sort_values(by='Air Date', ascending=True, inplace=True)

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [word for word in split_question if len(word) > 5]

    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1

    for word in split_question:
        terms_used.add(word)

    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(round(match_count, 2))

jeopardy['question_overlap'] = question_overlap

print(f"The average overlap in questions: {jeopardy['question_overlap'].mean() * 100:.3}%")

The average overlap in questions: 87.1%


So about 87% of the terms in newer questions appeared in older questions. It's promising, but since we're looking at single words, not complete phrases, it requires more investigating.

### 3. Low values vs High Values

One approach we can investigate is focusing our study on high-value questions. Since our resources are limited, it makes sense to spend them on the questions that will be most lucrative for us.

To learn which terms correspond to high-value questions we can do a Chi-squared test. This will allow us to find the words with the biggest difference in usage between high and low values.

Before running the test, let's narrow down the questions to 2 categories:
* low-value: any row where `Value` is `800` or less
* high-value: any row where `Value` is greater than `800`

In [26]:
def determine_value(row: pd.Series):
    return 1 if row['clean_value'] > 800 else 0

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

jeopardy.sample(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
165269,3413,1999-06-09,Jeopardy!,PARTY TIME!,$100,Sir John Soane had a 3-day party after buying ...,London,sir john soane had a 3day party after buying s...,london,100,0.0,1.0,0
89693,5390,2008-02-01,Double Jeopardy!,GROUNDHOG DAY,$800,In 1886 the first U.S. Groundhog Day was obser...,Punxsutawney,in 1886 the first us groundhog day was observe...,punxsutawney,800,0.0,1.0,0
189451,4817,2005-07-12,Double Jeopardy!,'80s MOVIE QUOTES,$400,"1980: ""I am serious, and don't call me Shirley""",Airplane!,1980 i am serious and dont call me shirley,airplane,400,0.0,1.0,0
147133,5245,2007-06-01,Double Jeopardy!,'90s TV,"$3,000","This sitcom debuted on Fox in August, 1998, 22...",That '70s Show,this sitcom debuted on fox in august 1998 22 y...,that 70s show,3000,0.0,1.0,1
78752,5926,2010-05-24,Double Jeopardy!,TOOL WORDS & PHRASES,$1600,The members of an organization aside from its ...,rank and file,the members of an organization aside from its ...,rank and file,1600,0.0,1.0,1
36766,4555,2004-05-28,Jeopardy!,"""G"" WHIZ",$600,"Excluding noncopyrighted works, this book firs...",The Guiness Book of World Records,excluding noncopyrighted works this book first...,the guiness book of world records,600,0.4,1.0,0
70420,6026,2010-11-22,Double Jeopardy!,THEY COME IN TWOS,$2000,"Named for their method of secretion, they're t...",endocrine & exocrine,named for their method of secretion theyre the...,endocrine exocrine,2000,0.0,1.0,1
61758,5006,2006-05-22,Double Jeopardy!,BEATLES RHYME TIME,$2000,Fired drummer's football shoe parts,Pete's cleats,fired drummers football shoe parts,petes cleats,2000,0.0,1.0,1
31069,5472,2008-05-27,Double Jeopardy!,GEOLOGY,$400,"One of these struck Boston in 1755, Missouri i...",an earthquake,one of these struck boston in 1755 missouri in...,an earthquake,400,0.0,1.0,0
215341,5135,2006-12-29,Jeopardy!,MARLIN,$800,"In 2006 Sterling Marlin, a veteran racer on th...",NASCAR,in 2006 sterling marlin a veteran racer on thi...,nascar,800,0.0,1.0,0


In [27]:
def count_usage(word: str):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1 
            else:
                low_count += 1
    return high_count, low_count

In [28]:
from random import choice, sample

comparison_terms = [choice(list(terms_used)) for _ in range(10)]
# comparison_terms = sample(list(terms_used), 10)
comparison_terms

['hirsute',
 'hammering',
 'hrefhttpwwwjarchivecommedia20040910dj26jpg',
 'berryn',
 'bonets',
 'tenzin',
 'positionsa',
 'brendan',
 'factorys',
 'redbird']

In [29]:
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(3, 4),
 (0, 5),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 2),
 (0, 1),
 (6, 10),
 (0, 1),
 (1, 4)]

# Chi-Squared test

In [30]:
jeopardy['high_value'].value_counts()

high_value
0    151871
1     61422
Name: count, dtype: int64

In [31]:
jeopardy.shape[0]

213293

In [32]:
from scipy.stats import chisquare
import numpy as np

low_value_count = jeopardy['high_value'].value_counts()[0]
high_value_count = jeopardy['high_value'].value_counts()[1]

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count

    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared


[Power_divergenceResult(statistic=0.6748876446186486, pvalue=0.4113527238863175),
 Power_divergenceResult(statistic=2.022176715765353, pvalue=0.1550167872840919),
 Power_divergenceResult(statistic=2.4725831135423793, pvalue=0.11584739715125356),
 Power_divergenceResult(statistic=0.4044353431530708, pvalue=0.5248075048535166),
 Power_divergenceResult(statistic=0.4044353431530708, pvalue=0.5248075048535166),
 Power_divergenceResult(statistic=0.8088706863061416, pvalue=0.3684543433835694),
 Power_divergenceResult(statistic=0.4044353431530708, pvalue=0.5248075048535166),
 Power_divergenceResult(statistic=0.5910329001770444, pvalue=0.44201998036337653),
 Power_divergenceResult(statistic=0.4044353431530708, pvalue=0.5248075048535166),
 Power_divergenceResult(statistic=0.18870972079830203, pvalue=0.6639926821692866)]