# Guided Project: Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. 

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

In [1]:
import pandas as pd
import re
import random
from scipy.stats import chisquare
import numpy as np

jeopardy = pd.read_csv("D:/DataQuest/JEOPARDY_CSV.csv")

In [2]:
jeopardy.columns #Some columns have spaces in front

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy.rename(columns={'Show Number':'Show Number', ' Air Date':'Air Date', ' Round':'Round', ' Category':'Category', ' Value':'Value',
       ' Question':'Question', ' Answer':'Answer'},inplace=True)

In [4]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1   Air Date     216930 non-null  object
 2   Round        216930 non-null  object
 3   Category     216930 non-null  object
 4   Value        213296 non-null  object
 5   Question     216930 non-null  object
 6   Answer       216927 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [6]:
jeopardy.isnull().sum()

Show Number       0
Air Date          0
Round             0
Category          0
Value          3634
Question          0
Answer            3
dtype: int64

In [7]:
jeopardy.head(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe


In [8]:
jeopardy[jeopardy['Answer'].isnull()]

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
94817,4346,2003-06-23,Jeopardy!,"GOING ""N""SANE",$200,"It often precedes ""and void""",
143297,6177,2011-06-21,Double Jeopardy!,NOTHING,$400,"This word for ""nothing"" precedes ""and void"" to...",
178922,4573,2004-06-23,Jeopardy!,MUCH ADO ABOUT NOTHING,$200,Completes the title of the 1939 book by Agatha...,


In [9]:
jeopardy[jeopardy['Value'].isnull()]

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
55,4680,2004-12-31,Final Jeopardy!,THE SOLAR SYSTEM,,Objects that pass closer to the sun than Mercu...,Icarus
116,5957,2010-07-06,Final Jeopardy!,HISTORIC WOMEN,,She was born in Virginia around 1596 & died in...,Pocahontas
174,3751,2000-12-18,Final Jeopardy!,SPORTS LEGENDS,,If Joe DiMaggio's hitting streak had gone one ...,H.J. Heinz (Heinz 57 Varieties)
235,3673,2000-07-19,Final Jeopardy!,THE MAP OF EUROPE,,"Bordering Italy, Austria, Hungary & Croatia, i...",Slovenia
296,4931,2006-02-06,Final Jeopardy!,FAMOUS SHIPS,,"On December 27, 1831 it departed Plymouth, Eng...",the HMS Beagle
...,...,...,...,...,...,...,...
216686,3940,2001-10-19,Final Jeopardy!,MAJOR LEAGUE BASEBALL TEAM NAMES,,This team received its name after an 1890 inci...,Pittsburgh Pirates
216746,6044,2010-12-16,Final Jeopardy!,SKYSCRAPERS,,After a construction boom fueled by oil & gas ...,Moscow
216807,5070,2006-09-29,Final Jeopardy!,NATIONAL CAPITALS,,"This city's website calls it ""the last divided...",Nicosia
216868,5195,2007-03-23,Final Jeopardy!,BESTSELLING AUTHORS,,He had the year's bestselling novel a record 7...,John Grisham


After briefly exploring the data, there is quite a lot of cleaning to be done before analysis. Let's get to it.

#### Data Cleaning - Normalizing Text

We will normalize text columns by removing punctuation and changing to lowercase.

In [10]:
#Function to normalize text
def clean_text(sen):
    sen = str(sen)
    sen = sen.lower()
    sen = re.sub(r'\W', ' ',sen)
    return sen

In [11]:
#Applying to Question column
jeopardy['Question'] = jeopardy['Question'].apply(clean_text)
jeopardy['Question']

0         for the last 8 years of his life  galileo was ...
1         no  2  1912 olympian  football star at carlisl...
2         the city of yuma in this state has a record av...
3         in 1963  live on  the art linkletter show   th...
4         signer of the dec  of indep   framer of the co...
                                ...                        
216925    this puccini opera turns on the solution to 3 ...
216926    in north america this term is properly applied...
216927    in penny lane  where this  hellraiser  grew up...
216928    from ft  sill  okla  he made the plea  arizona...
216929    a silent movie title includes the last name of...
Name: Question, Length: 216930, dtype: object

In [12]:
#Applying to Answer column
jeopardy['Answer'] = jeopardy['Answer'].apply(clean_text)
jeopardy['Answer']

0                             copernicus
1                             jim thorpe
2                                arizona
3                             mcdonald s
4                             john adams
                       ...              
216925                          turandot
216926                        a titmouse
216927                      clive barker
216928                          geronimo
216929    grigori alexandrovich potemkin
Name: Answer, Length: 216930, dtype: object

#### Data Cleaning - Normalizing Numeric Columns

We will clean two columns here.

Value column - remove $ sign and change to numeric.

Air Date - we will change from string to datetime.

In [13]:
#Function to clean $
def clean_dollars(string):
    string = str(string)
    string = re.sub(r'[,$]','',string)
    string = float(string)
    return string

In [14]:
#Apply cleaning
jeopardy['Value'] = jeopardy['Value'].apply(clean_dollars)
jeopardy['Value'] = jeopardy['Value'].fillna(0) #Replace NaN with 0
jeopardy['Value'].isnull().sum()

0

In [15]:
#Change air date to datetime object
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Show Number  216930 non-null  int64         
 1   Air Date     216930 non-null  datetime64[ns]
 2   Round        216930 non-null  object        
 3   Category     216930 non-null  object        
 4   Value        216930 non-null  float64       
 5   Question     216930 non-null  object        
 6   Answer       216930 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 11.6+ MB


#### Answers in Questions

We want to find out how often answers can also be used for a question. 

We will do this by seeing how many times words in the answer also occur in the question.

In [16]:
def answer(row):
    match_count = 0
    split_answer = row['Answer'].split()
    split_question = row['Question'].split()
    split_answer = [word for word in split_answer if word != 'the']
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [17]:
jeopardy['answer_in_question'] = jeopardy.apply(answer, axis=1)
jeopardy['answer_in_question'].mean()

0.060826271778010685

We find that 6% of words found in the answer are also found in the question.

We can't rely on hoping to guess the answer from the question.

#### Recycled Questions

We will look at how often questions have been used previously. We will do this looking at words with 6 more or characters from each row listed in an ascending order. Terms that have already appeared will increase our count.

In [18]:
#sort jeopardy by date ascending
jeopardy = jeopardy.sort_values(by='Air Date', ascending=True)
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,answer_in_question
84523,1,1984-09-10,Jeopardy!,LAKES & RIVERS,100.0,river mentioned most often in the bible,the jordan,0.000000
84565,1,1984-09-10,Double Jeopardy!,THE BIBLE,1000.0,according to 1st timothy it is the root of a...,the love of money,0.333333
84566,1,1984-09-10,Double Jeopardy!,'50'S TV,1000.0,name under which experimenter don herbert taug...,mr wizard,0.000000
84567,1,1984-09-10,Double Jeopardy!,NATIONAL LANDMARKS,1000.0,d c building shaken by november 83 bomb blast,the capitol,0.000000
84568,1,1984-09-10,Double Jeopardy!,NOTORIOUS,1000.0,after the deed he leaped to the stage shoutin...,john wilkes booth,0.000000
...,...,...,...,...,...,...,...,...
105947,6300,2012-01-27,Jeopardy!,VISITING THE CITY,800.0,there s a great opera house on bennelong point...,sydney,0.000000
105948,6300,2012-01-27,Jeopardy!,PANTS,1400.0,tight fitting pants patterned after those worn...,toreador pants,0.500000
105949,6300,2012-01-27,Jeopardy!,CHILD ACTORS,800.0,this kid with a familiar last name is seen ...,jaden smith,0.000000
105951,6300,2012-01-27,Jeopardy!,LESSER-KNOWN SCIENTISTS,800.0,joseph lagrange insisted on 10 as the basic un...,the metric system,0.500000


In [19]:
question_overlap = []
terms_used = set()

for index, row in jeopardy.iterrows():
    split_question = row['Question'].split()
    split_question = [word for word in split_question if len(word) >= 6]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy["question_overlap"].mean()

0.8992046127831375

There is almost 90% of recycled questions in the small dataset, which is only about 10% of all questions. However, this may warrant further investigation.


#### Low Value vs High Value Questions

We will take an arbitrary cut off of $800 to differentiate between low and high value questions.

We will loop through each of the terms from the last screen, terms_used, and:

Find the number of low value questions the word occurs in.
Find the number of high value questions the word occurs in.
Find the percentage of questions the word occurs in.
Based on the percentage of questions the word occurs in, find expected counts.
Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.

In [20]:
#Create function to sort "Value" column into nominal data (0/1), 1 for high value, 0 for low value
def sort_high_low(num):
    return 1 if num > 800 else 0

#Apply function to dataframe
jeopardy['high_value'] = jeopardy['Value'].apply(sort_high_low)
jeopardy['high_value'].value_counts()

high_value
0    155508
1     61422
Name: count, dtype: int64

In [21]:
#Function to count number of times word appears in high vs low value questions
def value_comp(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_clean = row['Question'].split()
        if word in split_clean:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return int(high_count), int(low_count)

In [22]:
#We will select 10 terms for comparison
comparison_terms = random.sample(list(terms_used), 10)
comparison_terms

['hallucinatory',
 'barkeep',
 '14_j_29a',
 'shortbread',
 'havdalah',
 'laputa',
 'glandarius',
 'unkindest',
 'editor',
 'anthelmintic']

In [23]:
#We pass 10 values into the value_comp function
observed_expected = []
for word in comparison_terms:
    observed_expected.append(value_comp(word))
print(observed_expected)

[(2, 3), (0, 2), (1, 0), (1, 2), (0, 1), (1, 3), (0, 1), (0, 1), (45, 113), (0, 1)]


In [24]:
#Applying chi-squared test
high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])
chi_squared = []
for x in observed_expected:
    total = sum(x)
    total_prop = total / len(jeopardy)
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count

    observed = np.array([x[0], x[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.3363947754070794, pvalue=0.5619176551024535),
 Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.03723409388907139, pvalue=0.846989214486915),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.021646150708492677, pvalue=0.8830323245068887),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.0021660245672526935, pvalue=0.9628793996783304),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695)]

None of the terms had a significant difference in usage between high value and low value rows.

#### Next steps

##### Here are some potential next steps:

-Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long.
-Manually create a list of words to remove, like the, than, etc.
-Find a list of stopwords to remove.
-Remove words that occur in more than a certain percentage (like 5%) of questions.

##### Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:

-Use the apply method to make the code that calculates frequencies more efficient.
-Only select terms that have high frequencies across the dataset, and ignore the others.

##### Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:

-See which categories appear the most often.
-Find the probability of each category appearing in each round.

##### Use the whole Jeopardy dataset (available here) instead of the subset we used in this lesson.

##### Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.