## Jeopardy is a popular TV show in the US where participants answer questions to win money.
## You want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win. The data set can be downloaded from https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  2000 non-null   int64 
 1    Air Date    2000 non-null   object
 2    Round       2000 non-null   object
 3    Category    2000 non-null   object
 4    Value       2000 non-null   object
 5    Question    2000 non-null   object
 6    Answer      2000 non-null   object
dtypes: int64(1), object(6)
memory usage: 109.5+ KB


Remove spaces in front of the columns name.

In [8]:
jeopardy.columns = jeopardy.columns.str.strip()

In [10]:
jeopardy.Round.value_counts()

Jeopardy!           1003
Double Jeopardy!     964
Final Jeopardy!       33
Name: Round, dtype: int64

## Normalize all of the text columns ( the Question and Answer columns), lowercase and remove punctuation so Don't and don't aren't consifered to be different words when comparing them.

In [36]:
from string import punctuation
def normalize_string(string):
    #new_string = string.lower().translate(str.maketrans('', '', punctuation))
    table = str.maketrans(dict.fromkeys(punctuation))
    new_string = string.lower().translate(table)
    return new_string

In [47]:
normalize_string("a b c, d; e ' f' f$dfd")

'a b c d e  f fdfd'

In [37]:
table = str.maketrans(dict.fromkeys('0123456789'))
print(table)
print('123hello.jpg'.translate(table))

{48: None, 49: None, 50: None, 51: None, 52: None, 53: None, 54: None, 55: None, 56: None, 57: None}
hello.jpg


In [40]:
jeopardy['clean_question'] = jeopardy.Question.apply(normalize_string)
jeopardy['clean_answer'] = jeopardy.Answer.apply(normalize_string)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


In [46]:
jeopardy.clean_answer.unique().size

1850

## Normalize dollar values

In [53]:
from string import punctuation
def normalize_number(string):
    table = str.maketrans(dict.fromkeys(punctuation))
    new_string = string.translate(table)
    if new_string.isdigit():
        return int(new_string)
    return 0

In [54]:
jeopardy['clean_value'] = jeopardy.Value.apply(normalize_number)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


## Clean data, transform data format ^^^

## To figure out whether to study past questions, study general knowledge, or not study at all
   * How often the answer is deducible from the question, by seeing how many times words in the answer also occur in the question.
   * How often new questions are repeats of older questions, by seeing how often complex words (>6 characters) reoccur.

In [70]:
def count_match(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for a in split_answer:
        if a in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [71]:
jeopardy['answer_in_question'] = jeopardy.apply(count_match, axis=1)
jeopardy.sample(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
1595,5795,2009-11-20,Double Jeopardy!,MYTHICAL CREATURES,$1200,"The word ""panic"" comes from the name of a Gree...",a satyr,the word panic comes from the name of a greek ...,a satyr,1200,0.5
1742,4699,2005-01-27,Jeopardy!,TEENS IN HISTORY,$400,She was a teenage farm girl when she beat the ...,Annie Oakley,she was a teenage farm girl when she beat the ...,annie oakley,400,0.0
817,4335,2003-06-06,Double Jeopardy!,"""S""-OTERICA",$800,Slang term for a left-handed boxer or fiddle p...,southpaw,slang term for a lefthanded boxer or fiddle pl...,southpaw,800,0.0
452,6037,2010-12-07,Double Jeopardy!,4 CONSONANTS IN A ROW,$400,If you're vertical but supported by your palms...,a handstand,if youre vertical but supported by your palms ...,a handstand,400,0.0
513,5243,2007-05-30,Double Jeopardy!,DOWN MEXICO WAY,$800,This resort city about 200 miles southwest of ...,Acapulco,this resort city about 200 miles southwest of ...,acapulco,800,0.0


In [73]:
# Mean of the answer_in_question
mean_answer_in_question = jeopardy.answer_in_question.mean()
print(mean_answer_in_question)

0.05665595238095238


* Given that only ~5.7% of the answers contain word from corresponding question, probably it's not a good idea to deduce answer from question through this method.
***
## How often new questions are repeats of older ones.

In [86]:
question_overlap = []
terms_used = set()
jeopardy.sort_values(by = ['Air Date'], ascending = True, inplace=True)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
1138,1279,1990-03-08,Jeopardy!,THE MIDDLE AGES,$300,It's estimated this dread 14th century epidemi...,Black Death,its estimated this dread 14th century epidemic...,black death,300,0.0
1153,1279,1990-03-08,Jeopardy!,AUSTRALIA,$500,This flightless bird is featured on Australia'...,Emu,this flightless bird is featured on australias...,emu,500,0.0
1152,1279,1990-03-08,Jeopardy!,JEWELRY,$500,Tahiti & French Polynesia are famous for pearl...,Black,tahiti french polynesia are famous for pearls...,black,500,0.0
1151,1279,1990-03-08,Jeopardy!,MANIAS,$500,"From the Greek for ""great"", it's the delusion ...",Megalomania,from the greek for great its the delusion of w...,megalomania,500,0.0
1150,1279,1990-03-08,Jeopardy!,THE MIDDLE AGES,$500,"This famous ""song"" is a romanticized account o...","""Song Of Roland""",this famous song is a romanticized account of ...,song of roland,500,0.666667


In [88]:
terms_used

set()

In [89]:
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    long_split_question = [question for question in split_question if len(question) >= 6]
    match_count = 0
    for word in long_split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(long_split_question) > 0:
        match_count = match_count / len(long_split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
1138,1279,1990-03-08,Jeopardy!,THE MIDDLE AGES,$300,It's estimated this dread 14th century epidemi...,Black Death,its estimated this dread 14th century epidemic...,black death,300,0.0,0.0
1153,1279,1990-03-08,Jeopardy!,AUSTRALIA,$500,This flightless bird is featured on Australia'...,Emu,this flightless bird is featured on australias...,emu,500,0.0,0.0
1152,1279,1990-03-08,Jeopardy!,JEWELRY,$500,Tahiti & French Polynesia are famous for pearl...,Black,tahiti french polynesia are famous for pearls...,black,500,0.0,0.0
1151,1279,1990-03-08,Jeopardy!,MANIAS,$500,"From the Greek for ""great"", it's the delusion ...",Megalomania,from the greek for great its the delusion of w...,megalomania,500,0.0,0.0
1150,1279,1990-03-08,Jeopardy!,THE MIDDLE AGES,$500,"This famous ""song"" is a romanticized account o...","""Song Of Roland""",this famous song is a romanticized account of ...,song of roland,500,0.666667,0.166667


In [83]:
next(jeopardy.iterrows())

(1138,
 Show Number                                                        1279
 Air Date                                            1990-03-08 00:00:00
 Round                                                         Jeopardy!
 Category                                                THE MIDDLE AGES
 Value                                                              $300
 Question              It's estimated this dread 14th century epidemi...
 Answer                                                      Black Death
 clean_question        its estimated this dread 14th century epidemic...
 clean_answer                                                black death
 clean_value                                                         300
 answer_in_question                                                    0
 Name: 1138, dtype: object)

In [91]:
jeopardy.tail()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
1962,6294,2012-01-19,Double Jeopardy!,AMERICAN HISTORY,$1600,His foes said that in 1877 he agreed to withdr...,(Rutherford B.) Hayes,his foes said that in 1877 he agreed to withdr...,rutherford b hayes,1600,0.0,0.625
1963,6294,2012-01-19,Double Jeopardy!,WHAT'S YOUR BEEF?,$1600,The second word in the French name of this bon...,filet mignon,the second word in the french name of this bon...,filet mignon,1600,0.0,0.5
1937,6294,2012-01-19,Jeopardy!,"THE EVOLUTION OF ""M""USIC",$800,"In the '90s it was ""Enter Sandman"" with this g...",Metallica,in the 90s it was enter sandman with this group,metallica,800,0.0,0.0
1936,6294,2012-01-19,Jeopardy!,INLETS,$800,"This Chilean city whose name means ""valley of ...",Valparaiso,this chilean city whose name means valley of p...,valparaiso,800,0.0,0.75
1953,6294,2012-01-19,Double Jeopardy!,WEAPONS OF WORLD WAR II,$800,"Ships in the U.S. Navy's Casablanca class of ""...",aircraft carriers,ships in the us navys casablanca class of esco...,aircraft carriers,800,0.0,0.6


In [92]:
jeopardy.question_overlap.mean()

0.3892065157065153

* Almost 40% of questions contain some part similar to previous ones.

## Study questions the pertain to high value questions.
## Figure out which terms correspond to high-value questions using a chi-squared test.
   * Low value -- any row where value is less than 800
   * High value -- any row where value is greater than 800.

In [93]:
# Determine which questions are high and low value, >800 high_value
def sort_value(row):
    if row['clean_value'] > 800:
        return 1
    return 0

jeopardy['high_value'] = jeopardy.apply(sort_value, axis = 1)
jeopardy.tail()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
1962,6294,2012-01-19,Double Jeopardy!,AMERICAN HISTORY,$1600,His foes said that in 1877 he agreed to withdr...,(Rutherford B.) Hayes,his foes said that in 1877 he agreed to withdr...,rutherford b hayes,1600,0.0,0.625,1
1963,6294,2012-01-19,Double Jeopardy!,WHAT'S YOUR BEEF?,$1600,The second word in the French name of this bon...,filet mignon,the second word in the french name of this bon...,filet mignon,1600,0.0,0.5,1
1937,6294,2012-01-19,Jeopardy!,"THE EVOLUTION OF ""M""USIC",$800,"In the '90s it was ""Enter Sandman"" with this g...",Metallica,in the 90s it was enter sandman with this group,metallica,800,0.0,0.0,0
1936,6294,2012-01-19,Jeopardy!,INLETS,$800,"This Chilean city whose name means ""valley of ...",Valparaiso,this chilean city whose name means valley of p...,valparaiso,800,0.0,0.75,0
1953,6294,2012-01-19,Double Jeopardy!,WEAPONS OF WORLD WAR II,$800,"Ships in the U.S. Navy's Casablanca class of ""...",aircraft carriers,ships in the us navys casablanca class of esco...,aircraft carriers,800,0.0,0.6,0


In [94]:
def count_word(word):
    low_count, high_count = 0, 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [104]:
# Random pick ten elements of terms_used for test

comparison_terms = np.random.choice(list(terms_used), 10, replace=False)
observed_expected = []

for term in comparison_terms:
    high_value_count, low_value_count = count_word(term)
    observed_expected.append([high_value_count, low_value_count])

In [105]:
observed_expected

[[1, 0],
 [2, 1],
 [0, 1],
 [0, 1],
 [1, 0],
 [0, 4],
 [0, 1],
 [0, 3],
 [0, 1],
 [0, 2]]

In [108]:
high_value_count = jeopardy[jeopardy.high_value == 1].shape[0]
low_value_count = jeopardy[jeopardy.high_value == 0].shape[0]
chi_squared = []

In [113]:
from scipy.stats import chisquare
for l in observed_expected:
    total = sum(l)
    total_prop = total / jeopardy.shape[0]
    expect_count_high_value = total_prop * high_value_count
    expect_count_low_value = total_prop * low_value_count
    chi2, p = chisquare(l, f_exp=[expect_count_high_value, expect_count_low_value])
    chi_squared.append((chi2, p))

In [114]:
chi_squared

[(2.597122302158273, 0.1070579459659198),
 (2.257843586626543, 0.1329390615126475),
 (0.38504155124653744, 0.5349173571192949),
 (0.38504155124653744, 0.5349173571192949),
 (2.597122302158273, 0.1070579459659198),
 (1.5401662049861498, 0.2145930675786351),
 (0.38504155124653744, 0.5349173571192949),
 (1.1551246537396123, 0.28247894395185624),
 (0.38504155124653744, 0.5349173571192949),
 (0.7700831024930749, 0.38019134513275776)]



Here are some potential next steps:

    Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
        Manually create a list of words to remove, like the, than, etc.
        Find a list of stopwords to remove.
        Remove words that occur in more than a certain percentage (like 5%) of questions.
    Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
        Use the apply method to make the code that calculates frequencies more efficient.
        Only select terms that have high frequencies across the dataset, and ignore the others.
    Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
        See which categories appear the most often.
        Find the probability of each category appearing in each round.
    Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
    Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.
