# Guided Project #14 - Winning Jeopardy
By [Luis Munguia](http://www.linkedin.com/in/luis-munguia) and [Dataquest](http://www.dataquest.io)

In this guided project, I'll work with data from Jeopardy. This is a popular TV show in the US where participants try to answer questions to win money.

My objective is figure out patterns in the questions to help me win.

This dataset contains 20000 rows from the full dataset.

The data dictionary is as follows:

* `Show Number` -- the jeopardy episode number of the show this question was in.
* `Air Date` -- the date the episode aired.
* `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* `Category` -- the category of the question.
* `Value` -- the number of dollars answering the question correctly is worth.
* `Question` -- the text of the question.
* `Answer` -- the text of the answer.

## 1.- Library and Jupyter setup.
Import `pandas` and do exploratory data analysis.

In [1]:
import pandas as pd

jeopardy = pd.read_csv("jeopardy.csv")

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [4]:
jeopardy.columns = ["Show Number", "Air Date", "Round", "Category", "Value",
                   "Question", "Answer"]

## 2.- Normalizing Text.
Import `re` and normalize columns "Question" and "Answer".

In [5]:
jeopardy["Question"]

0        For the last 8 years of his life, Galileo was ...
1        No. 2: 1912 Olympian; football star at Carlisl...
2        The city of Yuma in this state has a record av...
3        In 1963, live on "The Art Linkletter Show", th...
4        Signer of the Dec. of Indep., framer of the Co...
5        In the title of an Aesop fable, this insect sh...
6        Built in 312 B.C. to link Rome & the South of ...
7        No. 8: 30 steals for the Birmingham Barons; 2,...
8        In the winter of 1971-72, a record 1,122 inche...
9        This housewares store was named for the packag...
10                                        "And away we go"
11       Cows regurgitate this from the first stomach t...
12       In 1000 Rajaraja I of the Cholas battled to ta...
13       No. 1: Lettered in hoops, football & lacrosse ...
14       On June 28, 1994 the nat'l weather service beg...
15       This company's Accutron watch, introduced in 1...
16       Outlaw: "Murdered by a traitor and a coward wh.

In [6]:
import re

def normalize_text(text):
    text = text.lower()
    replacement = re.sub("[^A-Za-z0-9\s]", "", text)
    return replacement 

In [7]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...


In [8]:
for i in jeopardy.iloc[0]:
    print(i)

4680
2004-12-31
Jeopardy!
HISTORY
$200
For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory
Copernicus
for the last 8 years of his life galileo was under house arrest for espousing this mans theory


It works!

In [9]:
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)

In [10]:
for i in jeopardy.iloc[3]:
    print(i)

4680
2004-12-31
Jeopardy!
THE COMPANY LINE
$200
In 1963, live on "The Art Linkletter Show", this company served its billionth burger
McDonald's
in 1963 live on the art linkletter show this company served its billionth burger
mcdonalds


## 3.- Normalizing Values and Datetimes.
Normalize columns "Value" and "Air Date".

In [11]:
jeopardy["Value"].value_counts().sort_index(ascending = True)

$1,000      184
$1,020        1
$1,100        6
$1,111        1
$1,200       42
$1,300        6
$1,400       20
$1,492        1
$1,500       50
$1,600       19
$1,700        1
$1,800       22
$1,900        5
$10,000       3
$10,800       1
$100        804
$1000      1796
$12,000       2
$1200      1069
$1600      1027
$2,000      149
$2,021        1
$2,100        2
$2,127        1
$2,200       11
$2,300        1
$2,400        8
$2,500       18
$2,600        3
$2,800        5
           ... 
$4,100        1
$4,400        2
$4,500        1
$4,600        2
$4,700        1
$4,800        2
$400       3892
$5,000       23
$5,200        1
$5,400        1
$5,600        2
$5,800        1
$500        798
$6,000        7
$6,100        1
$6,200        1
$6,800        1
$600       1890
$7,000        7
$7,200        2
$7,400        1
$7,500        1
$700         15
$750          1
$8,000        3
$8,200        1
$800       2980
$9,000        1
$900          6
None        336
Name: Value, Length: 76,

In [12]:
def normalize_value(value):
    value = re.sub("[$,]", "", value)
    try:
        replacement = int(value)
    except Exception:
        replacement = 0
    return replacement 

In [13]:
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_value)

In [14]:
jeopardy["clean_value"].value_counts().sort_index()

0         336
100       804
200      2784
300       764
367         1
400      3892
500       798
600      1890
700        15
750         1
800      2980
900         6
1000     1980
1020        1
1100        6
1111        1
1200     1111
1300        6
1400       20
1492        1
1500       50
1600     1046
1700        1
1800       22
1900        5
2000     1223
2021        1
2100        2
2127        1
2200       11
         ... 
3500        6
3600        8
3800        2
3900        1
4000       32
4100        1
4400        2
4500        1
4600        2
4700        1
4800        2
5000       23
5200        1
5400        1
5600        2
5800        1
6000        7
6100        1
6200        1
6800        1
7000        7
7200        2
7400        1
7500        1
8000        3
8200        1
9000        1
10000       3
10800       1
12000       2
Name: clean_value, Length: 72, dtype: int64

In [15]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [16]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

# 4.- Answers in questions.

Answer the following questions: 
* How often the answer is deducible from the question?
* How often new questions are repeats of older questions?

In [17]:
def answer_questions(series):
    split_answer = series["clean_answer"].split()
    split_question = series["clean_question"].split()
    match_count = 0
    split_answer = [item for item in split_answer if item != "the"]
    if len(split_answer) == 0:
        return 0
    
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)       

jeopardy["answer_in_question"] = jeopardy.apply(answer_questions, axis = 1)

In [18]:
jeopardy["answer_in_question"].mean()

0.05834744478926688

This is dataquest's answer:
```python
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
jeopardy["answer_in_question"].mean()
```

They're not considering removing multiple instances of `"the"`. Even so, the difference between values is less than 0.002.

In [19]:
0.060493257069335872 - 0.05834744478926688

0.0021458122800689927

The answer is only deducible 6% of the time from the question, which is good indicator that I would need to study harder to win Jeopardy and not try to deduce anything.

# 5.- Recycled Questions.

Answer the remaining question: 

* How often new questions are repeats of older questions?

In [20]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [item for item in split_question if len(item) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()
    

0.6908737315671962

It seems that words are repeated 70% of the time, but I would need to investigate further, as this may be because of other unknown issues and not necessarily that Jeopardy recycles questions.

# 6.- Low value vs high value questions.

Do the same analysis as before, but narrow questions into two categories:

* Low value -- Any row where `Value` is less than `800`.
* High vale -- Any row where `Value` is greather than `800`.

In [21]:
def high_or_low(row):
    if row["clean_value"] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy["high_value"] = jeopardy.apply(high_or_low, axis = 1)

In [22]:
jeopardy["high_value"].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [23]:
def term_counter(words):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if words in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []

comparison_terms = list(terms_used)[:5]

for i in comparison_terms:
    observed_expected.append(term_counter(i))        

In [24]:
observed_expected

[(1, 1), (0, 1), (1, 0), (2, 5), (1, 0)]

Why does this answer change everytime I run the code?

It seems the reason is that it converts a set to a list, and this is randomized because sets do not follow an order.

The last time it gave me this:

[(0, 1), (1, 1), (0, 1), (0, 2), (0, 1)]

# 7.- Chi-squared test

Use `scipy.stats.chisquare` to compute the chi-squared value and p-value.

In [45]:
import numpy as np
from scipy.stats import chisquare

In [41]:
# When I wrote the following code... it dawned on me.
low_value_count, high_value_count = jeopardy["high_value"].value_counts()
print(low_value_count, high_value_count)

14265 5734


In [49]:
chi_squared = []

for i in observed_expected:
    total = sum(i)
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    observed = np.array([i[0],i[1]])
    expected = np.array([expected_high, expected_low])
    s, p = chisquare(observed, expected)
    chi_squared.append((s,p))   

In [50]:
chi_squared

[(0.4448774816612795, 0.5047776487545996),
 (0.401962846126884, 0.5260772985705469),
 (2.487792117195675, 0.11473257634454047),
 (3.423170782846152e-05, 0.9953317740648371),
 (2.487792117195675, 0.11473257634454047)]

All test results are below 5, so I would state that it was not successful. It's interesting that two values had the exact same result.

# 8.- Next steps.

* Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Such as:
    * Manually create a list of words to remove.
    * Find a list of stopwords to remove.
    * Remove words that occur in mnore than a cestain percentage of questions.
* Perform the chi-squared test across more terms to see what terms have larger differences. Code is slow but here are some ideas:
    * Use apply method to make code that calculates frequencies more efficient.
    * Only select terms that have high frequencies.
* Look more into the `Category` column and see if any interesting analysis can be done with it.:
    * See which categories appear the most often.
    * Find the probability of each category appearing in each round.
* Use the whole Jeopady dataset istead of the subsed used.
* Use phrases instead of single words when seeing if there's overlap between questions.

To be continued...