# This is Jeopardy!

#### Overview

This project is slightly different than others you have encountered thus far. Instead of a step-by-step tutorial, this project contains a series of open-ended requirements which describe the project you'll be building. There are many possible ways to correctly fulfill all of these requirements, and you should expect to use the internet, Codecademy, and/or other resources when you encounter a problem that you cannot easily solve.

#### Project Goals

You will work to write several functions that investigate a dataset of _Jeopardy!_ questions and answers. Filter the dataset for topics that you're interested in, compute the average difficulty of those questions, and train to become the next Jeopardy champion!

## Prerequisites

In order to complete this project, you should have completed the Pandas lessons in the <a href="https://www.codecademy.com/learn/paths/analyze-data-with-python">Analyze Data with Python Skill Path</a>. You can also find those lessons in the <a href="https://www.codecademy.com/learn/data-processing-pandas">Data Analysis with Pandas course</a> or the <a href="https://www.codecademy.com/learn/paths/data-science/">Data Scientist Career Path</a>.

Finally, the <a href="https://www.codecademy.com/learn/practical-data-cleaning">Practical Data Cleaning</a> course may also be helpful.

## Project Requirements

1. We've provided a csv file containing data about the game show _Jeopardy!_ in a file named `jeopardy.csv`. Load the data into a DataFrame and investigate its contents. Try to print out specific columns.

   Note that in order to make this project as "real-world" as possible, we haven't modified the data at all - we're giving it to you exactly how we found it. As a result, this data isn't as "clean" as the datasets you normally find on Codecademy. More specifically, there's something odd about the column names. After you figure out the problem with the column names, you may want to rename them to make your life easier for the rest of the project.
   
   In order to display the full contents of a column, we've added this line of code for you:
   
   ```py
   pd.set_option('display.max_colwidth', None)
   ```

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

jeopardy = pd.read_csv("jeopardy.csv")

print(jeopardy.columns)
jeopardy.rename(columns={'Show Number':'show_number',' Air Date': 'air_date', ' Round': 'jeo_round', ' Category': 'category', ' Value': 'value', ' Question': 'question', ' Answer': 'answer'},inplace=True)
print(jeopardy.head())



  pd.set_option('display.max_colwidth', -1)


Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
   show_number    air_date  jeo_round                         category value  \
0  4680         2004-12-31  Jeopardy!  HISTORY                          $200   
1  4680         2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2  4680         2004-12-31  Jeopardy!  EVERYBODY TALKS ABOUT IT...      $200   
3  4680         2004-12-31  Jeopardy!  THE COMPANY LINE                 $200   
4  4680         2004-12-31  Jeopardy!  EPITAPHS & TRIBUTES              $200   

                                                                                                      question  \
0  For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory              
1  No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves   
2  The city of Yuma in this state has a record average 

2. Write a function that filters the dataset for questions that contain all of the words in a list of words. For example, when the list `["King", "England"]` was passed to our function, the function returned a DataFrame of 49 rows. Every row had the strings `"King"` and `"England"` somewhere in its `" Question"`.

   Test your function by printing out the column containing the question of each row of the dataset.

In [2]:
def word_filter(word_list, fil_category, jeo_df):
    if len(word_list) > 0 :
        cured_list = [' '+word for word in word_list]
        filtered_df = jeo_df[(jeo_df[fil_category].str.lower()).str.contains(cured_list[-1].lower())]
        word_list.pop()
        return word_filter(word_list, fil_category, filtered_df)
    else:
        return jeo_df

filtered_questions = word_filter(["England","King"], 'question', jeopardy)
print(filtered_questions.head())

       show_number    air_date         jeo_round               category  \
4953   3003         1997-09-24  Double Jeopardy!  "PH"UN WORDS            
6337   3517         1999-12-14  Double Jeopardy!  Y1K                     
9191   3907         2001-09-04  Double Jeopardy!  WON THE BATTLE          
11710  2903         1997-03-26  Double Jeopardy!  BRITISH MONARCHS        
13454  4726         2005-03-07  Jeopardy!         A NUMBER FROM 1 TO 10   

       value  \
4953   $200    
6337   $800    
9191   $800    
11710  $600    
13454  $1000   

                                                                                                     question  \
4953   Both England's King George V & FDR put their stamp of approval on this "King of Hobbies"                 
6337   In retaliation for Viking raids, this "Unready" king of England attacks Norse areas of the Isle of Man   
9191   This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt                 


3. Test your original function with a few different sets of words to try to find some ways your function breaks. Edit your function so it is more robust.

   For example, think about capitalization. We probably want to find questions that contain the word `"King"` or `"king"`.
   
   You may also want to check to make sure you don't find rows that contain substrings of your given words. For example, our function found a question that didn't contain the word `"king"`, however it did contain the word `"viking"` &mdash; it found the `"king"` inside `"viking"`. Note that this also comes with some drawbacks &mdash; you would no longer find questions that contained words like `"England's"`.

4. We may want to eventually compute aggregate statistics, like `.mean()` on the `" Value"` column. But right now, the values in that column are strings. Convert the`" Value"` column to floats. If you'd like to, you can create a new column with float values.

   Now that you can filter the dataset of question, use your new column that contains the float values of each question to find the "difficulty" of certain topics. For example, what is the average value of questions that contain the word `"King"`?
   
   Make sure to use the dataset that contains the float values as the dataset you use in your filtering function.

In [3]:
jeopardy['value_float'] = jeopardy.value.apply(lambda x: float(x.replace('$','').replace(',','')) if x != 'None' else 0)
print(jeopardy['value_float'].mean())
print(jeopardy['value_float'].std())

739.9884755451067
639.8226925461519


5. Write a function that returns the count of unique answers to all of the questions in a dataset. For example, after filtering the entire dataset to only questions containing the word `"King"`, we could then find all of the unique answers to those questions. The answer "Henry VIII" appeared 55 times and was the most common answer.

In [5]:
unique_answers = word_filter(["King"],'question',jeopardy).answer.nunique()
print(unique_answers)

test = jeopardy[(jeopardy.answer == "Henry VIII") & (jeopardy.question.str.contains("King"))]
print(test.answer.count())

1625
2


6. Explore from here! This is an incredibly rich dataset, and there are so many interesting things to discover. There are a few columns that we haven't even started looking at yet. Here are some ideas on ways to continue working with this data:

 * Investigate the ways in which questions change over time by filtering by the date. How many questions from the 90s use the word `"Computer"` compared to questions from the 2000s?
 * Is there a connection between the round and the category? Are you more likely to find certain categories, like `"Literature"` in Single Jeopardy or Double Jeopardy?
 * Build a system to quiz yourself. Grab random questions, and use the <a href="https://docs.python.org/3/library/functions.html#input">input</a> function to get a response from the user. Check to see if that response was right or wrong.

In [6]:
import datetime
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])
ques_before_90 = jeopardy[jeopardy.air_date < pd.to_datetime(datetime.date(1990,1,1))]
ques_after_90  = jeopardy[jeopardy.air_date >= pd.to_datetime(datetime.date(1990,1,1))]

print('The first Jeopardy show aired on ' + str(ques_before_90.air_date.min()))
print(ques_before_90.question.count())
print(ques_after_90.question.count())

mean_before_90 = ques_before_90.value_float.mean()
mean_after_90 = ques_after_90.value_float.mean()
print('The mean value of the questions has increased in time, before 1990, the mean was {:3.0f}, afterwards, it has increased by about 300 dollars to: {:3.0f}.'.format(mean_before_90,mean_after_90))

jeo_round_types = jeopardy['jeo_round'].unique()
      
df_round_jeo = jeopardy[jeopardy.jeo_round == jeo_round_types[0]]
print('Total number of questions in a {} round of Jeopardy: '.format(jeo_round_types[0]) + str(df_round_jeo.question.count()))

df_round_jeo2 = jeopardy[jeopardy.jeo_round == jeo_round_types[1]]
print('Total number of questions in a {} round of Jeopardy: '.format(jeo_round_types[1]) + str(df_round_jeo2.question.count()))

df_round_jeo3 = jeopardy[jeopardy.jeo_round == jeo_round_types[2]]
print('Total number of questions in a {} round of Jeopardy: '.format(jeo_round_types[2]) + str(df_round_jeo3.question.count()))

df_round_jeo4 = jeopardy[jeopardy.jeo_round == jeo_round_types[3]]
print('Total number of questions in a {} round of Jeopardy: '.format(jeo_round_types[3]) + str(df_round_jeo4.question.count()))

print('Number of unique question catgories in a {} round of Jeopardy: '.format(jeo_round_types[0]) + str(len(sorted(df_round_jeo.category.unique()))))
print('Number of unique question catgories in a {} round of Jeopardy: '.format(jeo_round_types[1]) + str(len(sorted(df_round_jeo2.category.unique()))))
print('Number of unique question catgories in a {} round of Jeopardy: '.format(jeo_round_types[2]) + str(len(sorted(df_round_jeo3.category.unique()))))
print('Number of unique question catgories in a {} round of Jeopardy: '.format(jeo_round_types[3]) + str(len(sorted(df_round_jeo4.category.unique()))))


The first Jeopardy show aired on 1984-09-10 00:00:00
8108
208822
The mean value of the questions has increased in time, before 1990, the mean was 464, afterwards, it has increased by about 300 dollars to: 751.
Total number of questions in a Jeopardy! round of Jeopardy: 107384
Total number of questions in a Double Jeopardy! round of Jeopardy: 105912
Total number of questions in a Final Jeopardy! round of Jeopardy: 3631
Total number of questions in a Tiebreaker round of Jeopardy: 3
Number of unique question catgories in a Jeopardy! round of Jeopardy: 15155
Number of unique question catgories in a Double Jeopardy! round of Jeopardy: 14576
Number of unique question catgories in a Final Jeopardy! round of Jeopardy: 1952
Number of unique question catgories in a Tiebreaker round of Jeopardy: 3


In [163]:
df_round_jeo_filtered = word_filter(["Literature"], 'category', df_round_jeo)
df_round_jeo2_filtered = word_filter(["Literature"], 'category', df_round_jeo2)
df_round_jeo3_filtered = word_filter(["Literature"], 'category', df_round_jeo3)
df_round_jeo4_filtered = word_filter(["Literature"], 'category', df_round_jeo4)
print('{:2.1f}%'.format(df_round_jeo_filtered.question.count()*100/float(df_round_jeo.question.count())))
print('{:2.1f}%'.format(df_round_jeo2_filtered.question.count()*100/float(df_round_jeo2.question.count())))
print('{:2.1f}%'.format(df_round_jeo3_filtered.question.count()*100/float(df_round_jeo3.question.count())))
print('{:2.1f}%'.format(df_round_jeo4_filtered.question.count()*100/float(df_round_jeo4.question.count())))

0.3%
0.6%
1.9%
0.0%


## Exploration

We want to create simplified categories to provide a more robust statistic and understand their distrribution. 
We will categorize questions according to a reduced set of keywords they may contain. In this way we can focus the whole range of categories into a few major ones.
Each question will be assigned a single category according to how many keywords they contain. If there is a tie they will be assigned as 'other'. 

In [134]:
literature_kws = ['book','author','write','novel','poem','literature','published']
sports_kws = ['sport','championship','football','won','baseball','player', 'basketball', 'golf', 'chess', 'volleyball', 'cricket', 'swim']
music_kws = ['singer',' CD', 'song', ' LP', 'concert','music','instrument','Billboard', 'composer', 'orchestra']
geography_kws = ['geography','city','country', 'state', 'capital', 'river', 'mountain','earth','founded','island',' lake ', 'desert']
science_kws = ['science','scientist','chemis', 'physic', 'biolog', 'discover', 'Nobel','experiment','invent','develop']
history_kws = ['year','century','rule','king','empire','tribe','war','battle','ship','president','decade','colony','senator','congress']
other_kws = []

new_categories = {'literature':literature_kws,'sports':sports_kws,'music':music_kws,'geography':geography_kws,'science':science_kws,'history':history_kws,'other':other_kws}

list(new_categories.keys())

['literature', 'sports', 'music', 'geography', 'science', 'history', 'other']

In [135]:
def find_unique_max(score):
    naive_max = max(score)
    red_list = score[:]
    red_list.remove(naive_max)
    next_max = max(red_list)
    if naive_max == next_max:
        return -1
    else:
        return score.index(naive_max)
        
    
def find_category(question_str, category_dict):
    score = [0 for i in range(len(category_dict))]
    categories = list(category_dict.keys())
    for j in range(len(categories)):
        word_list = category_dict[categories[j]]
        for word in word_list:
            if word.lower() in question_str.lower():
                score[j] += 1
    if max(score) == 0 :
        return 'other'
    else:
        index_max = find_unique_max(score)
        return categories[index_max]

# test_ques = find_category("Who won the football world country city in 1998?", new_categories)
# print(test_ques)


In [136]:
jeopardy['new_category'] = jeopardy.question.apply(lambda x: find_category(x,new_categories))

In [182]:
print(jeopardy.new_category.unique())

jeopardy_new_cats = jeopardy.groupby(['new_category']).question.count().reset_index()

# print(jeopardy_new_cats.new_category.head())
jeopardy_new_cats['cat_fraction'] = jeopardy_new_cats.question*100/float(jeopardy_new_cats.question.sum())


print(jeopardy_new_cats)


['history' 'sports' 'geography' 'other' 'literature' 'music' 'science']
  new_category  question  cat_fraction
0  geography    23372     10.773982   
1  history      22474     10.360024   
2  literature   8561      3.946434    
3  music        6204      2.859909    
4  other        147919    68.187434   
5  science      3309      1.525377    
6  sports       5091      2.346840    
