# This is Jeopardy!

#### Overview

This project is slightly different than others you have encountered thus far. Instead of a step-by-step tutorial, this project contains a series of open-ended requirements which describe the project you'll be building. There are many possible ways to correctly fulfill all of these requirements, and you should expect to use the internet, Codecademy, and/or other resources when you encounter a problem that you cannot easily solve.

#### Project Goals

You will work to write several functions that investigate a dataset of _Jeopardy!_ questions and answers. Filter the dataset for topics that you're interested in, compute the average difficulty of those questions, and train to become the next Jeopardy champion!

## Prerequisites

In order to complete this project, you should have completed the Pandas lessons in the <a href="https://www.codecademy.com/learn/paths/analyze-data-with-python">Analyze Data with Python Skill Path</a>. You can also find those lessons in the <a href="https://www.codecademy.com/learn/data-processing-pandas">Data Analysis with Pandas course</a> or the <a href="https://www.codecademy.com/learn/paths/data-science/">Data Scientist Career Path</a>.

Finally, the <a href="https://www.codecademy.com/learn/practical-data-cleaning">Practical Data Cleaning</a> course may also be helpful.

## Project Requirements

1. We've provided a csv file containing data about the game show _Jeopardy!_ in a file named `jeopardy.csv`. Load the data into a DataFrame and investigate its contents. Try to print out specific columns.

   Note that in order to make this project as "real-world" as possible, we haven't modified the data at all - we're giving it to you exactly how we found it. As a result, this data isn't as "clean" as the datasets you normally find on Codecademy. More specifically, there's something odd about the column names. After you figure out the problem with the column names, you may want to rename them to make your life easier for the rest of the project.
   
   In order to display the full contents of a column, we've added this line of code for you:
   
   ```py
   pd.set_option('display.max_colwidth', None)
   ```

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
jeopardy = pd.read_csv('jeopardy.csv')
for col in jeopardy.columns:
    jeopardy = jeopardy.rename(columns={col:col.strip()})

2. Write a function that filters the dataset for questions that contains all of the words in a list of words. For example, when the list `["King", "England"]` was passed to our function, the function returned a DataFrame of 49 rows. Every row had the strings `"King"` and `"England"` somewhere in its `" Question"`.

   Test your function by printing out the column containing the question of each row of the dataset.

In [2]:
'''
  Name:         contains_filter_toStr 
  Type:         Void
  Description:  Prints the results of the contains_filter
                  to the console.
  Parameters:   words  => list of words the dataset is filtered by
                numStr => Number of entries that contain the words
                          in the given 'words' list.
                data   => The dataset in which to filter 
  Returns:      None
'''
def contains_filter_toStr(words, numStr, data):
    str_o = ''
    for i in range(len(words)):
        if (i == len(words) - 2):
            str_o += '\'{}\' and '.format(words[i])
        elif (i == len(words) - 1):
            str_o += '\'{}\''.format(words[i])
        else:
            str_o += '\'{}\', '.format(words[i])
    print('Number of Questions that contain the word(s) {}: {}\n'.format(str_o, numStr))
    print('Examples: \n--------\n')
    for index, row in data.head().iterrows():
        print('\t{}\n'.format(row['Question']))
    print('\n')

In [3]:
'''
  Name:         contains_filter
  Type:         Pandas Dataframe
  Description:  Used to filter the jeopardy dataset's 'Questions'
                  column by a given list of words (only returns 
                  the entries where the question contains the 
                  words in the given 'words' list).
  Parameters:   data  => the dataset in which to filter
                words => a list of words to filter the dataset
                         by
  Returns:      Filtered dataset where the question in each entry
                  contains all the words in the given 'words' list.
'''
def contains_filter(data, words):
    filter = lambda x: all(word.lower() in x.lower() for word in words)
    filtered_data = data[data.Question.apply(filter)]
    contains_filter_toStr(words, filtered_data.shape[0], filtered_data)
    #print(data[data.Question.apply(filter)].Question.head())
    return filtered_data

In [4]:
import time

time_sum = 0

words = ['King', 'England']
start_time = time.time()
filtered_data = contains_filter(jeopardy, words)
print('\nRUN-TIME: --- %s SECONDS --- ' % (time.time() - start_time))

Number of Questions that contain the word(s) 'King' and 'England': 152

Examples: 
--------

	Both England's King George V & FDR put their stamp of approval on this "King of Hobbies"

	In retaliation for Viking raids, this "Unready" king of England attacks Norse areas of the Isle of Man

	This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt

	This Scotsman, the first Stuart king of England, was called "The Wisest Fool in Christendom"

	It's the number that followed the last king of England named William




RUN-TIME: --- 0.3371388912200928 SECONDS --- 


3. Test your original function with a few different sets of words to try to find some ways your function breaks. Edit your function so it is more robust.

   For example, think about capitalization. We probably want to find questions that contain the word `"King"` or `"king"`.
   
   You may also want to check to make sure you don't find rows that contain substrings of your given words. For example, our function found a question that didn't contain the word `"king"`, however it did contain the word `"viking"` &mdash; it found the `"king"` inside `"viking"`. Note that this also comes with some drawbacks &mdash; you would no longer find questions that contained words like `"England's"`.

In [5]:
'''
  Name:         consists
  Type:         bool
  Description:  Used to determine if the phrase consists of
                  the target word(s)
  Parameters:   phrase => string to check for target word(s)
                target => list of target words
  Returns:      returns TRUE if phrase contains target word(s)
                  or FALSE if phrase does not contain the 
                  target word(s)
'''
def consists(phrase, target):
    consists_of = True
    phrase = ' {} '.format(phrase)
    for word in target:
        if (not any(phrase.lower().__contains__(variation.lower())
                    for variation in target[word])):
            consists_of = False
    return consists_of

In [6]:
'''
  Name:         gen_variation
  Type:         string
  Description:  Used to apply specific grammar (or punctuation) 
                to a word or input. Thereby generating a 'variation' 
                of that word or input.
  Parameters:   word    => word or input to apply grammar to and 
                           generate variation of.
                grammar => specific grammar (or punctuation) to apply 
                           to word or input
  Returns:      Returns 'variation' of word or input with a specific
                grammar (or punctuation) applied to it.
'''
def gen_variation(word, grammar):
    if (len(grammar) == 2):
        if (grammar[1] == ' '):
            return grammar[0] + word
        else:
            return grammar[0] + word + grammar[1]
    elif (len(grammar) == 1):
        return word + grammar
    else:
        return word

In [7]:
'''
  Name:         gen_allVariations
  Type:         Dictionary
  Description:  Used to generate all the variations of a list of words 
                or inputs by applying all possible combinations of grammar
                (or punctuation) to each word or input in the list.
  Parameters:   words => list of words to generate all variations of
  Returns:      Returns a dictionary, where the keys represent the 
                words we would like to generate all variations of 
                and the values are all the possible variations of
                its respective word or input key.
'''
def gen_allVariations(words):
    word_dict = {}
    grammar_dict = { # grammar dictionary used to represent 
                     # all the possible punctuation marks
         0: '',      1: '\' ',    2: '\'',
         3: '"',     4: '" ',     5: ',',      
         6: '.',     7: '!',      8: '?',      
         9: '""',   10: "''",    11: '( ',    
        12: ')',    13: '()',    14: 's'
    }
    for word in words: # for loop to iterate through all the words in the
                       # 'words' list
        word_dict[word] = [' {} '.format(word)]
        for key1 in grammar_dict: # nested for loop to iterate through all the
                                  # possible punctuation marks and generate all
                                  # the possible variations for each word or 
                                  # input
            for key2 in grammar_dict:
                if ((grammar_dict[key1] == '') and (grammar_dict[key2] == '')):
                    continue
                elif ((grammar_dict[key1] == 's') and (grammar_dict[key2] == 's')):
                    continue
                else:   
                    newWord = gen_variation(word, grammar_dict[key1])
                    newWord = gen_variation(newWord, grammar_dict[key2])
                    word_dict[word].append(' {} '.format(newWord))
    return word_dict

In [8]:
'''
  Name:         contains_filter_improved
  Type:         Pandas DataFrame
  Description:  Modified version of the contains_filter function
                which finds all the questions that contain all the
                words, or any variation of them, in the given
                'words' list. Each variation applies a different
                punctuation to it's respective word or input.
                Filtered data does not include entries where the
                words in the 'words' list are only featured as
                substrings of other words (e.g. King vs Viking).
  Parameters:   data  => dataframe to search through
                words => list of words for which to filter the
                         data by
  Returns:      Returns filtered dataframe, where each value in
                the 'Question' column contains all the words,
                or any variation of them, in the given 'words'
                list.
'''
def contains_filter_improved(data, words):
    words_dict = gen_allVariations(words)
    filter = lambda x: all(word.lower() in x.lower() for word in words) # Runs a pre-filter on the data to shorten the runtime
                                                                        # of the more complex 'consists' filter by shortening the 
                                                                        # amount of entries the 'consists' filter has to scan through
                                                                        # while including all the variations of the given words in the 
                                                                        # 'words' list.
    filtered_data = data[data.Question.apply(filter)]
    filtered_data = filtered_data[filtered_data.Question.apply(lambda x: consists(x, words_dict))]
    contains_filter_toStr(words, filtered_data.shape[0], filtered_data)
    return filtered_data

In [9]:
filtered_data = contains_filter_improved(jeopardy, words)
words = ['Norse', 'God']
filtered_data = contains_filter_improved(jeopardy, words)

Number of Questions that contain the word(s) 'King' and 'England': 128

Examples: 
--------

	Both England's King George V & FDR put their stamp of approval on this "King of Hobbies"

	In retaliation for Viking raids, this "Unready" king of England attacks Norse areas of the Isle of Man

	This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt

	This Scotsman, the first Stuart king of England, was called "The Wisest Fool in Christendom"

	It's the number that followed the last king of England named William



Number of Questions that contain the word(s) 'Norse' and 'God': 54

Examples: 
--------

	This Norse god known for his great strength was a protector of peasants & farmers

	Named for the top Norse god, this important port is Denmark's third-largest city

	This Norse god led his brothers in an attack on Ymir, the first giant

	Asgard was home to the Norse gods & this most famous palace

	Known as hump day, this day of the week is named for the Nors

4. We may want to eventually compute aggregate statistics, like `.mean()` on the `" Value"` column. But right now, the values in that column are strings. Convert the`" Value"` column to floats. If you'd like to, you can create a new column with float values.

   Now that you can filter the dataset of question, use your new column that contains the float values of each question to find the "difficulty" of certain topics. For example, what is the average value of questions that contain the word `"King"`?
   
   Make sure to use the dataset that contains the float values as the dataset you use in your filtering function.

In [10]:
'''
  Name:         toFloat
  Type:         float
  Description:  Used to convert the datatype of the
                'Values' column in the jeopardy dataframe
                from string to float
  Parameters:   val => value in 'Values' column to convert to
                       datatype 'float'
  Returns:      value converted to datatype 'float'
'''
def toFloat(val):
    newVal = val
    if (not isinstance(val, float)):
        if (isinstance(val, str)):
            newVal = newVal.strip('$').replace(',', '')
            newVal = float(newVal)
        else:
            newVal = float(newVal)
    return newVal
jeopardy.Value = jeopardy.Value.apply(lambda x: toFloat(x))
print('Average value for questions containing the word \'King\': {}'.format(round(contains_filter_improved(jeopardy, ['King']).Value.mean()*100)/100))

Number of Questions that contain the word(s) 'King': 2879

Examples: 
--------

	<a href="http://www.j-archive.com/media/2004-12-31_DJ_26.mp3">Ripped from today's headlines, he was a turtle king gone mad; Mack was the one good turtle who'd bring him down</a>

	Between 1842 & 1885, he repeatedly revised his "Idylls of the King"

	Robin Quivers is the radio consort of this self-proclaimed  "King of All Media"

	A Norman could say, "I'm the king of the motte-and-bailey style of" this

	Examples of this TV format include "Leave It to Beaver" & "The King of Queens"



Average value for questions containing the word 'King': 821.73


5. Write a function that returns the count of unique answers to all of the questions in a dataset. For example, after filtering the entire dataset to only questions containing the word `"King"`, we could then find all of the unique answers to those questions. The answer "Henry VIII" appeared 55 times and was the most common answer.

In [11]:
'''
  Name:         unique_count
  Type:         int
  Description:  Sorts the given dataset by its 'answer' column and
                returns the number of questions each unique answer
                applies to.
  Parameters:   dataset       => dataset to sort
                questions_col => name of the dataset's 'questions' column
                answers_col   => name of the dataset's 'answers' column
  Returns:      Returns a new dataset that represents the unique answers
                in the given dataset's 'answers' column and the number of 
                questions each answer applies to.
'''
def unique_count(dataset, questions_col, answers_col):
    count = dataset.groupby(answers_col)[questions_col].count()
    return count

king_data = contains_filter(jeopardy, ['King'])
print('Number of questions where \'Henry VIII\' was the answer: {}'.format(unique_count(king_data, 'Question', 'Answer')['Henry VIII']))

Number of Questions that contain the word(s) 'King': 7409

Examples: 
--------

	Around 100 A.D. Tacitus wrote a book on how this art of persuasive speaking had declined since Cicero

	<a href="http://www.j-archive.com/media/2004-12-31_DJ_26.mp3">Ripped from today's headlines, he was a turtle king gone mad; Mack was the one good turtle who'd bring him down</a>

	<a href="http://www.j-archive.com/media/2004-12-31_DJ_24.mp3">"500 Hats"... 500 ways to die.  On July 4th, this young boy will defy a king... & become a legend</a>

	It's the largest kingdom in the United Kingdom

	In this kid's game, you bounce a small rubber ball while picking up 6-pronged metal objects



Number of questions where 'Henry VIII' was the answer: 55
