# This is Jeopardy!

#### Overview

This project is slightly different than others you have encountered thus far. Instead of a step-by-step tutorial, this project contains a series of open-ended requirements which describe the project you'll be building. There are many possible ways to correctly fulfill all of these requirements, and you should expect to use the internet, Codecademy, and/or other resources when you encounter a problem that you cannot easily solve.

#### Project Goals

You will work to write several functions that investigate a dataset of _Jeopardy!_ questions and answers. Filter the dataset for topics that you're interested in, compute the average difficulty of those questions, and train to become the next Jeopardy champion!

## Prerequisites

In order to complete this project, you should have completed the Pandas lessons in the <a href="https://www.codecademy.com/learn/paths/analyze-data-with-python">Analyze Data with Python Skill Path</a>. You can also find those lessons in the <a href="https://www.codecademy.com/learn/data-processing-pandas">Data Analysis with Pandas course</a> or the <a href="https://www.codecademy.com/learn/paths/data-science/">Data Scientist Career Path</a>.

Finally, the <a href="https://www.codecademy.com/learn/practical-data-cleaning">Practical Data Cleaning</a> course may also be helpful.

## Project Requirements

1. We've provided a csv file containing data about the game show _Jeopardy!_ in a file named `jeopardy.csv`. Load the data into a DataFrame and investigate its contents. Try to print out specific columns.

   Note that in order to make this project as "real-world" as possible, we haven't modified the data at all - we're giving it to you exactly how we found it. As a result, this data isn't as "clean" as the datasets you normally find on Codecademy. More specifically, there's something odd about the column names. After you figure out the problem with the column names, you may want to rename them to make your life easier for the rest of the project.
   
   In order to display the full contents of a column, we've added this line of code for you:
   
   ```py
   pd.set_option('display.max_colwidth', None)
   ```

In [3]:
import pandas as pd
import numpy
pd.set_option('display.max_colwidth', None)

df = pd.read_csv('jeopardy.csv')
print(df)

        Show Number    Air Date             Round  \
0              4680  2004-12-31         Jeopardy!   
1              4680  2004-12-31         Jeopardy!   
2              4680  2004-12-31         Jeopardy!   
3              4680  2004-12-31         Jeopardy!   
4              4680  2004-12-31         Jeopardy!   
...             ...         ...               ...   
216925         4999  2006-05-11  Double Jeopardy!   
216926         4999  2006-05-11  Double Jeopardy!   
216927         4999  2006-05-11  Double Jeopardy!   
216928         4999  2006-05-11  Double Jeopardy!   
216929         4999  2006-05-11   Final Jeopardy!   

                               Category  Value  \
0                               HISTORY   $200   
1       ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2           EVERYBODY TALKS ABOUT IT...   $200   
3                      THE COMPANY LINE   $200   
4                   EPITAPHS & TRIBUTES   $200   
...                                 ...    ...   
216925       

2. Write a function that filters the dataset for questions that contains all of the words in a list of words. For example, when the list `["King", "England"]` was passed to our function, the function returned a DataFrame of 49 rows. Every row had the strings `"King"` and `"England"` somewhere in its `" Question"`.

   Test your function by printing out the column containing the question of each row of the dataset.

In [4]:
df = pd.read_csv('jeopardy.csv')

df.rename(columns={'Show Number':'show_number',' Air Date':'date',' Round':'round',' Category':'category'," Value":'value', ' Question':'question',' Answer':'answers'}, errors= 'raise', inplace = True) 

#print(df.head())
#print(df.columns)

# My starting solution, works but does detect input also contained in other words,
# like 'viking' is detecet when given 'King' as input

def words_scanner(list_of_words, df):
    # The function accepts a list of words then create a new DataFrame starting from original one  with only rows
    # containing the first word of the list, then it iterates through the list to scan the new df with every words
    # The resulting datafram will contain all rows that matching all the words in the list
        new_df = df[df.question.str.contains(list_of_words[0], case=False)].reset_index()
        
        #starting loop at 1 index to skip the first word that has been already used to create the current df
        i = 1
        for i in range(len(list_of_words)):
            new_df2 = new_df[new_df.question.str.contains(list_of_words[i])].reset_index()
            i=+1
        return new_df2



# Notebooke solution, works but does detect input also contained in other words,
# like 'viking' is detecet when given 'King' as input

#putting white-space after desired words seems to partially solve the issue, need to test other combo of words
"""



# Filtering a dataset by a list of words
def filter_data(df, words):
  # Lowercases all words in the list of words as well as the questions. Returns true if all of the words in the list appear in the question.
  filter = lambda x: all('{}'.format(word.lower()) in '{}'.format(x.lower()) for word in words)
  # Applies the lambda function to the Question column and returns the rows where the function returned True
  return df.loc[df["question"].apply(filter)].reset_index()

# Testing the filter function

filtered = filter_data(df, words)
print(filtered["question"])
"""



'\n\n\n\n# Filtering a dataset by a list of words\ndef filter_data(df, words):\n  # Lowercases all words in the list of words as well as the questions. Returns true if all of the words in the list appear in the question.\n  filter = lambda x: all(\'{}\'.format(word.lower()) in \'{}\'.format(x.lower()) for word in words)\n  # Applies the lambda function to the Question column and returns the rows where the function returned True\n  return df.loc[df["question"].apply(filter)].reset_index()\n\n# Testing the filter function\n\nfiltered = filter_data(df, words)\nprint(filtered["question"])\n'

3. Test your original function with a few different sets of words to try to find some ways your function breaks. Edit your function so it is more robust.

   For example, think about capitalization. We probably want to find questions that contain the word `"King"` or `"king"`.
   
   You may also want to check to make sure you don't find rows that contain substrings of your given words. For example, our function found a question that didn't contain the word `"king"`, however it did contain the word `"viking"` &mdash; it found the `"king"` inside `"viking"`. Note that this also comes with some drawbacks &mdash; you would no longer find questions that contained words like `"England's"`.

In [33]:
#1st scan. "King" "England"

words = ['king','England']
new_df = words_scanner(words, df)
print("1st  scan MATCHING ROWS FOUND: ".upper() + str(len(new_df)),'\n' )
print(new_df.head())


#2nd Scan "Queen" "Scotland"
words2 = [' Queen ',' Scotland ']
new_df2 = words_scanner(words2, df)


print("\n2nd SCAN MATCHING ROWS FOUND: ".upper() + str(len(new_df2)),'\n')
print(new_df2.head())



1ST  SCAN MATCHING ROWS FOUND: 152 

   level_0  index               category        date  \
0      161   4953           "PH"UN WORDS  1997-09-24   
1      207   6337                    Y1K  1999-12-14   
2      300   9191         WON THE BATTLE  2001-09-04   
3      391  11710       BRITISH MONARCHS  1997-03-26   
4      436  13454  A NUMBER FROM 1 TO 10  2005-03-07   

                                                                                                 question  \
0                Both England's King George V & FDR put their stamp of approval on this "King of Hobbies"   
1  In retaliation for Viking raids, this "Unready" king of England attacks Norse areas of the Isle of Man   
2                This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt   
3            This Scotsman, the first Stuart king of England, was called "The Wisest Fool in Christendom"   
4                                    It's the number that followed the last king 

In [35]:
mask = new_df['value'].str.contains("$")

new_df.loc[mask, ['question', 'value','answers']]

Unnamed: 0,question,value,answers
0,"Both England's King George V & FDR put their stamp of approval on this ""King of Hobbies""",$200,Philately (stamp collecting)
1,"In retaliation for Viking raids, this ""Unready"" king of England attacks Norse areas of the Isle of Man",$800,Ethelred
2,This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt,$800,Henry V
3,"This Scotsman, the first Stuart king of England, was called ""The Wisest Fool in Christendom""",$600,James I
4,It's the number that followed the last king of England named William,$1000,4
...,...,...,...
147,In 1066 this great-great grandson of Rollo made what some call the last Viking invasion of England,$600,William the Conqueror
148,Dutch-born king who ruled England jointly with Mary II & is a tasty New Zealand fish,"$3,000",William of Orange roughy
149,In 1781 William Herschel discovered Uranus & initially named it after this king of England,$1600,George III
150,"His nickname was ""Bertie"", but he used this name & number when he became king of England in 1901",$1000,Edward VII


4. We may want to eventually compute aggregate statistics, like `.mean()` on the `" Value"` column. But right now, the values in that column are strings. Convert the`" Value"` column to floats. If you'd like to, you can create a new column with float values.

   Now that you can filter the dataset of question, use your new column that contains the float values of each question to find the "difficulty" of certain topics. For example, what is the average value of questions that contain the word `"King"`?
   
   Make sure to use the dataset that contains the float values as the dataset you use in your filtering function.

In [36]:
try:
    new_df['value_float'] = new_df.value.apply(lambda x: x.replace("$","").replace(',','.').replace("None","0"))
    new_df['value_float'] = new_df['value_float'].astype(float)
    
except :
    pass

print("\n\n\nThe median value for questions containing words \"King\" and \"England\" is:  ".upper() + str(round(new_df.value_float.mean(),0)) +"$")
mask = new_df['value'].str.contains("$")

new_df.loc[mask, ['question', 'value_float']]




THE MEDIAN VALUE FOR QUESTIONS CONTAINING WORDS "KING" AND "ENGLAND" IS:  741.0$


Unnamed: 0,question,value_float
0,"Both England's King George V & FDR put their stamp of approval on this ""King of Hobbies""",200.0
1,"In retaliation for Viking raids, this ""Unready"" king of England attacks Norse areas of the Isle of Man",800.0
2,This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt,800.0
3,"This Scotsman, the first Stuart king of England, was called ""The Wisest Fool in Christendom""",600.0
4,It's the number that followed the last king of England named William,1000.0
...,...,...
147,In 1066 this great-great grandson of Rollo made what some call the last Viking invasion of England,600.0
148,Dutch-born king who ruled England jointly with Mary II & is a tasty New Zealand fish,3.0
149,In 1781 William Herschel discovered Uranus & initially named it after this king of England,1600.0
150,"His nickname was ""Bertie"", but he used this name & number when he became king of England in 1901",1000.0


5. Write a function that returns the count of unique answers to all of the questions in a dataset. For example, after filtering the entire dataset to only questions containing the word `"King"`, we could then find all of the unique answers to those questions. The answer "Henry VIII" appeared 55 times and was the most common answer.

In [40]:
# Note the space before king and after , it is to avoid finding words ending with king like "speaking"
words = ['king']
new_df = words_scanner(words, df)
print("Lenght of current df:" + str(len(new_df)))
print(new_df.head(5))

def answers_scanner(df):
    # The function accepts a filtered df from previous step and retuern count of unique answers to selected filter
    unique_list = df['answers'].value_counts()
    return unique_list

    

answers_scanner(new_df)



Lenght of current df:5881
   level_0  index                    category        date  \
0        0     34                 "X"s & "O"s  2004-12-31   
1        1     40  DR. SEUSS AT THE MULTIPLEX  2004-12-31   
2        2     50  DR. SEUSS AT THE MULTIPLEX  2004-12-31   
3        3     56               GEOGRAPHY "E"  2010-07-06   
4        4     72                LET'S BOUNCE  2010-07-06   

                                                                                                                                                                            question  \
0                                                                              Around 100 A.D. Tacitus wrote a book on how this art of persuasive speaking had declined since Cicero   
1  <a href="http://www.j-archive.com/media/2004-12-31_DJ_26.mp3">Ripped from today's headlines, he was a turtle king gone mad; Mack was the one good turtle who'd bring him down</a>   
2                <a href="http://www.j-archive.com/medi

Henry VIII     53
Richard III    31
Solomon        30
David          25
Louis XIV      24
               ..
mockingbird     1
John Major      1
Son of Sam      1
adrift          1
work            1
Name: answers, Length: 4452, dtype: int64

6. Explore from here! This is an incredibly rich dataset, and there are so many interesting things to discover. There are a few columns that we haven't even started looking at yet. Here are some ideas on ways to continue working with this data:

 * Investigate the ways in which questions change over time by filtering by the date. How many questions from the 90s use the word `"Computer"` compared to questions from the 2000s?
 * Is there a connection between the round and the category? Are you more likely to find certain categories, like `"Literature"` in Single Jeopardy or Double Jeopardy?
 * Build a system to quiz yourself. Grab random questions, and use the <a href="https://docs.python.org/3/library/functions.html#input">input</a> function to get a response from the user. Check to see if that response was right or wrong.

In [8]:
# number 1 : Frequency of word "Computer" in questions before and after date "1999-12-31"

questions_before = df[df.date <= "1999-12-31"].reset_index(drop=True)
questions_after = df[df.date > "1999-12-31"].reset_index(drop=True)

#print(questions_before.head(5))

#print(questions_after.head(5))

words = ['computer']
new_df_before = words_scanner(words, questions_before)
print("Lenght of 90s df:" + str(len(new_df_before)))
#print(new_df_before)


new_df_after = words_scanner(words, questions_after)
print("Lenght of 00s df:" + str(len(new_df_after)))
#print(new_df_after)


#print('The word /"computer/" appeared ' + str(len(new_df_before) + ' before 2000s and '+ str(len(new_df_after) + ' after')))
increase_percent = round(((len(new_df_after)-len(new_df_before))/len(new_df_before))*100, 2)
      
print('It is increased by: ' + str(increase_percent) + '%')




Lenght of 90s df:96
Lenght of 00s df:302
It is increased by: 214.58%


In [9]:
# number 2 : Categories percentual in different rounds ("Jeopdardy" or "Double Jeopardy")

#splitting df in 2 subset containing related rows 
questions_jeop = df[df['round'] == "Jeopardy!"].reset_index(drop=True)
questions_djeop = df[df['round'] == "Double Jeopardy!"].reset_index(drop=True)

#printing to check correct execution
#print(questions_djeop.head(5))
#print(questions_jeop.head(5))

jeop = questions_jeop[['category', 'round']]
grouped_jeop = jeop.groupby('category').count()
print('TOP 15 1st Round Category: \n',grouped_jeop.nlargest(15, 'round'))

djeop = questions_djeop[['category', 'round']]
grouped_djeop = djeop.groupby('category').count()
print('\nTOP 15 2nd Round Category: \n',grouped_djeop.nlargest(15, 'round'))
#djeop = questions_djeop[['category', 'question']]

#sorted_djeop = djeop.sort_values(['category','question']).reset_index(drop=True)
#grouped_jeop = sorted_jeop.groupby(['category']).count()

#print(sorted_djeop)


TOP 15 1st Round Category: 
                      round
category                  
POTPOURRI              255
STUPID ANSWERS         255
SPORTS                 253
ANIMALS                233
AMERICAN HISTORY       227
SCIENCE                217
STATE CAPITALS         210
TELEVISION             200
U.S. CITIES            195
BUSINESS & INDUSTRY    185
U.S. GEOGRAPHY         183
COMMON BONDS           180
POP MUSIC              180
TRANSPORTATION         178
PEOPLE                 175

TOP 15 2nd Round Category: 
                          round
category                      
BEFORE & AFTER             450
LITERATURE                 381
SCIENCE                    296
WORLD GEOGRAPHY            254
OPERA                      250
WORLD HISTORY              237
BALLET                     230
COLLEGES & UNIVERSITIES    220
ART                        215
ISLANDS                    215
CLASSICAL MUSIC            213
SHAKESPEARE                211
ART & ARTISTS              209
FICTIONAL CHARACT

In [20]:
# number 3 .  Build a system to test yourself
import random
# creating subset of df with just desired columns
df = df[['category','date','question','value','answers']]

# defining function that will generate random index between 0 and df lenght,  will get related row in existing df
# then it will shows the row containing question details to user
# will ask for "Answer" stored inside answer variable and will compare it with the actual anser
# eventually it will provide the result "CORRECT" or "NAAAH". 
# Do not leave answer input empty or function will crash the program and kernel restart will be required.

def question_picker():
    max_index = len(df)
    index = random.randint(0, max_index)
    question = df[['category','date','question','value','answers']].loc[index]
    print(question)
    answer = input()
    if answer == df.loc[index]['answers']:
        print('CORRECT')
    else:
        print("NAAAAH")
        



question_picker()    

    


category                                                 GEORGE WASHINGTON
date                                                            2010-07-05
question    George's entire tenure as president took place in this century
value                                                                $1200
answers                                             the eighteenth century
Name: 80960, dtype: object
the eighteenth century
CORRECT


category                                                         OTHER CIVIL WARS
date                                                                   2010-11-03
question    This country's civil war, lasting from 1936 to 1939, began in Morocco
value                                                                      $1,200
answers                                                                     Spain
Name: 99690, dtype: object
Spain
CORRECT


## Solution

7. Compare your program to our <a href="https://content.codecademy.com/PRO/independent-practice-projects/jeopardy/jeopardy_solution.zip">sample solution code</a> - remember, that your program might look different from ours (and probably will) and that's okay!

8. Great work! Visit <a href="https://discuss.codecademy.com/t/this-is-jeopardy-challenge-project-python-pandas/462365">our forums</a> to compare your project to our sample solution code. You can also learn how to host your own solution on GitHub so you can share it with other learners! Your solution might look different from ours, and that's okay! There are multiple ways to solve these projects, and you'll learn more by seeing others' code.