# Data Manipulation Challenge Project

In this challenge, I wrote some functions that investigate a dataset of Jeopardy! questions and answers, based on the instructions. 

### 1. Importing the libraries and the dataset

In [36]:
import pandas as pd
import random

In [8]:
df = pd.read_csv('jeopardy.csv')
pd.set_option('display.max_colwidth', None)

### 2. Renaming the columns with variable names

In [9]:
df.columns = ['show_number', 'air_date', 'round', 'category', 'value', 'question', 'answer']

### 3. Defining a function to find questions that contain all the words in a list of words

In [55]:
def contains_word(df, words):
    filter = lambda question: all(word.lower() in question.lower() for word in words)
    return df.question.apply(filter)

words = ['King', 'England']
filtered = df[contains_word(df, words)]
print('There are {} words that contain the word(s): {}'.format(len(filtered), words))

There are 152 words that contain the words: ['King', 'England']


### 4. Formatting the value column to float

In [16]:
df['float_value'] = df.value.apply(lambda x: float(x[1:].replace(',','')) if x != 'None' else 0)

### 5. Calculating the average difficulty of the dataframe

In [68]:
def mean_difficulty(df):
    return df.float_value.mean()

words = ['King']
df_king = df[contains_word(df, words)]
print('The average difficulty of questions that contain the word "King" is {}'.format(round(mean_difficulty(df_king), 2)))

The average difficulty of questions that contain the word "King" is 771.88


### 6. Finding what's the most common answer of the dataframe

In [70]:
def most_common_answer(df):
    answers = df.answer.value_counts()
    return answers.index[0], answers[0]

most_common, count_common = most_common_answer(df_king)
print('The most common answer for questions that contain the word "King" is {} and \
it appears {} times.'.format(most_common, count_common))

The most common answer for questions that contain the word "King" is Henry VIII and it appears 55 times.


### 7. Determining how the questions change over time

In [76]:
df['year'] = df.air_date.apply(lambda date: float(date[:4]))

In [77]:
def change_over_time(df, decades, words):
    first, second = decades[0], decades[1]
    first_decade = df[(df.year >= first) & (df.year < first+10)]
    second_decade = df[(df.year >= second) & (df.year < second+10)]
    
    words_first_decade = len(first_decade[contains_word(first_decade, words)])
    words_second_decade = len(second_decade[contains_word(second_decade, words)])
    increase = 100 * (words_second_decade - words_first_decade) / words_first_decade
    
    return words_first_decade, words_second_decade, increase

In [78]:
comp_nineties, comp_twenties, increase = change_over_time(df, [1990, 2000], ['Computer'])
print('The word "Computer" appeared in {} questions during the 90s and in {} questions in the 2000s. \n\
This represents an increase of {}%.'.format(comp_nineties, comp_twenties, round(increase)))

The word "Computer" appeared in 98 questions during the 90s and in 268 questions in the 2000s. 
This represents an increase of 173%.


### 8. Is there a connection between the round and the category?

In [80]:
def round_cat(df, cat):
    return df[df.category == cat.upper()]['round'].value_counts()

In [85]:
print(round_cat(df, 'literature'), '\n')
print('The category "Literature" appears more often in the "Douple Jeopardy!" round.')

Double Jeopardy!    381
Jeopardy!           105
Final Jeopardy!      10
Name: round, dtype: int64 

The category "Literature" appears more often in the "Douple Jeopardy!" round.


### 9. Answer the question yourself!

In [86]:
def question():
    random_index = random.randint(0, len(df))
    random_question = df.question[random_index]
    print('QUESTION: {}'.format(random_question))
    answer = input('Your answer: ')
    if answer.lower() == df.answer[random_index].lower():
        print("You're correct!")
    else:
        print('Wrong answer :(')

In [87]:
question()

QUESTION: Covering some 26,000 acres, the largest U.S. municipal park system belongs to this U.S. city
Your answer: a
Wrong answer :(
