# Jeopardy Game

The goal of this project is to investigate a dataset of "Jeopardy" game using pandas methods and write several aggregate functions to find some insights. 

## Tasks

### 1. Investigate and Clean

1.1. First we import pandas and NumPy.

In [1]:
import pandas as pd
import numpy as np 

1.2. After that we load the data from `jeopardy.csv.` into a DataFrame and investigate its contents. 

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')

1.3. Let's print first 5 rows of the dataset. 

In [3]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


1.4. Seems like questions are troncated. Let's fix this for this dataset - it might come handy in the future. 

In [14]:
pd.set_option('display.max_colwidth', None)

1.5. Now let's try to print out specific columns.

In [5]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

1.6. Columns names contain weird spaces that slow down the work, so let's rename them.

In [6]:
jeopardy.rename(columns={
    'Show Number': 'show_number',
    ' Air Date': 'air_date', 
    ' Round': 'round',
    ' Category': 'category',
    ' Value': 'value',
    ' Question': 'question',
    ' Answer': 'answer'
}, inplace=True)
jeopardy.head(5)

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams


1.7. Let's check columns' data types to understand whether they are acceptable for wrangling.

In [7]:
jeopardy.dtypes

show_number    int64 
air_date       object
round          object
category       object
value          object
question       object
answer         object
dtype: object

Seems like `value` column definitely should be numeric to perform some actions with it. 

1.8. Let's check what acsessive symbols it has to remove them. 

In [8]:
jeopardy.value.unique()

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', 'None', '$5,000', '$100', '$300',
       '$500', '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100',
       '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800',
       '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000',
       '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000',
       '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600',
       '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127',
       '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600',
       '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800',
       '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700',
       '$2,021', '$5,200', '$3,389', '$4,200', '$5', '$2,001', '$1,263',
       '$4,637', '$3,201', '$6,600', '$3,700', '$2,990', '$5,500',
       '$14,000', '$2,700', '$6,400', '$350', '$8,600', '$6,300', '$250',
    

1.9. We are going to replace **"None"** value with `NaN`, which won't be a part of any calculations and remove other unnecessary symbols. 

In [9]:
# Replace 'None' with NaN
jeopardy[jeopardy.columns] = jeopardy[jeopardy.columns].replace('None', np.NaN)

# Remove unnecessary symbols & reformat a column
cut_the_tails = lambda column: pd.to_numeric(column.replace('[\$%,]', '', regex=True))
jeopardy.value = cut_the_tails(jeopardy.value)
jeopardy.value

0         200.0 
1         200.0 
2         200.0 
3         200.0 
4         200.0 
          ...   
216925    2000.0
216926    2000.0
216927    2000.0
216928    2000.0
216929   NaN    
Name: value, Length: 216930, dtype: float64

1.10. Now it's a numeric type and we can check out some stats of the `value` column. 

In [10]:
jeopardy.value.describe()

count    213296.000000
mean     752.595923   
std      637.855303   
min      5.000000     
25%      400.000000   
50%      600.000000   
75%      1000.000000  
max      18000.000000 
Name: value, dtype: float64

### 2. Functions

2.1. We are going to write a function that filters the dataset for questions that contains all of the words in a list of words. For example, when the list `["King", "England"]` was passed to our function, the function returned every row that has the strings "King" and "England" somewhere in its "Question" column.

In [11]:
def find_words(dataset, column, word_list):
    rows = column.apply(lambda x: True)
    for word in word_list:
        rows = rows & column.str.contains('\\b' + word + '\\b', case=False)
    return dataset[rows]

word_list = ['England', 'King']
questions = find_words(jeopardy, jeopardy.question, word_list)
questions

Unnamed: 0,show_number,air_date,round,category,value,question,answer
4953,3003,1997-09-24,Double Jeopardy!,"""PH""UN WORDS",200.0,"Both England's King George V & FDR put their stamp of approval on this ""King of Hobbies""",Philately (stamp collecting)
6337,3517,1999-12-14,Double Jeopardy!,Y1K,800.0,"In retaliation for Viking raids, this ""Unready"" king of England attacks Norse areas of the Isle of Man",Ethelred
9191,3907,2001-09-04,Double Jeopardy!,WON THE BATTLE,800.0,This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt,Henry V
11710,2903,1997-03-26,Double Jeopardy!,BRITISH MONARCHS,600.0,"This Scotsman, the first Stuart king of England, was called ""The Wisest Fool in Christendom""",James I
13454,4726,2005-03-07,Jeopardy!,A NUMBER FROM 1 TO 10,1000.0,It's the number that followed the last king of England named William,4
...,...,...,...,...,...,...,...
201168,3515,1999-12-10,Jeopardy!,BEFORE & AFTER,500.0,Popular Saint-Exupery character waiting around to become king of England,The Little Prince of Wales
204778,5899,2010-04-15,Double Jeopardy!,THE 13 COLONIES' NAME ORIGINS,1200.0,"This southern colony was named for a king of England, the II of that name",Georgia
208742,4863,2005-11-02,Double Jeopardy!,BEFORE & AFTER,3000.0,Dutch-born king who ruled England jointly with Mary II & is a tasty New Zealand fish,William of Orange roughy
213870,5856,2010-02-15,Double Jeopardy!,URANUS,1600.0,In 1781 William Herschel discovered Uranus & initially named it after this king of England,George III


2.2. Now let's write a function that returns the count of the unique answers to all of the questions in a dataset. 

In [12]:
def find_uniq_answ(dataset):
    return dataset.groupby('answer').answer.count().reset_index(name = 'count')
    
find_uniq_answ(jeopardy)

Unnamed: 0,answer,count
0,Hamlet,1
1,Les Miserables,1
2,Nosferatu,1
3,She Loves You,1
4,Sleepless in Seattle,1
...,...,...
88262,étoufée,2
88263,études,1
88264,été,1
88265,über,1


2.3. Now we going to find how many unique answers we get after filtering the dataset by "King" and "England". Let's sort them in descending order to see the most common ones.

In [13]:
find_uniq_answ(questions).sort_values('count', ascending = False).head(25)

Unnamed: 0,answer,count
78,William the Conqueror,5
42,James I,3
66,Richard the Lionhearted,3
57,Oliver Cromwell,3
28,George I,3
39,Henry VIII,3
29,George III,3
31,Georgia,2
65,Richard the Lionheart,2
64,Richard III,2


## Conclusion

We've investigated and cleaned a bit the "Jeopardy" dataset and have written some custom functions which use aggregate methods within them to find some interesting insights: that **"William the Conqueror"** is, apparently, the most popular monarch to ask questions about. 