## Exploring data from Jeopardy

First, load and manipulate the data:

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
df = pd.read_csv('jeopardy.csv')
df.columns = df.columns.str.replace(' ', '')
print(df.head())

  pd.set_option('display.max_colwidth', -1)


   ShowNumber     AirDate      Round                         Category Value  \
0  4680        2004-12-31  Jeopardy!  HISTORY                          $200   
1  4680        2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2  4680        2004-12-31  Jeopardy!  EVERYBODY TALKS ABOUT IT...      $200   
3  4680        2004-12-31  Jeopardy!  THE COMPANY LINE                 $200   
4  4680        2004-12-31  Jeopardy!  EPITAPHS & TRIBUTES              $200   

                                                                                                      Question  \
0  For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory              
1  No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves   
2  The city of Yuma in this state has a record average of 4,055 hours of sunshine each year                      
3  In 1963, live on "The Art Linkletter Show", this company served it

In [2]:
df['ValueFloat'] = df['Value'].apply(lambda x: x.split('$')[-1])
df['ValueFloat'] = df.ValueFloat.str.replace(',', '')
df['ValueFloat'] = pd.to_numeric(df['ValueFloat'],errors='coerce')
print(df['ValueFloat'].head())

0    200.0
1    200.0
2    200.0
3    200.0
4    200.0
Name: ValueFloat, dtype: float64


Filter questions about presidents of the United Stes of Aerica:

In [29]:
president= df[df['Question'].str.contains('president|President', na=False)]
president= president[president['Question'].str.contains('USA|America', na=False)]
president = president.reset_index(drop=True)
print(president.head())
print(len(president))

   ShowNumber     AirDate             Round                Category   Value  \
0  3447        1999-09-07  Double Jeopardy!  THE CIVIL WAR           $1000    
1  6135        2011-04-22  Double Jeopardy!  IT'S A COUP D'ETAT      $2,000   
2  4935        2006-02-10  Jeopardy!         FAMOUS NAMES            $600     
3  3139        1998-04-02  Double Jeopardy!  PRESIDENTIAL CAMPAIGNS  $800     
4  5760        2009-10-02  Jeopardy!         A WORLD OF MEMOIRS      $600     

                                                                                                                                                                                                                                                                                                                                                Question  \
0  In February 1861 6 Southern states founded the Confederate States of America & elected him president                                                                          

Compare average difficulty of questions about American presidents with the average difficulty of all questions by checking their average value:

In [12]:
difficulty_president = president.ValueFloat.mean()
difficulty = df.ValueFloat.mean()
print(difficulty)
print(difficulty_president)

752.5959230365314
877.3109243697479


Check for unique answers:

In [31]:
duplicates = president.duplicated(subset = 'Answer')
print(duplicates.value_counts())

False    95
True     25
dtype: int64


In [36]:
dups = president.pivot_table(index = ['Answer'], aggfunc ='size')
dups = pd.DataFrame({'Answer':dups.index, 'Occurances':dups.values})
print(dups.head())
print(dups[dups.Occurances > 1])

                Answer  Occurances
0  "Hail to the Chief"  1         
1  (Gilbert) Stuart     1         
2  (Woodrow) Wilson     1         
3  Abraham Lincoln      2         
4  AmeriCorps           1         
                Answer  Occurances
3   Abraham Lincoln     2         
17  Colombia            2         
32  George Washington   4         
36  Honduras            2         
48  Lyndon Johnson      2         
49  Martin Van Buren    2         
51  McKinley            2         
52  Mexico              3         
53  Michael Douglas     2         
58  Nicaragua           2         
65  Richard M. Nixon    2         
66  Ronald Reagan       2         
73  Susan B. Anthony    3         
75  Teddy Roosevelt     4         
76  Theodore Roosevelt  3         
78  Thomas Jefferson    3         
87  William McKinley    2         


It seems that the two most common answers were 'George Washington' and 'Teddy Roosevelt'.

Now let's check if the questions reflect the technological advancements of each time period, specifically the frequency of questions containing the word computer, before and after the year 2000:

In [46]:
df.AirDate = pd.to_datetime(df.AirDate)
latest = df.AirDate.max()
earliest = df.AirDate.min()
print(latest)
print(earliest)

2012-01-27 00:00:00
1984-09-10 00:00:00


In [56]:
import datetime

two_thousand = datetime.datetime(2000, 1, 1)

computer = df[df.Question.str.contains('omputer', na=False)]
print(len(computer))

computer_pre2000 = computer[computer.AirDate < two_thousand]
print(len(computer_pre2000))
computer_post2000 = computer[computer.AirDate >= two_thousand]
print(len(computer_post2000))

431
104
327


Approximately 75% of the computer questions occured during and after the year 2000.

Now let's check the difficulty assigned to specific categories:

In [57]:
categories = df.pivot_table(index = ['Round'], aggfunc ='size')
print(categories)

Round
Double Jeopardy!    105912
Final Jeopardy!     3631  
Jeopardy!           107384
Tiebreaker          3     
dtype: int64


In [63]:
economics = df[df.Category == 'ECONOMICS']
economics_categories = economics.pivot_table(index = ['Round'], aggfunc ='size')
print(economics_categories)

Round
Double Jeopardy!    49
Final Jeopardy!     1 
Jeopardy!           15
dtype: int64


For example the category 'ECONOMICS' appears mostly in the second round, which is considered to be of intermediate difficulty. It does not appear very often in the first round and it only appeared once in the final round.

In [64]:
history = df[df.Category == 'HISTORY']
history_categories = history.pivot_table(index = ['Round'], aggfunc = 'size')
print(history_categories)

Round
Double Jeopardy!    194
Jeopardy!           155
dtype: int64


The category 'HISTORY' appears way more often than 'ECONOMICS', but only in the first two rounds.