# Jeopardy Analysis

## by Justin Sierchio

In this analysis, we will be looking at the game-show Jeopardy! Ideally, we would like to be able to answer the following questions:

<ul>
    <li>What topics are most likely covered?</li>
    <li>What is the most common answer on Jeopardy?</li>
    <li>What are some other conclusions we might able to draw from this analysis?</li>
</ul>

This data is in .csv file format and is from Kaggle at: https://www.kaggle.com/tunguz/200000-jeopardy-questions/download. More information related to the dataset can be found at: https://www.kaggle.com/tunguz/200000-jeopardy-questions.

## Notebook Initialization

In [1]:
# Import Relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

print('Initial libraries loaded into workspace!')

Initial libraries loaded into workspace!


In [2]:
# Upload Datasets for Study
df_JEOPARDY = pd.read_csv("JEOPARDY_CSV.csv");

print('Datasets uploaded!');

Datasets uploaded!


In [3]:
# Display 1st 5 rows from Jeopardy dataset
df_JEOPARDY.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As a final step, let's list how the dataset defines each of the terms (using the Kaggle definitions).

<ul>
    <li> Show Number:: the production number for that episode of Jeopardy!</li>
    <li> Air Date: the date the episode aired (YYYY-MM-DD).</li>
    <li> Round: the round of Jeopardy - Jeopardy, Double Jeopardy or Final Jeopardy.</li>
    <li> Category: the question category.</li>
    <li> Value: the $ for the question.</li>
    <li> Question: the actual Jeopardy question.</li>
    <li> Answer: the answer to the Jeopardy question.</li>
</ui>

# Data Cleaning

Let's make sure that the data is sufficiently cleaned for analysis. We will begin by looking at some basic statistics.

In [7]:
# What are the data types in the Jeopardy dataset?
df_JEOPARDY.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1    Air Date    216930 non-null  object
 2    Round       216930 non-null  object
 3    Category    216930 non-null  object
 4    Value       216930 non-null  object
 5    Question    216930 non-null  object
 6    Answer      216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


So we have the episode numers in 64-bit integers and every other variable as an object. Let's check for any 'NaN' or 'null' values.

In [9]:
# Check dataset for 'NaN' or 'null' values
df_JEOPARDY.isnull().sum()

Show Number    0
 Air Date      0
 Round         0
 Category      0
 Value         0
 Question      0
 Answer        2
dtype: int64

Since we have 2 rows in the 'Answer' column that are 'null' (out of 216,928 rows), we can safely remove those rows without adversely affecting the data quality.

In [11]:
# Remove 'NULL' rows from Jeopardy Dataset
df_JEOPARDY = df_JEOPARDY.dropna()

In [12]:
# Confirm all 'NULL' rows and columns removed
df_JEOPARDY.isnull().sum()

Show Number    0
 Air Date      0
 Round         0
 Category      0
 Value         0
 Question      0
 Answer        0
dtype: int64

Now it appears that our data is sufficiently cleaned to conduct a further analysis.