Basics

Parse, clean, and organize the Jeopardy! question data file to train a Naive Bayesian classifier.

Just as we have built a classifier above, your aim here is to make sense of the data presented, and create a binary classifier ("high value" and "low value," based on the points available for each) for questions. Despite the large number of questions, this is an extraordinarily difficult classification problem. Consider it as a human coder: how often could you tell those questions that are "easy" versus "hard"? The degree to which you are successful in this is largely based on your own contextual knowledge--indeed, you might be tempted to classify questions you know the answer to as "easy" and those you do not as "hard." The computer doesn't know the answers to any of these.

For that reason, do not be discouraged if your classifier does not perform well. This constitutes an especially difficult problem for a simple classifier to solve.

Put the script and its output (which may merely report the accuracy of the trial) in your github repository, and share the link/filenames when you start your quiz.

In [154]:
import pandas as pd
import json
from sklearn.naive_bayes import MultinomialNB
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline

In [155]:
with open('Desktop/jeopardy.json','r') as jeopardy:
    data = json.load(jeopardy)

In [156]:
data

[{'category': 'HISTORY',
  'air_date': '2004-12-31',
  'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
  'value': '$200',
  'answer': 'Copernicus',
  'round': 'Jeopardy!',
  'show_number': '4680'},
 {'category': "ESPN's TOP 10 ALL-TIME ATHLETES",
  'air_date': '2004-12-31',
  'question': "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'",
  'value': '$200',
  'answer': 'Jim Thorpe',
  'round': 'Jeopardy!',
  'show_number': '4680'},
 {'category': 'EVERYBODY TALKS ABOUT IT...',
  'air_date': '2004-12-31',
  'question': "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'",
  'value': '$200',
  'answer': 'Arizona',
  'round': 'Jeopardy!',
  'show_number': '4680'},
 {'category': 'THE COMPANY LINE',
  'air_date': '2004-12-31',
  'question': '\'In 1963, live on "The Art Linkletter Show", this company served its billionth burger\

In [157]:
jeopardy = pd.DataFrame(data)
jeopardy.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680


Let's do some exploration...

In [158]:
jeopardy['value'].unique()

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', None, '$5,000', '$100', '$300', '$500',
       '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100',
       '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800',
       '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000',
       '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000',
       '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600',
       '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127',
       '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600',
       '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800',
       '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700',
       '$2,021', '$5,200', '$3,389', '$4,200', '$5', '$2,001', '$1,263',
       '$4,637', '$3,201', '$6,600', '$3,700', '$2,990', '$5,500',
       '$14,000', '$2,700', '$6,400', '$350', '$8,600', '$6,300', '$250',
      

In [159]:
len(jeopardy['value'].unique())

150

In [160]:
jeopardy['category'].unique()

array(['HISTORY', "ESPN's TOP 10 ALL-TIME ATHLETES",
       'EVERYBODY TALKS ABOUT IT...', ..., 'OFF-BROADWAY',
       'RIDDLE ME THIS', 'AUTHORS IN THEIR YOUTH'], dtype=object)

In [161]:
print(jeopardy['air_date'].max())
print(jeopardy['air_date'].min())

2012-01-27
1984-09-10


Now let's try applying a baise naive model

In [162]:
jeopardy['value'] = jeopardy['value'].str.strip('$')
jeopardy['value'] = pd.to_numeric(jeopardy['value'].str.replace(',', ''))

In [163]:
print(max(jeopardy['value'].dropna()))
print(min(jeopardy['value'].dropna()))
(max(jeopardy['value'].dropna()) + min(jeopardy['value'].dropna()))/2

18000.0
5.0


9002.5

In [164]:
stop = set(stopwords.words('english') + list(punctuation))

In [170]:
jeopardy.loc[jeopardy['value'].between(0, 800, 'both'), 'category'] = 'low'
jeopardy.loc[jeopardy['value'].between(800, 1801, 'right'), 'category'] = 'high'

In [166]:
train_x, test_x, train_y, train_y = train_test_split(jeopardy['question'], jeopardy['category'], random_state = 1)

In [173]:
train_x

80347          'Cereal heiress Marjorie Post's middle name'
211766    'Nicholas Meyer's "The West End Horror" takes ...
154215    'He wrote, "At night we ride through mansions ...
162472    'On January 20, 1981 the U.S. released about $...
7460      'It was nice to see this musical "Looking Swel...
                                ...                        
109259                'Someone who does church work abroad'
50057     'One of Gabriel Garcia Marquez' best-known wor...
5192      'This Neapolitan tenor made his last public ap...
208780    'Signed February 2, 1848, this treaty ended th...
128037    'To end his marriage in 1877, this Russian com...
Name: question, Length: 162697, dtype: object

In [174]:
train_y

180575    low
85360     low
133653    low
4637      low
23868     low
         ... 
192345    low
44587     low
23307     low
194290    low
110226    low
Name: category, Length: 54233, dtype: object

In [175]:
tfidf_v = TfidfVectorizer(use_idf = True)
train_x_tf = tfidf_v.fit_transform(train_x)
test_x_tf = tfidf_v.fit_transform(test_x)
print(train_x_tf.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Let's try splitting the data differently because that didn't work

In [172]:
print(len(jeopardy[jeopardy['category'] == 'low']))
print(len(jeopardy[jeopardy['category'] == 'high']))

135669
77579


In [177]:
train = jeopardy.sample(frac=0.8, random_state=1)
test = jeopardy.drop(train.index)

In [178]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train['question'])
y_train = train['category']

In [179]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB()

In [180]:
X_test = vectorizer.transform(test['question'])
y_test = test['category']
accuracy = classifier.score(X_test, y_test)
print('Accuracy:', accuracy)

Accuracy: 0.6101737887797907


https://towardsdatascience.com/text-classification-using-naive-bayes-theory-a-working-example-2ef4b7eb7d5a