Michael O'Donnell - 02/07/2024

STC 510 - Module Basics Assignment 

The purpose of this program is to find if the Naive Bayes algorithm can somewhat accurately sort whether a question from Jeopardy is easy or hard to answer. To do so the program imports a JSON file that contains data from the television show Jeopardy. Once that data set is imported, the program then must put that JSON data into a dataframe for manipulation. After that the program then cleans the the value column to turn the values into an integer becuase the program needs to be able to compare the integers and not strings. 

Since determining whether a question from a game show is quite biased I Googled "what makes a question in Jeopardy hard." Doing so yielded this Medium Article:

https://medium.com/@pollockcolin/jeopardy-question-difficulty-1bba47770bc6

Where the author discusses Jeopardy and machine learning. In the article I was able to take some insight from where people on a Jeopardy site started getting more questions wrong then right. The author notes this by visualizing it in a graph by Jeopardy round and dollar amount. This is where my own bias came in and I made the cutoff at 50% correct to consider a question hard after that. Although this includes my own bias, I believe this cuts off the monumental amount of bias I could have included by trying to code the questions based on themes or something of that nature. 

So my logic for the cutoff was this: A question will be considered easy IF the ROUND is 'Jeopardy!' AND the 'value' is 800 or less OR if the ROUND is 'Double Jeopardy!' and the 'value' is 1200 or less. This way if the round is in Final Jeopardy or a Tiebreaker they will be considered hard since the show aims to make those harder. Following this line of reasoning a long with the graphs the author provided, I was able to get the Naive Bayes to classify an easy question that I assumed is easy from the data of the author 79% right. That is quite more accurate than the 50/50 chance of guessing whether a question will be easy or hard so I am content with this. 

I also used this source for importing the JSON file:

https://www.freecodecamp.org/news/loading-a-json-file-in-python-how-to-read-and-parse-json/

--------------------------

In [49]:
#import statements
import json
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import nltk
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import pandas as pd
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [50]:
#setting the stop words
english_stopwords = set(stopwords.words('english') + list(punctuation) + ['..','...','nbsp','n\'t'])

In [51]:
#iniatalizing the lemmatizer
lemmatizer = WordNetLemmatizer()

In [52]:
#bringing in the jeopardy JSON file
with open('jeopardy.json') as jeopardy_file:
    jeopardy_contents = jeopardy_file.read()

In [53]:
#parsing the jeopardy JSON file so it can read by python and put into a dataframe
parsed_jeopardy = json.loads(jeopardy_contents)

In [54]:
#puts the JSON questions into a dataframe
df = pd.DataFrame(parsed_jeopardy)

In [55]:
#Cleans the value column and sets the values to integers
df['value'] = df['value'].str.replace('$','')
df['value'] = df['value'].str.replace(',','')
df['value'] = df['value'].str.strip()
df['value'] = pd.to_numeric(df['value'])

In [56]:
#Adds a new column that uses the explained above logic that gives easy questions a 1 and hard questions a 0
df['difficulty'] = np.where(((df['round']=='Jeopardy!') & (df['value'] <= 800)) | ((df['round']=='Double Jeopardy!') & (df['value'] <= 1200)),1,0)

In [57]:
#This is a function that takes each cell given to it from a dataframe that then goes and makes it lowercase,
# tokenizes it, lemmantizes it and puts it back into the original string of the cell 
def manipulate_text(text):
    text = text.lower()
    toke_text = word_tokenize(text)
    wordlist = []
    for word in toke_text:
        if word not in english_stopwords:
            wordlist.append(word)
    wordlist2 = []
    for eachword in wordlist:
        wordlist2.append(lemmatizer.lemmatize(eachword))
    text = [' '.join(wordlist2)]
    text = str(text)
    return text

In [58]:
#This puts the column of question that I want to manipulate for training in the above function
df['question'] = df['question'].apply(manipulate_text)

In [59]:
#This is taken from the lecture, but it splits up the question and difficulty into training and testing sets
#so the program can see how accurate the algorithm is at classifying the text
X_train, X_test, Y_train, Y_test = train_test_split(df.question, df.difficulty, random_state=1)

In [60]:
#This vectorizes and trains/tests the algorithm on the question cells
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
X_train_tf = tfidf_vectorizer.fit_transform(X_train)
X_test_tf = tfidf_vectorizer.transform(X_test)

In [61]:
#This sees how well the test holds up against the training data
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_tf, Y_train)
predictions = naive_bayes.predict(X_test_tf)

In [62]:
print('Accuracy: ', accuracy_score(Y_test, predictions))

Accuracy:  0.7932992827245404
