<a href="https://colab.research.google.com/github/jherna85-pixel/SDS-510/blob/Module-5/Module_5_Project_Text_Analysis_Essentials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SDS 510: Module 5 Project
## Text Analysis Essentials

Jennifer Hernandez

Import what is needed and load in dataset
* Load the dataset into Python so we can work with it.
* Import all the tools needed for cleaning text and using the classification models.






In [1]:
import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_json('jeopardy.json')
df.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680


Check Value Counts - this will let us know how many times each dollar amount shows up.


In [3]:
print(df['value'].value_counts())

value
$400       42244
$800       31860
$200       30455
$600       20377
$1000      19539
           ...  
$4,238         1
$16,400        1
$1,347         1
$2547          1
$11,200        1
Name: count, Length: 149, dtype: int64


Clean up text and values to run models
- change to lowercase
- remove punctuation
- remove extra white space
- convert strings to integers
- remove rows that do not work

In [4]:
def clean_question(question):
    question = str(question).lower()
    question = re.sub(r'[^a-z]', ' ', question)
    question = re.sub(r'\s+', ' ', question).strip()
    return question

df['clean_question'] = df['question'].apply(clean_question)

In [5]:
def clean_value(value):
    if pd.isna(value):
        return None
    value = str(value).replace('$','').replace(',','')
    return int(value) if value.isdigit() else None

df['clean_value'] = df['value'].apply(clean_value)
df = df.dropna(subset=['clean_value'])

Calculate median and add new label for high and low values

* '> 600(median), 1, high value'
* '< 600(median), 0, low value'

In [6]:
median_value = df['clean_value'].median()
median_value

600.0

In [7]:
df['high_low'] = (df['clean_value'] > median_value).astype(int)

Vectorize text, train, and test
- added removal of stopwords and find things that are two words

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df.clean_question, df.high_low, random_state=1)

In [9]:

tfidf_vectorizer = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
X_train_tf = tfidf_vectorizer.fit_transform(X_train)
X_test_tf = tfidf_vectorizer.transform(X_test)


Running different Classifiers
* Naive Bayes
* Linear SVM
* Logistic Regression


Naive Bayes


In [23]:
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_tf, y_train)
nb_predictions = naive_bayes.predict(X_test_tf)

nb_accuracy=accuracy_score(y_test,nb_predictions)


Linear SVM

In [24]:
svm = LinearSVC()
svm.fit(X_train_tf, y_train)
svm_predictions = svm.predict(X_test_tf)

svm_accuracy=accuracy_score(y_test, svm_predictions)

Logistical Regression

In [21]:
log_regression = LogisticRegression(max_iter=500)
log_regression.fit(X_train_tf, y_train)
lr_predictions = log_regression.predict(X_test_tf)

lr_accuracy = accuracy_score(y_test, lr_predictions)

Compare Accuracies with each model

In [27]:
print('Naive Bayes Accuracy: ', nb_accuracy)
print("Linear SVM Accuracy:", svm_accuracy)
print("Logistical Accuracy:", lr_accuracy)

Naive Bayes Accuracy:  0.5863026029555172
Linear SVM Accuracy: 0.5751444002700472
Logistical Accuracy: 0.5912722226389618
