# Homework 5 Part 2

#### Authors: John Mazon, LeTicia Cancel, Bharani Nittala

### Assignment: Test/Training Classification Data 

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

Source: https://archive.ics.uci.edu/ml/datasets/Nursery

In [77]:
# libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

In [55]:
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.data", sep=",", header=None, names=["parents","has_nurs","form","children","housing","finance","social","health","class"])

In [56]:
data.head()

Unnamed: 0,parents,has_nurs,form,children,housing,finance,social,health,class
0,usual,proper,complete,1,convenient,convenient,nonprob,recommended,recommend
1,usual,proper,complete,1,convenient,convenient,nonprob,priority,priority
2,usual,proper,complete,1,convenient,convenient,nonprob,not_recom,not_recom
3,usual,proper,complete,1,convenient,convenient,slightly_prob,recommended,recommend
4,usual,proper,complete,1,convenient,convenient,slightly_prob,priority,priority


Convert all values to integers

In [57]:
data.parents[data.parents == 'usual'] = 1
data.parents[data.parents == 'pretentious'] = 2
data.parents[data.parents == 'great_pret'] = 3
data.has_nurs[data.has_nurs == 'proper'] = 1
data.has_nurs[data.has_nurs == 'less_proper'] = 2
data.has_nurs[data.has_nurs == 'improper'] = 3
data.has_nurs[data.has_nurs == 'critical'] = 4
data.has_nurs[data.has_nurs == 'very_crit'] = 5
data.form[data.form == 'complete'] = 1
data.form[data.form == 'completed'] = 2
data.form[data.form == 'incomplete'] = 3
data.form[data.form == 'foster'] = 4
data.children[data.children == 'more'] = 4
data.housing[data.housing == 'convenient'] = 1
data.housing[data.housing == 'less_conv'] = 2
data.housing[data.housing == 'critical'] = 3
data.finance[data.finance == 'convenient'] = 1
data.finance[data.finance == 'inconv'] = 2
data.social[data.social == 'nonprob'] = 1
data.social[data.social == 'slightly_prob'] = 2
data.social[data.social == 'problematic'] = 3
data.health[data.health == 'recommended'] = 1
data.health[data.health == 'priority'] = 2
data.health[data.health == 'not_recom'] = 3

In [58]:
data.shape

(12960, 9)

In [59]:
data.describe()

Unnamed: 0,parents,has_nurs,form,children,housing,finance,social,health,class
count,12960,12960,12960,12960,12960,12960,12960,12960,12960
unique,3,5,4,4,3,2,3,3,5
top,3,5,4,1,3,2,3,3,not_recom
freq,4320,2592,3240,3240,4320,6480,4320,4320,4320


In [60]:
data.groupby('class').size()

class
not_recom     4320
priority      4266
recommend        2
spec_prior    4044
very_recom     328
dtype: int64

In [81]:
array = data.values
X = array[:,0:8]
y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X,y, test_size=0.3)

In [82]:
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

In [83]:
results = []
names = []
for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

LR: 0.900683 (0.005117)
LDA: 0.898256 (0.008646)
KNN: 0.965168 (0.006119)
CART: 0.994379 (0.002893)
NB: 0.742393 (0.004198)
SVM: 0.980489 (0.004126)


In [84]:
# create Decision Tree Classifier object
model = DecisionTreeClassifier()

In [85]:
# Train Decision Tree Classifier
model = model.fit(X_train, Y_train)

In [86]:
# Predict the response for the test dataset
y_pred = model.predict(X_test)

In [87]:
# model accuracy
print("Accuracy:", metrics.accuracy_score(Y_test, y_pred))

Accuracy: 0.996141975308642


In [88]:
# evaluate predictions
print(accuracy_score(Y_test, y_pred))
print(confusion_matrix(Y_test, y_pred))
print(classification_report(Y_test, y_pred))

0.996141975308642
[[1261    0    0    0]
 [   0 1279    9    0]
 [   0    4 1224    0]
 [   0    2    0  109]]
              precision    recall  f1-score   support

   not_recom       1.00      1.00      1.00      1261
    priority       1.00      0.99      0.99      1288
  spec_prior       0.99      1.00      0.99      1228
  very_recom       1.00      0.98      0.99       111

    accuracy                           1.00      3888
   macro avg       1.00      0.99      0.99      3888
weighted avg       1.00      1.00      1.00      3888



reference: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/