# Classifying Kyphosis as Absent or Present

By: Matt Purvis

This project will take in Kyphosis data and train a model based off the labeled targets as absent or present using decision tree and random forest classifiers.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [2]:
df = pd.read_csv('C:\\Users\\v-mpurvis\OneDrive\\Personal Files\\Python Machine Learning Examples\\DataSets-Modules\\kyphosis.csv')
df

Unnamed: 0,Kyphosis,Age,Number,Start
0,absent,71,3,5
1,absent,158,3,14
2,present,128,4,5
3,absent,2,5,1
4,absent,1,4,15
...,...,...,...,...
76,present,157,3,13
77,absent,26,7,13
78,absent,120,2,13
79,present,42,7,6


# Train Test Split

In [11]:
X = df.drop('Kyphosis', axis = 1)
y = df['Kyphosis']

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = .3)

# Decision Tree

In [12]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train, Y_train)

pred = dtree.predict(X_test)

print(confusion_matrix(Y_test,pred))
print(classification_report(Y_test,pred))

[[15  2]
 [ 5  3]]
              precision    recall  f1-score   support

      absent       0.75      0.88      0.81        17
     present       0.60      0.38      0.46         8

    accuracy                           0.72        25
   macro avg       0.68      0.63      0.64        25
weighted avg       0.70      0.72      0.70        25



Seven observations were misclassified. There were 5 false negatives and 2 false positives. Out of 25 predictions that is not a very good model. Accuracy was only around 72%. The precision is the measure of correctly predicted positive observations over total number of predicted positive observations (3/5) ~ 60%. The recall is the "true positive rate", or when it is actually yes how often does it predict yes (3/8) ~ 38%. The F1 score is the harmonic mean between precision and recall. Therefore, many times, it is a better indicator of model success than just accuracy alone. In this case the f1-score and the accuracy (really all the metrics) are not verys strong and indicates a weak model. One thing to note is that this is a very small dataset. More observations could strengthen the model's predictive power. 

# Random Forests

In [13]:
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, Y_train)

pred = rfc.predict(X_test)
print(confusion_matrix(Y_test,pred))
print(classification_report(Y_test,pred))

[[17  0]
 [ 5  3]]
              precision    recall  f1-score   support

      absent       0.77      1.00      0.87        17
     present       1.00      0.38      0.55         8

    accuracy                           0.80        25
   macro avg       0.89      0.69      0.71        25
weighted avg       0.85      0.80      0.77        25



Random Forest performed a bit better. But still not a strong model. Accuracy was around 80%. F1 score around 77%. There were no false positives compared to the decision tree but there were still the same number of false negatives. Overall, it could be better. Providing more observations could help this issue moving forward. 