In [40]:
from tree.DecisionTree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
import pandas as pd

# Load Dataset
To test C4.5 Decision Tree Classifier, we use iris dataset

In [41]:
data = pd.read_csv('iris.csv')
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Number of instances are 150
Number of features/attributes/variable are 4, with every attributes are continous variable (we want to test splitting algorithm for continous/numeric attribute)

In [42]:
X = data.drop(columns="species")
y = data["species"]

X.shape, y.shape

((150, 4), (150,))

There are 4 species that represent 4 class (we want to ensure that our model can be applied to multiclass classification problem)

In [43]:
y.value_counts()

versicolor    50
setosa        50
virginica     50
Name: species, dtype: int64

# Split Dataset
First, using sklearn dataset, we split our dataset into two parts, training and testing. Training dataset used for building model, while testing for testing our model accuracy. 

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build Model and Fit data
Then, create DecisionTreeClassifier object and fit our training data

In [45]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

We can see our tree structure, like this

In [46]:
clf.print_tree_struct()

+  petal_length
|	+ [<=3.7274999999999996] petal_width
|	|	+ [<=0.3434782608695652] setosa (LEAF)
|	|	+ [>0.3434782608695652] sepal_length
|	|	|	+ [<=5.230769230769232] sepal_width
|	|	|	|	+ [<=2.9375] versicolor (LEAF)
|	|	|	|	+ [>2.9375] setosa (LEAF)
|	|	|	+ [>5.230769230769232] sepal_width
|	|	|	|	+ [<=3.4400000000000004] versicolor (LEAF)
|	|	|	|	+ [>3.4400000000000004] setosa (LEAF)
|	+ [>3.7274999999999996] petal_width
|	|	+ [<=1.704054054054054] versicolor (LEAF)
|	|	+ [>1.704054054054054] virginica (LEAF)


# Test Model
Next, predicting the testing set

In [47]:
y_predict = clf.predict(X_test)

Ideally, when we predict X_test it should produce output that similar or same to y_test. Testing accuracy means how close the predicted output (y_predict) to real output (y_test). Using sklearn library (accuracy_score), we get 0.967 (96.7%)

In [48]:
accuracy_score(y_test, y_predict)

0.9666666666666667

Using confusion_matrix (from sklearn), we can see that only 1 instance that misclassified

In [50]:
confusion_matrix(y_test, y_predict)

array([[ 9,  1,  0],
       [ 0,  9,  0],
       [ 0,  0, 11]], dtype=int64)