Decision Tree Post Pruner (Sci-kit Learn Extension)

James Chan 2017

Overview

At the time this was written (9/12/17) Sci-kit Learn's DecisionTreeClassifier does not support error-reduction post pruning, which is an effective way to reduce overfitting and potentially improve testing accuracy. My code takes advantage of the existing decision tree data structure and modifies upon it once the tree has been created with Sci-kit Learn.

Post Pruning Algorithm

The decision trees in the subsequent examples use Gini as split criterion with a leaf size of 1. The post pruning algorithm used is taken from Machine Learning (1997) by Tom Mitchell. The algorithm is very simple. Basically, while improvement in training accuracy is observed, continue to remove leaves.

Figure 1. Before Post-Pruning vs After Post-Pruning

Assess Post Pruner

Train Test Split

To assess accurately the effect of pruing, we need to take two separate test sets. One will be tested on the non-pruned tree, and the other will be tested on the post-pruned tree. As mentioned previously, we shuffle our data and average the results of multiple trials in order to get an unbiased assessment of post-pruning results

Figure 2. Train Test Split Scheme

Pruning Results Visualized

The learning curves below compare the out-of-sample accuracy before and after post-pruning. Each training size interval takes the average accuracy between 32 trials with random shuffling.

Abalone Dataset

Estimate the age of abalone base on features such as size, gender, and weight

Source: https://archive.ics.uci.edu/ml/datasets/abalone

Remark:

The accuracy after post-pruning is about 1.0 to 1.5% higher than a tree that hasn't been pruned.

Wine Dataset

Estimate the quality of red wine on a scale of 1-10 as assigned by people. Input features include alcohol content, malic acid content, and color intensity.

Source: https://archive.ics.uci.edu/ml/datasets/wine

Remark:

The accuracy after post-pruning is about 0.0 to 2.0% higher than a tree that hasn't been pruned.

Pima Indian Diabetes Dataset

Predict whether a subject has diabete base on predictor such as blood pressure, BMI, and age

Source: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Remark:

The accuracy after post-pruning is about 1.0 to 2.0% than a tree that hasn't been pruned.

Iris Dataset

Predict the speciies of iris base on features such as sepal length, petal length, and petal width.

Source: https://archive.ics.uci.edu/ml/datasets/iris

Remark:

There is no visible improvement to the accuracy after post-pruning.

Conclusion

Post-pruning has the desired property that it very often reduces both the variance and the bias in a decision tree learner. It should be considered whenever a decision tree algorithm is used.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
README.md		README.md
abalone.png		abalone.png
before_after_prune.png		before_after_prune.png
diabetes.png		diabetes.png
graphics.pptx		graphics.pptx
iris.png		iris.png
main.py		main.py
train_test_split.png		train_test_split.png
wine.png		wine.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decision Tree Post Pruner (Sci-kit Learn Extension)

Overview

Post Pruning Algorithm

Assess Post Pruner

Train Test Split

Pruning Results Visualized

Abalone Dataset

Remark:

Wine Dataset

Remark:

Pima Indian Diabetes Dataset

Remark:

Iris Dataset

Remark:

Conclusion

About

Releases

Packages

Languages

jameschanx/Decision_Tree_Post_Pruner-Scikit_Extension

Folders and files

Latest commit

History

Repository files navigation

Decision Tree Post Pruner (Sci-kit Learn Extension)

Overview

Post Pruning Algorithm

Assess Post Pruner

Train Test Split

Pruning Results Visualized

Abalone Dataset

Remark:

Wine Dataset

Remark:

Pima Indian Diabetes Dataset

Remark:

Iris Dataset

Remark:

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages