KNN Vs Classification Tree Models

Comparing model performance for predicting forest cover types

Dataset: [(https://www.kaggle.com/c/forest-cover-type-prediction)]

Introduction:

In this paper, I build upon my findings from the EDA process. First, I construct two separate models i.e. KNN and a Classification Tree to predict using my dataset. Thereafter, I implement different controlling strategies that prevent my models from being overfit/too complex. Lastly, I evaluate their performances, review the pros and cons of each model and then choose the most appropriate one for my dataset.

KNN Model Construction:

After loading my dataset into R-Studio, I managed the dataset. Since KNN models predict based on distance of variables, the inputs have to be numerical e.g. elevation, aspect, slope. Therefore, I dropped all other categorical variables e.g. ID, Wilderness_Area, Soil_Type, except the Hillshade and Cover_Type, which is my target (what I’m trying to predict).

KNN Model Interpretation:

After standardizing the remaining numerical variables, setting seed to 1234, and partitioning my model on a 60-40 training-test split, I created my model using K=1 based on knnCrossVal() which predicts best possible value (where K is the distance between the numerical variables). After running the models, I have the following matrix;

From it, we can infer that my model predicted 386 times when Forest Cover type 1 was and was actually the correct Forest Cover Type in the observations split of the testing dataset. On the flip side, we can infer that the number of times my model predicted forest cover type 7 as it was in the test split, was 624.

KNN Performance Evaluation:

After running my model, I calculated the accuracy rate - how often my model predicted the cover type correctly. The KNN model predicted a 75% accuracy rate and a 25% error rate. To further evaluate this, I used a benchmark error rate - the rate at which one would blindly predict the cover type without the help of a model - which came up to 86%.

CART Model Construction:

After loading my dataset into R-Studio, I managed the dataset. Since Classification models predict based on categorical and numerical variables, the inputs have to be numerical and categorical e.g. hillshade, elevation, aspect, slope. Here, the target is to predict Forest Cover Type .

CART Model Interpretation:

After standardizing the remaining numerical variables and converting the categorical variables into factor, setting seed to 1234, and partitioning my model on a 60-40 training-test split, I created my model and got the matrix below.

From it, we can infer that my model predicted 549 times when Forest Cover type 1 was and was actually the correct Forest Cover Type in the observations split of the testing dataset. On the flip side, we can infer that the number of times my model predicted forest cover type 7 as it was in the test split, was 819.

After running the models, I plotted the classification tree below.

Using the easyprune function I pruned the tree classification tree to remove unnecessary nodes.

Classification Tree Performance Evaluation:

After running my model, I calculated the accuracy rate - how often my model predicted the cover type correctly. The CART model predicted a 77% accuracy rate and a 22% error rate. To further evaluate this, I used a benchmark error rate - the rate at which one would blindly predict the cover type without the help of a model - which came up to 86%.

The accuracy of the model on the test data is better when the tree is pruned, which means that the pruned decision tree model generalizes well and is more suited. However, there are also other factors that can influence decision tree model creation, such as building a tree on an unbalanced class. These factors were not accounted for in this demonstration but it's very important for them to be examined during a live model formulation.

Pros and Cons of the KNN and Classification Tree Model:

KNN Model

Pros:

KNN is easy to implement as it takes two parameters i.e k and distance function.
Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.

Cons:

One needs to do feature scaling (standardization and normalization) before applying KNN algorithm to any dataset. If not, KNN may generate wrong predictions.

Classification Tree Model

Pros:

Trees can be displayed graphically and can be easily interpreted by non-experts.
Decision trees can easily handle qualitative (categorical) features without the need to create dummy variables.

Cons:

As the tree grows in size, it becomes prone to overfitting and requires pruning.
A small change in the data set can make tree structure unstable and cause variance.

Which model is better?
The KNN model brought down the error rate by 75% from the benchmark error as compared to the CART model accounting for an error rate by 77% from the benchmark error. Thus, it is observed that classification tree model is more accurate as compared to KNN model to predict the forest cover type on the test data available.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
CART - Forest Cover Type-2.R		CART - Forest Cover Type-2.R
Classification Tree Confusion Matrix.png		Classification Tree Confusion Matrix.png
Classification Tree Error Rates.png		Classification Tree Error Rates.png
Classification Tree Variable Types.png		Classification Tree Variable Types.png
Classification Tree after pruning.png		Classification Tree after pruning.png
KNN - Forest Cover Type-4.R		KNN - Forest Cover Type-4.R
KNN Confusion Matrix.png		KNN Confusion Matrix.png
KNN CrossVal.png		KNN CrossVal.png
KNN Error Rates.png		KNN Error Rates.png
KNN Variable Types.png		KNN Variable Types.png
README.md		README.md
Tree Before Pruning.png		Tree Before Pruning.png
sampleSubmission.csv		sampleSubmission.csv
sampleSubmission.csv.zip		sampleSubmission.csv.zip
test3.csv		test3.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KNN Vs Classification Tree Models

About

Uh oh!

Releases

Packages

Languages

jackfrost68/KNN_Vs_Classification_Tree_Models_using_R

Folders and files

Latest commit

History

Repository files navigation

KNN Vs Classification Tree Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages