The aim of this repository is to compare the performance of several different learning algorithms tasked with classifying a set of handwritten digits and determine if they are even or odd. The training and testing sets used are found in the data-folder.
See Kaggle for more information on the problem: http://www.kaggle.com/c/digit-recognizer
The following machine learning methods are implemented and compared:
- Tree-based Methods
- Classification Trees
- Random Forest
- Bagging
- Boosting
- Nearest Neighbors
- K Nearest Neighbors
- Support Vector Machines
- Neural Networks
- Artificial Neural Networks (ANN)
- Convolutional Neural Networks (CNN)
Each technique is evaluated in the following manner:
1. Splitting data into training/testing sets
The training set will be divided into a training set (80%) and a test set (20%). The test set will not be used in the training process, and is only there to evaluate the finally chosen hypothesis. If required, 20% of the training data will be extracted to form a validation set.
2. Finding the optimal parameters using validation
Some form of validation will be used to choose the optimal parameters for each model. Parameters can be regularization factors, learning rates, required complexities etc. The validation will most often be in the form of 10-fold Cross Validation, OOB error, otherwise own validation sets. The model with the optimal parameters is chosen with respect to Eval(h), and the model with the best parameters is denoted g∗ for each method.
3. Evaluating the best model on the test set
The best hypothesis, g∗, will then be evaluated on the test set and produce Eout(g∗). The value of Eout(g∗) is what will rank the different machine learning methods to each other. The lower Eout(g∗), the better.