Skip to content

Optimizing LinearSVC models trained on the MNIST Handwritten Digits dataset, includes ensemble methods and bayesian optimization.

Notifications You must be signed in to change notification settings

parkernisbet/mnist-svm-tuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

MNIST SVM Tuning

Scikit-learn support vector machines are computationally slow on account of lack of parallelization, so some creativity will be required to keep this project brief. To mitigate longer than ideal train and tuning times, I will rely on two methods of model optimization, ensemble learning and adaptive hyperparameter tuning.

SVMs, unlike most other sklearn algorithms, compute in sequence. On top of this, SVMs also have a rather unfortunate pseudo-quadratic relationship between dataset length and overall model complexity. Piping my model through a BaggingClassifier has the two-fold advantage of not only splitting my dataset across multiple weak-learning classifiers (reducing a single model's complexity), but also enabling model training in parallel (not quite a one-to-one scaling, but leaps and bounds above a single LinearSVC model).

On the tuning front, grid searching would take advantage of as many cores as I throw at it, but I still run into the issue of wasted compute cycles on hyperparameter sets that will clearly not result in an optimal model. Surely there must be some method of hyperparameter tuning that is adaptive in nature, right? Something that looks at prior hyperparameter runs and adjusts its search based on where it estimates the optimal parameters exists...

Enter bayesian optimization: a method of model optimization built specifically for situations with expensive-to-evaluate functions. As the optimizer iterates through the grid of parameters passed to it, it uses bayes' theorem to model the classifier losses as a globally optimizable objective. Similar to a naive bayes classifier, the loss function's prior probability distribution is updated with each hyperparameter evaluation to incrementally create a posterior for approximation. An acquisition function is utilized to direct sampling to areas with better likelihoods of approximation improvement, cutting down on "inefficient" tuning iterations.

As for the results, hyperparameter tuning on the bagged LinearSVC model took a little over 35 minutes on a 8C:16T CPU (Ryzen 7 3700X). As a benchmark I did attempt to brute-force using GridSearchCV and the barebones LinearSVC model (with a hardcoded list of 50 Cs), but ran into issues with non-convergence at max iterations under 100000. Running the grid search again with this increased max_iter parameter drastically increased the per-C compute times, up to around 3.1 hours. Annecdotal evidence suggests a roughly 5.3 times reduction in overall compute times while hyperparameter tuning, with the tradeoff being a 1.0% drop in test accuracy with idential parameters / 1.3% drop in test accuracy against a full converged LinearSVC. The caveat here is that the fully converged model takes over 150 seconds per train iteration, summing to a estimated total of 10 hours for hyperparameter tuning.

About

Optimizing LinearSVC models trained on the MNIST Handwritten Digits dataset, includes ensemble methods and bayesian optimization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published