HLearn is a suite of libraries for interpreting machine learning models according to their algebraic structure. Every structure has associated algorithms useful for learning. When we show that a model is an instance of a particular structure, we get the associated algorithms "for free."
|Structure||What we get|
|Monoid||parallel batch training|
|Abelian group||"untraining" of data points|
|Abelian group||more fast cross-validation|
|R-Module||weighted data points|
|Vector space||fractionally weighted data points|
|Functor||fast simple preprocessing of data|
|Monad||fast complex preprocessing of data|
This interpretation of machine learning is somewhat limiting in that not all models have obvious algebraic structure. But many important models do. Currently implemented models include:
Univariate distributions: exponential, log-normal, normal, kernel density estimator, binomial, categorical, geometric, poisson
Multivariate distributions: normal, categorical, subset of markov networks
Classifiers: naive bayes, full bayes, perceptron, bagging, boosting (sort of)
Univariate regression: exponential, logistic, power law, polynomial
NP-hard approximations: k-centers, bin packing, multiprocessor scheduling
Other: markov chains, many nearest neighbor based algorithms with cover trees
Note: These models not included in the latest hackage releases: decision stumps/trees, and k-nearest neighbor (kd-tree based)
Example: normal distribution
Every model in HLearn is trained from a data set using the function
train. The type signature specifies which model we're training.
let dataset = [1,2,3,4,5,6] let dist = train dataset :: Normal Double
We can train in parallel using the higher order function
parallel. The GHC run time automatically takes advantage of multiple cores on your compuer. If you have 4 cores, then run time is 4x faster.
let dist' = parallel train dataset :: Normal Double
We can also train in online mode. This is where you add data points to an already existing model using either the function
let dist_online1 = add1dp dist 7 let dist_online2 = addBatch dist [7,8,9,10]
Finally, once we've trained a data point, we can do all the normal operations on it we would expect. One common operation on distributions is evaluating the probability density function. We do this with the
pdf dist 10
For more details on why the Normal distribution has algebraic structure and what we can do with it, see the blog post Normal distributions form a monoid and why machine learning experts should care.
There are three main sources of documentation. First, there are a number of tutorials on my personal blog. These provide the most detail and are geared towards the beginner. They're probably the easiest way to get started. Next, there are two papers about the internals of the HLearn library. They are a good resource for understanding the theory behind why the library works. Finally, there's the hackage documentation.
Comparison to other libraries:
- TFP13 - HLearn: A Machine Learning Library for Haskell
- ICML13 - Algebraic Classifiers: a generic approach to fast cross-validation, online training, and parallel training
HLearn is under active development. At present, it is primarily a research tool. This means that the interfaces may change significantly in the future (but will definitely follow the PVP). I'm hoping HLearn will eventually become a stable package that will make it easy to incorporate machine learning techniques into Haskell programs.
Current development is focused in two areas. First, implementing new models and their algebraic structures. Many unimplemented models have "trivial" algebraic structure. But for many successful models it is unknown whether they can have interesting structure. The second area is investigating new structures. Many models have Functor/Applicative/Monoid structure (or in some strict sense almost have these structures) and I'm working on how to exploit these structures.
Any comments / questions / pull requests are greatly appreciated!
The code for the "Faster Cover Trees" paper at ICML2015 is contained within this repo. It shouldn't be considered in a released state yet though. There is still some work left to clean it up and make it easy to use. The work should be finished before the actual conference.