SummerOfCodeIdeas

Yannis Mentekidis edited this page Feb 19, 2017 · 63 revisions

Ideas for Google Summer of Code 2017

These are a list of ideas compiled by mlpack developers; they range from simpler code maintenance tasks to difficult machine learning algorithm implementation, which means that there are suitable ideas for a wide range of student abilities and interests. The "necessary knowledge" sections can often be replaced with "willing to learn" for the easier projects, and for some of the more difficult problems, a full understanding of the description statement and some coding knowledge is sufficient.

In addition, these are not all of the possible projects which could be undertaken; new, interesting ideas are also suitable.

At the bottom are a list of projects from 2013 and 2014 that aren't really applicable to this year, or were done.

For more information on any of these projects, visit #mlpack in freenode (IRC) or email mlpack@lists.mlpack.org (see also http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack). It is probably worth consulting the mailing list archives first; many of these projects have been discussed extensively in previous years. Also, it is probably not a great idea to contact mentors directly (this is why their emails are not given here); a public discussion on the mailing list is preferred, if you need help or more information.

If you are looking for how to get started with mlpack, see the GSoC page, the get involved page, and the tutorials and try to compile and run simple mlpack programs. Then, you could look at the list of issues on Github and maybe you can find an easy and interesting bug to solve.

For an application template, see the Application Guide.

Reinforcement learning

Back in 2015 DeepMind published their landmark Nature article on training deep neural networks to learn to play Atari games using raw pixels and a score function as inputs. Since then there has been a surge of interest in the capabilities of reinforcement learning using deep neural network policy networks; and honestly who didn't always wanted to play around with RL on their own domain specific automation tasks. This project revitalizes this desire and combines it with recent developments in reinforcement learning to train neural networks to play some of the unforgettable games. In more detail, this project involves implementing different reinforcement methods over the course of the summer. A good project will select one or two algorithms and implement them (with tests and documentation. Note while in principle, the implementation of an agent that can play one of the Atari games is quite simple, this project concentrates more on the recent ideas, e.g:

  • Double DQN: The Double DQN algorithm was released by Google’s DeepMind group back in 2015, and adds stability to the learning by using a Q evaluation which is known, to overestimate action values under certain conditions, to use in the td-error formula that is less prone to "jiggering".

  • Asynchronous Advantage Actor-Critic: The A3C algorithm was also released by Google’s DeepMind group earlier in 2015, and it made a splash by essentially obsoleting DQN. It was faster, simpler, more robust, and able to achieve much better scores on the standard battery of Deep RL tasks.

  • Persistent Advantage Learning DQN: In Google’s DeepMind group presented a novel RL exploration bonus based on an adaptation of count-based exploration for high-dimensional spaces. One of the main benefits is that the agent is able to recognize and adjust its behaviors efficiently to salient events.

The algorithms must be implemented according to the mlpack's neural network interface so that they work with various layer and network structures. In addition, this project could possibly contain a research component -- benchmarking runtimes of different algorithms with other existing implementations.

breakout sample image sequence breakout sample image sequence breakout sample image sequence

Note: We already have written code to communicate with the OpenAI Gym toolkit. You can find the code here: https://github.com/zoq/gym_tcp_api

difficulty: 7/10

deliverable: Implemented algorithms and proof (via tests) that the algortihms work with the mlpack's code base.

necessary knowledge: a working knowledge of neural networks and reinforcement learning, willingness to dive into some literature on the topic, basic C++

recommendations for preparing an application: To be able to work on this you should be familiar with the source code of mlpack. We suggest that everyone who likes to apply for this idea, compile and explore the source code, especially the neural network code and the code to communicate with the environment. If you have more time, try to review the documents linked below, and in your application provide comments/questions/ideas/tradeoffs/considerations based on your brainstorming.

relevant tickets: none are open at this time

references: Deep learning reading list, Deep learning bibliography, Playing Atari with deep reinforcement learning

potential mentor(s): Marcus Edel

Augmented Recurrent Neural Networks

Recurrent neural networks, such as the Long Short-Term Memory LSTM, have proven to be powerful sequence learning models. However, one limitation of the LSTM architecture is that the number of parameters grows proportionally to the square of the size of the memory, making them unsuitable for problems requiring large amounts of long-term memory. Recent approaches augmented with external memory have the ability to eluded the capabilities of traditional LSTMs and to learn algorithmic solutions to complex tasks.

turing machine sample

The animation was created by using the Turing Machine Simulator check it out it's awesome.

Recently we’ve seen some exciting attemps attempts to augment RNNs with external memory such as:

  • Neural Turing Machine: The NTM combines a RNN with an external memory bank and two different attention mechanisms allowing the model to perform many simple algorithms.

  • Neural Programmer: The Neural programmer is another approach that uses external memory to learn to create programs in order to solve a task without needing examples of correct programs.

  • Hierarchical Attentive Memory: HAM is some kind different as it uses a binary tree with leaves corresponding to memory cells. With the result to perform memory access in Θ(log n).

The underlying mlpack codebase could be extended to provide an efficient implementation for this kind of problems. A good project will select one or two algorithms and implement them (with tests and documentation) over the course of the summer. Note this project is more research/algorithm oriented, but definitely requires some implementation to evaluate the ideas. In addition, this project could possibly contain another research component -- benchmarking runtimes of different algorithms with other existing implementations.

difficulty: 8/10

deliverable: Implemented algorithms and proof (via tests) that the algortihms work with the mlpack's code base.

necessary knowledge: a working knowledge of neural networks, willingness to dive into some literature on the topic, basic C++

recommendations for preparing an application: To be able to work on this you should be familiar with the source code of mlpack. We suggest that everyone who likes to apply for this idea, try to compile and explore the source code, especially the neural network. If you have more time, try to review the documents linked below, and in your application provide comments/questions/ideas/tradeoffs/considerations based on your brainstorming.

relevant tickets: none are open at this time

references: Deep learning reading list, Deep learning bibliography, Neural Turing Machine, Neural programmer, Neural GPU, Hierarchical Attentive Memory

potential mentor(s): Marcus Edel

Cross-validation and hyper-parameter tuning infrastructure

description: most data science tasks are not as simple as plugging in some data to a learner and taking the result. Typically, learners will have hyperparameters to tune, and users will also want to perform some sort of cross-validation to make sure the technique works well. In the very distant past (~2008-2010), mlpack did have a cross-validation module, but it had to be removed. Now, it would be a good time to re-add this support. For this project, it would be up to you to observe the current mlpack API, devise a design for cross-validation and/or hyper-parameter tuning modules, and then implement them.

Note that most mlpack learners have a very similar API, with the following training function:

template<typename MatType>
void Train(MatType& data,
           arma::Row<size_t>& labels,
           const X param1 = default, const X param2 = default, const X param3 = default, ...);

This makes it easy for us to create a CrossValidation module that can split a training set into some number of folds and evaluate some average measure (accuracy, error, AUC, and others) over all folds. The API for this could look similar to the idea below:

// Run 10-fold cross-validation on the DecisionTree.
CrossValidation<DecisionTree<>, Accuracy> cv(trainingSet, trainingLabels);
const double averageAccuracy = cv.Evaluate(10);

Still, there is extra complexity here---this example glosses over how the user will need to pass non-default training parameters for cross-validation. That is a detail that will need to be worked out in the proposal. For the hyper-parameter tuner, we wish to try a number of different values for the training parameters. I can imagine an API something like this:

/**
 * Assume DecisionTree::Train() takes data, labels, then a size_t and double parameters:
 *   Train(MatType, arma::Row<size_t>, size_t, double).
 */

// The template parameters specify the type of model, and its parameters.
HyperParameterTuner<DecisionTree<>, size_t, double> tuner(trainingData, trainingLabels, testData, testLabels);

// Now we call Optimize() with the ranges to use for each parameter.
tuner.Optimize<GridSearchOptimizer<50 /* 50 bins for each dimension */>>(
    Range<size_t>(1, 50), // range for size_t parameter
    Range<double>(0.1, 0.5)); // range for double parameter

// Get the best decision tree.
DecisionTree<>& best = tuner.BestModel();
// Get the best parameters.
std::tuple<size_t, double> bestParams = tuner.BestParams();

Note that this is just an idea---there are many details left to be figured out, such as the measure to optimize for, how to handle regression models vs. classification models vs. other types of models, and so forth. Depending on the complexity of the proposal, it's possible that a single project might handle only the cross-validation module or the hyper-parameter tuner, instead of both.

difficulty: 6/10

deliverable: working cross-validation and/or hyper-parameter tuning module

necessary knowledge: basic data science concepts, good familiarity with C++ and template metaprogramming (since that will probably be necessary), familiarity with mlpack methods API (see src/mlpack/methods/*/)

relevant ticket(s): none are open at this time

potential mentor(s): Ryan Curtin

recommendations for preparing an application: This project can't really be designed on-the-fly, so a good proposal will have already gone through the existing codebase and identified what parts of the API will need to change (if any), what the API for the hyper-parameter tuner and cross-validation

Build testing with Docker and VMs

description: mlpack has an automatic build server (Jenkins) running at http://masterblaster.mlpack.org/. This build server is very powerful---it has 72 cores and 256GB RAM. However, it is in need of further configuration! Right now, there is the git commit build and the nightly matrix build, but there are some issues that should be addressed:

  • the build matrix is incomplete---it doesn't test different compilers, different architectures, different Boost versions
  • there is no testing on OS X
  • there is no testing on Windows

The idea behind this project would be to fix these issues. For the OS X and Windows build, a virtual machine could be used---or I can even supply physical systems with network connections that could be used. But the real focus of the project should be the matrix build, which is helpful in diagnosing hard-to-find issues that arise under different configurations. For the matrix build, we can build a number of Docker containers, each with a different set of libraries and compilers, that can be used to build mlpack and run the tests.

The completed project would include a running OS X and Windows build, plus a series of Docker containers used by the matrix build. Ideally, the Dockerfiles should be automatically generated, so that if, e.g., a new Armadillo version comes out, we can easily generate a new Docker container that can be used.

difficulty: 4/10

deliverable: Windows/OS X VMs, plus Docker containers for the matrix build, all integrated into Jenkins builds

necessary knowledge: some Windows/OS X familiarity, some C++ (for debugging compilation errors), Docker and Jenkins familiarity

relevant tickets: none are open at this time

potential mentor(s): Ryan Curtin

Profiling for further optimization

description: mlpack could run even faster if it used profiling information during compilation. This entails adding extra build steps to the build process: first, to run a subset of mlpack programs with profiling information, and second, to rebuild mlpack using that profiling information. The difficulty here is choosing datasets which reflect general characteristics of the datasets which users will run mlpack methods with. If an unusual dataset is chosen, then mlpack will be compiled to run very quickly on that type of dataset -- but it will not run as well on other datasets. Another issue here is that the profiling process should not take extremely long (that is, longer than an hour or so), so we cannot choose a huge variety of large datasets.

deliverable: a 'make profile' build option which performs the profiling as described above

difficulty: 6/10

necessary knowledge: some machine learning familiarity, some CMake scripting

relevant tickets: #48

potential mentor(s): Ryan Curtin

Fast k-centers Algorithm & Implementation

description: The k-centers problem is a fundamental part of many machine learning algorithms (See Cortes & Scott, "Scalable Sparse Approximation of a Sample Mean", 2013 for a recent use of it). The basic problem is: given a set of points and integer k, choose k points that minimize the maximum distance between a point in the set and its nearest neighbor in the set. The ideas underlying the Dual-Tree Boruvka MST algorithm in mlpack could probably be extended to provide an efficient implementation for this problem. This one is more research / algorithm oriented, but would definitely require some implementation to evaluate the idea.

deliverable: working, tested k-centers implementation; paper comparing it to basic Gonzales algorithm and other implementations

difficulty: 7/10, but biased by being really related to Bill's thesis work

necessary knowledge: some understanding of geometry and spatial data structures, willingness to dive into some literature on the topic, basic C++

relevant tickets: none open at this time

potential mentor(s): Ryan Curtin, (possibly) Bill March

Improvement of tree traversers

description: Many of mlpack's machine learning methods are dual-tree algorithms. These dual-tree algorithms are abstracted in such a way that they all use the same tree traversal methods. If these traversers could be improved, this would result in runtime gains for all tree-based mlpack methods. This requires a lot of abstract thought in very weird ways and often debugging and profiling tree traversers is fraught with sadness.

deliverable: demonstrably improved runtime for tree-based methods

difficulty: 9/10

necessary knowledge: very in-depth C++ knowledge and understanding of tree structures; familiarity with machine learning methods is helpful

relevant tickets: #235

potential mentor(s): Ryan Curtin

LMNN with LRSDP implementation

description: mlpack has a working LRSDP (low-rank semidefinite program) implementation, which gives faster solutions to SDPs. This could give good speedups for existing SDP-based machine learning methods, such as LMNN (large margin nearest neighbor). This project should only entail the expression of LMNN as a low-rank SDP and then the implementation should be straightforward because the mlpack LRSDP API is straightforward. The difficulty is still quite high, though, because debugging LRSDPs is complex.

deliverable: working, tested LMNN implementation with extensive documentation

difficulty: 9/10

necessary knowledge: in-depth C++ knowledge, understanding of semidefinite programs; familiarity with LMNN is helpful

relevant tickets: none open at this time

potential mentor(s): Ryan Curtin

Fixes to MVU and low-rank semidefinite programs

description: This project is not for the faint of heart. For some time now, MVU (maximum variance unfolding), a dimensionality reduction technique, has not been converging even on simple datasets. mlpack's implementation of MVU uses LRSDP (low-rank semidefinite programs); there is not another existing implementation of MVU+LRSDP. Many tears will be shed trying to debug this. A good approach will probably be to compare mlpack MVU results on exceedingly simple problems with other MVU implementation results. The final outcome of this project may not even be a successful converging algorithm but instead more information on what is going wrong.

deliverable: working MVU implementation, or, further details and information on the problem

difficulty: 10/10

necessary knowledge: understanding of convexity and duality, knowledge of semidefinite programs, incredible determination and perseverance

potential mentor(s): Ryan Curtin

Parallel stochastic optimization methods

description: Recently, with the prevalence of multicore machines, there has been a large interest in parallelizing stochastic gradient methods such as SGD and SCD. While in principle these parallel implementations are quite simple, in practice there are lots of subtle details that need to be handled. Examples include, how do I anneal my step size, how much coordination/synchronization should I use, how should I partition my data in memory for good cache locality, do I solve the primal/dual, etc. This project would implement parallel SGD/SCD, and explore some of these design decisions.

deliverable: parallel implementations of stochastic gradient descent and stochastic coordinate descent, including both test cases and possibly an empirical analysis of the convergence behavior on various datasets for the issues discussed above.

difficulty: 5/10

necessary knowledge: C++, experience with writing multithreaded programs, some background in convex optimization would also be helpful

potential mentor(s): Ryan Curtin

Essential Deep Learning Modules

description: In the past years Deep Learning has markedly attracted a lot of attention in the Machine Learning community for its ability to learn features that allow high performance in a variety of tasks. For example, DeepMind has shown that neural networks can learn to play Atari games just by observing large amounts of images, without being trained explicitly on how to play games. This project involves implementing essential building blocks of deep learning algorithms based on the existing neural network codebase. A good project will select a some of the architectures and implement them (with tests and documentation) over the course of the summer. This could include Restricted Boltzmann Machines (RBM) and training algorithms, Deep Belief Networks (DBN), Radial Basis Function Networks (RBFN) and Bidirectional Recurrent networks (BRN). Generative Adversarial Networks (GAN). The architecture should be designed to build a foundation to integrating many more models including support for other state-of-the-art deep learning techniques. Note this project aims to revisit some of the traditional models from a more modern perspective e.g.:

deliverable: Implemented deep learning modules and proof (via tests) that the code works.

difficulty: 5/10

necessary knowledge: a working knowledge of what neural networks are, willingness to dive into some literature on the topic, basic C++

recommendations for preparing an application: Being familiar with the mlpack codebase, especially with existing neural network code is the first step if you wish to take up this task. We suggest that you build mlpack on your system and explore the functionalities. Take a look at the different layers and basic network structures. When you prepare your application, provide some comments/ideas/tradeoffs/considerations about your decision process, when choosing the models you want to implement over the summer.

relevant tickets: none open at this time

references: Deep learning reading list, Deep learning bibliography

potential mentor(s): Marcus Edel

Alternatives to neighborhood-based collaborative filtering

description: The past two years in Summer of Code have seen the addition of a highly flexible collaborative filtering framework to mlpack. This framework offers numerous different types of matrix factorizations, and is likely the most flexible package for this (with respect to matrix factorizations). However, the implementation uses k-nearest-neighbors to select its recommendations, whereas there are more options. This project entails the investigation of alternatives (for example, weighted nearest neighbors and regression techniques) and the implementation of these alternatives in a flexible manner.

deliverable: Implemented alternatives to k-NN for recommendation selection

difficulty: 7/10

necessary knowledge: experience with C++ and templates, knowledge of recommendation systems, willingness to read the literature

relevant tickets: #406

references: this paper describes an alternative to k-NN recommendation selection: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.4662&rep=rep1&type=pdf

potential mentor(s): Ryan Curtin, Sumedh Ghaisas

Low rank/sparse optimization using Frank-Wolfe

description: The past decade has seen an explosion of work focusing on underdetermined linear inverse problems, where the solution is assume to have some sparse structure. Compressed sensing, matrix recovery, lasso, etc all fall under this framework. Many specialized algorithms have been developed for particular problem instances. Recently, researchers have realized that many of these algorithms are actually instances of the same Frank-Wolfe algorithm. The Frank-Wolfe algorithm is an old algorithm to optimize a general convex function over a compact set. Martin Jaggi in this awesome paper showed how algorithms such as orthogonal matching pursuit and various low rank matrix recovery algorithms are simply instances of Frank-Wolfe, and proved general O(1/k) convergence rates. This is really neat stuff, and it would be great if mlpack provided a solid implementation of this framework.

deliverable: An implementation of Frank-Wolfe for the various atomic norms listed in Jaggi's paper, complete with test cases. The implementation should be general enough that we can recovery OMP by simply setting a few template parameters.

difficulty: 7/10

necessary knowledge: Some background in convex optimization would be very useful. While the actual algorithms are very simple, to really understand what is going on (which helps in debugging, for instance) requires some level of familiarity with the math.

relevant tickets: None at this time

potential mentor(s): Ryan Curtin

Better benchmarking

description: Back in 2013, Marcus Edel built a very useful benchmarking system, found at https://github.com/zoq/benchmarks. This has been useful for comparing mlpack's performance to other libraries, and gives us nice graphs like this one:

knn benchmarks

Although we as mlpack developers can use the benchmarking system and it's been very useful for us, it's less useful to the general public unless we compare thoroughly against other current implementations. There are many mlpack methods that have been added since the system has been built but are not benchmarked---some examples include the decision tree implementation, the Hoeffding tree implementation, parts of the ANN code, the DBSCAN implementation, and more.

There are a couple of different ideas to this project:

  • Select some mlpack algorithms that aren't currently well benchmarked, and add benchmarking scripts for mlpack's implementations and other implementations. At the end of the summer, we could have several images like the one above that show the speed of mlpack's implementations.

  • Select a single mlpack algorithm and focus on rigorously benchmarking that algorithm, and improving its runtime so that it is the fastest of all the implementations that are being compared.

Even a combination of those two ideas could be useful.

For more information on the benchmarking project, see https://github.com/zoq/benchmarks/wiki/Google-Summer-of-Code-2014-:-Improvement-of-Automatic-Benchmarking-System.

deliverable: benchmarks for mlpack algorithms and other implementations, with finished comparisons of runtimes or other performance measures

difficulty: 5/10

necessary knowledge: must be familiar with the machine learning algorithms that are being benchmarked, in order to be able to produce meaningful and relevant comparisons

relevant tickets: none at this time

potential mentor(s): Ryan Curtin

Profiling for Parallelization

Efficiency and scalability is very important for the users of mlpack.

Some of our algorithms, such as the DTree (methods/det) and LSH (methods/lsh) already employ parallelism through OpenMP, to speed things up without complicating the codebase too much. Unfortunatelly, there's still a lot of algorithms implemented in mlpack that currently only utilize a single core.

For this project, you will explore the mlpack codebase, and identify algorithms that you can parallelize with OpenMP so we can take advantage of the multiple cores most machines have.

There are multiple ways mlpack could employ parallelism. Some of these are:

  • Replacing the embarassingly parallel sections, for example code that performs the same computation for multiple data, with parallel code.
  • Substituting the algorithm itself with a parallel version, that works for 1 or more threads (since some users might still want to use the singlethreaded code).
  • Identifying and parallelizing bottlenecks in the core libraries of mlpack - this might offer significant performance boost for many algorithms, however it may be difficult to test, and might end up being underwhelming in most situations, since mlpack relies on OpenBLAS (under Armadillo) for all our matrix computations. (OpenBLAS is already multi-threaded)

Furthermore, another contribution could be identifying thread-unsafe sections of code in the core libraries, and modifying it to make it thread-safe. This will have significant impact for future developers, since it will be easier for them to write parallel code. (once the sequential version has been thoroughly debugged)

difficulty: 5/10 - 9/10, depending on the chosen approach(es).

deliverable: Implemented parallel versions of the selected algorithms in OpenMP. Correctness tests. Benchmarking of the parallel algorithm versions for comparison with other libraries and past performance. You can take a look at the Better benchmarking project description and https://github.com/zoq/benchmarks for more information on the benchmarking tool mlpack uses.

necessary knowledge: Methods for profiling source code. C++ working knowledge. Parallelization techniques (preferably OpenMP).

relevant tickets: 173

potential mentor(s): Yannis Mentekidis

recommendations for preparing an application: The project description is vague. A good proposal will solidify some of the ideas, and propose both what parts of the code base need changing and what needs to be added to the API - as mentioned, users might not want to employ this feature.

Old ideas for GSoC 2015/2016

Implement tree types

description: mlpack focuses heavily on dual-tree algorithms, which each rely on a type of space tree (data structure). These tree types include kd-trees, ball trees, cover trees, vantage point trees, R trees, R* trees, metric trees, principal axis trees, k-means trees, random projection trees, Bregman ball trees, UB trees, R+ trees, Hilbert trees, X trees, segment trees, interval trees, range trees, and others. However, mlpack only implements kd-trees, cover trees, ball trees, R trees, and R* trees. This project involves implementing other tree types for mlpack. A good project will select a handful of tree types and implement them (with tests and documentation) over the course of the summer. The trees must be implemented according to this example interface so that they work with any type of dual-tree algorithm. In addition, this project could possibly contain a research component -- benchmarking runtimes of dual-tree algorithms when different types of trees are used.

difficulty: 5/10

deliverable: Implemented tree types and proof (via tests) that the tree types work with mlpack's dual-tree algorithms

necessary knowledge: Data structures, C++, knowledge of memory management

relevant tickets: #184, #275, #272, #227, #228, #289

potential mentor(s): Ryan Curtin

why not this year? The past several years have seen the implementation of many types of trees---there are few left. If you are still interested, you are welcome to create an application for this project, but you will have to find some new types of trees that mlpack does not implement. The list above is now inaccurate.

Automatic bindings

description: A better alternative to writing bindings by hand would be to have some script which processed machine learning method information and automatically generated bindings for MATLAB, Python, R, and perhaps other languages. SWIG, unfortunately, is not really an option for this. This project is somewhat exploratory and may not end up producing results if nothing is available, so this project will be research-intensive. In all likelihood the project will involve building a binding generator which can take in mlpack executable code (such as src/mlpack/methods/neighbor_search/allknn_main.cpp) and can produce a binding to another language that gives the same options as the command-line interface.

deliverable: automatic binding generators for mlpack methods which are easy to maintain

difficulty: 6/10

necessary knowledge: a working knowledge of what bindings are and a good ability to do research on software

relevant tickets: #202

why not this year? Automatic bindings are (mostly) implemented now. There is another project for adding more languages to the automatic bindings---check that one out instead.

Approximate Nearest Neighbor Search

description: mlpack provides an extensible, flexible exact nearest neighbor search implementation in the src/mlpack/methods/neighbor_search/ directory. This is a state-of-the-art implementation, with the ability to do dual-tree nearest neighbor search instead of single-tree nearest neighbor search, like other packages (i.e. FLANN, ANN, scikit-learn) do. However, there is currently no support for approximate nearest neighbor search, despite the popularity of approximate nearest neighbor search. This is a relatively simple extension to the existing NeighborSearch code. This project would have two components: the first component would be the implementation of approximate nearest neighbor search and either modification of the existing mlpack_allknn program or implementation of a new approximate nearest neighbor search command-line program. The second component would be working with the benchmarking system (https://github.com/zoq/benchmarks/) in order to produce rigorous comparisons between mlpack's approximate nearest neighbor search implementation and other libraries, such as FLANN, ANN, or LSHKIT (and even mlpack's LSH implementation). Depending on the robustness of the comparisons, this could result in a nice publication exploring the empirical performance of approximate nearest neighbor search schemes as implemented by existing machine learning libraries.

The figure below is some output from the existing benchmarking system on exact nearest neighbor search (from http://mlpack.org/benchmarks.html ); ideally, this project should result in some similar nice figures.

mlpack kNN benchmarks

deliverable: working approximate kNN code with tests, benchmarking scripts, benchmark comparison numbers and figures

difficulty: 6/10

necessary knowledge: C++, Python, background on nearest neighbor search techniques

relevant tickets: (have not filled this out yet)

potential mentor(s): Ryan Curtin, Bill March

why not this year? This was already done by Marcos Pividori as part of GSoC 2016.

Decision trees

description: During Google Summer of Code 2014, Udit Saxena implemented decision stumps as part of his AdaBoost project. More recently, an implementation of streaming decision trees (Hoeffding trees) has been added. What would complement these things well would be to extend the implementation of decision stumps to full-blown decision trees (like ID3, CART, or C4.5). This could be done in one of several ways: adapting the existing decision stump code, writing a new class, or adapting the tree construction algorithm in density estimation trees (src/mlpack/methods/det/) to handle arbitrary loss functions. Other ideas might include implementation streaming Mondrian trees (http://papers.nips.cc/paper/5234-mondrian-forests-efficient-online-random-forests.pdf), but the overall goal here is to boost the number of classification techniques that mlpack has available.

deliverables: implementation of decision trees or similar algorithms, with tests and benchmarks comparing against other libraries

difficulty: 5/10

relevant tickets: none open at this time

potential mentor(s): Parikshit Ram, Nick Vasiloglou, Ryan Curtin

recommendations for preparing an application: A good candidate should be familiar with all of the decision tree and density estimation tree functionality in mlpack, in order to make a good decision about how it could be refactored or integrated into one class (or if that is even possible). API design is one of the most important things here, so the proposal should definitely have a section focusing on the API of the finished project.

why not this year? Decision trees are now implemented in mlpack.

Neuroevolution algorithms

description: The Nintendo Entertainment System (NES) wasn't the first unforgettable gaming console most of us ever played, but with the unforgettable library of games it certainly made an iconic part of many of our childhoods. Countless hours spent in heated Mario Bros. matches, or battles against Dr. Wily's destructive robots in the Mega Man series. And who could forget their first foray into the magic world of The Legend of Zelda? This project revitalizes this passion and combines it with recent developments in evolutionary algorithms to train artificial neural networks to play some of the unforgettable games. In more detail, this project involves implementing different neuro-evolution algorithms applied to NES games for mlpack. A good project will select one or two algorithms and implement them (with tests and documentation) over the course of the summer, e.g.:

  • CNE: simple weight evolution over a topologically static neural network.
  • CMA-ES: Covariance Matrix Adaptation Evolutionary Strategy.
  • NEAT: evolution of weights and topology from simple initial structures.
  • HyperNEAT: indirect encoding of network weights, using Compositional Pattern Producing Networks.

The neuro-evolution algorithms must be implemented according to the mlpack's neural network interface so that they work with various layer and network structures. In addition, this project could possibly contain a research component -- benchmarking runtimes of different algorithms with other existing implementations.

Image of emulated Mario Bros

Note: We have set up an NES emulator (FCEUX) system. We already also have written some code to remotely communicate with the NES emulator, that allows a user to initiate a connection from their machine to our machine, and open up a port back to itself over this connection to e.g. get the current frame as a matrix, get the position, set input, etc. We can also obtain a system that can be used to train the neural networks. You can find the code used to communicate with the NES emulator here: https://github.com/zoq/nes

difficulty: 7/10

deliverable: Implemented neuro-evolution algorithms and proof (via tests) that the algortihms work with mlpack's neural network structure.

necessary knowledge: a working knowledge of what neural networks are, willingness to dive into some literature on the topic, basic C++

recommendations for preparing an application: To be able to work on this you should be familiar with the source code of mlpack. We suggest that everyone who wishes to apply for this idea, try to compile the source code and explore the source code, especially the neural network code and the code to communicate with the emulator. Students should at least go through the source code of the existing neural network code to get a good idea of how much work each algorithm would involve. If you have more time, try to review the documents linked below, and in your application provide comments/questions/ideas/tradeoffs/considerations based on your brainstorming.

relevant tickets: #412, #413, #414, #555

references: Deep learning reading list, Deep learning bibliography, HyperNEAT-GGP, Evolving Neural Networks through Augmenting Topologies

potential mentor(s): Marcus Edel

why not this year? This was already done by Bang Liu as part of GSoC 2016. However if you are still interested, you are welcome to create an application for this project since there are still some really interesting ideas that aren't implemented.

Dataset and experimentation tools

description: I would say that 90% of any machine learning problem is getting the data into a workable format. Real-world data is noisy, has missing values, is encoded improperly, is in weird formats, and has all kinds of other problems. Diagnosing and fixing these issues, and then preparing the data for input into a machine learning system, is often the most time-consuming part of the process. Usually, this is done with a conglomeration of Python tools (like pandas for instance), but it would be useful if mlpack users did not need to preprocess their data in other software tools before putting it into mlpack. Thus, there are some useful pieces of functionality that could be added to mlpack:

  • checking a dataset for loading problems and printing errors
  • imputation strategies for missing variables
  • splitting a dataset into a training and test set
  • converting categorical features into binary features (or numeric features)

There should be a C++ API for these tools, but also a command-line program that could be used. Below is an example gif of what the interface might be able to do; this can be used as inspiration for a project proposal.

mlpack kNN benchmarks

deliverable: command-line program for dataset tools and C++ API, plus extensive documentation and tutorials on how to use it

difficulty: 3/10

necessary knowledge: C++, preferably familiarity with data science workflows and tools

relevant tickets: none at this time

potential mentor(s): Ryan Curtin

recommendations for preparing an application: This is an open-ended project and a student should identify in their proposal which problems they intend to solve and clearly illustrate how they intend to solve them. Should this project be completed, it will likely see a high amount of use, so clear and comprehensive documentation is necessary. That means that the project proposal should be equally clear on how each of these goals should be accomplished. There is room for flexibility here, so, if the vision outlined above does not mesh with a vision that you have for this project, that's okay; let's discuss ideas!

why not this year? Keon Kim did this project last year. However, the mlpack_check_data tool was not implemented, so if you have a good idea for how to do that, feel free to bring it up for discussion!

We need to go deeper - GoogLeNet

description: Convolutional neural networks are the state-of-the-art algorithms for image classification. So it's not surprisingly, that almost everyone used some form of convolutional net for the annually held ImageNet competition. In 2012, it was won by DNNResearch using the convolutional neural network approach described in the paper by Krizhevsky et al. Last year the team named GoogLeNet placed first in the classification and detection challenges. The GoogLeNet team used a deep learning architecture that is more convolutional than previous network architectures which can increase the depth and width of the network without increasing significant the parameters, using a new module called inception. This project would involve implementing the components of the GoogLeNet architecture and afterwards evaluating the network on a smaller subset of the ImageNet data. While in principle, the basic implementation of the main module for the GoogLeNet the inception layer is quite simple, there are many fun spinoffs which has to be covered such as:

  • Addressing the problem of generating possible object locations for use in object recognition step. The GoogLeNet Team used the Selective Search algorithm, which combines the strength of both an exhaustive search and segmentation in combination with the multi-box approach.
  • Addressing the initialization of the network parameters. Since GoogLeNet contains over 100 individual models, the model initialization is tricky. Bad Initializations would produce no gradient for the learning process.
  • Addressing the special structure which isn't linear at all, there are auxiliary classifiers connected to the intermediate layers that get also back propagated, to and provide additional regularization.

This project involves some really neat methods, and it would be great if mlpack provided a solid implementation of this architecture including a method to generate possible object locations.

deliverable: Implemented GoogLeNet architecture including a method to generate possible object locations such as selective search and proof (via tests) that the algorithms work with mlpack's neural network structure.

difficulty: 8/10

necessary knowledge: Some background in deep learning would be very useful. While the actual algorithms are simple, to really understand what is going on requires some level of familiarity with the math. But detailed knowledge of the methods is not necessary - only a willingness to read some documentation and papers.

recommendations for preparing an application: Being familiar with the mlpack codebase is important if you like to take up this idea. We suggest that you build the library and explore the functionality that needs to be touched during the summer. Another step would be to review the documents below, and to provide comments/questions/ideas/tradeoffs/considerations in your application and outline the primary steps you would follow.

relevant tickets: #412, #413, #414, #555

references:

potential mentor(s): Marcus Edel

why not this year? This was already done by Nilay Jain as part of GSoC 2016.

Old ideas for GSoC 2013

Refinement of MATLAB bindings

description: Currently, there are MATLAB bindings contributed by Patrick Mason, but they are not comprehensive; there are some mlpack methods which do not have MATLAB bindings. This project entails developing the rest of the mlpack MATLAB bindings and standardizing them. Detailed knowledge of the machine learning methods is not necessary for this project -- only a willingness to read some documentation. Regardless, someone working on this project can expect to, over the course of the project, become at least somewhat familiar with state-of-the-art machine learning methods and how they work. Each binding consists of a MATLAB script which provides documentation and handles input, and a MEX file which passes the input to mlpack methods.

deliverable: MATLAB bindings for all mlpack methods (about twenty) with standardized APIs

difficulty: 2/10

necessary knowledge: MATLAB familiarity and some C++

relevant tickets: #243, #202

why not this year? Realistically, the automatic bindings project is far more useful and interesting, and reduces the maintenance load of bindings significantly.

Python bindings and R bindings for mlpack

description: Similar to the MATLAB bindings, Python and R bindings would be useful for mlpack users who don't know C++ or would prefer to use Python or R to do large-scale machine learning. This project entails developing bindings for Python and R which have consistent API and documentation. Detailed knowledge of the machine learning methods is not necessary -- only a willingness to read some documentation. Regardless, someone working on this project can expect to, over the course of the project, become at least somewhat familiar with state-of-the-art machine learning methods and how they work. Because no previous work has been done on these bindings, an investigation into options for bindings and the necessary CMake configuration to build the scripts.

deliverable: Python bindings or R bindings (or both) for all mlpack methods (about twenty) with standardized APIs

difficulty: 3/10

necessary knowledge: Python knowledge, R knowledge, and some C++; CMake knowledge would be helpful but is not necessary

relevant tickets: #202

why not this year? Like the MATLAB bindings, this project is superseded by automatic bindings.

Automatic benchmarking of mlpack methods

description: For widespread adoption of mlpack to happen, it is very important that relevant and up-to-date benchmarks are available. This project entails writing support scripts which will run mlpack methods on a variety of datasets (from small to large) and produce runtime numbers. The benchmarking scripts will also run the same machine learning methods from other machine learning libraries (see OtherLibraries) and then produce runtime graphs. This can be integrated into Jenkins so that benchmarks are auto-generated nightly, which could be very helpful in informing developers which of their changesets have caused speedups or slowdowns.

deliverable: an automated benchmarking system built into Jenkins

difficulty: 7/10

necessary knowledge: bash scripting (or other scripting languages; Python could work), some machine learning familiarity

relevant tickets: #47, #369

why not this year? This project was completed as part of GSoC 2013 by Marcus Edel and improved in 2014 by Anand Soni. It's very cool; see http://www.mlpack.org/benchmark.html .

Packaging in Debian (and Ubuntu)

description: mlpack is packaged in Fedora and RHEL, but currently not in Debian due to the stringent packaging requirements. To complete this project, mlpack would need to be properly packaged for Debian and Ubuntu and submitted in the proper manner. This would involve a package review process and probably would result in the project's undertaker becoming a Debian contributor or Debian developer. Because this process is generally somewhat slow and depends on other people's input, it may be best combined with another simple, low-difficulty project.

deliverable: mlpack in Debian or Ubuntu repositories (or at least an in-motion process)

difficulty: 2/10 but includes a lot of reading documentation

necessary knowledge: Linux familiarity, knowledge of how the open-source community works, and shell scripting

relevant tickets: #327

why not this year? This is far too little work for a good GSoC project. It could be part of another project, though.