-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp ann modules #831
Revamp ann modules #831
Conversation
…utures errors like this.
This reverts commit 9d85b64.
…Pooling and MeanPooling class names for the actual module name.
Select<arma::mat, arma::mat>*, | ||
Sequential<arma::mat, arma::mat>*, | ||
VRClassReward<arma::mat, arma::mat>* | ||
>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was quite surprised, when I reviewed the visitor paradigm and then the boost::variant sources, to discover that boost::variant and static_visitor are completely implemented at compile time and there is no runtime overhead. It took a while to read the boost::variant code to verify (and I don't completely understand it, only "sort of"), and then taking a look at mlpack programs:
$ nm -g bin/mlpack_knn | c++filt | grep MonoSearchVisitor
$
I'm quite impressed! I didn't think it could be done entirely at compile-time.
Anyway, one drawback of this approach is that it doesn't make it easy for users to add their own layers. One way you could do that is this:
using LayerTypes = boost::variant<
Add<arma::mat, arma::mat>*,
...,
#ifdef EXTRA_LAYER_TYPES
EXTRA_LAYER_TYPES,
#endif
VRClassReward<arma::mat, arma::mat>*
>;
And then a user can add their own layer by adding that to EXTRA_LAYER_TYPES
before including mlpack.
Another possible issue that I have not thought of any solution to is this: if you know the structure of your network at compile time you could specify a layer as, e.g., Linear<arma::mat::fixed<10, 10>, arma::mat::fixed<10, 10>>
, but that's not part of the boost::variant
object, so you couldn't use it. Maybe in situations like that we can use dynamic dispatch of some sort? I'll keep thinking but I don't know if I will be able to come up with something straightforward---boost::any
wouldn't be a good solution because then we lose the static visitor and any call to any layer will have to be a lookup somehow. Either way I think even if we don't have a solution for now, it would be great to use this as-is and try to improve later, because this whole idea is so much nicer for users than before too.
(I haven't reviewed the rest of the code yet---I've primarily focused only on this point so far. I'll add comments about the rest of the code hopefully in the next few days.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a closer look at the PR. I think the easiest solution to use an own layer is to inherit from one of the existing layer e.g. Linear<>
.
class OwnLayer : public Liner<>
{
public:
OwnLayer();
void Forward(...);
void Backward(...):
void Gradient(...);
};
afterwards use the rest of the code as always. It's not ideal but it works, without rebuilding the library. I guess another solution would be to extend boost-variant something like:
template <typename LocalTypes>
class FFN
{
typename boost::make_variant_over<
typename boost::mpl::joint_view<
LayerTypes,
LocalTypes
>::type
>::type extendedTypes;
...
};
FFN<OwnLayer<> > model;
model.Add<OwnLayer<> >();
The problem I see is that you have to specify the type of the new layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't you have to make Forward()
, Backward()
, and Gradient()
virtual for inheriting from Layer<>
to work?
Anyway the LocalTypes
solution is nicer than what I proposed.
I don't see a way using boost::variant to collect the full list of types used by Add()
at compile time. Anything I can think of is pretty ugly. But I don't think it's a huge problem to ask users to use a template parameter if they want to use their own custom layer type---we can just try to provide enough types that it's not a problem. And maybe if we think of something better later, we can refactor. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me. I think that if we have good documentation and examples for this, then publicize it, we will get a lot of interest---especially if we can show that it outperforms some other toolkits. I hope the comments I've made are useful. For the most part I've just assumed that all the layers are implemented correctly; there are so many, I think it would take me a very long time to read them all. :) I focused primarily on the structure of the code at a high-level and specifically on the FFN and RNN classes, plus the abstractions of the layers themselves.
I have a few more general comments and questions:
-
Is the CNN class meant to be replaced by FFN with convolutional layers? I think that's the case, just want to be sure I understand.
-
The FFN class holds
arma::mat
s internally mostly, but the layers are general enough that they can support other types of input, likearma::Mat<int>
or whatever else. Should the types held by FFN instead be a function of the element types held by the layers? It might be nice for that to be the case, especially because it could also solve Tham's question in Could we change the response type of ann from arma::mat to arma::Row<size_t>? #541 from many months ago. It would also allow training reduced-precision neural networks, like in https://arxiv.org/pdf/1502.02551.pdf for instance. -
It seems like a lot of the parameters of the FFN and RNN classes are there specifically for use by the optimizer during training. That's fine with me, but I thought that another option might be to create some kind of "optimization helper" class that held the network itself and all of the training parameters. I think that would be a lot of refactoring work though since it would apply to the layers too; maybe it's only something to think about (and maybe it's not a good idea in the long run).
-
Since this is such a nice refactoring and has so much potential, I think we can make a big splash if we do some planning on how to release this with maximum impact to the machine learning and data science community. If you're up for that, we can plot out some simple roadmap of things to do before releasing this as part of mlpack 3.0.0.
Overall this is great! I am very excited. Sorry that my review took so long. :(
* | ||
* @param layer The Layer to be added to the model. | ||
*/ | ||
void Add(LayerTypes layer) { network.push_back(layer); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are two Add()
overloads here---one constructs a new layer and adds it to the model, and the other takes a pointer to a layer that is actually owned at a higher scope and puts it in the network without copying it (it just takes the pointer). It seems to me like you could end up in a weird situation if you did something like this...
Linear<> l(10, 10);
// modify the parameters of L somehow
// ...
{
FFN<> f;
f.Add(&l);
} // destructs l
// Try to re-use l in another network.
FFN<> f2;
f2.Add(&l);
// But since l was already destructed, it's not the same!
Do you think it would be better to instead provide a "copy" and "move" version of Add()? Kind of like:
// Copy a layer and add it to the network.
template<typename LayerType>
void Add(const LayerType& l, /* SFINAE to check that LayerType is in LayerTypes */);
// Take ownership of a layer and add it to the network.
template<typename LayerType>
void Add(LayerType&& l, /* SFINAE to check that LayerType is in LayerTypes */);
I don't know if that's the best way to accomplish it. It looks like for most of the layers, we can use the default copy and move constructors/operators (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, using the copy and move constructors/operators is a good idea.
for (size_t i = 0; i < network.size(); ++i) | ||
{ | ||
offset += boost::apply_visitor(WeightSetVisitor(std::move(parameter), | ||
offset), network[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am understanding right, this clever design means that all parameters in the network are stored in contiguous memory. That is nice! It'll provide nice cache benefits and should also accelerate serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the drawback is, that it's not that easy to change the weight vector size during training, but I couldn't think of any situation where that's necessary.
OutputLayerType>::value, | ||
"The type of outputLayer must be OutputLayerType."); | ||
this->predictors = std::move(predictors); | ||
this->responses = std::move(responses); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know until I looked it up that calling std::move()
on a const reference actually just calls the copy constructor. If I understand correctly, we must hold predictors
and responses
as members, since the optimizer will be calling Evaluate()
and Gradient()
repeatedly. Another option is simply to hold a pointer to predictors
and responses
, then in Train()
, just take the address of the given data, and when training is done, set the pointers to NULL
.
{ | ||
this->deterministic = deterministic; | ||
ResetDeterministic(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the two checks above are for safety, since the user might for some reason call Evaluate() or Gradient(). Typically though the optimizer will be calling those two functions, and in those cases I think you have already called ResetParameters()
and ResetDeterministic()
in Train()
before calling the optimizer. I think it would be fine, if you wanted, to remove those two checks from here, and then provide a "user" overload of Evaluate()
and Gradient()
that don't need the parameters
parameter (since it's unused anyway), and the documentation could direct users to use that one if they needed. I don't know if it would give much speedup, but if Evaluate()
is called for every point during training, it might be noticeable---might be worth testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is, that some layer act differently in training and testing mode. Like the Dropout layer randomly sets some values to zero in training mode and uses all values while testing. Since the optimizer evaluates the network after each batch/iteration we have to set the deterministic parameter accordingly and sitch it back for the next iteration.
// Evaluate the gradient for this iteration.
if (shuffle)
function.Gradient(iterate, visitationOrder[currentFunction], gradient);
else
function.Gradient(iterate, currentFunction, gradient);
for the Gradient function, we have to set the deterministic parameter to training
// Now add that to the overall objective function.
if (shuffle)
overallObjective += function.Evaluate(iterate,
visitationOrder[currentFunction]);
else
overallObjective += function.Evaluate(iterate, currentFunction);
for the Evaluate function, we have to set the deterministic parameter to testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, then there is no way around it.
Forward(arma::mat(predictors.colptr(i), predictors.n_rows, 1, false, true), | ||
network); | ||
currentInput = std::move(arma::mat(predictors.colptr(i), | ||
predictors.n_rows, 1, false, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can also get away with currentInput = predictors.unsafe_col(i)
here and a few lines later for currentTarget
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, maybe it's sufficient to just save the index, I'll have to check if that's works.
* LoadOutputParameterVisitor restores the output parameter using the given | ||
* parameter set. | ||
*/ | ||
class LoadOutputParameterVisitor : public boost::static_visitor<void> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is very long; if you wanted to, I think it would be reasonable to split each visitor into its own file since there are so many of them. Your call, I think it's fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets do this, would definitely increase the readability.
* mlpack is free software; you may redistribute it and/or modify it under the | ||
* terms of the 3-clause BSD license. You should have received a copy of the | ||
* 3-clause BSD license along with mlpack. If not, see | ||
* http://www.opensource.org/licenses/BSD-3-Clause for more information. | ||
* Tests for various convolution strategies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, can we put this comment back up above the license? If we ever need to change the license text, this'll probably mess up the awk scripts I try to make. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's make it easier for awk.
module.Backward(std::move(input), std::move(output), std::move(delta)); | ||
BOOST_REQUIRE_EQUAL(arma::accu(output), arma::accu(delta)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a lot of layer types aren't tested; should we add some tests for those too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, implementing more tests is definitely on my list, I think I used at least every layer in one test. Like there is no test for the base layer in ann_layer_test
, since each activation function is tested in the activation_functions_test
or the mean pooling layer is tested in pooling_test
.
offset), network[i]); | ||
|
||
boost::apply_visitor(resetVisitor, network[i]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to clear all of the locally-held training-time parameters like currentInput
, currentTarget
, etc., here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If somebody calls the Gradient function right after loading the model before calling the Evaluate
or Train
function right.
@@ -47,6 +47,8 @@ class FFTConvolution | |||
* @param input Input used to perform the convolution. | |||
* @param filter Filter used to perform the conolution. | |||
* @param output Output data that contains the results of the convolution. | |||
* @param dW Stride of filter application in the x direction. | |||
* @param dH Stride of filter application in the y direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like dW and dH aren't parameters to this method; maybe this comment should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
Ah, I meant to add, I think it is fine to merge now too before the merge gets hard(er), and possibly open some issues for any of the relevant comments or anything that needs to be addressed later. |
Thanks for taking the time to look over the PR.
|
I agree, it looks like it's already not an straightforward merge, and this would open the possibility to create some easy issues. |
Hmmm, now there is something interesting to think about. I see two ways to go---the first is to modify the optimizer abstraction so it supports a separate test set. The other is to make a "wrapper" optimizer that calls Evaluate() with the test set and Gradient() with the training set. I can see difficulties with each of those ideas. Definitely something to figure out how to do nicely later. :) I don't have any more comments, I think it's ready for merge whenever you address the comments that you wanted to now and we can open more issues to be fixed later. I don't know if anyone else had any comments on the PR. |
Thanks, I think it's ready. |
Merged, I'll open another issue for the things discussed here. |
This pull request revamps the ann modules. It should be a lot easier to create networks especially more complex networks that depend on the concat layer or the recurrent layer.
Changes:
The way parameters of different types are dealt with is greatly simplified: arma::Mat for all input and output parameters instead of arma::Cube to represent 3rd order tensors. Makes it possible to e.g. use the linear layer for convolutional and standard feed forward networks.
The introduced Recurrent module/layer is able to transform each non-recurrent layer/module into a recurrent layer.
In addition to the feedforward, recurrent and convolution network test, each layer is now tested by approximating the Jacobian matrix.
The introduction of the Concat module/layer allows us to split the network into subnetworks, which makes the implementation of e.g. the inception network a lot easier. (I'll open another PR for the inception layer once the PR is merged).
The way the network is created is greatly simplified instead of using:
you can create the same network with: