New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactoring of mlperceptron #5
Conversation
Thanks heaps, I'll try to do a proper review this week (busy with research, but this might actually come in handy). Results might be bad because there's no regularization. I'd like to include at least L2 regularization. |
not sure what you mean with the L2 regularization, it is included. When you do the review, can you elaborate a bit on how you imagined reusing cost functions from SGD? I left them in the main loop for now. |
I thought I'd disabled regularization in the Python wrapper. Let's get back to this when I really have some time to sparse. Feel free to hack further in the meantime (in a separate branch). Final remark: I don't think we can actually reuse much code from SGD, because it does OvA training only and I want true multiclass training here. But we can implement the same functions and architecture so we get e.g. hinge loss networks. |
@@ -289,38 +446,26 @@ def backprop_sgd(self, X, np.ndarray Y): | |||
end = n_samples | |||
start = n_samples - batchsize | |||
|
|||
|
|||
# densify outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this? sparse
is true iff X
is sparse; Y
is always dense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the sparse implementation, randomization is over instances in the training set, in the dense version, randomization is over batches. The loops collect the different rows of Y into a dense matrix.
I believe it should be enough to randomize batches, but I did not want to break your functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But that makes the code quite hard to read, so I suggest we pick one strategy. SGDClassifier
randomizes samples, so I picked that strategy. I can't immediately see the consequences of shuffling the batches only; wouldn't that take longer to converge if the samples are not pre-shuffled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit messy, agreed. I chose randomization over batches because of the speed improvement (only views are required, no looping). It is probably a good idea to pre-shuffle your data before applying this technique, but I do not expect a strong effect on convergence otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but we can't expect users to pre-shuffle their inputs -- not all users are experts (and honestly, I tend to forget that kind of stuff all the time myself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a convenience/speed tradeoff... what about adding an classifier shuffle=True
param, so that the classifier by default shuffles the (dense) data before learning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much is the speed difference? I'm not too fond of messing with X
in-place, nor of copying it.
@temporaer Can you still run the document classification example? It seems to be eating up all the CPU without doing anything ( |
The problem seems to be the sparse derivative for W0. I didn't see this, since i tested with dense inputs...looking into this. |
Ok. I like your way of structuring the code, btw. Sorry that it took me so long to do a review. |
The dense case seems broken too: I get a |
I tried to aim for a few goals as discussed in the sklearn pull request:
I'm pretty sure I removed some bugs in the sequence of gradient calculation and weight updates in the process. At the same time, I 'simplified' few features away while debugging (e.g. the special tanh), but I'll put them in again once we can agree on the structure.