Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactoring of mlperceptron #5

Closed
wants to merge 0 commits into from

Conversation

temporaer
Copy link

I tried to aim for a few goals as discussed in the sklearn pull request:

  • have different concepts=classes for activation functions and weights
  • have different concepts for weight updates
  • not to sacrifice speed for doing the above
  • potentially allow to reuse components for multiple layers

I'm pretty sure I removed some bugs in the sequence of gradient calculation and weight updates in the process. At the same time, I 'simplified' few features away while debugging (e.g. the special tanh), but I'll put them in again once we can agree on the structure.

@larsmans
Copy link
Owner

larsmans commented Mar 5, 2013

Thanks heaps, I'll try to do a proper review this week (busy with research, but this might actually come in handy). Results might be bad because there's no regularization. I'd like to include at least L2 regularization.

@temporaer
Copy link
Author

not sure what you mean with the L2 regularization, it is included. When you do the review, can you elaborate a bit on how you imagined reusing cost functions from SGD? I left them in the main loop for now.

@larsmans
Copy link
Owner

larsmans commented Mar 5, 2013

I thought I'd disabled regularization in the Python wrapper. Let's get back to this when I really have some time to sparse. Feel free to hack further in the meantime (in a separate branch).

Final remark: I don't think we can actually reuse much code from SGD, because it does OvA training only and I want true multiclass training here. But we can implement the same functions and architecture so we get e.g. hinge loss networks.

@@ -289,38 +446,26 @@ def backprop_sgd(self, X, np.ndarray Y):
end = n_samples
start = n_samples - batchsize


# densify outputs
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this? sparse is true iff X is sparse; Y is always dense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sparse implementation, randomization is over instances in the training set, in the dense version, randomization is over batches. The loops collect the different rows of Y into a dense matrix.

I believe it should be enough to randomize batches, but I did not want to break your functionality.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But that makes the code quite hard to read, so I suggest we pick one strategy. SGDClassifier randomizes samples, so I picked that strategy. I can't immediately see the consequences of shuffling the batches only; wouldn't that take longer to converge if the samples are not pre-shuffled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit messy, agreed. I chose randomization over batches because of the speed improvement (only views are required, no looping). It is probably a good idea to pre-shuffle your data before applying this technique, but I do not expect a strong effect on convergence otherwise.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but we can't expect users to pre-shuffle their inputs -- not all users are experts (and honestly, I tend to forget that kind of stuff all the time myself).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a convenience/speed tradeoff... what about adding an classifier shuffle=True param, so that the classifier by default shuffles the (dense) data before learning?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much is the speed difference? I'm not too fond of messing with X in-place, nor of copying it.

@larsmans
Copy link
Owner

@temporaer Can you still run the document classification example? It seems to be eating up all the CPU without doing anything (verbose=True but no output after several minutes waiting).

@temporaer
Copy link
Author

The problem seems to be the sparse derivative for W0. I didn't see this, since i tested with dense inputs...looking into this.

@larsmans
Copy link
Owner

Ok. I like your way of structuring the code, btw. Sorry that it took me so long to do a review.

@larsmans
Copy link
Owner

The dense case seems broken too: I get a dtype mismatch because Y_batch must be np.float64, while Y is np.int. I'm writing unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants