Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W1D3_MultiLayerPerceptrons: Sec 2.1 #722

Closed
wizofe opened this issue Jul 13, 2022 · 5 comments
Closed

W1D3_MultiLayerPerceptrons: Sec 2.1 #722

wizofe opened this issue Jul 13, 2022 · 5 comments

Comments

@wizofe
Copy link

wizofe commented Jul 13, 2022

drop_last=True used on train loader but on the test

the same is for shuffle=True vs shuffle=False in both W1D3 notebooks. Incosistency?

@spirosChv spirosChv added W1D3 Code-improvement Something is not working as expected, but does not prevent the correct execution labels Jul 13, 2022
@spirosChv
Copy link
Collaborator

@wizofe, thank you for contributing to our repo. Although I do not remember by heart where this is used, shuffling is False during the test as it does not matter the order of the images. While during training, the order matters as we split the dataset into batches. Does this make sense?

The drop_last argument during the test is unnecessary as we do not care if one batch is smaller, but during training, we want to drop the last batch if it is smaller (for this, I am not quite sure if it makes any difference).

@spirosChv spirosChv removed the Code-improvement Something is not working as expected, but does not prevent the correct execution label Jul 13, 2022
@GaganaB
Copy link
Collaborator

GaganaB commented Jul 13, 2022

Hi @wizofe, I agree with Spiros here. I'll elaborate below just to clarify a few things (hopefully).

  • Shuffling: The test and train sets are generated by probabilistic distributions over the entire data called the data generating processes. And this works on the assumption of i.i.d i.e., Independent examples and identically distributed. And we shuffle this data to overcome catastrophic forgetting, and to ensure representative samples across test/train/validation sets.
    Mathematically speaking: assume that we have P elements in W (that is, there are P weights in the network), L is a surface in a P+1-dimensional space. This arises from the fact that for any given matrices of weights W , the loss function can be evaluated on X and that value becomes the elevation of the surface.
    But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if X is unchanged over all training iterations, because the surface is fixed for a given X; all its features are static, including its various minima. To ensure that the gradient descent doesn't get stuck, we shuffle the train data. And since we follow the iid principle anyway, shuffling the test sets would not be necessary.

  • Drop_Last: The drop_last parameter signals to the sampler to drop the tail of the data to make it evenly divisible across the number of replicas. Since we shuffle the train data anyway, we can afford to drop the last non-full batch. But we don't share similar luxuries when it comes to the test sample.

I hope that helps. Feel free to comment below if we can be of further assistance. :)

@GaganaB GaganaB closed this as completed Jul 13, 2022
@wizofe
Copy link
Author

wizofe commented Jul 14, 2022

Hi @spirosChv thanks for your comment and @GaganaB for the great insight and explanation. I think I do understand the case of train vs test, although I've previously only seen in practice shuffling both. Is there any advantage to that @GaganaB?

@spirosChv
Copy link
Collaborator

Hi @spirosChv thanks for your comment and @GaganaB for the great insight and explanation. I think I do understand the case of train vs test, although I've previously only seen in practice shuffling both. Is there any advantage to that @GaganaB?

In practice, there is no advantage to shuffling the test set. During the test phase, the inputs pass through a static network. So, the order does not matter. However, you want to shuffle to have a more unbiased estimation during training. Imagine that you have collected some images; the first half is clean, whereas the second half is blurry. If you do not shuffle, your network will never learn the existence of both.
On the other hand, when you test, as you do not learn anything, the order does not play any role. In the end, you report a loss and an accuracy score as an average across all testing inputs. To calculate the average, the order does not matter.

PS. I suggest continuing this conversation on discord to increase visibility from other TAs/Students/etc.

Thank you.

@wizofe
Copy link
Author

wizofe commented Jul 14, 2022

Thank you both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants