W1D3_MultiLayerPerceptrons: Sec 2.1 #722

wizofe · 2022-07-13T17:40:05Z

drop_last=True used on train loader but on the test

the same is for shuffle=True vs shuffle=False in both W1D3 notebooks. Incosistency?

The text was updated successfully, but these errors were encountered:

spirosChv · 2022-07-13T21:13:21Z

@wizofe, thank you for contributing to our repo. Although I do not remember by heart where this is used, shuffling is False during the test as it does not matter the order of the images. While during training, the order matters as we split the dataset into batches. Does this make sense?

The drop_last argument during the test is unnecessary as we do not care if one batch is smaller, but during training, we want to drop the last batch if it is smaller (for this, I am not quite sure if it makes any difference).

GaganaB · 2022-07-13T21:54:03Z

Hi @wizofe, I agree with Spiros here. I'll elaborate below just to clarify a few things (hopefully).

Shuffling: The test and train sets are generated by probabilistic distributions over the entire data called the data generating processes. And this works on the assumption of i.i.d i.e., Independent examples and identically distributed. And we shuffle this data to overcome catastrophic forgetting, and to ensure representative samples across test/train/validation sets.
Mathematically speaking: assume that we have P elements in W (that is, there are P weights in the network), L is a surface in a P+1-dimensional space. This arises from the fact that for any given matrices of weights W , the loss function can be evaluated on X and that value becomes the elevation of the surface.
But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if X is unchanged over all training iterations, because the surface is fixed for a given X; all its features are static, including its various minima. To ensure that the gradient descent doesn't get stuck, we shuffle the train data. And since we follow the iid principle anyway, shuffling the test sets would not be necessary.
Drop_Last: The drop_last parameter signals to the sampler to drop the tail of the data to make it evenly divisible across the number of replicas. Since we shuffle the train data anyway, we can afford to drop the last non-full batch. But we don't share similar luxuries when it comes to the test sample.

I hope that helps. Feel free to comment below if we can be of further assistance. :)

wizofe · 2022-07-14T07:33:33Z

Hi @spirosChv thanks for your comment and @GaganaB for the great insight and explanation. I think I do understand the case of train vs test, although I've previously only seen in practice shuffling both. Is there any advantage to that @GaganaB?

spirosChv · 2022-07-14T07:42:38Z

Hi @spirosChv thanks for your comment and @GaganaB for the great insight and explanation. I think I do understand the case of train vs test, although I've previously only seen in practice shuffling both. Is there any advantage to that @GaganaB?

In practice, there is no advantage to shuffling the test set. During the test phase, the inputs pass through a static network. So, the order does not matter. However, you want to shuffle to have a more unbiased estimation during training. Imagine that you have collected some images; the first half is clean, whereas the second half is blurry. If you do not shuffle, your network will never learn the existence of both.
On the other hand, when you test, as you do not learn anything, the order does not play any role. In the end, you report a loss and an accuracy score as an average across all testing inputs. To calculate the average, the order does not matter.

PS. I suggest continuing this conversation on discord to increase visibility from other TAs/Students/etc.

Thank you.

wizofe · 2022-07-14T08:24:35Z

Thank you both!

spirosChv added W1D3 Code-improvement Something is not working as expected, but does not prevent the correct execution labels Jul 13, 2022

spirosChv removed the Code-improvement Something is not working as expected, but does not prevent the correct execution label Jul 13, 2022

GaganaB closed this as completed Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

W1D3_MultiLayerPerceptrons: Sec 2.1 #722

W1D3_MultiLayerPerceptrons: Sec 2.1 #722

wizofe commented Jul 13, 2022

spirosChv commented Jul 13, 2022

GaganaB commented Jul 13, 2022

wizofe commented Jul 14, 2022 •

edited

spirosChv commented Jul 14, 2022

wizofe commented Jul 14, 2022

W1D3_MultiLayerPerceptrons: Sec 2.1 #722

W1D3_MultiLayerPerceptrons: Sec 2.1 #722

Comments

wizofe commented Jul 13, 2022

spirosChv commented Jul 13, 2022

GaganaB commented Jul 13, 2022

wizofe commented Jul 14, 2022 • edited

spirosChv commented Jul 14, 2022

wizofe commented Jul 14, 2022

wizofe commented Jul 14, 2022 •

edited