Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lab 10. Back Propagation Implementation. #25

Closed
kkweon opened this issue Mar 1, 2017 · 14 comments
Closed

Lab 10. Back Propagation Implementation. #25

kkweon opened this issue Mar 1, 2017 · 14 comments

Comments

@kkweon
Copy link
Collaborator

kkweon commented Mar 1, 2017

In, lab-10-X1-mnist_back_prop.py

Back propagation is defined as follows:

# Forward
l1 = tf.add(tf.matmul(X, w1), b1)
a1 = sigma(l1)
l2 = tf.add(tf.matmul(a1, w2), b2)
y_pred = sigma(l2)

diff = (y_pred - Y)

# Back prop (chain rule)
d_l2 = diff * sigma_prime(l2)
d_b2 = d_l2
d_w2 = tf.matmul(tf.transpose(a1), d_l2)
d_a1 = tf.matmul(d_l2, tf.transpose(w2))
d_l1 = d_a1 * sigma_prime(l1)
d_b1 = d_l1
d_w1 = tf.matmul(tf.transpose(X), d_l1)

Problem

This backpropagation is only true when the loss function is
image

Proof

Current Forward Step:
image

If we assume the loss function is above,

image

which is represented as

d_l2 = diff * sigma_prime(l2)
d_w2 = tf.matmul(tf.transpose(a1), d_l2)

We can continue for other variables

Conclusion

  1. The current loss is a variant of mean squared error and this loss is usually used for regression problems
  2. This is a classification problem(MNIST)
  3. Interestingly, with current hyperparameters, it works well
  4. However, if you change hyperparameters, says increasing the batch size to somewhat reasonable like 128 or above, it will fail to converge due to the wrong loss function (because it will try to match all 10 classes at once instead of focusing on the correct label)
  5. I suspect the reason it works because of the MNIST dataset characteristics. Most filled up data is 0, the pixel of the background image.
  6. With the correct cross entropy loss function, it has no issue with batch size or learning rate
  7. I suggest that loss function should be clearly defined before going into any back propagations. This follows Andrew Kapathy's approach as well.
@cynthia
Copy link
Contributor

cynthia commented Mar 1, 2017

MSE will still work, just not very well as it's less suitable for multinomial distribution. I'm guessing the choice is for simplicity reasons, as cross-entropy+softmax would make it harder to follow.

For pedagogical reasons, not introducing extra concepts which are extensions of the base concept (in this case, backprop) being taught is probably better for the audience. (That said, TensorFlow's notation decreases the readability significantly with little gain, so there is that...)

@kkweon
Copy link
Collaborator Author

kkweon commented Mar 1, 2017

I agree it can look tedious but I thought to bring up because of the following reasons:

  • I thought Lab10-X1 was optional like an extra challenge. Currently, it can give people the wrong idea that RMSE is okay to use for any classification problems.
    • Quora has some great answers for anyone who wants to know why it isn't a good idea.
  • It is still a good practice to explicitly write down a loss function we are trying to minimize/maximize
    • Although tensorflow and other frameworks will do dirty jobs by calculating gradients its own, it's still required to write a correct loss function. What people need to learn is actually how to define a loss function on their own.
    • So I think it's still better to explicitly define what loss function we are using in every files (even if we stick to the MSE and do backprops by hands). It might look more complex but people would have been writing/used to many loss functions already before coming to this lab 10.
    • For example, in the code, we can write as below and leave some comments why it's not advised to use this l2 loss when solving a classification problem.
# Forward
...
loss = tf.reduce_sum(tf.square(y_pred - Y)) / 2  # or loss = tf.nn.l2_loss(y_pred - Y) / 2
diff = (y_pred - Y)
...

@hunkim
Copy link
Owner

hunkim commented Mar 1, 2017

@kkweon : " I think it's still better to explicitly define what loss function we are using in every files (even if we stick to the MSE and do backprops by hands)"

I think it's a very good idea.

@cynthia
Copy link
Contributor

cynthia commented Mar 2, 2017

I second the point that it would be useful for the readers if the method used was noted, and probably a inline comment mentioning "you wouldn't do this in a real world environment" about the inadequacy of the pieces used. (And deal with better tools for this later.)

I understand that TensorFlow is the cool thing to do, but I'm a bit curious if it would have been better to do this in raw numpy for beginners. (TF's lazy evaluation and un-Pythonic notation can be confusing even for seasoned Python programmers.)

@hunkim
Copy link
Owner

hunkim commented Mar 2, 2017

Thanks for your discussion. I try to make one slide to explain forward/backprop (for single values). Could you do a quick sanity check for me?

Many variables (so there might be some typos), but we will see if I can explain it well in my video/lecture at HKUST.

image

@kkweon
Copy link
Collaborator Author

kkweon commented Mar 2, 2017

It seems good, but some dimensions are wrong which I suppose you are aware of this.

I think it's also worth mentioning how to do a quick dimension check.
We know that a gradient of W2 (= dE/dW2) must have the same dimension as W2.

So, we know

  • dE/dW2 must have the dimension shape of (hidden_dim, output dim)
    • assuming a1 * W2 = (N, hidden_dim) * (hidden_dim, output_dim)
  • dE/dsigma_2 has the shape of (N, output_dim)
  • a_1 has the shape of (N, hidden_dim)
  • Therefore, we know a_1 must be transposed such that
    (hidden_dim, N) x (N, output_dim) = (hidden_dim, output_dim)
  • dE/dW2 = t(a_1) * dE/dsigma_2

So as long as we know the derivative of W2 must have the same dimension, we can just focus on a normal calculus (without worrying about matrix).
After that, we can correct the dimension easily (usually by changing the order of computation or transpose)

@hunkim
Copy link
Owner

hunkim commented Mar 2, 2017

This is a matrix version:

image

@hunkim
Copy link
Owner

hunkim commented Mar 2, 2017

@kkweon Thanks for the comments. For figure 1, it's for single values so no need to worry about the dimensions. In Figure 1, I just wanted to show how forward and back prop works with the simple chain rule.

I added Figure 2 for matrix, and I guess the dimensions are all correct. Basically, we can directly write code from these rules. Could you do a quick check? Cheers!

For easy comments, I shared the slides + latex code at https://docs.google.com/presentation/d/1_ZmtfEjLmhbuM_PqbDYMXXLAqeWN0HwuhcSKnUQZ6MM/edit?usp=sharing.

@hunkim
Copy link
Owner

hunkim commented Mar 2, 2017

@cynthia I agree. Using TF to write backprp is not the best idea. However, I don't want to introduce new numpy functions such as np.dot, etc.

Do you think we can simplify this code as much as we can? For example l1 = tf.add(tf.matmul(X, w1), b1) -> l1 = tf.matmul(X, w1) + b1.

It's just my thought. Feel free to add yours.

@kkweon
Copy link
Collaborator Author

kkweon commented Mar 2, 2017

There were two typos (left comments in the google slides). Everything else looks good to me.

@cynthia
Copy link
Contributor

cynthia commented Mar 2, 2017

So, here are my two cents: consistency wise, I'm not sure if I am in full agreement that TF notation is easier to understand than numpy notation.

As for the slides, content wise I don't think I can add more than what has been mentioned above - but who is your audience? If you want your audience to be graduate school (or at least CS undergrad) level, the slides are fine. If you want to make the material accessible for everyone, using mathematical notation is not a great idea. (Even the most "obvious" greek characters are enough to scare away most programmers.)

@hunkim
Copy link
Owner

hunkim commented Mar 2, 2017

@cynthia I see. Perhaps, could you make a simple numpy version of lab 10-X1? I really appreciate it.

Slides, I guess they are for more advanced students/developers. Certainly, they are not for the beginners.

@cynthia
Copy link
Contributor

cynthia commented Mar 3, 2017

Sure, that's probably a separate issue, I'll send in a numpy PR when I have time.

As for the remark about this being advanced students, I think advanced students deserve better datasets. I'm personally a bit uncomfortable with the data used ([1 2 3] -> [1 2 3]) as it's not the best data for demonstrating the characteristics of the underlying algorithms involved. Obviously, this is a subjective remark from one person so feel free to ignore it.

Aside from that nit, LGTM. (LGTM is not for the slides, I haven't looked at them carefully so I don't have any remarks)

@hunkim
Copy link
Owner

hunkim commented Mar 3, 2017

@cynthia "I'll send in a numpy PR when I have time." +1

"data used ([1 2 3] -> [1 2 3]) as it's not the best" , agree. However, I used that in my theory lecture part, so it's hard to change in the lab. When I remake the theory video, I'll change it. Thanks for your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants