Lab 10. Back Propagation Implementation. #25

kkweon · 2017-03-01T14:21:03Z

In, lab-10-X1-mnist_back_prop.py

Back propagation is defined as follows:

# Forward
l1 = tf.add(tf.matmul(X, w1), b1)
a1 = sigma(l1)
l2 = tf.add(tf.matmul(a1, w2), b2)
y_pred = sigma(l2)

diff = (y_pred - Y)

# Back prop (chain rule)
d_l2 = diff * sigma_prime(l2)
d_b2 = d_l2
d_w2 = tf.matmul(tf.transpose(a1), d_l2)
d_a1 = tf.matmul(d_l2, tf.transpose(w2))
d_l1 = d_a1 * sigma_prime(l1)
d_b1 = d_l1
d_w1 = tf.matmul(tf.transpose(X), d_l1)

Problem

This backpropagation is only true when the loss function is

Proof

Current Forward Step:

If we assume the loss function is above,

which is represented as

d_l2 = diff * sigma_prime(l2)
d_w2 = tf.matmul(tf.transpose(a1), d_l2)

We can continue for other variables

Conclusion

The current loss is a variant of mean squared error and this loss is usually used for regression problems
This is a classification problem(MNIST)
Interestingly, with current hyperparameters, it works well
However, if you change hyperparameters, says increasing the batch size to somewhat reasonable like 128 or above, it will fail to converge due to the wrong loss function (because it will try to match all 10 classes at once instead of focusing on the correct label)
I suspect the reason it works because of the MNIST dataset characteristics. Most filled up data is 0, the pixel of the background image.
With the correct cross entropy loss function, it has no issue with batch size or learning rate
I suggest that loss function should be clearly defined before going into any back propagations. This follows Andrew Kapathy's approach as well.

The text was updated successfully, but these errors were encountered:

cynthia · 2017-03-01T15:26:57Z

MSE will still work, just not very well as it's less suitable for multinomial distribution. I'm guessing the choice is for simplicity reasons, as cross-entropy+softmax would make it harder to follow.

For pedagogical reasons, not introducing extra concepts which are extensions of the base concept (in this case, backprop) being taught is probably better for the audience. (That said, TensorFlow's notation decreases the readability significantly with little gain, so there is that...)

kkweon · 2017-03-01T19:49:38Z

I agree it can look tedious but I thought to bring up because of the following reasons:

~~I thought Lab10-X1 was optional like an extra challenge~~. Currently, it can give people the wrong idea that RMSE is okay to use for any classification problems.
- Quora has some great answers for anyone who wants to know why it isn't a good idea.
It is still a good practice to explicitly write down a loss function we are trying to minimize/maximize
- Although tensorflow and other frameworks will do dirty jobs by calculating gradients its own, it's still required to write a correct loss function. What people need to learn is actually how to define a loss function on their own.
- So I think it's still better to explicitly define what loss function we are using in every files (even if we stick to the MSE and do backprops by hands). It might look more complex but people would have been writing/used to many loss functions already before coming to this lab 10.
- For example, in the code, we can write as below and leave some comments why it's not advised to use this l2 loss when solving a classification problem.

# Forward
...
loss = tf.reduce_sum(tf.square(y_pred - Y)) / 2  # or loss = tf.nn.l2_loss(y_pred - Y) / 2
diff = (y_pred - Y)
...

hunkim · 2017-03-01T20:54:22Z

@kkweon : " I think it's still better to explicitly define what loss function we are using in every files (even if we stick to the MSE and do backprops by hands)"

I think it's a very good idea.

cynthia · 2017-03-02T03:03:45Z

I second the point that it would be useful for the readers if the method used was noted, and probably a inline comment mentioning "you wouldn't do this in a real world environment" about the inadequacy of the pieces used. (And deal with better tools for this later.)

I understand that TensorFlow is the cool thing to do, but I'm a bit curious if it would have been better to do this in raw numpy for beginners. (TF's lazy evaluation and un-Pythonic notation can be confusing even for seasoned Python programmers.)

hunkim · 2017-03-02T07:28:46Z

Thanks for your discussion. I try to make one slide to explain forward/backprop (for single values). Could you do a quick sanity check for me?

Many variables (so there might be some typos), but we will see if I can explain it well in my video/lecture at HKUST.

kkweon · 2017-03-02T09:35:42Z

It seems good, but some dimensions are wrong which I suppose you are aware of this.

I think it's also worth mentioning how to do a quick dimension check.
We know that a gradient of W2 (= dE/dW2) must have the same dimension as W2.

So, we know

dE/dW2 must have the dimension shape of (hidden_dim, output dim)
- assuming a1 * W2 = (N, hidden_dim) * (hidden_dim, output_dim)
dE/dsigma_2 has the shape of (N, output_dim)
a_1 has the shape of (N, hidden_dim)
Therefore, we know a_1 must be transposed such that
(hidden_dim, N) x (N, output_dim) = (hidden_dim, output_dim)
dE/dW2 = t(a_1) * dE/dsigma_2

So as long as we know the derivative of W2 must have the same dimension, we can just focus on a normal calculus (without worrying about matrix).
After that, we can correct the dimension easily (usually by changing the order of computation or transpose)

hunkim · 2017-03-02T09:53:13Z

This is a matrix version:

hunkim · 2017-03-02T09:57:54Z

@kkweon Thanks for the comments. For figure 1, it's for single values so no need to worry about the dimensions. In Figure 1, I just wanted to show how forward and back prop works with the simple chain rule.

I added Figure 2 for matrix, and I guess the dimensions are all correct. Basically, we can directly write code from these rules. Could you do a quick check? Cheers!

For easy comments, I shared the slides + latex code at https://docs.google.com/presentation/d/1_ZmtfEjLmhbuM_PqbDYMXXLAqeWN0HwuhcSKnUQZ6MM/edit?usp=sharing.

hunkim · 2017-03-02T10:54:11Z

@cynthia I agree. Using TF to write backprp is not the best idea. However, I don't want to introduce new numpy functions such as np.dot, etc.

Do you think we can simplify this code as much as we can? For example l1 = tf.add(tf.matmul(X, w1), b1) -> l1 = tf.matmul(X, w1) + b1.

It's just my thought. Feel free to add yours.

kkweon · 2017-03-02T11:20:52Z

There were two typos (left comments in the google slides). Everything else looks good to me.

cynthia · 2017-03-02T15:05:05Z

So, here are my two cents: consistency wise, I'm not sure if I am in full agreement that TF notation is easier to understand than numpy notation.

As for the slides, content wise I don't think I can add more than what has been mentioned above - but who is your audience? If you want your audience to be graduate school (or at least CS undergrad) level, the slides are fine. If you want to make the material accessible for everyone, using mathematical notation is not a great idea. (Even the most "obvious" greek characters are enough to scare away most programmers.)

hunkim · 2017-03-02T15:38:06Z

@cynthia I see. Perhaps, could you make a simple numpy version of lab 10-X1? I really appreciate it.

Slides, I guess they are for more advanced students/developers. Certainly, they are not for the beginners.

cynthia · 2017-03-03T04:22:35Z

Sure, that's probably a separate issue, I'll send in a numpy PR when I have time.

As for the remark about this being advanced students, I think advanced students deserve better datasets. I'm personally a bit uncomfortable with the data used ([1 2 3] -> [1 2 3]) as it's not the best data for demonstrating the characteristics of the underlying algorithms involved. Obviously, this is a subjective remark from one person so feel free to ignore it.

Aside from that nit, LGTM. (LGTM is not for the slides, I haven't looked at them carefully so I don't have any remarks)

hunkim · 2017-03-03T05:33:31Z

@cynthia "I'll send in a numpy PR when I have time." +1

"data used ([1 2 3] -> [1 2 3]) as it's not the best" , agree. However, I used that in my theory lecture part, so it's hard to change in the lab. When I remake the theory video, I'll change it. Thanks for your comments.

kkweon mentioned this issue Mar 4, 2017

refactor: lab-09-5-sigmoid_back_prop.py #55

Merged

kkweon closed this as completed Mar 4, 2017

This was referenced Apr 13, 2017

Minor error in lab-10-X1, backprop process #112

Closed

Derivative term Error in Back propagation through sigmoid activation function. #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lab 10. Back Propagation Implementation. #25

Lab 10. Back Propagation Implementation. #25

kkweon commented Mar 1, 2017

cynthia commented Mar 1, 2017

kkweon commented Mar 1, 2017

hunkim commented Mar 1, 2017 •

edited

cynthia commented Mar 2, 2017

hunkim commented Mar 2, 2017 •

edited

kkweon commented Mar 2, 2017

hunkim commented Mar 2, 2017 •

edited

hunkim commented Mar 2, 2017 •

edited

hunkim commented Mar 2, 2017

kkweon commented Mar 2, 2017

cynthia commented Mar 2, 2017

hunkim commented Mar 2, 2017 •

edited

cynthia commented Mar 3, 2017 •

edited

hunkim commented Mar 3, 2017

Lab 10. Back Propagation Implementation. #25

Lab 10. Back Propagation Implementation. #25

Comments

kkweon commented Mar 1, 2017

Problem

Proof

Conclusion

cynthia commented Mar 1, 2017

kkweon commented Mar 1, 2017

hunkim commented Mar 1, 2017 • edited

cynthia commented Mar 2, 2017

hunkim commented Mar 2, 2017 • edited

kkweon commented Mar 2, 2017

hunkim commented Mar 2, 2017 • edited

hunkim commented Mar 2, 2017 • edited

hunkim commented Mar 2, 2017

kkweon commented Mar 2, 2017

cynthia commented Mar 2, 2017

hunkim commented Mar 2, 2017 • edited

cynthia commented Mar 3, 2017 • edited

hunkim commented Mar 3, 2017

hunkim commented Mar 1, 2017 •

edited

hunkim commented Mar 2, 2017 •

edited

hunkim commented Mar 2, 2017 •

edited

hunkim commented Mar 2, 2017 •

edited

hunkim commented Mar 2, 2017 •

edited

cynthia commented Mar 3, 2017 •

edited