Our manual backprop weight average is missing #173

hunkim · 2017-05-27T20:56:31Z

For example,

https://github.com/hunkim/DeepLearningZeroToAll/blob/master/lab-09-x-xor-nn-back_prop.py

d_W1 = tf.matmul(tf.transpose(X), d_l1)

X's shape is (?, 2), X^T's shape is (2,?), and dl_1's shape is (?, 2). The shape of d_W1 should be (2,2), but the values of d_W1 are proportional to the sample size. We need to average these values.

The current sample size is only 4 so it's OK, but when sample size is too big, it does not work.

To reproduce add this code:

x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])

FYI, d_b values are averaged:
tf.reduce_mean(d_b1, axis=[0])

The text was updated successfully, but these errors were encountered:

kkweon · 2017-05-27T21:30:37Z

We just need to divide by its batch size.

N = X.shape[0]
d_W1 = tf.matmul(tf.transpose(X), d_l1) / N

hunkim · 2017-05-27T22:23:03Z

How about something like this? I removed sigma_prime.

W1 = tf.Variable(tf.random_normal([2, 2]), name='weight1')
b1 = tf.Variable(tf.random_normal([2]), name='bias1')
layer1 = tf.sigmoid(tf.matmul(X, W1) + b1)

W2 = tf.Variable(tf.random_normal([2, 1]), name='weight2')
b2 = tf.Variable(tf.random_normal([1]), name='bias2')
Y_pred = tf.sigmoid(tf.matmul(layer1, W2) + b2)

# cost/loss function
cost = -tf.reduce_mean(Y * tf.log(Y_pred) + (1 - Y) *
                       tf.log(1 - Y_pred))

d_Y_pred = (Y_pred - Y) / (Y_pred * (1.0 - Y_pred) + 1e-7)
d_sigma = Y_pred * (1 - Y_pred)

# Layer 2
d_o2 = d_Y_pred * d_sigma
d_l2 = tf.multiply(d_o2, d_sigma)
d_b2 = d_l2
d_W2 = tf.matmul(tf.transpose(layer1), d_l2)

# Mean
d_b2_mean = tf.reduce_mean(d_b2, axis=[0])
d_W2_mean = d_W2 / tf.cast(tf.shape(layer1)[0], dtype=tf.float32)

# Layer 1
d_o1 = layer1 * (1-layer1)
d_l1 = tf.multiply(tf.matmul(d_l2, tf.transpose(W2)), d_o1)
d_b1 = d_l1
d_W1 = tf.matmul(tf.transpose(X), d_l1)

# Mean
d_W1_mean = d_W1 / tf.cast(tf.shape(X)[0], dtype=tf.float32)
d_b1_mean = tf.reduce_mean(d_b1, axis=[0])

# Weight update
step = [
  tf.assign(W2, W2 - learning_rate * d_W2_mean),
  tf.assign(b2, b2 - learning_rate * d_b2_mean),
  tf.assign(W1, W1 - learning_rate * d_W1_mean),
  tf.assign(b1, b1 - learning_rate * d_b1_mean)
]

kkweon · 2017-05-27T22:54:53Z

I don't have a machine to run a test right now, but I guess it will work.
However, I'm not sure if it's better. It looks quite complicated to me already.

hunkim · 2017-05-28T00:01:51Z

@kkweon Do you need a machine to run? :-) I think your brain is enough.

The previous code starts with diff:
diff = hypothesis - Y which is hard to understand.

Let me know if you have any refactoring suggestions.

kkweon · 2017-05-28T00:11:49Z

@hunkim
I personally prefer d_o2 * d_sigma over tf.multiply(d_o2, d_sigma).
Because

it's more natural
it's safer because every operation is overridden in tf.Tensor class and tested
Less verbose

Like you remember when it turned into 1.0, all the basic operations were renamed.
People who used tf.mul had to manually fix their codes to tf.multiply.

hunkim · 2017-05-28T00:32:28Z

Refactored:

# cost/loss function                                                          
cost = -tf.reduce_mean(Y * tf.log(Y_pred) + (1 - Y) *                         
                       tf.log(1 - Y_pred))                                    
                                                                              
# Loss derivative                                                             
d_Y_pred = (Y_pred - Y) / (Y_pred * (1.0 - Y_pred) + 1e-7)                    
                                                                              
# Layer 2                                                                     
d_sigma2 = Y_pred * (1 - Y_pred)                                              
d_l2 = d_Y_pred * d_sigma2                                                    
d_b2 = d_l2                                                                   
d_W2 = tf.matmul(tf.transpose(layer1), d_l2)                                  
                                                                              
# Mean                                                                        
d_b2_mean = tf.reduce_mean(d_b2, axis=[0])                                    
d_W2_mean = d_W2 / tf.cast(tf.shape(layer1)[0], dtype=tf.float32)             
                                                                              
# Layer 1                                                                     
d_sigma1 = layer1 * (1-layer1)                                                
d_l1 = d_l2 * d_sigma1                                                        
d_b1 = d_l1                                                                   
d_W1 = tf.matmul(tf.transpose(X), d_l1)                                       
                                                                              
# Mean                                                                        
d_W1_mean = d_W1 / tf.cast(tf.shape(X)[0], dtype=tf.float32)                  
d_b1_mean = tf.reduce_mean(d_b1, axis=[0])                                    
                                                                              
# Weight update                                                               
step = [                                                                      
  tf.assign(W2, W2 - learning_rate * d_W2_mean),                              
  tf.assign(b2, b2 - learning_rate * d_b2_mean),                              
  tf.assign(W1, W1 - learning_rate * d_W1_mean),                              
  tf.assign(b1, b1 - learning_rate * d_b1_mean)                               
]

make sense?

kkweon · 2017-05-28T00:42:38Z

looks good. autopep8 will do the rest.

hunkim · 2017-05-28T00:53:35Z

@kkweon This is right version:

# Network                                                              
#          p1     a1           l1     p2     a2           l2 (y_pred)  
# X -> (*) -> (+) -> (sigmoid) -> (*) -> (+) -> (sigmoid) -> (loss)    
#       ^      ^                   ^      ^                            
#       |      |                   |      |                            
#       W1     b1                  W2     b2                           
                                                                       
# Loss derivative                                                      
d_Y_pred = (Y_pred - Y) / (Y_pred * (1.0 - Y_pred) + 1e-7)             
                                                                       
# Layer 2                                                              
d_sigma2 = Y_pred * (1 - Y_pred)                                       
d_a2 = d_Y_pred * d_sigma2                                             
d_p2 = d_a2                                                            
d_b2 = d_a2                                                            
d_W2 = tf.matmul(tf.transpose(l1), d_p2)                               
                                                                       
# Mean                                                                 
d_b2_mean = tf.reduce_mean(d_b2, axis=[0])                             
d_W2_mean = d_W2 / tf.cast(tf.shape(l1)[0], dtype=tf.float32)          
                                                                       
# Layer 1                                                              
d_l1 = tf.matmul(d_p2, tf.transpose(W2))                               
d_sigma1 = l1 * (1 - l1)                                               
d_a1 = d_l1 * d_sigma1                                                 
d_b1 = d_a1                                                            
d_p1 = d_a1                                                            
d_W1 = tf.matmul(tf.transpose(X), d_a1)                                
                                                                       
# Mean                                                                 
d_W1_mean = d_W1 / tf.cast(tf.shape(X)[0], dtype=tf.float32)           
d_b1_mean = tf.reduce_mean(d_b1, axis=[0])                             
                                                                       
# Weight update                                                        
step = [                                                               
  tf.assign(W2, W2 - learning_rate * d_W2_mean),                       
  tf.assign(b2, b2 - learning_rate * d_b2_mean),                       
  tf.assign(W1, W1 - learning_rate * d_W1_mean),                       
  tf.assign(b1, b1 - learning_rate * d_b1_mean)                        
]

Can you run in your brain?

kkweon · 2017-05-28T00:58:00Z

Yes, the comment really helped. It looks great.
If I can, anyone should be able to run this in his/her brain. So, it's awesome.

hunkim · 2017-05-28T01:00:47Z

@kkweon Do you like the naming? p for product and a for addition.

kkweon · 2017-05-28T01:04:52Z

@hunkim it should be fine with the comment. Honestly I thought it is a name for the activation layer but I was able to figure it out by reading the comment

hunkim · 2017-05-28T01:15:12Z

@kkweon Still I don't like names. Let me know if you have any suggestions.

hunkim assigned kkweon and sxjscience May 27, 2017

hunkim closed this as completed May 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Our manual backprop weight average is missing #173

Our manual backprop weight average is missing #173

hunkim commented May 27, 2017 •

edited

kkweon commented May 27, 2017

hunkim commented May 27, 2017

kkweon commented May 27, 2017

hunkim commented May 28, 2017

kkweon commented May 28, 2017

hunkim commented May 28, 2017 •

edited

kkweon commented May 28, 2017

hunkim commented May 28, 2017

kkweon commented May 28, 2017

hunkim commented May 28, 2017

kkweon commented May 28, 2017

hunkim commented May 28, 2017

Our manual backprop weight average is missing #173

Our manual backprop weight average is missing #173

Comments

hunkim commented May 27, 2017 • edited

kkweon commented May 27, 2017

hunkim commented May 27, 2017

kkweon commented May 27, 2017

hunkim commented May 28, 2017

kkweon commented May 28, 2017

hunkim commented May 28, 2017 • edited

kkweon commented May 28, 2017

hunkim commented May 28, 2017

kkweon commented May 28, 2017

hunkim commented May 28, 2017

kkweon commented May 28, 2017

hunkim commented May 28, 2017

hunkim commented May 27, 2017 •

edited

hunkim commented May 28, 2017 •

edited