Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Our manual backprop weight average is missing #173

Closed
hunkim opened this issue May 27, 2017 · 12 comments
Closed

Our manual backprop weight average is missing #173

hunkim opened this issue May 27, 2017 · 12 comments
Assignees

Comments

@hunkim
Copy link
Owner

hunkim commented May 27, 2017

For example,

https://github.com/hunkim/DeepLearningZeroToAll/blob/master/lab-09-x-xor-nn-back_prop.py

d_W1 = tf.matmul(tf.transpose(X), d_l1)

X's shape is (?, 2), X^T's shape is (2,?), and dl_1's shape is (?, 2). The shape of d_W1 should be (2,2), but the values of d_W1 are proportional to the sample size. We need to average these values.

The current sample size is only 4 so it's OK, but when sample size is too big, it does not work.

To reproduce add this code:

x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])
x_data = np.vstack([x_data, x_data])
y_data = np.vstack([y_data, y_data])

FYI, d_b values are averaged:
tf.reduce_mean(d_b1, axis=[0])

@kkweon
Copy link
Collaborator

kkweon commented May 27, 2017

We just need to divide by its batch size.

N = X.shape[0]
d_W1 = tf.matmul(tf.transpose(X), d_l1) / N

@hunkim
Copy link
Owner Author

hunkim commented May 27, 2017

How about something like this? I removed sigma_prime.

W1 = tf.Variable(tf.random_normal([2, 2]), name='weight1')
b1 = tf.Variable(tf.random_normal([2]), name='bias1')
layer1 = tf.sigmoid(tf.matmul(X, W1) + b1)

W2 = tf.Variable(tf.random_normal([2, 1]), name='weight2')
b2 = tf.Variable(tf.random_normal([1]), name='bias2')
Y_pred = tf.sigmoid(tf.matmul(layer1, W2) + b2)

# cost/loss function
cost = -tf.reduce_mean(Y * tf.log(Y_pred) + (1 - Y) *
                       tf.log(1 - Y_pred))

d_Y_pred = (Y_pred - Y) / (Y_pred * (1.0 - Y_pred) + 1e-7)
d_sigma = Y_pred * (1 - Y_pred)

# Layer 2
d_o2 = d_Y_pred * d_sigma
d_l2 = tf.multiply(d_o2, d_sigma)
d_b2 = d_l2
d_W2 = tf.matmul(tf.transpose(layer1), d_l2)

# Mean
d_b2_mean = tf.reduce_mean(d_b2, axis=[0])
d_W2_mean = d_W2 / tf.cast(tf.shape(layer1)[0], dtype=tf.float32)

# Layer 1
d_o1 = layer1 * (1-layer1)
d_l1 = tf.multiply(tf.matmul(d_l2, tf.transpose(W2)), d_o1)
d_b1 = d_l1
d_W1 = tf.matmul(tf.transpose(X), d_l1)

# Mean
d_W1_mean = d_W1 / tf.cast(tf.shape(X)[0], dtype=tf.float32)
d_b1_mean = tf.reduce_mean(d_b1, axis=[0])

# Weight update
step = [
  tf.assign(W2, W2 - learning_rate * d_W2_mean),
  tf.assign(b2, b2 - learning_rate * d_b2_mean),
  tf.assign(W1, W1 - learning_rate * d_W1_mean),
  tf.assign(b1, b1 - learning_rate * d_b1_mean)
]

@kkweon
Copy link
Collaborator

kkweon commented May 27, 2017

I don't have a machine to run a test right now, but I guess it will work.
However, I'm not sure if it's better. It looks quite complicated to me already.

@hunkim
Copy link
Owner Author

hunkim commented May 28, 2017

@kkweon Do you need a machine to run? :-) I think your brain is enough.

The previous code starts with diff:
diff = hypothesis - Y which is hard to understand.

Let me know if you have any refactoring suggestions.

@kkweon
Copy link
Collaborator

kkweon commented May 28, 2017

@hunkim
I personally prefer d_o2 * d_sigma over tf.multiply(d_o2, d_sigma).
Because

  • it's more natural
  • it's safer because every operation is overridden in tf.Tensor class and tested
  • Less verbose

Like you remember when it turned into 1.0, all the basic operations were renamed.
People who used tf.mul had to manually fix their codes to tf.multiply.

@hunkim
Copy link
Owner Author

hunkim commented May 28, 2017

Refactored:

# cost/loss function                                                          
cost = -tf.reduce_mean(Y * tf.log(Y_pred) + (1 - Y) *                         
                       tf.log(1 - Y_pred))                                    
                                                                              
# Loss derivative                                                             
d_Y_pred = (Y_pred - Y) / (Y_pred * (1.0 - Y_pred) + 1e-7)                    
                                                                              
# Layer 2                                                                     
d_sigma2 = Y_pred * (1 - Y_pred)                                              
d_l2 = d_Y_pred * d_sigma2                                                    
d_b2 = d_l2                                                                   
d_W2 = tf.matmul(tf.transpose(layer1), d_l2)                                  
                                                                              
# Mean                                                                        
d_b2_mean = tf.reduce_mean(d_b2, axis=[0])                                    
d_W2_mean = d_W2 / tf.cast(tf.shape(layer1)[0], dtype=tf.float32)             
                                                                              
# Layer 1                                                                     
d_sigma1 = layer1 * (1-layer1)                                                
d_l1 = d_l2 * d_sigma1                                                        
d_b1 = d_l1                                                                   
d_W1 = tf.matmul(tf.transpose(X), d_l1)                                       
                                                                              
# Mean                                                                        
d_W1_mean = d_W1 / tf.cast(tf.shape(X)[0], dtype=tf.float32)                  
d_b1_mean = tf.reduce_mean(d_b1, axis=[0])                                    
                                                                              
# Weight update                                                               
step = [                                                                      
  tf.assign(W2, W2 - learning_rate * d_W2_mean),                              
  tf.assign(b2, b2 - learning_rate * d_b2_mean),                              
  tf.assign(W1, W1 - learning_rate * d_W1_mean),                              
  tf.assign(b1, b1 - learning_rate * d_b1_mean)                               
]                                                                             

make sense?

@kkweon
Copy link
Collaborator

kkweon commented May 28, 2017

looks good. autopep8 will do the rest.

@hunkim
Copy link
Owner Author

hunkim commented May 28, 2017

@kkweon This is right version:

# Network                                                              
#          p1     a1           l1     p2     a2           l2 (y_pred)  
# X -> (*) -> (+) -> (sigmoid) -> (*) -> (+) -> (sigmoid) -> (loss)    
#       ^      ^                   ^      ^                            
#       |      |                   |      |                            
#       W1     b1                  W2     b2                           
                                                                       
# Loss derivative                                                      
d_Y_pred = (Y_pred - Y) / (Y_pred * (1.0 - Y_pred) + 1e-7)             
                                                                       
# Layer 2                                                              
d_sigma2 = Y_pred * (1 - Y_pred)                                       
d_a2 = d_Y_pred * d_sigma2                                             
d_p2 = d_a2                                                            
d_b2 = d_a2                                                            
d_W2 = tf.matmul(tf.transpose(l1), d_p2)                               
                                                                       
# Mean                                                                 
d_b2_mean = tf.reduce_mean(d_b2, axis=[0])                             
d_W2_mean = d_W2 / tf.cast(tf.shape(l1)[0], dtype=tf.float32)          
                                                                       
# Layer 1                                                              
d_l1 = tf.matmul(d_p2, tf.transpose(W2))                               
d_sigma1 = l1 * (1 - l1)                                               
d_a1 = d_l1 * d_sigma1                                                 
d_b1 = d_a1                                                            
d_p1 = d_a1                                                            
d_W1 = tf.matmul(tf.transpose(X), d_a1)                                
                                                                       
# Mean                                                                 
d_W1_mean = d_W1 / tf.cast(tf.shape(X)[0], dtype=tf.float32)           
d_b1_mean = tf.reduce_mean(d_b1, axis=[0])                             
                                                                       
# Weight update                                                        
step = [                                                               
  tf.assign(W2, W2 - learning_rate * d_W2_mean),                       
  tf.assign(b2, b2 - learning_rate * d_b2_mean),                       
  tf.assign(W1, W1 - learning_rate * d_W1_mean),                       
  tf.assign(b1, b1 - learning_rate * d_b1_mean)                        
]                                                                      

Can you run in your brain?

@kkweon
Copy link
Collaborator

kkweon commented May 28, 2017

Yes, the comment really helped. It looks great.
If I can, anyone should be able to run this in his/her brain. So, it's awesome.

@hunkim
Copy link
Owner Author

hunkim commented May 28, 2017

@kkweon Do you like the naming? p for product and a for addition.

@kkweon
Copy link
Collaborator

kkweon commented May 28, 2017

@hunkim it should be fine with the comment. Honestly I thought it is a name for the activation layer but I was able to figure it out by reading the comment

@hunkim
Copy link
Owner Author

hunkim commented May 28, 2017

@kkweon Still I don't like names. Let me know if you have any suggestions.

@hunkim hunkim closed this as completed May 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants