Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bias+relu on padded vertices #17

Closed
maosi-chen opened this issue Oct 26, 2017 · 4 comments
Closed

bias+relu on padded vertices #17

maosi-chen opened this issue Oct 26, 2017 · 4 comments

Comments

@maosi-chen
Copy link

maosi-chen commented Oct 26, 2017

#Thank you for sharing your code. I really enjoy it (as well as your paper)!

For the bias + relu process (e.g. b1relu), I'm concerned about the effects of bias on those padded vertices (fake nodes) especially when there are many consecutive ones. For any layer before the last (cnn_graph) layer, since next layer's L is padded with 0 on the corresponding positions the effects of bias may be eliminated to some extend (x0 in chebyshev5 is still affected). However, for the last cnn_graph layer those bias on padded vertices will be kept and fed into the fully connected layer.

My question is: should we mask those padded vertices (i.e. set them to 0) after adding the bias to the output of convolution and before relu?

Thanks!
Maosi Chen

@mdeff
Copy link
Owner

mdeff commented Oct 30, 2017

Thanks for your interest.

There should be no fake nodes on the last cnn_graph layer. So it will not impact the fully connected layer.

In the previous cnn_graph layers, fake nodes are disconnected (i.e. not connected to any other nodes). So they won't influence the graph convolution. Because they are initialized to zero, they will never be chosen by the max pooling, and as such the gradient will never flow to their bias which will be kept to zero as well.

I'm not sure I fully answered your question, feel free to follow up otherwise.

@maosi-chen
Copy link
Author

maosi-chen commented Oct 30, 2017

Thank you for answering my concern.

Maybe the issue is caused by my mistakes on adapting the code to my application. My application is to do a spatial interpolation on hourly surface ozone concentration (i.e. my graph signal) observed on EPA ground stations. The number of stations (16-38 within a 6-degree box around Oklahoma) vary with dates. For a training sample, the inputs are the location of a target station (any one of them) and the rest stations' locations as well as their ozone observations, the output is the ozone concentration at the target stations. To reduce the calculation burden during training, I prepare the Laplacian matrices for each sample before training using your libraries (e.g. graph.distance_sklearn_metric, graph.adjacency, coarsening.coarsen, graph.laplacian, graph.rescale_L, coarsening.perm_data). Since the variation of stations (vertices) numbers, I think if I want to process the samples by batch during training, I need to construct the Laplacian matrices with the same sizes at the corresponding coarsening levels across samples. As it turns out, the minimal coarsening levels to accomplish this is 7 (i.e. 8 Laplacian matrices per sample) and the first layer has 128 nodes (i.e. >90 fake nodes). The attachment (sample1.txt) contains one sample, which includes the raw data (target and rest points' positions and surface ozone amounts) and the 8 Laplacian matrices calculated from them. During training, necessary modifications on chebyshev5 are made to conduct convolution by batch (please see chebyshev5_batch below).

If the process above is correct, I'm still concerned about the bias on fake nodes.
For (cnn_graph) layer 0, both x and L0 are correctly padded with zeros on fake nodes, there is no problem for the convolution. But I think we cannot guarantee the results of convolution (y in chebyshev5_batch) are always greater than 0 on real nodes? Assuming real nodes get positive values and fake nodes get zeros in y, we add a bias on each output feature, if there is at least one real node in the pooling window, the fake nodes in the window will not be chosen; but if there are all fake nodes in the pooling window, the bias will survive and be past to the next cnn_graph layer. In that case, the next layer' x0 will have the bias value on fake nodes (where they should be 0); x1 = L1 times x0 is correct because L1 has 0s on fake nodes; x2= L1 times x1 minus x0, L1 times x1 part is still correct but the bias will show up on real nodes because of the x0 part; x3,...,xM will have the similar problem as x2.

I'm sorry I write so long to describe my issue. Could you help me find where I make mistakes here regarding the bias? Besides, could you also elaborate on how to use attention mechanism to convert variable-length features to a fixed-length features before feeding them into fully connected layers?

Thank you so much!
Maosi

chebyshev5_batch
Purpose:
    perform the graph filtering on the given layer on a batch
Args:
    x: the batch of inputs for the given layer, 
       dense tensor, size: [N, M, Fin].
    L: the batch of sorted Laplacian of the given layer (tf.Tensor), 
       dense tensor, size: [N, M, M].
    Fout: the number of output features on the given layer.
    K: the filter size or number of hopes on the given layer.
    lyr_num: the idx of the original Laplacian lyr (start form 0).
Output:
    y: the filtered output from the given layer

def chebyshev5_batch(x, L, Fout, K, lyr_num):
    N, M, Fin = get_shape(x)
    def expand_concat(orig, new):
        new = tf.expand_dims(new, 0)  # 1 x N x M x Fin
        return tf.concat([orig, new], axis=0)  # (shape(x)[0] + 1) x N x M x Fin

    x0 = x  # N x M x Fin
    stk_x = tf.expand_dims(x0, axis=0)  # 1 x N x M x Fin (eventually K x N x M x Fin, if K>1)
    
    if K > 1:
        x1 = tf.matmul(L, x0) # N x M x Fin
        stk_x = expand_concat(stk_x, x1) 
    for kk in range(2, K):
        x2 = tf.matmul(L, x1) - x0 # N x M x Fin
        stk_x = expand_concat(stk_x, x2)
        x0 = x1
        x1 = x2
    
    # now stk_x has the shape of K x N x M x Fin
    # transpose to the shape of  N x M x Fin x K
    ##  source positions         1   2   3     0   
    stk_x_transp = tf.transpose(stk_x, perm=[1,2,3,0])
    stk_x_forMul = tf.reshape(stk_x_transp, [-1, Fin*K]) #[N*M, Fin*K]
    
    #W = self._weight_variable([Fin*K, Fout], regularization=False)  
    W_initial = tf.truncated_normal_initializer(0, 0.1)
    W = tf.get_variable('weights_L_'+str(lyr_num), [Fin*K, Fout], tf.float32, initializer=W_initial)
    tf.summary.histogram(W.op.name, W)
    
    y = tf.matmul(stk_x_forMul, W)
    y = tf.reshape(y, tf.stack([N, M, tf.Dimension(Fout)], axis=0)) #[N, M, Fout]
    return y
'''
get_shape:
Purpose:
    Get the tensor's shape (as numbers, tensors, or their combination)
Args:
    tensor: the input tensor
Return:
    dims: the list containing the array of dims (as numbers, tensors, or their combination)
Notes:
    copied from
    https://github.com/vahidk/EffectiveTensorflow#get-shape-
'''
def get_shape(tensor):
  """Returns static shape if available and dynamic shape otherwise."""
  static_shape = tensor.shape.as_list()
  dynamic_shape = tf.unstack(tf.shape(tensor))
  dims = [s[1] if s[0] is None else s[0]
          for s in zip(static_shape, dynamic_shape)]
  return dims

sample1.txt

@mdeff
Copy link
Owner

mdeff commented Nov 16, 2017

I think if I want to process the samples by batch during training, I need to construct the Laplacian matrices with the same sizes at the corresponding coarsening levels across samples.

You don't need to have the same number of nodes at each level, because it's a convolution. Think of a 2D convolution on an image. Whatever the image size, you can convolve it with e.g. a 3x3 filter. The problem arises at the interface with the fully connected layers, which expect a fixed length representation. You could achieve that by constraining the coarsening to arrive at a fixed size graph (if your input graphs are of similar size) or use an attention mechanism.

In the case of a positive bias and a parent node composed only of fake children nodes, the parent node would indeed get the value of the bias instead of 0. And if that value is greater than the one of its real neighbors, it will be chosen by the max pooling. As a fix, you can indeed reset fake nodes to zero. Did you see any effect by doing this?

@maosi-chen
Copy link
Author

maosi-chen commented Nov 16, 2017

I haven't tried the no-mask version yet because of the bias effects on the fake nodes. If you think it is worth exploring, I'll give it a try. Please let me know.

Thank you for the suggestion of "constraining the coarsening to arrive at a fixed size graph (if your input graphs are of similar size)". I think before coarsening to only one node left in the last cnn_graph layer, any other cnn_graph layer may have some fake nodes survived. We can stop coarsening to arrive at a fixed size graph, but the fixed length vector we feed into the fully connected layer may contain some fake nodes information (either zero or the learned bias). I don't think fully connected layer can handle inputs like this.

About the attention mechanism, I don't quite understand the mechanism you mentioned in your other answer (#5, "equation 7 in https://arxiv.org/abs/1511.05493"). Could you explain it a little more or give an implementation example? The closest thing I have tried is to apply the stacked bidirectional LSTM after the first cnn_graph convolution (without any bias and pooling) to fetch a fixed length of hidden states and feed them into the following fully connected layers. It did help improve the performance (loss ~ 0.15 compared to loss ~ 0.2 for coarsening until one node without LSTM), but it converged much slower and was less accurate than the pure stacked bidirectional LSTM + Fully connected structure (loss~0.05).

@mdeff mdeff closed this as completed Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants