-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bias+relu on padded vertices #17
Comments
Thanks for your interest. There should be no fake nodes on the last cnn_graph layer. So it will not impact the fully connected layer. In the previous cnn_graph layers, fake nodes are disconnected (i.e. not connected to any other nodes). So they won't influence the graph convolution. Because they are initialized to zero, they will never be chosen by the max pooling, and as such the gradient will never flow to their bias which will be kept to zero as well. I'm not sure I fully answered your question, feel free to follow up otherwise. |
Thank you for answering my concern. Maybe the issue is caused by my mistakes on adapting the code to my application. My application is to do a spatial interpolation on hourly surface ozone concentration (i.e. my graph signal) observed on EPA ground stations. The number of stations (16-38 within a 6-degree box around Oklahoma) vary with dates. For a training sample, the inputs are the location of a target station (any one of them) and the rest stations' locations as well as their ozone observations, the output is the ozone concentration at the target stations. To reduce the calculation burden during training, I prepare the Laplacian matrices for each sample before training using your libraries (e.g. graph.distance_sklearn_metric, graph.adjacency, coarsening.coarsen, graph.laplacian, graph.rescale_L, coarsening.perm_data). Since the variation of stations (vertices) numbers, I think if I want to process the samples by batch during training, I need to construct the Laplacian matrices with the same sizes at the corresponding coarsening levels across samples. As it turns out, the minimal coarsening levels to accomplish this is 7 (i.e. 8 Laplacian matrices per sample) and the first layer has 128 nodes (i.e. >90 fake nodes). The attachment (sample1.txt) contains one sample, which includes the raw data (target and rest points' positions and surface ozone amounts) and the 8 Laplacian matrices calculated from them. During training, necessary modifications on chebyshev5 are made to conduct convolution by batch (please see chebyshev5_batch below). If the process above is correct, I'm still concerned about the bias on fake nodes. I'm sorry I write so long to describe my issue. Could you help me find where I make mistakes here regarding the bias? Besides, could you also elaborate on how to use attention mechanism to convert variable-length features to a fixed-length features before feeding them into fully connected layers? Thank you so much!
|
You don't need to have the same number of nodes at each level, because it's a convolution. Think of a 2D convolution on an image. Whatever the image size, you can convolve it with e.g. a 3x3 filter. The problem arises at the interface with the fully connected layers, which expect a fixed length representation. You could achieve that by constraining the coarsening to arrive at a fixed size graph (if your input graphs are of similar size) or use an attention mechanism. In the case of a positive bias and a parent node composed only of fake children nodes, the parent node would indeed get the value of the bias instead of 0. And if that value is greater than the one of its real neighbors, it will be chosen by the max pooling. As a fix, you can indeed reset fake nodes to zero. Did you see any effect by doing this? |
I haven't tried the no-mask version yet because of the bias effects on the fake nodes. If you think it is worth exploring, I'll give it a try. Please let me know. Thank you for the suggestion of "constraining the coarsening to arrive at a fixed size graph (if your input graphs are of similar size)". I think before coarsening to only one node left in the last cnn_graph layer, any other cnn_graph layer may have some fake nodes survived. We can stop coarsening to arrive at a fixed size graph, but the fixed length vector we feed into the fully connected layer may contain some fake nodes information (either zero or the learned bias). I don't think fully connected layer can handle inputs like this. About the attention mechanism, I don't quite understand the mechanism you mentioned in your other answer (#5, "equation 7 in https://arxiv.org/abs/1511.05493"). Could you explain it a little more or give an implementation example? The closest thing I have tried is to apply the stacked bidirectional LSTM after the first cnn_graph convolution (without any bias and pooling) to fetch a fixed length of hidden states and feed them into the following fully connected layers. It did help improve the performance (loss ~ 0.15 compared to loss ~ 0.2 for coarsening until one node without LSTM), but it converged much slower and was less accurate than the pure stacked bidirectional LSTM + Fully connected structure (loss~0.05). |
#Thank you for sharing your code. I really enjoy it (as well as your paper)!
For the bias + relu process (e.g.
b1relu
), I'm concerned about the effects of bias on those padded vertices (fake nodes) especially when there are many consecutive ones. For any layer before the last (cnn_graph) layer, since next layer's L is padded with 0 on the corresponding positions the effects of bias may be eliminated to some extend (x0
inchebyshev5
is still affected). However, for the last cnn_graph layer those bias on padded vertices will be kept and fed into the fully connected layer.My question is: should we mask those padded vertices (i.e. set them to 0) after adding the bias to the output of convolution and before relu?
Thanks!
Maosi Chen
The text was updated successfully, but these errors were encountered: