Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mini-batch training in VAE #3

Closed
wehlutyk opened this issue Mar 29, 2018 · 2 comments
Closed

Support mini-batch training in VAE #3

wehlutyk opened this issue Mar 29, 2018 · 2 comments
Assignees

Comments

@wehlutyk
Copy link
Collaborator

Note that a mini-batch with convolutions needs access to all the neighbouring nodes it will include in convolutions, on top of the nodes in the mini-batch for which we compute an update.

@wehlutyk
Copy link
Collaborator Author

The current implementation in #10 mostly works, but with a gotcha concerning the scaling of losses.

In fact, there are three open questions, and one puzzle which is probably related.

Question 1: scaling between neighbours and non-neighbours

When computing the reconstruction loss on the adjacency matrix, we previously used np.sum(adj * np.log(x) + (1 - adj) * np.log(1 - x)) (in fact we used something numerically more stable, with the same result). This is wrong for two reasons:

  • np.sum makes this scale with n_nodes ** 2, so the magnitude compared to the KL loss changes with the size of the network. That's easily fixed by using np.mean instead, and defining how to scale w.r.t. KL loss (see question 2).
  • It gives equal weight to neighbours and non-neighbours in the prediction accuracy. Ideally, what we'd like to do is predict, for each node, the set of neighbours and the set of non-neighbours, without penalising our prediction loss by the fact that there are (usually) many more non-neighbours than there are neighbours. One solution is to scale the items in the KL loss so that, for a given node, its neighbours contribute 1/2 on average, and its non-neighbours contribute 1/2 on average. We then average over all nodes to get a loss that contributes 1, but weighs neighbours and non-neighbours equally. But there might be better solutions to this, or better ways of scaling (which is the question).

Question 2: scaling between KL loss and reconstruction loss

After the correction for point 1 of question 1, we are effectively training with a KL loss and a reconstruction loss that have more or less the same magnitude (something around .5), instead of having a reconstruction loss that is up to 600 times bigger than the KL. It turns out in this case things fail: training is either very poor, or falls into the puzzle described below, unless we downscale the KL loss by 30 to 600. Also, this is what Kipf et al. seem to do in their implementation (in that file, the norm and pos_weight values take care of the scaling mentioned in point 2 of question 1). But if I'm not mistaken, there's no theoretical grounding for that.

So, three possibilities:

  • find a mistake in my derivation to conclude that Kipf's implementation is the right one
  • find no mistake, but find a way to justify this downscaling (not only heuristic)
  • find no mistake, but find a way to make the training work without downscaling

Question 3: mini-batch sparsity

In the case where we have a sparse network, mini-batching with randomly selected nodes will almost always select nodes that are not directly connected, such that the adjacency matrix restricted to the mini-batch will almost always be the identity matrix. So training on this is bound to be very bad.

Should we instead select nodes that have a bigger chance of being connected? Or maybe we can scale the reconstruction loss so that the training works even if most of the sub-adjacency matrices it sees are identity matrices (but it jumps on the occasion when it sees a connection between nodes)?

Puzzle: why does the KL fall to 0

When training with the situation described in question 2 (so equally scaled KL and reconstruction), most often the KL drops to almost 0 (1e-7), and the reconstruction loss goes a little bit up (sic). Everything looks like the optimizer is trying to reduce the KL without regards to is small magnitude compared to the reconstruction loss. Wtf is going on? (Maybe start a new issue for this, but it's also related to the question 2 on scaling losses.)

@wehlutyk
Copy link
Collaborator Author

After the meeting with Marton and @jaklevab.

Question 1

Question 2
This issue was raised in tkipf/gae#8, and indeed the answer there is correct: in my code the reconstruction loss for the adjacency matrix was essentially normalised twice by the number of nodes (by an average over a 2-D tensor), and the KL was normalised only once (by an average over a 1-D tensor). Turning into only one normalisation for the reconstruction loss is exactly equivalent to scaling down the KL by the number of nodes.

Question 3
Indeed, this is a problem, see #15.

Puzzle
This is solved by the answer to question 2, and is most likely because the value we were then minimising was not any more an upper bound to the theoretical objective to be optimised.

wehlutyk added a commit that referenced this issue Jul 10, 2018
 
Merge pull request #10 from ixxi-dante/issue-3-mini-batching

[WIP] Mini-batch training in VAE. Closes #3.
wehlutyk added a commit that referenced this issue Jul 10, 2018
 
Merge pull request #10 from ixxi-dante/issue-3-mini-batching

[WIP] Mini-batch training in VAE. Closes #3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant