Support mini-batch training in VAE #3

wehlutyk · 2018-03-29T08:19:16Z

Note that a mini-batch with convolutions needs access to all the neighbouring nodes it will include in convolutions, on top of the nodes in the mini-batch for which we compute an update.

wehlutyk · 2018-04-19T14:20:55Z

The current implementation in #10 mostly works, but with a gotcha concerning the scaling of losses.

In fact, there are three open questions, and one puzzle which is probably related.

Question 1: scaling between neighbours and non-neighbours

When computing the reconstruction loss on the adjacency matrix, we previously used np.sum(adj * np.log(x) + (1 - adj) * np.log(1 - x)) (in fact we used something numerically more stable, with the same result). This is wrong for two reasons:

np.sum makes this scale with n_nodes ** 2, so the magnitude compared to the KL loss changes with the size of the network. That's easily fixed by using np.mean instead, and defining how to scale w.r.t. KL loss (see question 2).
It gives equal weight to neighbours and non-neighbours in the prediction accuracy. Ideally, what we'd like to do is predict, for each node, the set of neighbours and the set of non-neighbours, without penalising our prediction loss by the fact that there are (usually) many more non-neighbours than there are neighbours. One solution is to scale the items in the KL loss so that, for a given node, its neighbours contribute 1/2 on average, and its non-neighbours contribute 1/2 on average. We then average over all nodes to get a loss that contributes 1, but weighs neighbours and non-neighbours equally. But there might be better solutions to this, or better ways of scaling (which is the question).

Question 2: scaling between KL loss and reconstruction loss

After the correction for point 1 of question 1, we are effectively training with a KL loss and a reconstruction loss that have more or less the same magnitude (something around .5), instead of having a reconstruction loss that is up to 600 times bigger than the KL. It turns out in this case things fail: training is either very poor, or falls into the puzzle described below, unless we downscale the KL loss by 30 to 600. Also, this is what Kipf et al. seem to do in their implementation (in that file, the norm and pos_weight values take care of the scaling mentioned in point 2 of question 1). But if I'm not mistaken, there's no theoretical grounding for that.

So, three possibilities:

find a mistake in my derivation to conclude that Kipf's implementation is the right one
find no mistake, but find a way to justify this downscaling (not only heuristic)
find no mistake, but find a way to make the training work without downscaling

Question 3: mini-batch sparsity

In the case where we have a sparse network, mini-batching with randomly selected nodes will almost always select nodes that are not directly connected, such that the adjacency matrix restricted to the mini-batch will almost always be the identity matrix. So training on this is bound to be very bad.

Should we instead select nodes that have a bigger chance of being connected? Or maybe we can scale the reconstruction loss so that the training works even if most of the sub-adjacency matrices it sees are identity matrices (but it jumps on the occasion when it sees a connection between nodes)?

Puzzle: why does the KL fall to 0

When training with the situation described in question 2 (so equally scaled KL and reconstruction), most often the KL drops to almost 0 (1e-7), and the reconstruction loss goes a little bit up (sic). Everything looks like the optimizer is trying to reduce the KL without regards to is small magnitude compared to the reconstruction loss. Wtf is going on? (Maybe start a new issue for this, but it's also related to the question 2 on scaling losses.)

wehlutyk · 2018-04-20T13:33:49Z

After the meeting with Marton and @jaklevab.

Question 1

Ok
It's not clear what is best here. Another way would be to reward mostly the true positives and true negatives, but don't weigh the false positives and false negatives so much. See Decide which adjacency reconstruction normalisation to use #14.

Question 2
This issue was raised in tkipf/gae#8, and indeed the answer there is correct: in my code the reconstruction loss for the adjacency matrix was essentially normalised twice by the number of nodes (by an average over a 2-D tensor), and the KL was normalised only once (by an average over a 1-D tensor). Turning into only one normalisation for the reconstruction loss is exactly equivalent to scaling down the KL by the number of nodes.

Question 3
Indeed, this is a problem, see #15.

Puzzle
This is solved by the answer to question 2, and is most likely because the value we were then minimising was not any more an upper bound to the theoretical objective to be optimised.

Merge pull request #10 from ixxi-dante/issue-3-mini-batching [WIP] Mini-batch training in VAE. Closes #3.

wehlutyk added the enhancement label Mar 29, 2018

wehlutyk mentioned this issue Mar 29, 2018

Scale up the VAE tests #8

Closed

wehlutyk self-assigned this Apr 11, 2018

wehlutyk mentioned this issue Apr 12, 2018

[WIP] Mini-batch training in VAE #10

Merged

This was referenced Apr 20, 2018

Decide which adjacency reconstruction normalisation to use #14

Closed

Mini-batch sampling strategy #15

Closed

wehlutyk closed this as completed in 3b6461c Apr 20, 2018

wehlutyk added a commit that referenced this issue Jul 10, 2018

2a5776c

Merge pull request #10 from ixxi-dante/issue-3-mini-batching [WIP] Mini-batch training in VAE. Closes #3.

wehlutyk added a commit that referenced this issue Jul 10, 2018

131e1f9

Merge pull request #10 from ixxi-dante/issue-3-mini-batching [WIP] Mini-batch training in VAE. Closes #3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mini-batch training in VAE #3

Support mini-batch training in VAE #3

wehlutyk commented Mar 29, 2018

wehlutyk commented Apr 19, 2018

wehlutyk commented Apr 20, 2018

Support mini-batch training in VAE #3

Support mini-batch training in VAE #3

Comments

wehlutyk commented Mar 29, 2018

wehlutyk commented Apr 19, 2018

Question 1: scaling between neighbours and non-neighbours

Question 2: scaling between KL loss and reconstruction loss

Question 3: mini-batch sparsity

Puzzle: why does the KL fall to 0

wehlutyk commented Apr 20, 2018