Extraction of internal features from the tower representation #34

kblomdahl · 2018-10-03T19:15:42Z

We should investigate the internal features of the tower representation. Doing this should provide several benefits:

It should help us identify good features that we can compute directly, and feed into the neural network.
It should help us improve the neural network architecture.

kblomdahl · 2018-10-03T19:59:02Z

The default network architecture (ResNet) seems to have an odd bias towards a strong activation close to the edge of the board. As can be observed in the diagram below:

This behavior makes sense, consider a set of weights W that are normally distributed, and a positive input x:

y = W [ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ ]

If W ∈ N(μ, σ) and μ < 0 then along the edge x₇, x₈, and x₉ are zero:

W [ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ ] ≤ W [ x₁ x₂ x₃ x₄ x₅ x₆ 0 0 0 ]

This hypothesis can be confirmed by investigating the final weights, and checking the mean of the convolution layers:

{
    "01_upsample/conv_1": {
        "mean": -0.0003626140533015132,
        "std": 0.00519569544121623
    },
    "02_residual/conv_1": {
        "mean": -0.00011233328405069187,
        "std": 0.002601742511615157
    },
    "02_residual/conv_2": {
        "mean": -0.00011419910879340023,
        "std": 0.0026016614865511656
    },
    "03_residual/conv_1": {
        "mean": -0.00010304508032277226,
        "std": 0.0026021271478384733
    },
    "03_residual/conv_2": {
        "mean": -6.648160342592746e-05,
        "std": 0.002603317843750119
    },
    "04_residual/conv_1": {
        "mean": -0.0003849344211630523,
        "std": 0.0025755600072443485
    },
    "04_residual/conv_2": {
        "mean": 0.00016120942018460482,
        "std": 0.002599172294139862
    },
    "05_residual/conv_1": {
        "mean": 0.00021969537192489952,
        "std": 0.002594882855191827
    },
    "05_residual/conv_2": {
        "mean": 0.00045546123874373734,
        "std": 0.002564028138294816
    },
    "06_residual/conv_1": {
        "mean": -0.00022857918520458043,
        "std": 0.0025941154453903437
    },
    "06_residual/conv_2": {
        "mean": 0.00020613202650565654,
        "std": 0.0025959957856684923
    },
    "07_residual/conv_1": {
        "mean": -0.0020295707508921623,
        "std": 0.001631724531762302
    },
    "07_residual/conv_2": {
        "mean": -2.099659468512982e-05,
        "std": 0.002604082226753235
    },
    "08_residual/conv_1": {
        "mean": -0.0021902318112552166,
        "std": 0.001408747280947864
    },
    "08_residual/conv_2": {
        "mean": 7.066841476444097e-07,
        "std": 0.002604166977107525
    },
    "09_residual/conv_1": {
        "mean": -0.0021506529301404953,
        "std": 0.0014684603083878756
    },
    "09_residual/conv_2": {
        "mean": -2.3711704670859035e-06,
        "std": 0.002604165580123663
    },
    "10_residual/conv_1": {
        "mean": -0.0015733279287815094,
        "std": 0.002075168304145336
    },
    "10_residual/conv_2": {
        "mean": -0.00012044789764331654,
        "std": 0.0026013797614723444
    },
    "11p_policy/conv_1": {
        "mean": -0.007533475290983915,
        "std": 0.06204431504011154
    },
    "11p_policy/linear_1": {
        "mean": -0.01642797514796257,
        "std": 0.038641564548015594
    },
    "11v_value/conv_1": {
        "mean": -0.0014275580178946257,
        "std": 0.08837682008743286
    },
    "11v_value/linear_1": {
        "mean": 0.00015922913735266775,
        "std": 0.0526527501642704
    },
    "11v_value/linear_2": {
        "mean": -0.010312804952263832,
        "std": 0.06787863373756409
    },
}

As can be seen most of the convolution layers has a negative mean.

edge_tower_representation.zip

Conclusion

The current neural network architecture has a systematic bias towards the edge of the board. This problem is presumably exaggerated by the fact that we clip activations to the range [0, 6].

kblomdahl · 2018-10-03T20:21:59Z

Some approaches to solving the problem above immediately springs to mind:

Padding with a different constant instead of zero (we could pad with the median of the input distribution, a truncated normal distribution).
Use a per-activation, instead of per-channel, bias after each convolution layer.

Neither of these approaches are supported by cuDNN when running with layout NCHWVECTC and INT8x4.

Neural style transfer research has encountered similar artifacts, with a similar cause, but no conclusion beyond using super-resolution.

kblomdahl · 2018-10-06T14:59:27Z

Even if it might turn out to be tricky to implement said algorithms in cuDNN, we can still train the architectures in Tensorflow and verify whether they result in a lower loss:

Green is per-activation offset
Gray is per-channel offset

As one can observe they are effectively equivalent, with the difference being well within the margin for error.

kblomdahl · 2018-10-06T15:42:52Z

An alternative attack vector is to try and prevent the non-zero mean for the weights, at which point the zero-padding does not matter anymore.

The most likely culprit for the negative weights mean is the residual connection, which in the AlphaZero architecture looks like this, where Rₖ is the output of a residual block, and Lₖ₋₁ is the output of the previous residual connector:

Lₖ = Lₖ₋₁ + Rₖ

Rₖ ∈ N(0, 1)

Note that if Lₖ₋₁ has a non-zero variance then the variance of Lₖ will increase with k. This would normally not be a huge problem, however since we clip our activations to six the optimizer must continuously decrease the mean of Lₖ to maintain expressiveness of the network.

A solution to this is to instead interpolate between Lₖ₋₁ and Rₖ when computing the residual connector, i.e. Lₖ = α Lₖ₋₁ + (1 - α) Rₖ. A few interpretations for different value of α:

α = 0.5 - This is the current architecture without the exploding activations.
α ≠ 0.5 - This is what is called a highway network, a common value is α = 0.9.

We could also make α a trainable parameter.

Results

This change seems to have a positive effect on the final loss of the training:

Network	Value Loss	Policy Loss
Baseline	0.8162	1.936
`α = 0.5`	0.7942	1.887

It also seems to improve the actual playing strength of the engine:

dg-v060.per-channel-offset v dg-v060.highway (38/500 games)
unknown results: 9 23.68%
board size: 19   komi: 7.5
                             wins              black         white       avg cpu
dg-v060.per-channel-offset     10 26.32%       4  21.05%     6  31.58%    173.60
dg-v060.highway                19 50.00%       9  47.37%     10 52.63%    139.36
                                               13 34.21%     16 42.11%

kblomdahl · 2018-10-06T17:16:29Z

When looking at the activations after introducing the α interpolation, the border problem seems to have been largely resolved. But an un-intended side-effect of making sure that Lₖ ∈ N(0, 1), is that the quantization resolution became worse, which has negative effects on the playing strength.

If we want to preserve 99.9% of all values, assuming the values are a normal distribution with a mean of 0 and a variance of 1, then we should clip the activations at 3.09023. This is a far cry from 6.0, and shows that we are effectively just wasting half of the quantized range.

This in-efficient use of the quantized range presumably also had negative effects before introducing α, since the initial parts of the architecture were not using the entire range, resulting in a low resolution during the initial parts of the inference.

Tweak the neural network architecture based on discoveries made in #34: - Change the residual blocks to a highway architecture to avoid the magnitude of the activations exploding. - Clip relu activations at 3.09023 instead of 6.0 - Reduce batch size from 1024 to 512 (improve performance, and does not seem to have any impact on the final accuracy). - Disable adversarial training by default. Also make some quality of life changes to the script: - Read _big SGF files_ instead of TFRecord's. This allows us to simplify and streamline the training pipeline. - Add a '--tower' option, that allows us to inspect the internal features of the network. - Add a measurement of the orthogonality of the weights to '--print'.

kblomdahl closed this as completed Oct 12, 2018

kblomdahl mentioned this issue Apr 15, 2021

Re-implement INT8x32_CONFIG support during inference #50

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction of internal features from the tower representation #34

Extraction of internal features from the tower representation #34

kblomdahl commented Oct 3, 2018

kblomdahl commented Oct 3, 2018 •

edited

kblomdahl commented Oct 3, 2018 •

edited

kblomdahl commented Oct 6, 2018

kblomdahl commented Oct 6, 2018 •

edited

kblomdahl commented Oct 6, 2018

Extraction of internal features from the tower representation #34

Extraction of internal features from the tower representation #34

Comments

kblomdahl commented Oct 3, 2018

kblomdahl commented Oct 3, 2018 • edited

Conclusion

kblomdahl commented Oct 3, 2018 • edited

kblomdahl commented Oct 6, 2018

kblomdahl commented Oct 6, 2018 • edited

Results

kblomdahl commented Oct 6, 2018

kblomdahl commented Oct 3, 2018 •

edited

kblomdahl commented Oct 3, 2018 •

edited

kblomdahl commented Oct 6, 2018 •

edited