-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extraction of internal features from the tower representation #34
Comments
The default network architecture (ResNet) seems to have an odd bias towards a strong activation close to the edge of the board. As can be observed in the diagram below: This behavior makes sense, consider a set of weights
If
This hypothesis can be confirmed by investigating the final weights, and checking the mean of the convolution layers: {
"01_upsample/conv_1": {
"mean": -0.0003626140533015132,
"std": 0.00519569544121623
},
"02_residual/conv_1": {
"mean": -0.00011233328405069187,
"std": 0.002601742511615157
},
"02_residual/conv_2": {
"mean": -0.00011419910879340023,
"std": 0.0026016614865511656
},
"03_residual/conv_1": {
"mean": -0.00010304508032277226,
"std": 0.0026021271478384733
},
"03_residual/conv_2": {
"mean": -6.648160342592746e-05,
"std": 0.002603317843750119
},
"04_residual/conv_1": {
"mean": -0.0003849344211630523,
"std": 0.0025755600072443485
},
"04_residual/conv_2": {
"mean": 0.00016120942018460482,
"std": 0.002599172294139862
},
"05_residual/conv_1": {
"mean": 0.00021969537192489952,
"std": 0.002594882855191827
},
"05_residual/conv_2": {
"mean": 0.00045546123874373734,
"std": 0.002564028138294816
},
"06_residual/conv_1": {
"mean": -0.00022857918520458043,
"std": 0.0025941154453903437
},
"06_residual/conv_2": {
"mean": 0.00020613202650565654,
"std": 0.0025959957856684923
},
"07_residual/conv_1": {
"mean": -0.0020295707508921623,
"std": 0.001631724531762302
},
"07_residual/conv_2": {
"mean": -2.099659468512982e-05,
"std": 0.002604082226753235
},
"08_residual/conv_1": {
"mean": -0.0021902318112552166,
"std": 0.001408747280947864
},
"08_residual/conv_2": {
"mean": 7.066841476444097e-07,
"std": 0.002604166977107525
},
"09_residual/conv_1": {
"mean": -0.0021506529301404953,
"std": 0.0014684603083878756
},
"09_residual/conv_2": {
"mean": -2.3711704670859035e-06,
"std": 0.002604165580123663
},
"10_residual/conv_1": {
"mean": -0.0015733279287815094,
"std": 0.002075168304145336
},
"10_residual/conv_2": {
"mean": -0.00012044789764331654,
"std": 0.0026013797614723444
},
"11p_policy/conv_1": {
"mean": -0.007533475290983915,
"std": 0.06204431504011154
},
"11p_policy/linear_1": {
"mean": -0.01642797514796257,
"std": 0.038641564548015594
},
"11v_value/conv_1": {
"mean": -0.0014275580178946257,
"std": 0.08837682008743286
},
"11v_value/linear_1": {
"mean": 0.00015922913735266775,
"std": 0.0526527501642704
},
"11v_value/linear_2": {
"mean": -0.010312804952263832,
"std": 0.06787863373756409
},
} As can be seen most of the convolution layers has a negative mean. ConclusionThe current neural network architecture has a systematic bias towards the edge of the board. This problem is presumably exaggerated by the fact that we clip activations to the range |
Some approaches to solving the problem above immediately springs to mind:
Neither of these approaches are supported by cuDNN when running with layout Neural style transfer research has encountered similar artifacts, with a similar cause, but no conclusion beyond using super-resolution. |
Even if it might turn out to be tricky to implement said algorithms in cuDNN, we can still train the architectures in Tensorflow and verify whether they result in a lower loss:
As one can observe they are effectively equivalent, with the difference being well within the margin for error. |
An alternative attack vector is to try and prevent the non-zero mean for the weights, at which point the zero-padding does not matter anymore. The most likely culprit for the negative weights mean is the residual connection, which in the AlphaZero architecture looks like this, where
Note that if A solution to this is to instead interpolate between
We could also make ResultsThis change seems to have a positive effect on the final loss of the training:
It also seems to improve the actual playing strength of the engine:
|
When looking at the activations after introducing the α interpolation, the border problem seems to have been largely resolved. But an un-intended side-effect of making sure that If we want to preserve 99.9% of all values, assuming the values are a normal distribution with a mean of 0 and a variance of 1, then we should clip the activations at 3.09023. This is a far cry from 6.0, and shows that we are effectively just wasting half of the quantized range. This in-efficient use of the quantized range presumably also had negative effects before introducing α, since the initial parts of the architecture were not using the entire range, resulting in a low resolution during the initial parts of the inference. |
Tweak the neural network architecture based on discoveries made in #34: - Change the residual blocks to a highway architecture to avoid the magnitude of the activations exploding. - Clip relu activations at 3.09023 instead of 6.0 - Reduce batch size from 1024 to 512 (improve performance, and does not seem to have any impact on the final accuracy). - Disable adversarial training by default. Also make some quality of life changes to the script: - Read _big SGF files_ instead of TFRecord's. This allows us to simplify and streamline the training pipeline. - Add a '--tower' option, that allows us to inspect the internal features of the network. - Add a measurement of the orthogonality of the weights to '--print'.
We should investigate the internal features of the tower representation. Doing this should provide several benefits:
The text was updated successfully, but these errors were encountered: