Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about heads architecture #2201

Open
Ishinoshita opened this issue Feb 8, 2019 · 13 comments
Open

Question about heads architecture #2201

Ishinoshita opened this issue Feb 8, 2019 · 13 comments

Comments

@Ishinoshita
Copy link

Not an issue but more a question in fact

Policy and value heads are two quite shallow networks (1 conv + 1 FC and 1 conv + 2 FC resp.) plugged on top of 20 or 40 residual blocks towers. Thus, they rely on a shared information to produce two different estimators.

Although interpretability is lacking, that shared information for sure contains enough refined data to quickly estimate groups status. If a large surrounded group is unconditionally alive, then that information matters for the value head. It matters also for the policy head, but to dismiss any move in that area for both sides (if there is a ko fight at the other side of the board, that may change). If the group is conditionally alive, that area will probably be of interest for both heads. If the game is very close, the status of even a small group of stones might be critical for the value head to estimate the win rate, and for the policy to suggest best moves to be searched. If the game is much out of balance, that small group may not matter much for the value head whereas the (raw) policy would still suggest correct saving or killing moves.

Since the two heads have two different goals, would deeper heads helps each head to better achieve its goal ? E.g. (and just to illustrate), a network comprising a 38 residual blocks tower followed by two heads, each comprising 2 residual blocks followed by the current heads architecture, would a priori allow each head to refine the information in a more specific way according to its own goal (policy or value estimation).

Would that help ? OC, I have no clue of the answer. I'm just surprised to have never seen this hyperparameter been discussed anywhere.

Maybe I just didn't look at the right place, or maybe there are good rationale to dismiss such architecture exploration. Maybe DM just investigated this and concluded that the best synergy is when the heads share information (and gradient backwards) as long as possible...

@alreadydone
Copy link
Contributor

Might be a good idea. But maybe you should decide how many filters to use for the policy head and for the value head (don't have to add up to 256). If you use 128 filters for both, that would reduce computation by 2x (less in practice) as 2x128x128 / 256x256 = 1/2.
Note the original AlphaGo uses separate policy and value networks with no sharing at all, which is this idea taken to the extreme.

@Ishinoshita
Copy link
Author

Ishinoshita commented Feb 8, 2019

IIRC, AGZ paper estimated the benefit of policy and value networks merge as +600 Elo, so that definitely advocates against early separation. I was just wondering about some more degree of freedom between heads.

Also, I was wondering if the same information (output of residual blocks tower) could feed a third 'action value' head (might be used to init Q ?....), but I fail to see how to built a satisfying training target for that head.

Edit: might be trained against Q(a), where 'a' is a move which has recieved at least one visit in the search. Usefulness could indeed be to initialize Q for new nodes. Init to draw or init to loss are arguably the best possible init values. Init to loss seems to be the best, absent any other info, but if one were to redo a search from the same position, the most promising init values would be the Q estimated by the first search, so if an action value head were trained to predict the Q(a), that head could be used to prime the Q in the search.

@TFiFiE
Copy link
Contributor

TFiFiE commented Feb 8, 2019

I don't quite see the point, because the two computations can end being de facto disentangled from each other to any degree as deemed optimal by the training process or are you arguing that too much freedom in this regard could actually make for suboptimal training? Doesn't the very success of the zero approach in deep learning belie this notion in the first place?

I do wonder, however, what went into determining each head architecture, so as to know what would be the right choice for outputting territory estimation. Should it be like the policy head (without pass) and the softmax replaced by individual squashing for each intersection value or should they each be calculated like the value head?

@Ishinoshita
Copy link
Author

@TFiFiE I don't argue anything, just highlight the fact that I haven't seen discussed anywhere whether shallow heads is optimal. Between AG deep/fully separated 'heads' and AZ fully interwinned heads, is the benefit monotonic? That's my only questionning here.

@alreadydone
Copy link
Contributor

It could happen that in the last few blocks in a well-trained network there are some filters devoted to the policy head and some to the value head, and the connection between them are very weak, so that when you apply some pruning algorithm to the net, you naturally get a net with the last few blocks effectively split into two parts. I don't know anyone who've tried this; pruning should be able to trade some accuracy for speed on slow devices.

@Ishinoshita
Copy link
Author

I see your point. Thank you!

@Vandertic
Copy link

On SAI 7x7 we tried to increase the filters of the 1x1 convolutions but the performance was more or less the same.

@gcp
Copy link
Member

gcp commented Feb 11, 2019

Also, I was wondering if the same information (output of residual blocks tower) could feed a third 'action value' head (might be used to init Q ?....), but I fail to see how to built a satisfying training target for that head.

I suggested this earlier: #1109

@Ishinoshita
Copy link
Author

@gcp Thank you for the reminder! Was indeed truly reinventing the wheel here.
So many threads..., some that I read and forgot :-(

@rhalbersma
Copy link

Also, I was wondering if the same information (output of residual blocks tower) could feed a third 'action value' head (might be used to init Q ?....), but I fail to see how to built a satisfying training target for that head.

I suggested this earlier: #1109

Has been tried for the game of Hex :
https://www.ijcai.org/proceedings/2018/0523.pdf

@alreadydone
Copy link
Contributor

@rhalbersma See lightvector's analysis of that paper: #1109 (comment)

@Ishinoshita
Copy link
Author

Ishinoshita commented Feb 12, 2019

Still a naive question: would it be possible to train an action value head directly against the Q values estimated by the search, by mask traning, that is, during learning, to backprop the gradient across only the moves that have received at least one visit and thus have some Q(s, a) estimate? Would still need to add Q to training data, while mask ("don't care") could just be derived from locations where policy = 0.

With dirichlet noise forcing visits of suboptimal moves or complete blunders, I would expect that action value head to be faced with sufficient example of non-optimal moves and thus to generalize reasonably well to unseen positions. As in the SL training of a policy network based on pro games dataset, where the classifier targets are one-hot vectors.

Edit: possibly, to avoid any tricky question/issue regarding a loss function over 3 heads, the action value head could be trained separately, with its own loss function, without interfering with residual blocks tower, i.e. would just consider output of the tower as an input, the tower been still trained only based on V and P heads current loss, as done today.

@alreadydone
Copy link
Contributor

alreadydone commented Feb 21, 2019

image
The first layer of the policy head is 1x1 convolution with 2 filters, and that of the value head is 1x1 conv with 1 filter. I plotted 256 points in 3D space indicating how the 256 features (of the 256x40 net # 205) affect these 3 filters. A lot of features are close to one of the x,y,z axes, so these 3 filters (in particular the two heads) seem to be well-separated to some degree. Most features concentrate around the origin, but there are a few features that contribute significantly to the value head or the first filter of the policy head; none for the second filter of the policy head though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants