-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It may forget pertinent information about positions that it no longer visits. #38
Comments
I think that there is that's possibility, I have a simple hypothesis that
so, I feel that increasing sim_per_move and dataset size gradually is effective. |
I think larger slim_per_move and self-play dataset can't resolve no longer visits problem, because the unusually positions can't be selected by self-play MCTS. |
@mokemokechicken I asked @gooooloo a similar question in other thread, but what is the default ratio of the number of games per gradient update ratio of your algorithm? I guess the ratio is important for the performance, since it behaves like sims/move, which is undoubtedly important. |
I do not know which number to answer concretely, but the resulting speed is as follows. setting
speed
so
Maybe, it means that 1 position is learned 68 times regardless (nb_game_in_file, max_file_num). |
Thanks for your answer. In the case of Go with AlphaZero, 700k minibatches (2048 positions each) and 21 million self-play games were performed. Assuming that each game ended with 150 stones (positions) placed, 700k x 2048/(21m x 150)=0.44 [trained position]/[self-play-generated position], which is much less than 68. So, I guess you can improve your performance with more self-plays per update. Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration. Since having more games generated means more diverse data than having more sims/move, so spending more time on self-play may be more beneficial than more sims/move. But in practice, since your alg doesn't allow multi-processing (of multiple games) as done by Akababa, my suggestion may be not useful. But this may be useful for @gooooloo . |
I think so too. So I am planing to implement multiprocess self-play, |
I am testing on feature/multiprocess_selfplay, when 16 parallel in self-play,
so
|
Cool. So, multi-processing successfully decreased the ratio and achieved 36s per game under 400 sims/move. Now, it suffices to elucidate the trade-off between training/selfplay ratio and sims/move. I'm excited for your subsequent announcements! |
I also added wait to optimizer to change the ratio. Now,
so
|
Mine is:
I actually don't understand below number @mokemokechicken mentioned:
But if I just use this number, then I have self-play speed: 80 positions per second (=400/5). |
Thanks @AranKomat . I didn't see this post until just now...
Yes, I also think so. Deepmind uses 2000+ or 4000+ TPU for selfplay (as Aja Huang says in a post, I just can't remember the link). We can see the self play performance is important.
Actually I was getting an smaller selfplay/training ratio when increasing sims/move from 100 to 800. Although I also introduced multi process implementation at that time, the overall self play game speed is a little bit slower than before. Yet I observe the AI strength improvement. |
@gooooloo In AlphaZero, staggering 5000 TPUs were used, so I totally agree. It's weird but nice that increased sims/move resulted in a smaller ratio. Hopefully, @mokemokechicken and others will observe a similar phenomena. |
Note: |
But a reversi game has up to 60 position to move, isn't it? Event with up to 5 "PASS" move, it is 65. Then even with game state flip and rotation, it is at most 260. UPDATE: |
The ratio is # of selp play moves / # of trained moves. I increased # sims per move, then self play got slower, then # of self play moves smaller. But training module not changed. So the total ratio got smaller. Isn't it? |
@gooooloo Sorry, I thought you were talking about training/self-play ratio, but it was opposite. My mistake. I also agree with you about the number of positions per game. |
@AranKomat I made a mistake calculating. Please see that post again, I modified it. |
@gooooloo Well, that makes sense. But when I said 150 stones on average Go game, I didn't take into account the symmetries, so for fair comparison I didn't consider symmetries of reversi, which has the same set of symmetries as Go. Sorry for not being explicit. Since what we're concerned with is the ratio between our training/self-play ratio (5.3 after symmetries) vs. AZ's training/self-play ratio (about 0.44, but it's 0.44/8=0.055 after symmetries), there's still 100 times of difference, which is reasonable given the number of GPUs we're using. |
It is strange that training/self-play ratio becomes under 1. It means that there are positions not used in training. |
The ratio of 0.44 was obtained from AlphaZero, where symmetry wasn't exploited. Also, Shogi and Chess cannot exploit symmetries, so they set the self-play vs training ratio of AlphaZero based on the assumption that self-play data isn't necessarily as plentiful as in symmetric games. Without symmetry, the ratio is 0.44, which is closer to 1. The ratio for Shogi and Chess may be even closer to 1. Also, in symmetric games without symmetric data augmentation, the NN quickly learns symmetry, which was demonstrated by AZ being superior to AGZ in Go. Considering the eventual meaninglessness of symmetric data augmentation, the net ratio of @gooooloo becomes 5.3*8=42.4. So, he needs at least 42 times more GPUs for self-play to get to 1. |
@AranKomat @mokemokechicken I double checked my pipeline's performance, should be 25 processes + 180 second per game per process, which gives 7 seconds per game in average. Then My ratio should be about 7.*(=426/(400/7)), not 5.3. |
I see my model don't be improved anymore.
Moreover I found "It may forget pertinent information about positions that it no longer visits" as ThomasWAnthony's when opinion select action unusually.
@mokemokechicken, @gooooloo How about it?
The text was updated successfully, but these errors were encountered: