-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about distributed learning #1
Comments
Yes, we are working on the server portion of the distributed system. It is probably possible to get very decent performance with 10-100 people in a few months, or less. The idea is also to start with an even smaller network to see if the system works correctly, and where we end up strength-wise. I will post more info here and on the computer-go mailinglist when we have something ready to test. Aside from the server you need a small self-play GTP client (I have this ready and will upload it to a separate github repo soon) and a script to fetch the network weights from the server and upload the games back. The code in this repo also needs 2 minimal tweaks to include more randomness in the games. Low dan level should be achievable, I think. You only need a small network for that (~6 residual layers or so). The supervised network that is available as an example probably already reaches ~1 dan level on a GTX 1080, and it was only trained for 2 days on a single GPU. (Of course, I cheated by not bootstrapping from self-play) Despite that bad news that replicating the final results will literally take ages, the good news is that initial progress goes very quickly. |
@gcp thank you for fast answer! |
Yes, sure. I wonder if there's some way to do it automatically on github. |
You can distribute binaries using Github Releases. Not exactly automatic, but I think that's the easiest way. I wonder how many people will be willing to donate compute time compared to something Fishtest. |
Why distribute the training when Google provides access to TPU based processors through their cloud? You have access to the same hardware that Google used to train AlphaGo... just start a kickstarter or a Patreon to cover the cost of training... |
@gafferongames Cloud TPUs are still in "preview" mode, i.e they reach out to you if you get to have access, but as far as I know nobody who applied through the public sign-up sheet has actually gotten in yet. Nor have they given any details about pricing of Cloud TPU, beyond the 1,000 free ones they're going to be donating to certain academic groups. Unless I've missed some announcement somewhere? |
Various reasons:
|
Small update here: the missing randomization parts are implemented. I managed to build a Windows binary (probably required if we're hoping many people join!). Still needed: Finishing up the self-play tool, and then beta-testing the server setup. I am testing different training configurations with a human database (supervised learning) to get some idea for the parameters (learning rate, decay, value vs policy weighting), the performance on normal GPUs, and to be able to optimize the search parameters (UCT coeff etc) in leela-zero. |
https://github.com/yenw/computer-go-dataset could it be useful as additional database? Since it contains thouthands of FineArt/DeepZen/CGI games with pro and top amateurs? |
Only for calibrating with a supervised learning dataset. I already have a database of comparable size, but thanks for the link. (AlphaGo Zero's paper used a smaller dataset of only KGS games, so the size shouldn't be an issue) |
I'm jumping in the middle of a conversation here, but would it be feasible to write a BOINC application to do the distribution for us? BOINC is essentially a portal of distributed computation projects and makes it easy to contribute computer cycles to whatever you find personally interesting. It houses projects like SETI@Home, CERN's physics stuff, prime number searches, etc. I think a lot of non-Go players would even be interested in contributing. |
Once distributed training is ready to go, I'll donate some time on my Titan X. What might help would be binary releases once in a while. Win64, whatever's the current latest Ubuntu LTS, etc. Static linking whenever possible would simply things even further. This will attract a wider base of users. |
I'll donate time on my GTX1080 for this. |
At the start of the learning, the network can't score games correctly so it certainly can't make reasonable resignation decisions. Resigning can only be enabled if training has progressed somewhat.
I will also donate time on my GTX1080. I am student of Go in Korea, My dream is professional Go player. I have a pure curious about distributed learning system. If distribute learning is possible, there are many students of Go in Korea who want to participate. Is it possible to learn network with seperated computer without real time network connection? For example, if we run learning algorithm on single computer it may build small network. How can it is possible. |
In theory, yes, as long as you can download a new network periodically and upload the data periodically. Depending on the delay in updating, you'll have a slower learning-feedback loop, though, so it's not ideal. And the client would have to be written to deal with this, which certainly isn't going to be true for the first versions.
What would "adjust" even mean here. |
What I was meaning is merge, The complicity of Go is come from uncertainty for large move. We can't expect for large number of move even though we have learned a lot. But with well trained value network, story is different. We know who is good, and who is bad without total play game. We can decision who will win at anytime, with well trained value network. If we look them local moves (1-20, 10-20) it became easy game. That is why I thought we are doing adjust. I thought we merge small networks. I am not familiar with programming. Sorry for amateur like reply. |
Sorry, but I completely fail to understand what you are proposing. The whole point is to learn an accurate value network. If we already had one then most of the effort wouldn't be necessary. If the score is very much to one side, then the program can resign and terminate the game early (but there are caveats here - see the original paper which addresses them well). Even so I don't understand what you would want to achieve by only looking at 10 moves from a total game, or how this would help distribution for clients that aren't connected to the internet, or anything really. |
There is no "merging" of networks. Every client plays with the full, best-so-far network. The only thing that is "merged" is the training data, which consists of the network inputs, the search outcome and the game outcome sampled over many games. (The latter is described in the README "Training data format") |
Deepmind claims that they only trained for 3 days on 4 TPUs to reach a comparable level of the old Go. Nvadia claims their Volta GPU probably have the better performance than the TPU. Anyone can explain the discrepancy why 1700years. That is a difference between normal PC and Quantum computer. |
@wensdong As far as I can tell, this is something that has been misreported by the media: when playing a game (e.g. against Alphago Lee), Alphago Zero does indeed use a single machine with 4 TPUs, but when generating the self-play games for training, it is practically certain that they use many more machines: generating 29 million self-play games on a single machine would be pretty tough :) As far as I'm aware, there's no data on the hardware used for self-play. Edit: @gcp I was wondering - do you intend to release the raw self-play games that you collect,or just the generated weights? |
The idea is that all data stays open and public domain. |
@wensdong See the post to computer-go that is linked from the file named README where this is all already explained. It is simply not possible to get the required data with only that hardware, and nowhere did DeepMind claim that. They only talked about the playing machine (when the end result is already available!), and the learning machine (64 GPU), which is a small part of the required learning pipeline (it doesn't generate the actual games!). |
Is the server done yet? If so/when it's done I'd gladly let this train on my gtx1080 when I'm sleeping |
Updates:
With (2) now finished, it's time to do the final tweaks and start putting autogtp and leela-zero into a package that interacts with the server. |
Updates:
|
I was looking at the code and I have a question: |
Good question! I need a controller program anyway to handle the HTTP networking, unzip etc and I already had some code to autoplay 2 GTP programs that I adapted. It is true that because the distributed program uses the same network for both players, it could play against itself in a single process. It's probably pretty easy to adapt autogtp to ditch the second process. |
@marcocalignano I will test it tomorrow, I think by then I will have some 85k networks to test with it too. |
But I should go on with the multi games option? |
Yes, this looks good. Specifying the GPU id explicitly works but requires the user to understand how to launch Leela on the console and find the right numbers (no problem for my use of course). I think for using this technique in autogtp it should take a simple count of # of GPU to use, and then just use --gpu 0, --gpu 1 etc. As said, I will change leela-zero so it numbers the GPUs in order of preference. |
Ok I am ready to test multi games mode what do I have to puul to have the last client? |
Game data 852c24aaf81ef708dc773157e298c0a296d6c9e4ab9ba8d9ceddaeb0686ded50 stored in database Game data 661291685827292130b0d566cd76cb47c65a593fd4c8469bee3e495de565cf06 stored in database Can you check these two game if they are ok? |
@gcp did you check the previous games? |
They arrived fine in the db, i.e.:
|
Ok I did them with the multigames version. With all the improvement merge that comes in ;) I have to keep rebasing but I'd like to do a pull request maybe on the next branch |
Question: can measures be implemented to prevent simple attacks on Leela? Like checking some statistical characteristics of the input data per user? My only fear is that later on or even right now someone is spamming training data with noise. |
Currently since Leela Zero uses two threads, the result isn't deterministic. So confirming results isn't really possible. |
You can give each thread its own random engine and deterministically split work, so it's definitely possible. However, you'd need to implement your own distribution functions in order to make it replicable across platforms. It would also require server support, because the server would need to select a random engine initialization sequence for each thread and each run. I have quite a bit of experience in making reproducible pseudorandom results across platforms with multithreaded execution, and if you need I can take a look at the code and see what I can do. |
The problem isn't the random number generators, the problem is that the threads are used to feed hardware that is used by other processes. And the whole setup of fast parallel game tree searchers generally means that "deterministically split work" means that the algorithm runs much slower than optimal, because the whole point is to split at the most opportune moment. (Work stealing algorithms tend to rule the roost) Parallel game tree search is like a 35+ years old research topic, the odds of making a breakthrough in a few afternoons aren't that good. |
I'm not certain that work stealing algorithms necessarily preclude reproducibility. This is might just be something that no one's tried, rather than a breakthrough. You're right that the approach in my previous post is completely wrong. (Because this is a tree search, whereas my previous work had been on a 2d plane.) However, we can shift the problem like this: (assuming we need x bits of random numbers maximum per node visit) We don't know which nodes we will visit, so we need a way of numbering nodes, if we come up with a way, we can simply bin a bunch of random numbers and pull from the proper area of the array. There's a rub though, and that's coming up with a method of numbering nodes such that we don't need to bin gigabytes of RNG, due to how explosively huge the search space is. If it is possible to order nodes by their desired visit order deterministically (ie, like an A* search), then the scheme will work. Otherwise a small and simple hashing scheme might be doable (think 16-24 bit), but probably would ultimately screw up the randomness and likely lead to worse training data due to hash collisions reusing RNG data, (birthday paradox says about 1% collisions at 16-bit (that's about 10 per move), and 6e-3% for 24-bit (that's about 10 per game), but this requires generating 64x kb and 16x Mb of RNG engine calls per move respectively). Another possible scheme is some sort of (cryptographic?) hash function that takes a naive numbering scheme (such as breadth first numbering, would require gmp which can be expensive) and a single random number from the engine and uses the hash as the RNG bits (if x is more than the hash length, it's possible to generate more hashes by appending sequential bytes to the end of the message). But again, this has some question marks surrounding just how good of random numbers you'll get (but this is at least probably fine.) Honestly, I should actually read the paper so I know what the hell I'm talking about. So if I'm that far off base, just tell me to RtFP. And I totally understand if all this complexity isn't really warranted, as the probability of bad actors is rather low. |
I don't want to ruin your fun in exploring that topic, but as for the problem at hand, it's just easier to run 2 instances with 1 thread, if we worry about bad actors. The multi-threading is important for playing real games with a clock - where running a 2nd instance does nothing useful - but for just generating the data it is not needed. After removing multi-threading, I suspect making the engine reproducible is not that difficult, although the floating point accuracy of GPUs (how stable are those?) may be an issue. Requiring independently verified results cuts the throughput in half, but it seems to be how BOINC works. I think that may be required if a larger, more long term run, on the order of the full Alpha Go Zero network size, would be done. |
It's not just GPUs. CPUs also have floating point reproducibility issues. For instance, the RSQRTPS instruction is only required to give an estimate, and on Haswell CPUs (at least) the lower 12 bits of the mantissa are always all zeros:
|
Hi, do you still consider launching BOINC project for Leela? |
Hello!
Since it required huge amount of computing resources to teach network - is it possible to create some distributed system, where everyone who is willing can join and contribute their machine resources?
According their paper, 40 days is needed to overcome previous version, but for AlphaGo Lee version 3 days is enough. So it seems less than 150 years is needed :D
If 10-100 people contribute machine resources - it seems possible to make some progress. Also may donation is option to rent some powerful GPU based server (this should be even simpler since no need to build any distributed system).
Sorry if question is silly :)
Could you estimate - does it possible to achieve any tangible progress with this network learning on GTX1080? (by tangible I mean - to achieve at least low dan lvl for a month of time for example)
Thank you!
The text was updated successfully, but these errors were encountered: