New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The server downloads are slow #1731
Comments
Hmm you could try running leela with -w weightfile --tune-only and see what happens. |
maybe failed to download the 40b network? The matches amount between 40b and 20b is still 0 after 2 hours. |
we could add BitTorrent feature to accelerate downloading. |
We can not download weight by autogtp automatically. Big problem! |
the matchs are also slow down, maybe the weights are too big to download |
Many Google Colab users are having trouble downloading the networks... |
It seems something is wrong. Usually there are around 200 clients contributing self-play games, but only less than 100 now. |
Yes. Something is definitely wrong with the server. Download speed is ridiculously slow. It took me 5 hours to download a 50MB weight file. Mean while, I can already download at least a 20GB file. Please fix it. |
40b matchs were only played by 4 clients |
There are only 33 clients now! |
the new 20b weight (9.5M+96k) to be tested is also delayed |
It seems to be speeding up some now and there are about 14 clients who have submitted matches. Could needing to download two different 40 block networks overloaded the download speed of the server perhaps? @gcp |
@roy7 No,that is not the case. |
should the 40b match be stopped and let normal match start? |
I get "Got a new job: match" and then it just hangs. Match is between c910dee and 409bee. EDIT: a while later I get a curl error "code 18" EDIT 2: and now curl error 56 |
you can use wget -c toget 409*gz file and unzip it to networks directory |
please pause new test matchs until the server is healthy |
Sorry, I don't know how to pause the test matches. @roy7 Need technical helps. |
Tried manually downloading files, getting 4kb/sec. Is this a server capacity issue? |
Still have a problem with downloading networks - 4-10kB/s on colab, 50kB/s on my PC. It's definitely something on server side, cause both download other sources at >10MB/s |
Yes the server got throttled, @gcp will investigate when he can, he's away for a bit. Looks like it wasn't a problem with these networks per se but a big spike in traffic earlier this month for some weird reason. |
We've spent July averaging about 100 GB traffic per day. Around the 27th, traffic jumped to 400 GB per day. (The network size was increased, but only by 2.3x, so it can't fully explain this surge) For most of August, we had 500 GB traffic per day (which is still relatively "too high") until on the 10th it surged to about 2.5 TB per day (40 block networks are "only" twice as big so wouldn't explain this either?). When we reached about 8.5 TB traffic the connection got throttled. The hosting options narrow down severely (or go up in price quickly) if you require more than 75 TB traffic per month. The surges mismatch with network sizes is also a bit worrying. 40 block nets are a bit less than 200M a piece. We do about 5 test matches per day. Maximum amount of clients per day is typically <1500 (but the majority of them won't see all tests, so say less than half?). So realistically with 40 blocks I'd expect to see 750G to 1.5 TB per day max. Not 2.5 TB. We're still investigating this and looking for solutions. |
I just cleaned up some server stuff so those numbers are probably not very reliable for a few more hours. |
when the weights get bigger, the chance of download failure is increase. for max save traffic, I also suggest again to move self-paly sgf.xz to other place and split it to pieces of each weights, just as the training data did, as people are more interested in recent games. |
Maybe we need to do some p2p when download the weight file. All clients get weight file from server at the same time might be the bottleneck. |
@gcp "Pulling 75 TB of traffic costs real money." Is distributed effort still a cost realistic goal? We don't want you personally suffering to provide us with so much. Please take care of yourself. |
Yes, that is the problem. CDN caches are local to regions, so it may be fast in some places and slow in others. |
Yes, we (=me + donations, prize money, etc) can afford this. But that doesn't mean I'm not trying to be cost-conscious here. My point was more that you can't normally get such bandwidth for free, and that services like Dropbox etc. will cut you off if you pull that kind of traffic. I saw a number somewhere that 350 TB/month = maxing out a 1 GBps network interface. So if you have a 100Mbps server with unlimited traffic, that's not actually good enough. |
The server is uncapped again. I'll monitor how efficiently the CDN works and make a decision what to do with hosting when there's a bit more data on what we require now. |
@gcp , we have manually mirrored the latest network files in google drive and almost all the Chinese Google Colab users (hundreds of g accounts) are using the mirrors. |
@gcp, I noticed that you are checking the server traffic flow. Hereby, I'd like to remind you the following two facts that might cause you mis-count on server traffic volume of these two days. To deal with the very slow download speed; as @liujn2018 said, (i) for Google Colab users, they are downloading latest network files from google drive. Five people are responsible for downloading ELF, latest passed network and current match network and upload to Google drive. Around 700-1100 clients are downloading network files from Google drive instead of LZ website. ii) for personal computer users, we upload the network files to local QQ group server from where people can download them to their local machine. Around 100 clients I knew are running autogtp by this way. These temporary measures work fine, downloading 3 network files only takes less than 5 seconds within Google servers. It can save quite a lot LZ server traffic. But it takes too much manpower to monitor the LZ website to check network updating,, downloading from LZ website and uploading to Google drive, and updating running code. It would be great if this could be done automatically. Moreover, we also worried about that google side might block those accounts with extreme increase in traffic. Thanks. |
Such phenomenon suggests that many traffic is not created by autogtp, but (perhaps) by direct download from links in the homepage. This is not surprising because now people are aware that leela-zero can beat pro-players. I believe not only professionals but also more and more amateur go players are downloading the most recent best-network frequently. Assuming my speculation is correct (@gcp can easily verify this by comparing the IP addresses of those uploading games and those downloading weightings? ), my suggestion is to replace all the links explicitly provided in the homepage with torrent files, so as to offload traffic created by those who are not contributing computational power to P2P network. Maybe such people are able to enjoy faster downloading speed from P2P network. |
consider only keep weights not finally failed yet and current best and elf as a benchmark? |
Not possible since we don't store any personal information. |
Didn't know that. :) I think the IP addresses of game-uploaders are logged; otherwise we cannot block malicious uploaders. Anyway, the purpose is just differentiating self-play contributors from pure-downloaders. Maybe we can compare SHA3 hash of IP addresses. There must be one way or another to do so, without (seriously) compromising privacy. |
We already do this. Most of the old weights are on the storage server. |
We do log (and are allowed to log) IPs for a short period of time to handle abuse and keep the server functioning. I've asked roy7 if he has some time to dig in what happened before they get deleted. |
If this project makes it (even) hard(er) for Go players to use Leela Zero, it would have failed at its purpose. |
I currently see a hitrate on the CDN of 85% and it's still rising. So we should be fine hosting-wise with a regular solution, or maybe even our current one, as traffic dropped by a factor >x6. This also means that downloads for most of you who aren't in Europe (where the server is!) should be significantly faster, i.e. West Coast USA and China/Japan/Korea. I'm curious how the speed is for people inside China. From what I understand, there is no CDN location inside China (because this requires specific agreements) but the location you're downloading from should still be much closer to there. I'm not sure where the servers that Colab uses sit, but for sure it's much more likely there's going to be a CDN location close to them. You also probably won't cause any traffic to the main server. |
@gcp, I am using 100Mbps which is below average level in China. The network download speed is about 0.5M/s to 1M/s. |
in my pc
and if i type
and the pages
|
@l1t1 I am not having any problems, I am getting ~24MB/s which is around my limit. |
@gcp You should try OVH servers. They have good peering around the world and come with unmetered 500 Mbps port with 1Gbps burst speed. This means 161 - 322 TB/m. I have SP-64 in GRA and im currently serving ~85-120 TB/m with it without any issues. |
On top of that you can buy https://www.ovh.com/world/dedicated-servers/bandwidth-upgrade.xml for quite cheap if you need more traffic. |
Downloading 3c8f6488 is incredibly slow as of now. Are we back to hitting the throttling? |
@ihavnoid Probably just you, it is nearly an instant download for me. |
I looked at OVH but our current provider actually has better offerings if you stay below ~75TB / month, which we definitely do now. |
It's no longer possible to get throttled. In any case traffic to the real server is minimal. I think this thread can be summarized by this picture: |
The networks are now split between the main server (which is normally fastest) which has the latest networks, and is CDN cached, and the storage server (which has slow but big disks). If you try to fetch a network from the main server but it's no longer there, you'll get forwarded to the storage server. I should probably try to make that mirroring automatic, so the "index" of networks/ works as expected. |
So still pretty sad from inside China. |
This is now mirrored automatically every day. |
@gcp. I try to start self play in my computer but it stuck to “starting tuning process “ for 2hrs.Any problem on lz server?restart needed?
Clients drop from 1000 to 45 in the last 2-3 hrs.
The text was updated successfully, but these errors were encountered: