Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The server downloads are slow #1731

Closed
alphaladder opened this issue Aug 13, 2018 · 71 comments
Closed

The server downloads are slow #1731

alphaladder opened this issue Aug 13, 2018 · 71 comments

Comments

@alphaladder
Copy link

alphaladder commented Aug 13, 2018

@gcp. I try to start self play in my computer but it stuck to “starting tuning process “ for 2hrs.Any problem on lz server?restart needed?

Clients drop from 1000 to 45 in the last 2-3 hrs.

@roy7
Copy link
Collaborator

roy7 commented Aug 14, 2018

Hmm you could try running leela with -w weightfile --tune-only and see what happens.

@fell111
Copy link

fell111 commented Aug 14, 2018

maybe failed to download the 40b network? The matches amount between 40b and 20b is still 0 after 2 hours.

@arondes
Copy link

arondes commented Aug 14, 2018

we could add BitTorrent feature to accelerate downloading.

@alphaladder
Copy link
Author

We can not download weight by autogtp automatically. Big problem!

@l1t1
Copy link

l1t1 commented Aug 14, 2018

the matchs are also slow down, maybe the weights are too big to download

@diadorak
Copy link

Many Google Colab users are having trouble downloading the networks...

@pcengine
Copy link

pcengine commented Aug 14, 2018

It seems something is wrong. Usually there are around 200 clients contributing self-play games, but only less than 100 now.

@Hayden2018
Copy link

Yes. Something is definitely wrong with the server. Download speed is ridiculously slow. It took me 5 hours to download a 50MB weight file. Mean while, I can already download at least a 20GB file. Please fix it.

@l1t1
Copy link

l1t1 commented Aug 14, 2018

40b matchs were only played by 4 clients

@alphaladder
Copy link
Author

alphaladder commented Aug 14, 2018

There are only 33 clients now!

@l1t1
Copy link

l1t1 commented Aug 14, 2018

the new 20b weight (9.5M+96k) to be tested is also delayed

@roy7
Copy link
Collaborator

roy7 commented Aug 14, 2018

It seems to be speeding up some now and there are about 14 clients who have submitted matches. Could needing to download two different 40 block networks overloaded the download speed of the server perhaps? @gcp

@alphaladder
Copy link
Author

@roy7 No,that is not the case.
The problem had been there for several hours even before the 40b was queued.

@l1t1
Copy link

l1t1 commented Aug 14, 2018

should the 40b match be stopped and let normal match start?

@d7urban
Copy link

d7urban commented Aug 14, 2018

I get "Got a new job: match" and then it just hangs. Match is between c910dee and 409bee.

EDIT: a while later I get a curl error "code 18"

EDIT 2: and now curl error 56

@l1t1
Copy link

l1t1 commented Aug 14, 2018

you can use wget -c toget 409*gz file and unzip it to networks directory

@l1t1
Copy link

l1t1 commented Aug 14, 2018

please pause new test matchs until the server is healthy

@bjiyxo

@bjiyxo
Copy link

bjiyxo commented Aug 14, 2018

Sorry, I don't know how to pause the test matches. @roy7 Need technical helps.

@ihavnoid
Copy link
Member

Tried manually downloading files, getting 4kb/sec. Is this a server capacity issue?

@nukee86
Copy link

nukee86 commented Aug 14, 2018

Still have a problem with downloading networks - 4-10kB/s on colab, 50kB/s on my PC. It's definitely something on server side, cause both download other sources at >10MB/s

@roy7
Copy link
Collaborator

roy7 commented Aug 14, 2018

Yes the server got throttled, @gcp will investigate when he can, he's away for a bit. Looks like it wasn't a problem with these networks per se but a big spike in traffic earlier this month for some weird reason.

@l1t1
Copy link

l1t1 commented Aug 14, 2018

2018-08-15_071152
the number selfplay games is less than match games in last 24h, is the statistics right?

@gcp
Copy link
Member

gcp commented Aug 14, 2018

We've spent July averaging about 100 GB traffic per day.

Around the 27th, traffic jumped to 400 GB per day. (The network size was increased, but only by 2.3x, so it can't fully explain this surge)

For most of August, we had 500 GB traffic per day (which is still relatively "too high") until on the 10th it surged to about 2.5 TB per day (40 block networks are "only" twice as big so wouldn't explain this either?). When we reached about 8.5 TB traffic the connection got throttled.

The hosting options narrow down severely (or go up in price quickly) if you require more than 75 TB traffic per month. The surges mismatch with network sizes is also a bit worrying.

40 block nets are a bit less than 200M a piece. We do about 5 test matches per day. Maximum amount of clients per day is typically <1500 (but the majority of them won't see all tests, so say less than half?). So realistically with 40 blocks I'd expect to see 750G to 1.5 TB per day max. Not 2.5 TB.

We're still investigating this and looking for solutions.

@gcp
Copy link
Member

gcp commented Aug 14, 2018

the number selfplay games is less than match games in last 24h, is it really?

I just cleaned up some server stuff so those numbers are probably not very reliable for a few more hours.

@gcp gcp changed the title The server has some problem .(technical help) The server downloads are slow Aug 14, 2018
@l1t1
Copy link

l1t1 commented Aug 14, 2018

my current download speed
2018-08-15_075930

when the weights get bigger, the chance of download failure is increase.
does the autogtp curl support resume from break-point? if not, all fail clients will try to download again and again,so produce much bigger traffic.
can you try to redirect the latest weights to some cdn server?

for max save traffic, I also suggest again to move self-paly sgf.xz to other place and split it to pieces of each weights, just as the training data did, as people are more interested in recent games.

@fell111
Copy link

fell111 commented Aug 15, 2018

Maybe we need to do some p2p when download the weight file. All clients get weight file from server at the same time might be the bottleneck.

@PhilipFRipper
Copy link

@gcp "Pulling 75 TB of traffic costs real money."

Is distributed effort still a cost realistic goal? We don't want you personally suffering to provide us with so much. Please take care of yourself.

@gcp
Copy link
Member

gcp commented Aug 15, 2018

I am not an expert on CDNs, but is it that there are some nets that are not cached in the CDN yet?

Yes, that is the problem. CDN caches are local to regions, so it may be fast in some places and slow in others.

@gcp
Copy link
Member

gcp commented Aug 15, 2018

Is distributed effort still a cost realistic goal?

Yes, we (=me + donations, prize money, etc) can afford this. But that doesn't mean I'm not trying to be cost-conscious here.

My point was more that you can't normally get such bandwidth for free, and that services like Dropbox etc. will cut you off if you pull that kind of traffic.

I saw a number somewhere that 350 TB/month = maxing out a 1 GBps network interface. So if you have a 100Mbps server with unlimited traffic, that's not actually good enough.

@gcp
Copy link
Member

gcp commented Aug 15, 2018

The server is uncapped again.

I'll monitor how efficiently the CDN works and make a decision what to do with hosting when there's a bit more data on what we require now.

@l1t1
Copy link

l1t1 commented Aug 15, 2018

it looks the server is healthy now
current match, receive 85 clients in 100+ games
2018-08-15_203840

last match, receive 85 clients in 300+ games
2018-08-15_204005

@liujn2018
Copy link

@gcp , we have manually mirrored the latest network files in google drive and almost all the Chinese Google Colab users (hundreds of g accounts) are using the mirrors.
Let's see what will happen.

@dayun110
Copy link

dayun110 commented Aug 15, 2018

@gcp, I noticed that you are checking the server traffic flow. Hereby, I'd like to remind you the following two facts that might cause you mis-count on server traffic volume of these two days. To deal with the very slow download speed; as @liujn2018 said, (i) for Google Colab users, they are downloading latest network files from google drive. Five people are responsible for downloading ELF, latest passed network and current match network and upload to Google drive. Around 700-1100 clients are downloading network files from Google drive instead of LZ website. ii) for personal computer users, we upload the network files to local QQ group server from where people can download them to their local machine. Around 100 clients I knew are running autogtp by this way. These temporary measures work fine, downloading 3 network files only takes less than 5 seconds within Google servers. It can save quite a lot LZ server traffic. But it takes too much manpower to monitor the LZ website to check network updating,, downloading from LZ website and uploading to Google drive, and updating running code. It would be great if this could be done automatically. Moreover, we also worried about that google side might block those accounts with extreme increase in traffic. Thanks.

@foreverpokemongo
Copy link

foreverpokemongo commented Aug 15, 2018

40 block nets are a bit less than 200M a piece. We do about 5 test matches per day. Maximum amount of clients per day is typically <1500 (but the majority of them won't see all tests, so say less than half?). So realistically with 40 blocks I'd expect to see 750G to 1.5 TB per day max. Not 2.5 TB.

Such phenomenon suggests that many traffic is not created by autogtp, but (perhaps) by direct download from links in the homepage. This is not surprising because now people are aware that leela-zero can beat pro-players. I believe not only professionals but also more and more amateur go players are downloading the most recent best-network frequently.

Assuming my speculation is correct (@gcp can easily verify this by comparing the IP addresses of those uploading games and those downloading weightings? ), my suggestion is to replace all the links explicitly provided in the homepage with torrent files, so as to offload traffic created by those who are not contributing computational power to P2P network. Maybe such people are able to enjoy faster downloading speed from P2P network.

@l1t1
Copy link

l1t1 commented Aug 16, 2018

consider only keep weights not finally failed yet and current best and elf as a benchmark?

@MartinDevelopment
Copy link

MartinDevelopment commented Aug 16, 2018

can easily verify this by comparing the IP addresses of those uploading games and those downloading weightings?

Not possible since we don't store any personal information.

@foreverpokemongo
Copy link

Not possible since we don't store any personal information.

Didn't know that. :)

I think the IP addresses of game-uploaders are logged; otherwise we cannot block malicious uploaders.

Anyway, the purpose is just differentiating self-play contributors from pure-downloaders. Maybe we can compare SHA3 hash of IP addresses. There must be one way or another to do so, without (seriously) compromising privacy.

@gcp
Copy link
Member

gcp commented Aug 16, 2018

consider only keep weights not finally failed yet and current best and elf as a benchmark?

We already do this. Most of the old weights are on the storage server.

@gcp
Copy link
Member

gcp commented Aug 16, 2018

Not possible since we don't store any personal information.

We do log (and are allowed to log) IPs for a short period of time to handle abuse and keep the server functioning. I've asked roy7 if he has some time to dig in what happened before they get deleted.

@gcp
Copy link
Member

gcp commented Aug 16, 2018

my suggestion is to replace all the links explicitly provided in the homepage with torrent files, so as to offload traffic created by those who are not contributing computational power to P2P network

If this project makes it (even) hard(er) for Go players to use Leela Zero, it would have failed at its purpose.

@gcp
Copy link
Member

gcp commented Aug 16, 2018

I currently see a hitrate on the CDN of 85% and it's still rising. So we should be fine hosting-wise with a regular solution, or maybe even our current one, as traffic dropped by a factor >x6.

This also means that downloads for most of you who aren't in Europe (where the server is!) should be significantly faster, i.e. West Coast USA and China/Japan/Korea. I'm curious how the speed is for people inside China. From what I understand, there is no CDN location inside China (because this requires specific agreements) but the location you're downloading from should still be much closer to there.

I'm not sure where the servers that Colab uses sit, but for sure it's much more likely there's going to be a CDN location close to them. You also probably won't cause any traffic to the main server.

@dayun110
Copy link

dayun110 commented Aug 16, 2018

@gcp, I am using 100Mbps which is below average level in China. The network download speed is about 0.5M/s to 1M/s.

@l1t1
Copy link

l1t1 commented Aug 16, 2018

in my pc

C:\>e:\wget http://zero.sjeng.org/networks/c
4d760dadf7979b740b52ff932ba00207adf25d84ae03bb93b00fe642e7d9407.gz
--2018-08-17 07:37:15--  http://zero.sjeng.org/networks/c4d760dadf7979b740b52ff9
32ba00207adf25d84ae03bb93b00fe642e7d9407.gz
Resolving zero.sjeng.org... 104.27.188.119, 104.27.189.119
Connecting to zero.sjeng.org|104.27.188.119|:80... failed: Connection timed out.

Connecting to zero.sjeng.org|104.27.189.119|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93589184 (89M) [application/octet-stream]
Saving to: `c4d760dadf7979b740b52ff932ba00207adf25d84ae03bb93b00fe642e7d9407.gz'


 0% [                                       ] 127,150     6.34K/s  eta 4h 11m

and if i type http://zero.sjeng.org/networks/ in browser,it is redirect to https://leela.online-go.com/networks/ , the 2nd address is faster.

C:\>e:\wget https://leela.online-go.com/netw
orks/best-network.gz --no-check-certificate
--2018-08-17 07:43:13--  https://leela.online-go.com/networks/best-network.gz
Resolving leela.online-go.com... 104.25.35.20, 104.25.34.20
Connecting to leela.online-go.com|104.25.35.20|:443... connected.
WARNING: cannot verify leela.online-go.com's certificate, issued by `/C=GB/ST=Gr
eater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO Domain Validation Legac
y Server CA 2':
  Unable to locally verify the issuer's authority.
WARNING: certificate common name `ssl380677.cloudflaressl.com' doesn't match req
uested host name `leela.online-go.com'.
HTTP request sent, awaiting response... 200 OK
Length: 93596453 (89M) [application/octet-stream]
Saving to: `best-network.gz'

 3% [>                                      ] 2,932,474    302K/s  eta 5m 2s

and the pages https://leela.online-go.com/networks/ is stale, its best-networks is very old
2018-08-17_074841

C:\>e:wget http://zero.sjeng.org/networks/9c56ae62f1d6c9a1dff58491d19eaed6d4db79
dbe9badece60b7241f412ac51e.gz
--2018-08-17 07:50:09--  http://zero.sjeng.org/networks/9c56ae62f1d6c9a1dff58491
d19eaed6d4db79dbe9badece60b7241f412ac51e.gz
Resolving zero.sjeng.org... 104.27.189.119, 104.27.188.119
Connecting to zero.sjeng.org|104.27.189.119|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93594386 (89M) [application/octet-stream]
Saving to: `9c56ae62f1d6c9a1dff58491d19eaed6d4db79dbe9badece60b7241f412ac51e.gz'


 0% [                                       ] 6,120       --.-K/s  eta 34h 6m  ^

@MartinDevelopment
Copy link

@l1t1 I am not having any problems, I am getting ~24MB/s which is around my limit.

@Ikfes
Copy link

Ikfes commented Aug 27, 2018

@gcp You should try OVH servers. They have good peering around the world and come with unmetered 500 Mbps port with 1Gbps burst speed. This means 161 - 322 TB/m. I have SP-64 in GRA and im currently serving ~85-120 TB/m with it without any issues.

@Ikfes
Copy link

Ikfes commented Aug 27, 2018

On top of that you can buy https://www.ovh.com/world/dedicated-servers/bandwidth-upgrade.xml for quite cheap if you need more traffic.

@ihavnoid
Copy link
Member

Downloading 3c8f6488 is incredibly slow as of now. Are we back to hitting the throttling?

@MartinDevelopment
Copy link

@ihavnoid Probably just you, it is nearly an instant download for me.

@gcp
Copy link
Member

gcp commented Sep 3, 2018

You should try OVH servers.

I looked at OVH but our current provider actually has better offerings if you stay below ~75TB / month, which we definitely do now.

@gcp
Copy link
Member

gcp commented Sep 3, 2018

Are we back to hitting the throttling?

It's no longer possible to get throttled. In any case traffic to the real server is minimal.

I think this thread can be summarized by this picture:
https://sjeng.org/dl/traffic.png

@gcp
Copy link
Member

gcp commented Sep 3, 2018

and if i type http://zero.sjeng.org/networks/ in browser,it is redirect to https://leela.online-go.com/networks/ , the 2nd address is faster...and the pages https://leela.online-go.com/networks/ is stale, its best-networks is very old

The networks are now split between the main server (which is normally fastest) which has the latest networks, and is CDN cached, and the storage server (which has slow but big disks). If you try to fetch a network from the main server but it's no longer there, you'll get forwarded to the storage server.

I should probably try to make that mirroring automatic, so the "index" of networks/ works as expected.

@gcp
Copy link
Member

gcp commented Sep 3, 2018

6.34K/s eta 4h 11m
The network download speed is about 0.5M/s to 1M/s.

So still pretty sad from inside China.

@gcp
Copy link
Member

gcp commented Sep 3, 2018

I should probably try to make that mirroring automatic, so the "index" of networks/ works as expected.

This is now mirrored automatically every day.

@alphaladder

This comment has been minimized.

@gcp gcp closed this as completed Oct 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests