-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build.koreader.rocks / ota.koreader.rocks down (Azure bandwidth issues) #10615
Comments
I have some time recently to work on it. |
How do you mean precisely? Like the list of apt installed packages and the Docker images? Wrt the OS we shouldn't need much of anything other than Bash and Docker; all the important things are done in Docker. It might be quicker to talk on Gitter or something btw, if you're on there? |
Oh, that's great. Previously I thought we copied the binaries. |
There are a couple of things on there like Pinging @houqp for the Cloudflare stuff. |
I know progress sync and ota are on the machine, while progress sync has persistent date. Anything else? |
The configs and scripts, the signing key, I meant the ops thing quite literally (perhaps including the actual archives for ota, though that'd regenerate in time). Just everything that's in there. What's in the cronfile is potentially also important, though it's probably just to run the cleanup script once or twice a day. |
OK, let me first upgrade the machine type, currently it's running an A1_Basic which will be deprecated August 2024. VM | Size | Type | vCPUs | RAM (GiB) | Data disks | Max IOPS | Temp storage (GiB) | Premium disk | Cost/month Likely the B1ms VM type is the best approach. The only thing matters is the temporary storage, 40GB vs 4GB, not sure if it's sufficient. Or if 1G ram is enough, we can try B1s, and save more budget for the bandwidth - azure charges bandwidth separately which was the reason it ran out my budget from the subscription. It seems like I lost access to the VM, previously it used a user name & password. But now it accepts only the ssh key-pair. But azure is very bad at the keys, replacing the key would break you guys I believe. ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDfnbW0jcCTFJknPks6Lir9ZZfiX8By62414r0bvN4ciIQWleU147Ma4ZBrR5E7GV8IyX4zbLmldI00uKbFb9q2IpHN7ebNmKfIOnDnTFOuLMPjsAUHCl13yIr0yLlEWILu1Tni7w3oeNXGy7WK2oDJ5DgwASvoCrQ5GKg3SNnmBJBk/1EMWVQHZ+SralMx5Udz80ij1YutWV0S8kJH3YgHXE1G4SVmTq9oC7riMI5l1QWJgaKynrY2D171VRhIbafqLkR7SmcN1Vw23fnAEbIga94SBcXyl9tVG2r5rYUSGyHkvO+rjc+XHf701AudeG/+LQB2Uf5t90y8e1oV5IREeVz2BIYomQBNQjTyS+EoOB+ai/ponXwaeoVUxTdYRWQTzVtZ0ewPoLTheCK8qEDwm04al4xjWgfCAXzd+5ZKpS2NcgLi+RDL3bN5uExMPevOtkVhhr1BrdJFBz7LeB1X7cx34I1GxhuLfGMoh0oVYhmv8dC2RIoNCngRzFVTKq0= hzj-jie@hzj-jie-x1 |
By the way, changing the machine type (size in azure concept) would preserve the data but restart the machine. Should we set up the services into systemd first? |
I'm thinking by Docker container name, something like this?
|
What does temporary storage mean exactly? Over in /mnt? That's not really used atm; actually I'm not sure I even knew that existed. I figure we can easily make do with 4 GB for some temp extraction/file manipulation provided the actually important permanent stuff is at least some 30 GB as it is now. |
I don't know, but I can take a look once I can login to the machine. |
I've just added your key into your own authorized_keys (hoping you remember the spelling of your username :) |
I cleaned up some my bad user names. But if I unfortunately broke your authentication, let me know. |
So the machine has two disks. The root is sda1, persistent AFAICT, and it's ~30G; the sdb is mounted at /mnt, temporary, and it's ~40G. So I think the "Temporary storage" means it, but we never use it.
The sda1 in contrast is 98% full,
Pretty much everything is in /ops, or more specifically, /ops/prod/build/download/stable. Are we serving only the latest build from here via ota? If so we can delete the old stable versions then. For the server update, it's lucky that the do-release-upgrade exists on the vm and we may use it. |
Not as a matter of course, people depend on that and we purposefully keep a limited number of old ones around. But in order to free up 10+ GB for a release upgrade, go right ahead. :-) |
Same story for the older nightlies. |
PS I'm sure you're aware, but don't forget to run the upgrade in tmux in case of connectivity issues. :-) |
I will try :) |
Added and enabled services:
Anyone want to confirm if I have done it right? If everything goes right, we may want to add these files into repo. |
Luckily you don't seem to have interfered with my release proceedings. 😅 I'll check in a few minutes. |
Oh, I will definitely let you guys know first before proceeding updates or restarting. |
I adjusted the nightswatcher startup script which had
|
I created koreader/koreader-misc#45 to add the definition files into koreader-misc. Meanwhile, what's the good time for me to try a restart and ensure the services would be up and running after the restart? |
It doesn't really matter if the nightlies are missing for a day, but the ideal time would be sometime after 7 UTC so the artifacts for the day are there already. Unless you mean wrt the current Android tc update (e.g., #10679) in which case it may be better to wait until that's sorted out. |
Oh, maybe in another way, would you please share the commands you are using to start the services? I do see five dockers now. My concern is that if the services are not working as expected (I do not think so though), I can manually start the dockers to avoid breaking the services. |
By the way, memory-wise, 1G would be a little bit restricted if I read /proc/meminfo correctly.
|
They're scripts in ops (under build, nginx and sync). |
I resized the machine to 1 vCPU + 1G ram, and it seems working. Likely it can buy us 3-5 more days per month. So the public address is public accessible, I am not sure if anyone crawls the site directly through the public ip. |
I do see requests like
and
Both should be disallowed. |
I can see why you'd say that about some bot (though it should either get it from the CF cache or be the first hence putting it in the cache) but the first is simply Opera so I'm not sure what you're getting at? For reference, mine looks like this:
|
Oh, I got them from the nginx log.
rather than the regular browsers. |
A request like that will happen all the time for every caching edge node. There's no such thing as regular browsers just accessing, CF is always in between. |
Oh, do you mean requests like
are indeed passing through cf? I made a mistake here, the hosts are not resolved to the public ip address of the vm. But as long as it returned 200, the request was served through the azure network and got charged. For example, the following apk was accessed 8 times, and some requests were hours apart, and cf should cache them.
Or the tar.gz files in the dict/
I haven't seen cache related headers in the nginx configuration,
Are you sure cf would cache the files in the case? |
Yes. It's impossible for it to be any other way.
"Hours apart" combined with different edge nodes means there's no particular reason it would be cached. They might drop it sooner than whatever you set or might expect if it hasn't been accessed in a while. Something like CF helps the most when there are a ton of requests coming in a short span using the same edge node. But that's why we have a 50-60% cache hit rate rather than over 90% (on average, this morning at 7 it was actually 96.54%).
Yes, you can easily verify for yourself with curl -I or -v. Most important, you can see CF is aware of
Indeed, it might help a few percentage points to explicitly add some for at least a week (except for some files that should only be cached a few hours or maybe a day tops). But ultimately the bandwidth looks like this. Those are quite comfortable numbers. If I find the time later I'll set up my VPS that I'm not really using because Azure is just… weirdly unattractive, but first it's probably better to identify where the data leak is coming from. But in any case it can't be from build.koreader.rocks or ota.koreader.rocks since as stated those pass completely through CF and are tracked by CF. :-/ |
Ok, my previous company used some edge networking services a while ago which transfered the data across their edge servers, so the static requests would never reach the server again after some 5 minutes I recall. But anyway, glad to know CF does not need any heads to function. (I remember we discussed this before :) |
The long and short of it is that it's all proxied and I can't imagine them somehow missing tens of gigabytes of data traffic. 47 (total) - 26 (cached) = 21 GB traffic from build. and ota over the past week. Which it seems to me really only leaves the thing I blanked out here, or something internal to Azure. Although technically this is possible:
With some file:
Of course that's not quite as it should be so we'll have to double check the traffic numbers from the nginx logs just in case. https://developers.cloudflare.com/fundamentals/setup/allow-cloudflare-ip-addresses/ |
I've added these caching headers:
Of course as stated this will only potentially affect the 22 GB of uncached traffic (mainly depending on whether CF merely checks last-modified or simply redownloads the entire thing because why not), not whatever traffic is actually being problematic. |
That's really a lot. Can cf handle 206 correctly? By the way, the name asdfsdfa is cool 👍 |
I think I see what you're saying.
2, 3 ⇒ CF has downloaded some 50–70 GB more from the server than it actually served to end users, explaining the missing data, and using more data than it ever saved in the process. This is probably fairly testable, though I don't have the time to right this moment. |
To whom it may concern, the OTA server is down again. Latest APK on a mirror site is from 01/19/24 |
It was resumed yesterday. |
My sync randomly stopped working a few days ago across all devices. Is this a global thing? |
ota.koreader.rocks is down starting from atleast yesterday ~08:00 UTC. I can contribute financially or with my time. I have azure and linux server experience if you guys need help. |
Thanks for the offer :) The server will be back online the 26. AFAIK it gets killed when the network bandwitdh reaches some limit. I think @Frenzie and @Hzj-jie are the people involved with the server management. Not sure what can be done without increasing quotas (I'm happily unaware :p) |
I find the VPS very bad compared to all others I have access to. It's slow and SSH connections are unreliable. From e.g. AWS EC2 I know that's definitely not an issue with being in North America. These issues might be a worthy trade-off for a lot of bandwidth, yet bandwidth is apparently also more than an order of magnitude less than anything else. I am nonetheless grateful for its existence. But as you know these things largely work on annoyance thresholds and it never quite managed to make enough impact, although it's come close. I definitely wouldn't want to lower the annoyance threshold by rewarding Azure for their terrible product. |
btw github pipeline does not much build requirements? just curious |
No idea, but lets focus on solve the bandwidth issue here. Feel free to open another ticket if you want to build artifacts from github. I think gitlab is better. Not all the eggs on the same basket :) |
Now it's a 1st gen vm on azure, the vm type has been deprecated and would be removed around 2024.
Meanwhile the Ubuntu is 16, and should be upgraded.
The text was updated successfully, but these errors were encountered: