Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding automatic size checker #1385

Open
JLO64 opened this issue Jan 6, 2024 · 5 comments
Open

Questions regarding automatic size checker #1385

JLO64 opened this issue Jan 6, 2024 · 5 comments

Comments

@JLO64
Copy link
Contributor

JLO64 commented Jan 6, 2024

So as mentioned at the end of #1366 the script to automatically check website size don't work anymore due to its reliance on GMetrix. As such I've been working on updating it with Cloudflare APIs. It's working pretty well now (I replaced all the GMetrix stuff with Cloudflare APIs, improved error handling) but before I PR my changes I'd like to communicate and get clarification on certain things.

Broken URLs

A decent number of URLs on the list are currently broken and can't be accessed by either Cloudflare or via a web browser. With the error handling I currently have in place, I have the sizes in their entries set to "1000" so that they don't show up on the website. However, this will add a bunch of empty entries to the site list which would cause a bunch of extra/unnecessary processing when this this script is run in the future.

I could have these entries deleted automatically, but I don't like the idea of removing other's people's entries like that. Maybe it would be better for them to just be manually removed instead?

Cloudflare Authorization

Just as a heads up, the Cloudflare API isn't open and you need to register for it. Registering for an account is easy and straightfoward. However, in order to access your accountId needed for this script you need to tie a domain to that account. I did so with mine and while it prompts you to change your domain DNS settings you can just skip past that.

Thankfully, I haven't yet run into any rate limits for the API. At one point I ran just over 100 consecutive URLs through it and had no issues.

.env file?

Speaking of authorization stuff, I currently have the API token and account ID stored the same way the GTMetrix credentials are (in a seperate file called myauth.py). I think I understand why this method was used but I'd argue that storing them in a .env file is better since it allows this script to be better run via GitHub actions or something similar.

That said, I'm willing to stick to the current method for my PR.

Markdown file output

I'm currently planning on submitting my PR with two files changed, site_size_rechecker.py and sites.yml. A good chunk of the entries are outdated and given how long it takes to run the whole list (it's a minimum of 20 seconds per entry) I'm gonna leave the script running on my server overnight before I PR.

Is it alright if I modify the script to output the markdown table to a file? Since I'm gonna be redoing all the entries, it's gonna be a truly massive list that I doubt will fit in a PR so I'd rather link to a file.

Documentation

docs_site_size_rechecker.md is a really handy file that documents how to use site_size_rechecker.py. That said, it requires a signifigant rewrite due to the changes I've made. Would you mind if I PR my changes first then update this file?

@kevquirk
Copy link
Owner

kevquirk commented Jan 6, 2024

@JLO64 firstly, thanks for all the great work on this - yay for open source!

So a few comments from me, and I'll also let @garritfra opine too..

Broken sites

We definitely need to do something with these. We currently have the site checker which spits out a list of any status that isn't 200, then we manually go through the list. We haven't run it for a while, which is probably why you're seeing lots of broken sites.

Having something automated to clean up would be better for obvious reasons, but I think we need to consider some kind of safety net so that things aren't accidentally deleted that are valid. Maybe a CRON that runs and if a site fails over, say 7 days, it gets deleted. I'm just spit-balling ideas here, as this is all way out of my skillset.

If that can't be done easily, we will continue with the manual script and cleanup.

Cloudflare auth and .env file

I can set up an account so it's tied to this project. I wonder if instead of an .env file we use GitHub secrets instead? That way we can work it all into GH Actions, as you mentioned?

Markdown file output

I think that's fine - whatever makes it easiest to parse is good with me.

Thanks again,

Kev

@garritfra
Copy link
Collaborator

No objections!

An idea regarding the automatic size check: Renovate, the dependabot competitor, opens an issue in each repository it's active in and lists all dependencies that are out of date. Maybe this approach works here as well?

We could have an issue that regularly gets updated by an action, listing all domains that could be removed, alongside the time it has been down for or above 512 kB. To avoid having to maintain a backend, the body of the issue should be in machine-readable form.

If we want, we can even add checkmarks next to the domains to schedule removal, or remove any domains that were unhealthy for n days (or n iterations of the check).

These are just my unfiltered thoughts, I'm totally open for remarks or alternative solutions.

@JLO64
Copy link
Contributor Author

JLO64 commented Jan 8, 2024

Sorry for not responding sooner! I've had a hectic weekend and haven't been able to touch VS Code or GitHub, but I'll definely be able to submit a PR based on the above comments soon.

I wish I could contribute more on the GitHub Actions side of things, but I've never used it before now. I'm gonna try giving myself a crash course on how that all works this week, but for now I'll absain from commenting on that stuff. That said, switching to GitHub Secrets seems like a good idea!

Regarding broken links and websites larger than 512KB, maybe a new variables should be added to the yaml entries. last_checked and last_passed ? (EDIT: I went to sleep late last night, so I didn't realize that we already have a variable checking date last checked lol) This could additionally be used to filter links for the website by adding a liquid if statement to index.md that checks to make sure that these two dates are the same. This could really help the QoL of someone just browsing the list.

@JLO64
Copy link
Contributor Author

JLO64 commented Jan 11, 2024

Sorry for the lack of updates, I got really sick on Monday and am still recovering.

Quick update on the script. I tried running it overnight but it errored out after ~250 entries. I'm not quite certain why that happened since when I reran it from the entry it failed at it ran properly. Additionally, I think there's an issue with some form of rate limiting going on.

image

This is a patern throughought the table where every 60-ish entries for roughly 10 entries it failes to scan them. At most the script querries the API 2 times per 20 seconds which is well within the limits of the overall API and that particular endpoint.

Additionally, there are a bunch of sites that are failing seemingly at random despite being accessable via a browser and Cloudflare. The frequency of this is low at roughly 1 per 40 entries.

Thankfully, I can just have the script retest these entries by having it sort sites.yml by last_passed instead of last_checked. I'll be doing that once I'm done with the entire list of sites.

Long term I don't think this will be an issue if the script is run periodically via GitHub Actions for just a couple of sites at a time, but this is going to be a problem if we ever have to rerun the entire list again. For now I'll PR the script as it is. I haven't changed the API token stuff, documentation, or comments. I'll do that in later PRs.

@JLO64 JLO64 mentioned this issue Jan 11, 2024
7 tasks
@kevquirk
Copy link
Owner

This looks fantastic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants