Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: scrape new expanded_assets page #83

Merged
merged 1 commit into from Sep 13, 2022

Conversation

gebner
Copy link
Member

@gebner gebner commented Sep 12, 2022

No description provided.

@gebner
Copy link
Member Author

gebner commented Sep 12, 2022

I'm really not sure that this is the best approach, but for now it fixes the issue: https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/binary.20package.20was.20not.20provided.20for.20'linux'

@lovettchris
Copy link
Contributor

lovettchris commented Sep 12, 2022

A quick fix is needed and what you did here works for now, and I tested these changes on my windows machine and they work fine, but long term we should chat about switching to the gh CLI instead which is guaranteed to work regardless of web page content. Like this:

@collares
Copy link

Elan switched from using the GitHub API to scraping the web page in 2018 due to the API rate limit: dab74012. I'm not speaking for the devs here, but not forcing the user to create a GH API token (and still staying under the API request limit) is an important factor to have in mind.

@lovettchris
Copy link
Contributor

That looks like a different thing going through a web interface at https://api.github.com with a elan useragent string, but I'm talking about forking out to "gh.exe" command line interface, which is acting on the user's behalf and should not run into rate limits unless an individual user is going nuts with downloads.

@Kha Kha merged commit 0e2b6aa into leanprover:master Sep 13, 2022
@gebner gebner mentioned this pull request Sep 13, 2022
13 tasks
@Kha
Copy link
Member

Kha commented Sep 18, 2022

@lovettchris Surely gh is using the GH API as well...?

@lovettchris
Copy link
Contributor

lovettchris commented Sep 19, 2022

The code is here: https://github.com/cli/cli but what kind of throttling issues did you run into before? Were you hitting it during CI jobs because every CI job shared the same "api_key: $GH_TOKEN" ? I have never seen any throttling issues with gh in a CI job (like creating a release with softprops/action-gh-release@v1 which uses the gh rest api) have you? When I saw that elan was previously registering itself as the user agent, I wonder if github was globally rate limiting elan, perhaps also if elan was not using the user's Authorization header when running on the user's machine, as per https://docs.github.com/en/rest/rate-limit.

@Kha
Copy link
Member

Kha commented Sep 19, 2022

@collares already explained why we can't use Authorization: we don't want to force the user to create an API token, or to even have a GH account in the first place. And the unauthenticated default is not enough for elan.

For unauthenticated requests, the rate limit allows for up to 60 requests per hour. Unauthenticated requests are associated with the originating IP address, and not the person making requests.

@lovettchris
Copy link
Contributor

lovettchris commented Sep 19, 2022

I see, well I doubt a normal user that has no github account would need to use elan more than 60 times per hour. So then perhaps it's just a matter of ensuring the CI build uses an authenticated header using the $GITHUB_TOKEN than comes with the CI environment. Then for power users that are running boat loads of tests locally that are calling elan all day long, perhaps we could discover those users are authenticated using gh auth status and use their token.

@Kha
Copy link
Member

Kha commented Sep 19, 2022

A normal university for sure could need to use elan more than 60 times per hour from the same IP though

@lovettchris
Copy link
Contributor

lovettchris commented Sep 19, 2022

Ok, you are talking about a "university" it sounds like you are saying the rate limit applies to 'elan' and not the user. Perhaps that is the problem we need to fix here, but I guess the only way to fix it is through user authentication with github which you are explicitly trying to avoid? I'm not sure why we need to avoid that... but I notice this tidbit on the rate-limits, I wonder if we can tap into this:

GitHub Apps that are installed on an organization within an enterprise on GitHub.com are subject to a limit of 15,000 requests per hour per organization that has installed the app.

Could it be as simple as me getting you a token for elan since I'm in a github enterprise organization? It says "The request is from a GitHub App that's owned by a GitHub Enterprise Cloud organization." -- so I wonder how they verify ownership?

@collares
Copy link

@lovettchris This is not a reply to your previous message, but just a random comment: You're coming from Lean 4, which is currently aimed at programmers, so I think you (understandably) have a programmer-oriented perspective. I'd guess that a plurality of Lean users today are not programmers and probably will never have the need for a GitHub account. Even if that's not the case, I think people like that (e.g., a group of first-year math undergrads installing Lean for their "Introduction to Proof" class, or someone who just completed the Natural Number Game and wants to explore further) will still be an important user demographic for the foreseeable future.

@lovettchris
Copy link
Contributor

lovettchris commented Sep 19, 2022

Sure, I definitely have a CS lens on all this, for example, I was thinking about how every CS class these days publishes the class assignments in a github repo... (or gitlab or something similar). I was also thinking about how even web designers who have no CS background are using git to share their designs, same with documentation writers, game designers. I've even worked with mechanical engineers in the Chicago water department who use github for all kinds of things... Perhaps we are bending over backwards just a little too far. That's all. Happiness is known when to say "no", and perhaps flakey HTML scaping hits that limit :-)

@tydeu
Copy link
Member

tydeu commented Sep 28, 2022

@lovettchris I think it is important to consider different perspectives here. You seem to have encountered Git in a lot of contexts outside developing CS projects (which is quite cool 😄). I, despite having been through almost 9 years of CS at a reasonably well ranked university, have not (none of my CS classes even ever used it -- though it was discussion topic among students and there are other classes in the department which do). Also, even as a longtime Git/GitHub user myself, I did not have gh (or any form of GitHub automation) until I needed to start automating releases for Lake cloud builds. Thus, I think it is likely unreasonable to expect virtually all users of Lean (as elan is the primary way of obtaining Lean) to have knowledge of or use Git or GitHub or even more rarely the GitHub CLI.

As a side note, the one time I did encounter a class using version control outside CS was a professor in mathematics who was using Subversion, not Git. 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants