Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the code #15

Merged
merged 8 commits into from
Jun 5, 2024
Merged

Speed up the code #15

merged 8 commits into from
Jun 5, 2024

Conversation

mdavis-xyz
Copy link
Contributor

@mdavis-xyz mdavis-xyz commented May 14, 2024

  • Add a cache decorator, to avoid re-downloading all HTML files for each CSV file (x11 speedup for downloading many files. ) (Avoid re-scraping the file-list pages #13 )
  • Use a requests.session to re-use the same TLS+HTTP handshake/session (x2 speedup)
  • Swap http for https. (Change base url from HTTP to HTTPS #14 )
  • Add a sleep with exponential backoff for the retry logic for downloads, instead of immediate retry. This gives AEMO some breathing room, to let them autoscale their EC2 cluster. Although note that requests can throw exceptions prior to returning an HTTPS status, which will not be caught by the existing logic. (e.g. if your wifi cuts out). Also, the actual zip file download is not retried. Perhaps it would be better to just call resp.raise_for_status() and configure retries within the requests.session like this.
  • I refactored the header stuff. I added the session for speed. But once we're using a requests session, you can just define the headers once, and they'll be re-used automatically. (Also the @cache decorator can't handle mutable dicts). I'm a bit perplexed by all the header stuff that was there. I've never needed to manually specify any header before when webscraping nemweb. What's the purpose of the headers? (e.g. I think requests adds most of them for you automatically, e.g. Host) I left the referrer header in, although I don't think it's needed.
  • I updated poetry.lock because the CICD and local tests were failing if I didn't do that.

One more thing we could do (which I haven't done yet) is to not redownload a zip if we've already downloaded it. I see the library deletes zips when decompressing. I think it should leave them there. So when we download, check if there's an existing file (perhaps check the file size matches expectation). If so, skip the download. That's faster for the user, and results in less server costs for AEMO. Since the folder is called a "cache", I think re-use is also a more intuitive behaviour for the user to expect.

Another thing we could do to speed up the table size estimation is to avoid doing HEAD to get file sizes, and instead grab them from the <div>s in the HTML. (Probably a x20 speedup)

@mdavis-xyz
Copy link
Contributor Author

mdavis-xyz commented May 14, 2024

BTW, here's how opennem handles retries (via requests.session retry configuration):

https://github.com/opennem/opennem/blob/4a16a99744a4582080b4711a105c6f070b875960/opennem/utils/http.py#L106-L124


retry_strategy = Retry(
    total=DEFAULT_RETRIES,
    backoff_factor=2,
    status_forcelist=[403, 429, 500, 502, 503, 504],
    allowed_methods=["HEAD", "GET", "OPTIONS"],
)


# This will retry on 403's as well
retry_strategy_on_permission_denied = Retry(
    total=DEFAULT_RETRIES,
    backoff_factor=2,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["HEAD", "GET", "OPTIONS"],
)

http = requests.Session()

http.headers.update({"User-Agent": USER_AGENT})

adapter_timeout = TimeoutHTTPAdapter()
http.mount("https://", adapter_timeout)
http.mount("http://", adapter_timeout)


adapter_retry = HTTPAdapter(max_retries=retry_strategy)
http.mount("https://", adapter_retry)
http.mount("http://", adapter_retry)

@prakaa
Copy link
Owner

prakaa commented May 18, 2024 via email

@codecov-commenter
Copy link

codecov-commenter commented Jun 5, 2024

Codecov Report

Attention: Patch coverage is 85.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.81%. Comparing base (bbcedc2) to head (70831fe).

Files Patch % Lines
mms_monthly_cli/mms_monthly.py 85.00% 3 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #15      +/-   ##
==========================================
- Coverage   81.98%   81.81%   -0.17%     
==========================================
  Files           2        2              
  Lines         161      165       +4     
  Branches       26       28       +2     
==========================================
+ Hits          132      135       +3     
- Misses         26       27       +1     
  Partials        3        3              
Flag Coverage Δ
unittests 81.81% <85.00%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@prakaa
Copy link
Owner

prakaa commented Jun 5, 2024

Hi @mdavis-xyz, can you please sync your fork? I've updated the workflow so CI doesn't fail for codecov.

Looks like local tests are passing.

As for changes:

  1. Thanks for changing to HTTPS
  2. Good to know re requests.session vs requests.get. Not super familiar with REST
  3. The @cache decorator looks really useful and is something I wasn't aware of. Thanks!
  4. Fine to remove the other header stuff. I suspect this was a relic of debugging why requests weren't getting served. Your additions around sleep time look good
  5. My memory of scraping the HTML file sizes is that they weren't accurate. But I can't remember by how much they were off
  6. This package is a lightweight adaptation that of NEMSEER code. I implemented a file/data check in NEMSEER, so could reuse portions of that code. I don't think zips should be retained - some data files are very large and some users (myself included) are hard drive limited. To keep bid zip archives that are ~20GB plus the ~50GB CSV is not ideal. I think NEMSEER used this: https://github.com/UNSW-CEEM/NEMSEER/blob/9ac64fa7a3bce6cde65da14f955ec5ac3343c31c/src/nemseer/query.py#L307

I will have to get around to adding some of these improvements to NEMSEER.

Thanks very much for your contribution! I'll add you to the acknowledgements.

Abi

@prakaa prakaa merged commit 8e4fb70 into prakaa:master Jun 5, 2024
5 of 14 checks passed
@prakaa
Copy link
Owner

prakaa commented Jun 5, 2024

Closes #13 #14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants