Speed up the code #15

mdavis-xyz · 2024-05-14T06:06:51Z

Add a cache decorator, to avoid re-downloading all HTML files for each CSV file (x11 speedup for downloading many files. ) (Avoid re-scraping the file-list pages #13 )
Use a requests.session to re-use the same TLS+HTTP handshake/session (x2 speedup)
Swap http for https. (Change base url from HTTP to HTTPS #14 )
Add a sleep with exponential backoff for the retry logic for downloads, instead of immediate retry. This gives AEMO some breathing room, to let them autoscale their EC2 cluster. Although note that requests can throw exceptions prior to returning an HTTPS status, which will not be caught by the existing logic. (e.g. if your wifi cuts out). Also, the actual zip file download is not retried. Perhaps it would be better to just call resp.raise_for_status() and configure retries within the requests.session like this.
I refactored the header stuff. I added the session for speed. But once we're using a requests session, you can just define the headers once, and they'll be re-used automatically. (Also the @cache decorator can't handle mutable dicts). I'm a bit perplexed by all the header stuff that was there. I've never needed to manually specify any header before when webscraping nemweb. What's the purpose of the headers? (e.g. I think requests adds most of them for you automatically, e.g. Host) I left the referrer header in, although I don't think it's needed.
I updated poetry.lock because the CICD and local tests were failing if I didn't do that.

One more thing we could do (which I haven't done yet) is to not redownload a zip if we've already downloaded it. I see the library deletes zips when decompressing. I think it should leave them there. So when we download, check if there's an existing file (perhaps check the file size matches expectation). If so, skip the download. That's faster for the user, and results in less server costs for AEMO. Since the folder is called a "cache", I think re-use is also a more intuitive behaviour for the user to expect.

Another thing we could do to speed up the table size estimation is to avoid doing HEAD to get file sizes, and instead grab them from the <div>s in the HTML. (Probably a x20 speedup)

for more information, see https://pre-commit.ci

…to 13-speedup

mdavis-xyz · 2024-05-14T08:26:27Z

BTW, here's how opennem handles retries (via requests.session retry configuration):

https://github.com/opennem/opennem/blob/4a16a99744a4582080b4711a105c6f070b875960/opennem/utils/http.py#L106-L124


retry_strategy = Retry(
    total=DEFAULT_RETRIES,
    backoff_factor=2,
    status_forcelist=[403, 429, 500, 502, 503, 504],
    allowed_methods=["HEAD", "GET", "OPTIONS"],
)


# This will retry on 403's as well
retry_strategy_on_permission_denied = Retry(
    total=DEFAULT_RETRIES,
    backoff_factor=2,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["HEAD", "GET", "OPTIONS"],
)

http = requests.Session()

http.headers.update({"User-Agent": USER_AGENT})

adapter_timeout = TimeoutHTTPAdapter()
http.mount("https://", adapter_timeout)
http.mount("http://", adapter_timeout)


adapter_retry = HTTPAdapter(max_retries=retry_strategy)
http.mount("https://", adapter_retry)
http.mount("http://", adapter_retry)

prakaa · 2024-05-18T12:26:59Z

Hi Matthew, Thanks for all of this! Looks great! I'm taking some time off at the moment, but will make some time to look at this in early June. Abi

…

On Tue, 14 May 2024, 6:26 pm Matthew Davis, ***@***.***> wrote: BTW, here's how opennem handles retries (via requests.session retry configuration): https://github.com/opennem/opennem/blob/4a16a99744a4582080b4711a105c6f070b875960/opennem/utils/http.py#L106-L124 http = requests.Session() http.headers.update({"User-Agent": USER_AGENT}) adapter_timeout = TimeoutHTTPAdapter() http.mount("https://", adapter_timeout) http.mount("http://", adapter_timeout) adapter_retry = HTTPAdapter(max_retries=retry_strategy) http.mount("https://", adapter_retry) http.mount("http://", adapter_retry) — Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHEMJQNRDS6CCLRGOXICC5TZCHDETAVCNFSM6AAAAABHVPPR4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBZGU4DCNBRG4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

codecov-commenter · 2024-06-05T03:14:00Z

Codecov Report

Attention: Patch coverage is 85.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.81%. Comparing base (bbcedc2) to head (70831fe).

Files	Patch %	Lines
mms_monthly_cli/mms_monthly.py	85.00%	3 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #15      +/-   ##
==========================================
- Coverage   81.98%   81.81%   -0.17%     
==========================================
  Files           2        2              
  Lines         161      165       +4     
  Branches       26       28       +2     
==========================================
+ Hits          132      135       +3     
- Misses         26       27       +1     
  Partials        3        3

Flag	Coverage Δ
unittests	`81.81% <85.00%> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

prakaa · 2024-06-05T05:26:47Z

Hi @mdavis-xyz, can you please sync your fork? I've updated the workflow so CI doesn't fail for codecov.

Looks like local tests are passing.

As for changes:

Thanks for changing to HTTPS
Good to know re requests.session vs requests.get. Not super familiar with REST
The @cache decorator looks really useful and is something I wasn't aware of. Thanks!
Fine to remove the other header stuff. I suspect this was a relic of debugging why requests weren't getting served. Your additions around sleep time look good
My memory of scraping the HTML file sizes is that they weren't accurate. But I can't remember by how much they were off
This package is a lightweight adaptation that of NEMSEER code. I implemented a file/data check in NEMSEER, so could reuse portions of that code. I don't think zips should be retained - some data files are very large and some users (myself included) are hard drive limited. To keep bid zip archives that are ~20GB plus the ~50GB CSV is not ideal. I think NEMSEER used this: https://github.com/UNSW-CEEM/NEMSEER/blob/9ac64fa7a3bce6cde65da14f955ec5ac3343c31c/src/nemseer/query.py#L307

I will have to get around to adding some of these improvements to NEMSEER.

Thanks very much for your contribution! I'll add you to the acknowledgements.

Abi

prakaa · 2024-06-05T07:09:50Z

Closes #13 #14

mdavis-xyz and others added 7 commits May 13, 2024 23:01

add sleep and backoff to retry

44b22c2

re-use connection with requests session

920d1eb

add cache decorator to speed up recrawling of HTML

f8a4005

swap http:// with https://

ad6a074

[pre-commit.ci] auto fixes from pre-commit.com hooks

7ffc87a

for more information, see https://pre-commit.ci

update poetry.lock

6913339

Merge branch '13-speedup' of github.com:mdavis-xyz/mms-monthly-cli in…

83f7f55

…to 13-speedup

Merge branch 'prakaa:master' into 13-speedup

70831fe

prakaa merged commit 8e4fb70 into prakaa:master Jun 5, 2024
5 of 14 checks passed

prakaa mentioned this pull request Jun 7, 2024

Implement requests and performance improvements UNSW-CEEM/NEMSEER#63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up the code #15

Speed up the code #15

mdavis-xyz commented May 14, 2024 •

edited

Loading

mdavis-xyz commented May 14, 2024 •

edited

Loading

prakaa commented May 18, 2024 via email

codecov-commenter commented Jun 5, 2024 •

edited

Loading

prakaa commented Jun 5, 2024

prakaa commented Jun 5, 2024

Speed up the code #15

Speed up the code #15

Conversation

mdavis-xyz commented May 14, 2024 • edited Loading

mdavis-xyz commented May 14, 2024 • edited Loading

prakaa commented May 18, 2024 via email

codecov-commenter commented Jun 5, 2024 • edited Loading

Codecov Report

prakaa commented Jun 5, 2024

prakaa commented Jun 5, 2024

mdavis-xyz commented May 14, 2024 •

edited

Loading

mdavis-xyz commented May 14, 2024 •

edited

Loading

codecov-commenter commented Jun 5, 2024 •

edited

Loading