Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI dependency wheel caching #6287

Merged
merged 2 commits into from
Aug 7, 2020
Merged

CI dependency wheel caching #6287

merged 2 commits into from
Aug 7, 2020

Conversation

LysandreJik
Copy link
Member

In the past week there has been a drastic increase of Circle CI test failures due to mismatching hash for large dependencies (Torch and TensorFlow), exemple here:

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    tensorflow from https://files.pythonhosted.org/packages/97/ae/0b08f53498417914f2274cc3b5576d2b83179b0cbb209457d0fde0152174/tensorflow-2.3.0-cp36-cp36m-manylinux2010_x86_64.whl#sha256=5c9f9a36d5b4d0ceb67b985486fe4cc6999a96e2bf89f3ba82ffd8317e5efadd (from transformers==3.0.2):
        Expected sha256 5c9f9a36d5b4d0ceb67b985486fe4cc6999a96e2bf89f3ba82ffd8317e5efadd
             Got        2b6dbd560e2f78ccad4c831d170a8612fb592a562a128c05aa6d64f84baa17e5

The issue stems from incomplete downloads when Circle CI downloads the wheels in order to install the dependencies. The download is halted, the file is incomplete which results in a different hash.

With this PR, the CirlceCI ~/.cache/pip directory which contains the downloaded wheels is cached between runs, which means that the files won't be re-downloaded as long as the latest version is available in the cache. A cache is created for each workflow.

Unfortunately, CircleCI does not make it available to update the cache nor to delete the cache. If a new version of either Torch or TensorFlow is released, the cache won't be updated until we update either setup.py, the cache version, or until the cache expires 15 days after its creation.

If the cache works well enough, I'll include ~/.cache/torch in the cache, so that all the downloads done during the tests are saved in the cache as well. This should prevent other flaky tests from happening due to missing models on the S3.

Remove cache dir, re-trigger cache


Only pip archives


Not sudo when pip
Remove no-cache-dir instruction


Remove last sudo occurrences


v0.3
@codecov
Copy link

codecov bot commented Aug 6, 2020

Codecov Report

Merging #6287 into master will increase coverage by 0.77%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6287      +/-   ##
==========================================
+ Coverage   79.68%   80.45%   +0.77%     
==========================================
  Files         146      146              
  Lines       26595    26595              
==========================================
+ Hits        21192    21397     +205     
+ Misses       5403     5198     -205     
Impacted Files Coverage Δ
src/transformers/file_utils.py 80.05% <0.00%> (-0.26%) ⬇️
src/transformers/tokenization_dpr.py 57.65% <0.00%> (+4.50%) ⬆️
src/transformers/generation_tf_utils.py 86.46% <0.00%> (+5.76%) ⬆️
src/transformers/modeling_tf_gpt2.py 95.31% <0.00%> (+23.43%) ⬆️
src/transformers/modeling_tf_flaubert.py 88.19% <0.00%> (+63.97%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d89acd0...f3776aa. Read the comment docs.

Copy link
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any notion of routines/functions in circleci? Might make it easier to reuse some of this logic without typos. Otherwise this looks great and thanks for the informative PR description!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for digging into this, those errors are so annoying!

@julien-c
Copy link
Member

julien-c commented Aug 6, 2020

If the cache works well enough, I'll include ~/.cache/torch in the cache, so that all the downloads done during the tests are saved in the cache as well. This should prevent other flaky tests from happening due to missing models on the S3.

If the model is missing on S3, it should be reported as a failure, shouldn't it?

I wouldn't include ~/.cache/torch in the cache because layering multiple caches (files are already behind a CDN) tends to lead to hard to debug bugs

@LysandreJik LysandreJik merged commit 80a0676 into master Aug 7, 2020
@LysandreJik LysandreJik deleted the ci-caching branch August 7, 2020 06:49
@LysandreJik
Copy link
Member Author

@julien-c These flaky failing tests are saying that the model is missing on S3, while it isn't. It's available, but since CircleCI has connection issues it reports those with an error. I'll link such a failing test here when there is one.

@LysandreJik LysandreJik mentioned this pull request Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants