Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] clear ~/.cache/torch_extensions between builds #14520

Merged
merged 1 commit into from
Nov 25, 2021

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Nov 25, 2021

This PR is trying to address CI failures with pt-nightly. https://github.com/huggingface/transformers/runs/4280926354?check_suite_focus=true

~/.cache/torch_extensions/ currently uses a single hardcoded path to install all custom cuda extensions and so when it was built with pt-1.8 but then attempted to be used with pt-nightly (pt-1.11-to-be), the following happens:

ImportError: /github/home/.cache/torch_extensions/py38_cu111/cpu_adam/cpu_adam.so: undefined symbol: curandCreateGenerator

pt-1.10 has improved the situation by adding a prefix: ~/.cache/torch_extensions/py38_cu113 which makes the builds not shared between different cuda and python versions, but it missed the crucial pt-version in that prefix. I reported the issue here:
pytorch/pytorch#68905

And of course ideally all the builds should be installed into the virtual python environment and not have a global shared dir.

This PR tries to address the issue by wiping out ~/.cache/torch_extensions/ completely when CI starts.

This of course means deepspeed will rebuild the extensions on every CI run, but this is actually a good thing, because then we really test the right version of it. It does it really fast so it shouldn't introduce a large overhead.

@LysandreJik

pip install .[testing,deepspeed,fairscale]
pip install git+https://github.com/microsoft/DeepSpeed
rm -rf ~/.cache/torch_extensions/ # shared between conflicting builds
pip install .[testing,fairscale]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also removed deepspeed here, since it's immediately re-installed on the following line.

@stas00 stas00 changed the title [CI] clear ~/.cache/torch_extensions between builds [CI] clear ~/.cache/torch_extensions between builds Nov 25, 2021
@stas00 stas00 changed the title [CI] clear ~/.cache/torch_extensions between builds [CI] clear ~/.cache/torch_extensions between builds Nov 25, 2021
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you for taking care of that, @stas00. Looks good to me, let's monitor the nightly scheduled runs.

@LysandreJik LysandreJik merged commit d1fd64e into huggingface:master Nov 25, 2021
@stas00 stas00 deleted the shared-torch-extensions branch November 25, 2021 17:46
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
@ydshieh ydshieh mentioned this pull request Jun 27, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants