Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script deletes files and subfolders on my machine #77

Closed
BirgerMoell opened this issue Dec 6, 2022 · 4 comments
Closed

Script deletes files and subfolders on my machine #77

BirgerMoell opened this issue Dec 6, 2022 · 4 comments

Comments

@BirgerMoell
Copy link

I tried running the following script and it deletes all the files and folders in my current working directory.

https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#python-script

There is a --overwrite_output_dir \ flag but I'm guessing the behaviour should be to delete a folder inside the current working directory, not all the files in the folder.

There should probably be a way to rewrite this since deleting folders and subfolders on someones computer is dangerous and I trusted the code and let it run on my computer.

@sanchit-gandhi
Copy link
Contributor

Hey @BirgerMoell, sorry to hear that you lost some files 😥

This behaviour looks to be a combination of a couple of things:

  1. The output_dir is set to ./ (i.e. the directory in which the script run.sh is saved). The README instructions suggest that you run the script from within a new model repository cloned from the Hugging Face Hub (i.e. set your current working dir to a cloned model repo). This means that the only files that can be touched by the script are those within the model repository, and any files outside the repo are safe! The Trainer is sandbox in this regard, it only has access to the files you specify in the output_dir. If you don't want Trainer to touch files in your current working dir, you need to run the script from a different dir or set output_dir accordingly.
  2. This is the desired behaviour by setting overwrite_output_dir (see docs). I'm afraid we can't change this as it's required for over-riding checkpoints and flushing dirs.

So what you can do is run the script from within a new model repo or set --overwrite_output_dir=False \ in the arguments to run.sh. We're trying to give instructions that can be used by as many participants as possible. They've been stress tested when all of the instructions are run sequentially, but the behaviour will no doubt be different when different steps are run. Sorry to hear that you lost files. If this was a model repo cloned from the HF Hub could you revert to the last commit?

@BirgerMoell
Copy link
Author

I got most of the stuff back through git but I seemed a bit unexpected so I just wanted to report it to make sure it behaved as it was intendent.

@sanchit-gandhi
Copy link
Contributor

For sure! Thanks for reporting it! I'll have a look to see if there's a way we can avoid this in the future (potentially being more explicit in the text?)

@sanchit-gandhi
Copy link
Contributor

(It will only do a full wipe if the output dir is not a HF repo:
https://github.com/huggingface/transformers/blob/6a707cf586f37865d1a9fac02c56c43cf8dcf979/src/transformers/trainer.py#L3326)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants