Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve temporary data management using temp directories #133

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Mar 16, 2022

This PR should solve #124 and #126:

  1. In the wrappers, the tmp_directory (usually tmp/model-architecture) is deleted at the beginning
  2. The wrappers get an additional parameter for the temporary directory, which is passed to the trainer and all the data is saved there.
  3. The save() will copy the data correctly in the output, which could be a) specified by the user with parameter --output but it's at discretion of the application or, b) the default location under data/XYZ/models

The eval_nfold of sequence labelling it's evaluating the models and it's using self.model to set the best one. Such model is then the one copied to the output directory when we call model.save(...)

@lfoppiano lfoppiano changed the title Bugfix/directory management Add temporary directory Mar 16, 2022
@lfoppiano lfoppiano changed the title Add temporary directory Improve temporary data management using temp directories Mar 30, 2022
@lfoppiano lfoppiano marked this pull request as ready for review May 19, 2022 05:29
@lfoppiano lfoppiano requested a review from kermitt2 May 19, 2022 05:30
@lfoppiano
Copy link
Collaborator Author

Hopefully I did not break anything.

@lfoppiano lfoppiano added the enhancement New feature or request label May 19, 2022
@kermitt2
Copy link
Owner

Hi Luca !

Shouldn't the tmp directory be defined in resource-registry.json? because it is a typical library-level resource (like the default download path).
self.registry is then already available in the 2 wrappers, which could help to avoid adding more parameters in the existing methods.

Using only one tmp path in resource-registry.json would also be more clear, because (if I am not wrong) there are 2 different tmp default paths in the current version ('data/tmp', 'data/models/sequenceLabelling/', but not 'data/models/textClassification/'), which is a bit confusing.

We could also probably use the tmp path as the "download" path for simplification? so replace "embedding-download-path": "data/download" by "tmp-path": "data/tmp" ?

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented May 25, 2022

Shouldn't the tmp directory be defined in resource-registry.json? because it is a typical library-level resource (like the default download path). self.registry is then already available in the 2 wrappers, which could help to avoid adding more parameters in the existing methods.
Using only one tmp path in resource-registry.json would also be more clear, because (if I am not wrong) there are 2 different tmp default paths in the current version ('data/tmp', 'data/models/sequenceLabelling/', but not 'data/models/textClassification/'), which is a bit confusing.

Good point. I've modified this by:

  • adding the temp-path in the resource-registry.json
  • setting the defaulttmp_pathas /data/tmp in the Trainer, the default tmp_path is now data/tmp anywhere configured via a constant

We could also probably use the tmp path as the "download" path for simplification? so replace "embedding-download-path": "data/download" by "tmp-path": "data/tmp" ?

I'm not sure about this, I kept them separated for the moment. I think the download path has a specific policy for cleaning (clean all at any time), while I'm not sure what should be for the temp directory. The current implementation removes the tmp_path/model before using it.

Copy link
Owner

@kermitt2 kermitt2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the review for the part of the PR related to adding an output path to the application scripts for the models, instead of using the default one only (except grobidTagger.py where it was already there).

if output_directory:
model.save(output_directory)
else:
model.save()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe simpler in one line:

model.save(dir_path=output_directory)

(sp applies to all application scripts)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated all the application scripts, however I was reluctant to remove the output_directory=None and change the signature of various application methods.
I added a check on the dir_path in save() to default it to the default directory if the dir_path is None. Maybe you have a better solution that this..

delft/applications/citationClassifier.py Outdated Show resolved Hide resolved
@kermitt2
Copy link
Owner

The PR is doing two different things:

  • It adds an output path parameter to the application scripts indicating where to save the models, instead of using the default one only (except grobidTagger.py where it was already there).

  • It adds a tmp path for "pre-saving" the trained model resources prior to the actual saving. This modifies the current logic for saving trained models in particular for n-fold training and eval and when a transformer layer is used.

The second one is much more complex and really needs tests, so I am doing two separate review to help me :)

@kermitt2 kermitt2 self-requested a review May 28, 2022 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants