Improve temporary data management using temp directories #133

lfoppiano · 2022-03-16T09:08:16Z

This PR should solve #124 and #126:

In the wrappers, the tmp_directory (usually tmp/model-architecture) is deleted at the beginning
The wrappers get an additional parameter for the temporary directory, which is passed to the trainer and all the data is saved there.
The save() will copy the data correctly in the output, which could be a) specified by the user with parameter --output but it's at discretion of the application or, b) the default location under data/XYZ/models

The eval_nfold of sequence labelling it's evaluating the models and it's using self.model to set the best one. Such model is then the one copied to the output directory when we call model.save(...)

…the grobidTagger

…y transformer weight files from temporary directory

lfoppiano · 2022-05-19T05:30:28Z

Hopefully I did not break anything.

kermitt2 · 2022-05-21T14:32:10Z

Hi Luca !

Shouldn't the tmp directory be defined in resource-registry.json? because it is a typical library-level resource (like the default download path).
self.registry is then already available in the 2 wrappers, which could help to avoid adding more parameters in the existing methods.

Using only one tmp path in resource-registry.json would also be more clear, because (if I am not wrong) there are 2 different tmp default paths in the current version ('data/tmp', 'data/models/sequenceLabelling/', but not 'data/models/textClassification/'), which is a bit confusing.

We could also probably use the tmp path as the "download" path for simplification? so replace "embedding-download-path": "data/download" by "tmp-path": "data/tmp" ?

lfoppiano · 2022-05-25T00:51:58Z

Shouldn't the tmp directory be defined in resource-registry.json? because it is a typical library-level resource (like the default download path). self.registry is then already available in the 2 wrappers, which could help to avoid adding more parameters in the existing methods.
Using only one tmp path in resource-registry.json would also be more clear, because (if I am not wrong) there are 2 different tmp default paths in the current version ('data/tmp', 'data/models/sequenceLabelling/', but not 'data/models/textClassification/'), which is a bit confusing.

Good point. I've modified this by:

adding the temp-path in the resource-registry.json
setting the defaulttmp_pathas /data/tmp in the Trainer, the default tmp_path is now data/tmp anywhere configured via a constant

We could also probably use the tmp path as the "download" path for simplification? so replace "embedding-download-path": "data/download" by "tmp-path": "data/tmp" ?

I'm not sure about this, I kept them separated for the moment. I think the download path has a specific policy for cleaning (clean all at any time), while I'm not sure what should be for the temp directory. The current implementation removes the tmp_path/model before using it.

kermitt2

This is the review for the part of the PR related to adding an output path to the application scripts for the models, instead of using the default one only (except grobidTagger.py where it was already there).

kermitt2 · 2022-05-28T12:50:58Z

delft/applications/citationClassifier.py

+    if output_directory:
+        model.save(output_directory)
+    else:
+        model.save()


Maybe simpler in one line:

model.save(dir_path=output_directory)

(sp applies to all application scripts)

I updated all the application scripts, however I was reluctant to remove the output_directory=None and change the signature of various application methods.
I added a check on the dir_path in save() to default it to the default directory if the dir_path is None. Maybe you have a better solution that this..

delft/applications/citationClassifier.py

kermitt2 · 2022-05-28T13:11:09Z

The PR is doing two different things:

It adds an output path parameter to the application scripts indicating where to save the models, instead of using the default one only (except grobidTagger.py where it was already there).
It adds a tmp path for "pre-saving" the trained model resources prior to the actual saving. This modifies the current logic for saving trained models in particular for n-fold training and eval and when a transformer layer is used.

The second one is much more complex and really needs tests, so I am doing two separate review to help me :)

lfoppiano added 3 commits March 16, 2022 15:18

define a temporary directory where the data is temporarly stored

cf452be

fix reformatting

8305806

recreate model tmp directory before training if it exists

65697aa

lfoppiano changed the title ~~Bugfix/directory management~~ Add temporary directory Mar 16, 2022

This was linked to issues Mar 28, 2022

Fix --output option to dump the model in a specific directory #124

Open

improve directory management for training/nfold-training #126

Open

lfoppiano and others added 8 commits March 28, 2022 12:48

add directory management to classification

709166a

add --output option to dump the model in a specified directory as in …

656e53e

…the grobidTagger

revert back the training

7011375

revert back another temporary change

23c5fad

cleanup unused imports

4d57727

missing temp_directory

6c3b169

uniform weight & configuration file names using common constants, cop…

bfc0ab7

…y transformer weight files from temporary directory

Merge branch 'master' into bugfix/directory_management

7cc4083

lfoppiano changed the title ~~Add temporary directory~~ Improve temporary data management using temp directories Mar 30, 2022

lfoppiano added 5 commits March 30, 2022 15:07

missing import

fec4aae

Merge branch 'master' into bugfix/directory_management

2c75c3b

fix missing import

9471bc0

udpate applications

29d5f85

cosmetics

ae1e50e

lfoppiano marked this pull request as ready for review May 19, 2022 05:29

lfoppiano requested a review from kermitt2 May 19, 2022 05:30

lfoppiano added the enhancement New feature or request label May 19, 2022

configuration of tmp_path from the resource-registry

a2db5a1

lfoppiano added 2 commits May 25, 2022 10:25

pass tmp-path to the trainer and set default tmp_path for train_fold

cb8ce8f

Merge branch 'master' into bugfix/directory_management

5abe6c0

kermitt2 reviewed May 28, 2022

View reviewed changes

kermitt2 self-requested a review May 28, 2022 13:11

lfoppiano added 2 commits May 30, 2022 10:00

Set default path to data/model in parameters

c1c6de9

simplification of the call to model.save() in the application scripts

4b56bc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve temporary data management using temp directories #133

Improve temporary data management using temp directories #133

lfoppiano commented Mar 16, 2022 •

edited

Loading

lfoppiano commented May 19, 2022

kermitt2 commented May 21, 2022

lfoppiano commented May 25, 2022 •

edited

Loading

kermitt2 left a comment

kermitt2 May 28, 2022

lfoppiano May 30, 2022

kermitt2 commented May 28, 2022

Improve temporary data management using temp directories #133

Are you sure you want to change the base?

Improve temporary data management using temp directories #133

Conversation

lfoppiano commented Mar 16, 2022 • edited Loading

lfoppiano commented May 19, 2022

kermitt2 commented May 21, 2022

lfoppiano commented May 25, 2022 • edited Loading

kermitt2 left a comment

Choose a reason for hiding this comment

kermitt2 May 28, 2022

Choose a reason for hiding this comment

lfoppiano May 30, 2022

Choose a reason for hiding this comment

kermitt2 commented May 28, 2022

lfoppiano commented Mar 16, 2022 •

edited

Loading

lfoppiano commented May 25, 2022 •

edited

Loading