More robustness around the way s3 target sync works. #1205

scott2b · 2021-07-22T21:24:42Z

Lack of information and error reporting in the s3 target sync makes it difficult to troubleshoot s3 target configs.

The current check for the target prefix checks if the prefix path exists and fails if it does not. This is not entirely helpful for a couple of reasons:

In an s3 context, paths are not really directory structures, so the idea of a path needing to exist in non-intuitive. While it is possible to create a key that ends with a slash using the AWS CLI, it is not a common operation in my experience. Rather, one expects to simply be able to write a key regardless of whether or not the full path already exists. That said, perhaps you consider this to be a feature, ie. by preventing users from accidentally mis-typing an intended path name .
The check does not actually check to see if the location is writeable, resulting in an invalid configuration that happens to validate.

Some ideas for improvement:

The choice of whether a path already exists seems like an arbitrary design decision. I don't have a strong opinion to offer either way in this matter, but it might be helpful to clarify this in the docs and offer some documentation on creating these paths (either via the console or the cli for example).
Check if the specified location is writeable. I'm not entirely sure if there is a way to do this in boto other than to simply write a test file ... of course this complicates things a bit by creating a test file and then needing further permissions to delete the test file. Ideally boto would have some way to simply check the permissions, and maybe it does?
Report s3 write errors in the UI when annotations are submitted. Again, perhaps this is a design decision? Like maybe you wouldn't want to bug annotators with information that is more administrative in nature, but I am inclined to think that it is better all around not only to report the error but also to fail the annotation since the expected behavior will not be fulfilled.

Finally, somewhat peripheral, but something that would also go a long way toward mitigating target config errors would be to have a target "sync" option that would write all existing annotations out to the currently configured target. Currently only new annotations are written out, meaning a bad config cannot be corrected without losing the existing annotations to history.

scott2b · 2021-07-23T15:15:33Z

Regarding the lack of a "sync" option for s3 targets, here are a couple of further tips gathered from Slack conversation that help to mitigate this issue.

Be sure to use the "future" setting for s3 sync. This will ensure that the export format for exports from LabelStudio matches the s3 target data format. Do this by setting the env variable: FUTURE_SAVE_TASK_TO_STORAGE=1
Given the setting mentioned above, now it is relatively trivial to do a "catchup" export from LabelStudio which can be extracted and copied up to s3, effectively resulting in what we would expect a "sync" operation to do. To do this:

Export annotations via the "Export" button in the upper right of a LabelStudio project
Extract each annotation into a file named according to its ID. E.g (Python):

import json

with open('export-file.json') as infile:
    data = json.load(infile)
    for item in data:
        with open(f'labels/{item["id"]}', 'w') as outfile:
            json.dump(item, outfile)

Copy these files up to s3:

aws cli s3 cp --recursive labels/ s3://mybucket/path/to/target/

smoreface · 2021-09-24T21:22:20Z

Target sync is added in version 1.3.0 release:

After setting up target storage and performing annotations, manually sync annotations using the Sync button for the configured target storage. Annotations are still stored in the Label Studio database, and the target storage receives a JSON export of each annotation. Annotations are sent to target storage as a one-way export. You can also export or sync using the API.

On later syncs, duplicates are overwritten by annotation ID.

https://labelstud.io/guide/storage.html

smoreface · 2021-10-15T23:57:15Z

Added documentation around permissions needed, target sync, and error messages are improved with 1.3.0. Please open a new issue referencing this one if you still have issues!

scott2b assigned makseq Jul 22, 2021

makseq added the feature Feature request label Jul 22, 2021

makseq added backend export storages External / Cloud storage connections labels Oct 12, 2021

smoreface closed this as completed Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robustness around the way s3 target sync works. #1205

More robustness around the way s3 target sync works. #1205

scott2b commented Jul 22, 2021

scott2b commented Jul 23, 2021 •

edited

Loading

smoreface commented Sep 24, 2021 •

edited by makseq

Loading

smoreface commented Oct 15, 2021

More robustness around the way s3 target sync works. #1205

More robustness around the way s3 target sync works. #1205

Comments

scott2b commented Jul 22, 2021

scott2b commented Jul 23, 2021 • edited Loading

smoreface commented Sep 24, 2021 • edited by makseq Loading

smoreface commented Oct 15, 2021

scott2b commented Jul 23, 2021 •

edited

Loading

smoreface commented Sep 24, 2021 •

edited by makseq

Loading