Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robustness around the way s3 target sync works. #1205

Closed
scott2b opened this issue Jul 22, 2021 · 3 comments
Closed

More robustness around the way s3 target sync works. #1205

scott2b opened this issue Jul 22, 2021 · 3 comments
Assignees
Labels
backend export feature Feature request storages External / Cloud storage connections

Comments

@scott2b
Copy link

scott2b commented Jul 22, 2021

Lack of information and error reporting in the s3 target sync makes it difficult to troubleshoot s3 target configs.

The current check for the target prefix checks if the prefix path exists and fails if it does not. This is not entirely helpful for a couple of reasons:

  1. In an s3 context, paths are not really directory structures, so the idea of a path needing to exist in non-intuitive. While it is possible to create a key that ends with a slash using the AWS CLI, it is not a common operation in my experience. Rather, one expects to simply be able to write a key regardless of whether or not the full path already exists. That said, perhaps you consider this to be a feature, ie. by preventing users from accidentally mis-typing an intended path name .
  2. The check does not actually check to see if the location is writeable, resulting in an invalid configuration that happens to validate.

Some ideas for improvement:

  1. The choice of whether a path already exists seems like an arbitrary design decision. I don't have a strong opinion to offer either way in this matter, but it might be helpful to clarify this in the docs and offer some documentation on creating these paths (either via the console or the cli for example).
  2. Check if the specified location is writeable. I'm not entirely sure if there is a way to do this in boto other than to simply write a test file ... of course this complicates things a bit by creating a test file and then needing further permissions to delete the test file. Ideally boto would have some way to simply check the permissions, and maybe it does?
  3. Report s3 write errors in the UI when annotations are submitted. Again, perhaps this is a design decision? Like maybe you wouldn't want to bug annotators with information that is more administrative in nature, but I am inclined to think that it is better all around not only to report the error but also to fail the annotation since the expected behavior will not be fulfilled.

Finally, somewhat peripheral, but something that would also go a long way toward mitigating target config errors would be to have a target "sync" option that would write all existing annotations out to the currently configured target. Currently only new annotations are written out, meaning a bad config cannot be corrected without losing the existing annotations to history.

@makseq makseq added the feature Feature request label Jul 22, 2021
@scott2b
Copy link
Author

scott2b commented Jul 23, 2021

Regarding the lack of a "sync" option for s3 targets, here are a couple of further tips gathered from Slack conversation that help to mitigate this issue.

  • Be sure to use the "future" setting for s3 sync. This will ensure that the export format for exports from LabelStudio matches the s3 target data format. Do this by setting the env variable: FUTURE_SAVE_TASK_TO_STORAGE=1

  • Given the setting mentioned above, now it is relatively trivial to do a "catchup" export from LabelStudio which can be extracted and copied up to s3, effectively resulting in what we would expect a "sync" operation to do. To do this:

  1. Export annotations via the "Export" button in the upper right of a LabelStudio project

  2. Extract each annotation into a file named according to its ID. E.g (Python):

import json

with open('export-file.json') as infile:
    data = json.load(infile)
    for item in data:
        with open(f'labels/{item["id"]}', 'w') as outfile:
            json.dump(item, outfile)
  1. Copy these files up to s3:
aws cli s3 cp --recursive labels/ s3://mybucket/path/to/target/

@smoreface
Copy link
Contributor

smoreface commented Sep 24, 2021

Target sync is added in version 1.3.0 release:

After setting up target storage and performing annotations, manually sync annotations using the Sync button for the configured target storage. Annotations are still stored in the Label Studio database, and the target storage receives a JSON export of each annotation. Annotations are sent to target storage as a one-way export. You can also export or sync using the API.

On later syncs, duplicates are overwritten by annotation ID.

https://labelstud.io/guide/storage.html

@makseq makseq added backend export storages External / Cloud storage connections labels Oct 12, 2021
@smoreface
Copy link
Contributor

Added documentation around permissions needed, target sync, and error messages are improved with 1.3.0. Please open a new issue referencing this one if you still have issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend export feature Feature request storages External / Cloud storage connections
Projects
None yet
Development

No branches or pull requests

3 participants