You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lack of information and error reporting in the s3 target sync makes it difficult to troubleshoot s3 target configs.
The current check for the target prefix checks if the prefix path exists and fails if it does not. This is not entirely helpful for a couple of reasons:
In an s3 context, paths are not really directory structures, so the idea of a path needing to exist in non-intuitive. While it is possible to create a key that ends with a slash using the AWS CLI, it is not a common operation in my experience. Rather, one expects to simply be able to write a key regardless of whether or not the full path already exists. That said, perhaps you consider this to be a feature, ie. by preventing users from accidentally mis-typing an intended path name .
The check does not actually check to see if the location is writeable, resulting in an invalid configuration that happens to validate.
Some ideas for improvement:
The choice of whether a path already exists seems like an arbitrary design decision. I don't have a strong opinion to offer either way in this matter, but it might be helpful to clarify this in the docs and offer some documentation on creating these paths (either via the console or the cli for example).
Check if the specified location is writeable. I'm not entirely sure if there is a way to do this in boto other than to simply write a test file ... of course this complicates things a bit by creating a test file and then needing further permissions to delete the test file. Ideally boto would have some way to simply check the permissions, and maybe it does?
Report s3 write errors in the UI when annotations are submitted. Again, perhaps this is a design decision? Like maybe you wouldn't want to bug annotators with information that is more administrative in nature, but I am inclined to think that it is better all around not only to report the error but also to fail the annotation since the expected behavior will not be fulfilled.
Finally, somewhat peripheral, but something that would also go a long way toward mitigating target config errors would be to have a target "sync" option that would write all existing annotations out to the currently configured target. Currently only new annotations are written out, meaning a bad config cannot be corrected without losing the existing annotations to history.
The text was updated successfully, but these errors were encountered:
Regarding the lack of a "sync" option for s3 targets, here are a couple of further tips gathered from Slack conversation that help to mitigate this issue.
Be sure to use the "future" setting for s3 sync. This will ensure that the export format for exports from LabelStudio matches the s3 target data format. Do this by setting the env variable: FUTURE_SAVE_TASK_TO_STORAGE=1
Given the setting mentioned above, now it is relatively trivial to do a "catchup" export from LabelStudio which can be extracted and copied up to s3, effectively resulting in what we would expect a "sync" operation to do. To do this:
Export annotations via the "Export" button in the upper right of a LabelStudio project
Extract each annotation into a file named according to its ID. E.g (Python):
import json
with open('export-file.json') as infile:
data = json.load(infile)
for item in data:
with open(f'labels/{item["id"]}', 'w') as outfile:
json.dump(item, outfile)
After setting up target storage and performing annotations, manually sync annotations using the Sync button for the configured target storage. Annotations are still stored in the Label Studio database, and the target storage receives a JSON export of each annotation. Annotations are sent to target storage as a one-way export. You can also export or sync using the API.
On later syncs, duplicates are overwritten by annotation ID.
Added documentation around permissions needed, target sync, and error messages are improved with 1.3.0. Please open a new issue referencing this one if you still have issues!
Lack of information and error reporting in the s3 target sync makes it difficult to troubleshoot s3 target configs.
The current check for the target prefix checks if the prefix path exists and fails if it does not. This is not entirely helpful for a couple of reasons:
Some ideas for improvement:
Finally, somewhat peripheral, but something that would also go a long way toward mitigating target config errors would be to have a target "sync" option that would write all existing annotations out to the currently configured target. Currently only new annotations are written out, meaning a bad config cannot be corrected without losing the existing annotations to history.
The text was updated successfully, but these errors were encountered: