You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MIME types unfortunately don't officially support tar formats. Croissant uses "application/x-tar", since "x-" is reserved for experimental types. MIME types do support compression methods, such as gzip and zlib, yielding "application/gzip" and "application/zlib", respectively. Since tar formats compose with compression (e.g., .tar.gz), it's common for the combination to cause confusion.
Croissant should provide a recommendation for what to do in such cases. In the loader, there is an implicit conversion from ".tar.gz" to ".tar" here. Meanwhile, the editor is associating ".tar.gz" files as ".gz" (using extensions and file headers). Thus, the same file has different types depending on the implementation. For example, a user may see the following error when using a ".tar.gz" file in the editor, even though the same extension is already used in the flores-200 dataset:
NotImplementedError: File type FileType(name='GZIP', encoding_format='application/gzip', extensions=['gz']) is not supported. Please, open an issue on GitHub: https://github.com/mlcommons/croissant/issues/new
This can be fixed if the right approach is formalized. Using "application/x-tar+gzip" or similar (the spec doc is using "application/x-gzip") would prevent confusion, especially when FileObjects may be confused with FileSets. Otherwise, Croissant can stick to one (e.g., "application/gzip") and use implicit behavior to attempt a recovery for common but unspecified cases (e.g., .tar.gz).
The text was updated successfully, but these errors were encountered:
MIME types unfortunately don't officially support tar formats. Croissant uses "application/x-tar", since "x-" is reserved for experimental types. MIME types do support compression methods, such as gzip and zlib, yielding "application/gzip" and "application/zlib", respectively. Since tar formats compose with compression (e.g., .tar.gz), it's common for the combination to cause confusion.
Croissant should provide a recommendation for what to do in such cases. In the loader, there is an implicit conversion from ".tar.gz" to ".tar" here. Meanwhile, the editor is associating ".tar.gz" files as ".gz" (using extensions and file headers). Thus, the same file has different types depending on the implementation. For example, a user may see the following error when using a ".tar.gz" file in the editor, even though the same extension is already used in the flores-200 dataset:
This can be fixed if the right approach is formalized. Using "application/x-tar+gzip" or similar (the spec doc is using "application/x-gzip") would prevent confusion, especially when FileObjects may be confused with FileSets. Otherwise, Croissant can stick to one (e.g., "application/gzip") and use implicit behavior to attempt a recovery for common but unspecified cases (e.g., .tar.gz).
The text was updated successfully, but these errors were encountered: