Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize MIME Types for Compressed tar Files #547

Open
mkuchnik opened this issue Feb 21, 2024 · 0 comments
Open

Standardize MIME Types for Compressed tar Files #547

mkuchnik opened this issue Feb 21, 2024 · 0 comments
Labels
invalid This doesn't seem right

Comments

@mkuchnik
Copy link
Contributor

mkuchnik commented Feb 21, 2024

MIME types unfortunately don't officially support tar formats. Croissant uses "application/x-tar", since "x-" is reserved for experimental types. MIME types do support compression methods, such as gzip and zlib, yielding "application/gzip" and "application/zlib", respectively. Since tar formats compose with compression (e.g., .tar.gz), it's common for the combination to cause confusion.

Croissant should provide a recommendation for what to do in such cases. In the loader, there is an implicit conversion from ".tar.gz" to ".tar" here. Meanwhile, the editor is associating ".tar.gz" files as ".gz" (using extensions and file headers). Thus, the same file has different types depending on the implementation. For example, a user may see the following error when using a ".tar.gz" file in the editor, even though the same extension is already used in the flores-200 dataset:

NotImplementedError: File type FileType(name='GZIP', encoding_format='application/gzip', extensions=['gz']) is not supported. Please, open an issue on GitHub: https://github.com/mlcommons/croissant/issues/new

This can be fixed if the right approach is formalized. Using "application/x-tar+gzip" or similar (the spec doc is using "application/x-gzip") would prevent confusion, especially when FileObjects may be confused with FileSets. Otherwise, Croissant can stick to one (e.g., "application/gzip") and use implicit behavior to attempt a recovery for common but unspecified cases (e.g., .tar.gz).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

1 participant