Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document to compress data files before uploading #5687

Closed
albertvillanova opened this issue Mar 30, 2023 · 3 comments · Fixed by #5691
Closed

Document to compress data files before uploading #5687

albertvillanova opened this issue Mar 30, 2023 · 3 comments · Fixed by #5691
Labels
documentation Improvements or additions to documentation

Comments

@albertvillanova
Copy link
Member

In our docs to Share a dataset to the Hub, we tell users to upload directly their data files, like CSV, JSON, JSON-Lines, text,... However, these extensions are not tracked by Git LFS by default, as they are not in the .giattributes file. Therefore, if they are too large, Git will fail to commit/upload them.

I think for those file extensions (.csv, .json, .jsonl, .txt), we should better recommend to compress their data files (using ZIP for example) before uploading them to the Hub.

  • Compressed files are tracked by Git LFS in our default .gitattributes file

What do you think?
CC: @stevhliu

See related issue:

@albertvillanova albertvillanova added the documentation Improvements or additions to documentation label Mar 30, 2023
@stevhliu
Copy link
Member

Great idea!

Should we also take this opportunity to include some audio/image file formats? Currently, it still reads very text heavy. Something like:

We support many text, audio, and image data extensions such as .zip, .rar, .mp3, and .jpg among many others. For data extensions like .csv, .json, .jsonl, and txt, we recommend compressing them before uploading to the Hub. These file extensions are not tracked by Git LFS by default, and if they're too large, they will not be committed and uploaded. Take a look at the .gitattributes file in your repository for a complete list of supported file extensions.

@albertvillanova
Copy link
Member Author

albertvillanova commented Mar 31, 2023

Hi @stevhliu, thanks for your suggestion.

I agree it is a good opportunity to mention that audio/image file formats are also supported.

Nit:
I would not mention .zip, .rar after "text, audio, and image data extensions". Those are "compression" extensions and not "text, audio, and image data extensions".

What about something similar to:

We support many text, audio, and image data extensions such as .csv, .mp3, and .jpg among many others. For text data extensions like .csv, .json, .jsonl, and .txt, we recommend compressing them before uploading to the Hub (to .zip or .gz file extension for example).

Note that text file extensions are not tracked by Git LFS by default, and if they're too large, they will not be committed and uploaded. Take a look at the .gitattributes file in your repository for a complete list of tracked file extensions by default.

Note that for compressions I have mentioned:

  • gz, to compress individual files
  • zip, to compress and archive multiple files; zip is preferred rather than tar because it supports streaming out of the box

@stevhliu
Copy link
Member

Perfect, thanks for making the distinction between compression and data extensions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants