Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse safetensors metadata #1855

Merged
merged 13 commits into from
Nov 24, 2023
Merged

Parse safetensors metadata #1855

merged 13 commits into from
Nov 24, 2023

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Nov 22, 2023

Related to #1832 cc @LysandreJik @Narsil

There are 2 methods:

  • parse_safetensors_file_metadata => parses a safetensors file on the Hub => the "real" method that parses safetensors
  • get_safetensors_metadata => takes a repo and parses all safetensors files (if sharded with a model.safetensors.index.json file) => more focused towards transformers architecture (i.e. opinionated)

I tried to follow more or less the typescript implementation with similar error handling.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 22, 2023

The documentation is not available anymore as the PR was closed or merged.

@Wauplin Wauplin marked this pull request as ready for review November 23, 2023 12:08

To parse metadata from a single safetensors file, use [`get_safetensors_metadata`].

For more details regarding the safetensors format, check out https://huggingface.co/docs/safetensors/index#format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll be able to hyper-link to this implem from the safetensors doc too BTW (like i had done for the JS implem)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more generally let's always make sure we cross-link stuff as much as possible

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Wauplin ! API looks great and works great.

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved
)
_headers = self._build_hf_headers(token=token)

# 1. Fetch first 100kb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? If I'm not mistaken @Narsil had told me 1MB but I didn't try it firsthand

Copy link
Contributor Author

@Wauplin Wauplin Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Narsil mentioned me 100kb as a good starting point.

Out of curiosity I've made a quick empirical study. I parsed 2700 files from the top 1000 models tagged as safetensors-compatible on the Hub, sorted by downloads. Out of 2700 files,

  • maximum metadata header is 365kb
  • 3.2% have a metadata header >=100kb
  • 4.1% have a metadata header >=75kb
  • 7.5% have a metadata header >= 50kb
  • 18% have a metadata header >= 25kb

Given these numbers, 100kb looks like a good threshold. We could even lower it to 75kb but it's not worth it.

@Wauplin
Copy link
Contributor Author

Wauplin commented Nov 24, 2023

Thanks for the reviews @julien-c and @LysandreJik!

I'll keep in mind to update the hub docs once this is released (#1855 (comment))

@Wauplin Wauplin merged commit 8f7e04e into main Nov 24, 2023
14 of 16 checks passed
@Wauplin Wauplin deleted the 1832-parse-safetensors-data branch November 24, 2023 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants