Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add defstruct #105

Merged
merged 2 commits into from May 10, 2022
Merged

Add defstruct #105

merged 2 commits into from May 10, 2022

Conversation

jcrist
Copy link
Owner

@jcrist jcrist commented May 9, 2022

Adds a method for dynamically defining new struct types. This is helpful
for situations where types aren't known until runtime, but you still
want to provide type validation when encoding/decoding.

jcrist added 2 commits May 9, 2022 17:30
Adds a method for dynamically defining new struct types. This is helpful
for situations where types aren't known until runtime, but you still
want to provide type validation when encoding/decoding.
Recent mypy upgrade broke the CI setup.
@jcrist jcrist merged commit d85078c into master May 10, 2022
@jcrist jcrist deleted the defstruct branch May 10, 2022 03:01
@jcrist
Copy link
Owner Author

jcrist commented May 10, 2022

One use of this is for dynamically defining a type used exclusively for extracting a few known fields from a larger structure. In code where the necessary fields are static, a classic struct definition would suffice. But for code where the fields aren't known until runtime, msgspec.defstruct becomes necessary.

For example, here's a small script that parses and queries the current repodata.json file for conda-forge. A struct type is defined at runtime to parse only the fields required for the query, avoiding allocating extra data that is never used.

from operator import attrgetter
import msgspec


def top10_packages(sort_field):
    # Dynamically define a new type with only the required fields
    Package = msgspec.defstruct("Package", ["name", sort_field])
    RepoData = msgspec.defstruct("RepoData", [("packages", dict[str, Package])])

    # Load and parse the data into this new type
    with open("current_repodata.json", "rb") as f:
        repo_data = msgspec.json.decode(f.read(), type=RepoData)

    # Sort by the designated field
    packages = list(repo_data.packages.values())
    getter = attrgetter(sort_field)
    packages.sort(key=getter, reverse=True)

    # Return the results
    return [(p.name, getter(p)) for p in packages[:10]]


for name, size in top10_packages("size"):
    print(f"- {name}: {size / (2 ** 20):.2f} MiB")

Results:

$ python example.py
- spacy-model-en_core_web_lg: 630.93 MiB
- spacy-model-en_vectors_web_lg: 584.45 MiB
- geant4-data-ndl: 572.80 MiB
- proj-data: 565.82 MiB
- spacy-model-en_core_web_trf: 442.50 MiB
- nltk_data: 428.19 MiB
- geant4-data-emlow: 295.86 MiB
- scitime: 287.50 MiB
- pyspark: 267.96 MiB
- cartopy_offlinedata: 216.34 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant