Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-reproducible GUID in compressed XML files #110

Closed
rclobus opened this issue Dec 18, 2021 · 6 comments · Fixed by #112
Closed

Non-reproducible GUID in compressed XML files #110

rclobus opened this issue Dec 18, 2021 · 6 comments · Fixed by #112

Comments

@rclobus
Copy link

rclobus commented Dec 18, 2021

Hello Richard Hughes,

While working on the “reproducible builds” effort 1, I have noticed that the content of the GUID field of the converted, compressed XML files cannot be recreated in any way.

The function xb_builder_import_node (which is called by as_cache_components_to_internal_xb in appstream) uses the address of a pointer for the creation of the GUID field 2.
See the currents tests 3 for Debian, and the investigations I previously did 4.

Steps to reproduce:

  • Run appstreamcli refresh-cache
  • Notice that the content of /var/cache/app-info/cache/C-os-catalog.xb changes only in the GUID field (using xb-tool)
  • If the file C-local-metainfo.xb is deleted before running appstreamcli refresh-cache, it will also differ only in the GUID field.

The tool xb-tool does not call xb_builder_import_node, so I cannot give an example that only uses libxmlb.

My goal:

  • Create a live image that can be reproduced bit-by-bit.

Proposal:

  • For a drop-in-replacement of the GUID field (128 bits) I see several easy alternatives:
    1. Zero-out the field and not use it (which is probably not desired)
    1. Calculate the MD5 checksum (which is also 128 bits) of the compressed bitstream before writing to disk
    1. Generate a truly random UUID, with uuid_generate_random from libuuid (which I can de-randomize using my LD_PRELOAD-hack)
  • Alternatively (but that requires more bits), something similar to GNU_BUILD_ID can be used
  • If you are interested, I can provide you with a patch.

With kind regards,
Roland Clobus

@ximion
Copy link
Collaborator

ximion commented Dec 18, 2021

If libxmlb allowed me to reset the guid, appstream could also compute its own guid (possibly based on the files that went into the xmlb creation and their ctimes). At the moment, AS only adds its version number to the guid.
Alternatively, option 3 (completely random ID) makes sense to me to add if libxmlb xb_builder_import_node was called, rather than taking a pointer address.

@hughsie
Copy link
Owner

hughsie commented Dec 18, 2021

I'm pretty sure libxmlb always uses the file mtime when building the silo, but if libappstream is using xb_builder_import_node() then there's not much we can do with that API. Maybe adding the pointer is always harmful? I think we should do something to indicate the builder has been changed. Perhaps we want a xb_builder_import_node_full that specifies the GUID string to use when adding nodes manually? Maybe we stop doing the pointer thing and rely on libappstream using xb_builder_append_guid correctly?

@hughsie
Copy link
Owner

hughsie commented Dec 18, 2021

@rclobus -- if you did a PR to remove the %p action for _add_node I'd merge it.

@ximion
Copy link
Collaborator

ximion commented Dec 18, 2021

I think we should do something to indicate the builder has been changed.

You could mix in a random GUID once, once a n XbBuilderNode has been added manually. The data AppStream adds doesn't really have a canonical origin XML file anymore, as it has been preprocessed (and some cruft has been removed, icons resolved, XML normalized to the latest specification version, etc.).

Maybe we stop doing the pointer thing and rely on libappstream using xb_builder_append_guid correctly?

That would be fine with me, I can vouch for AppStream doing the right thing - but it may make the API less foolproof.
IMHO the pointer thing is a pretty neat hack, but having explicit control over this would be much nicer... Also for performance, just setting the GUID once all the nodes have been added may be useful.

rclobus added a commit to rclobus/libxmlb that referenced this issue Dec 19, 2021
Instead of using the address of a pointer, use a deterministic hash
value based on the content of the node.

Fixes hughsie#110
@rclobus
Copy link
Author

rclobus commented Dec 19, 2021

I've prepared a pull request. By using xb_builder_node_export it will not be the fastest code, but it will guarantee unique and reproducible hash values based on the (full) content of the node.

hughsie added a commit that referenced this issue Dec 20, 2021
Do not use the address of a pointer as this breaks reproducible builds.

Fixes #110

Based on a patch by Roland Clobus <rclobus@rclobus.nl>, many thanks.
@hughsie
Copy link
Owner

hughsie commented Dec 20, 2021

Maybe we let the caller make that decision? That seems simplest to me: #112

hughsie added a commit that referenced this issue Dec 20, 2021
Do not use the address of a pointer as this breaks reproducible builds.

Fixes #110

Based on a patch by Roland Clobus <rclobus@rclobus.nl>, many thanks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants