Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort list of files #1804

Closed
wants to merge 1 commit into from

Conversation

@bmwiedemann
Copy link

commented Jul 15, 2019

Sort list of source files
so that fast_data_types.so builds in a reproducible way
in spite of indeterministic filesystem readdir order
and random set iteration order.

See https://reproducible-builds.org/ for why this is good.

Note: I tested that it builds (on a clean VM) with 0.14.2, but it might need more testing than that.

This PR was done while working on reproducible builds for openSUSE.

Sort list of files
so that fast_data_types.so builds in a reproducible way
in spite of indeterministic filesystem readdir order
and random set iteration order.

See https://reproducible-builds.org/ for why this is good.

@bmwiedemann bmwiedemann force-pushed the bmwiedemann:sort branch from 40d3581 to 41ebc19 Jul 15, 2019

@bmwiedemann

This comment has been minimized.

Copy link
Author

commented Jul 16, 2019

Unfortunately commit 97c2b7c did not fix (all) the ordering issues.
Even on a perfectly deterministic filesystem there are ordering issues, so there are components other than the filesystem - e.g. random set/dict order.

@kovidgoyal

This comment has been minimized.

Copy link
Owner

commented Jul 16, 2019

set/dict is not random as of python3.7, they are in insertion order. All
the inputs to compile_c_extension are deterministic, as far as I can
tell. compile_c_extension is called from three places

  1. compile_kittens() where my commit made it deterministic
  2. build() where find_files() is already deterministic, because the
    files are sorted on mtime
  3. compile_glfw() where sources comes from a JSON file, so are
    deterministic.

The only possible source of randomness is (2) if all the files have
identical mtimes, which can be easily remedied by subsorting on file
name.

@bmwiedemann

This comment has been minimized.

Copy link
Author

commented Jul 16, 2019

in https://build.opensuse.org/package/view_file/openSUSE:Factory/kitty/kitty.spec Line 66
we do sed on all files before the build (in random order). So mtimes are rather nondeterministic.

Apart from that, applying patches also screws up mtimes, so for reproducibility it would be good to replace the mtime-sort with a filename-sort in find_c_files

kovidgoyal added a commit that referenced this pull request Jul 16, 2019

Dont sort on mtime
Files are built in order by size anyway and mtime sort makes link order
mtime dependent which breaks reproducible builds on openSUSE as they
modify mtimes randomnly. See #1804
@kovidgoyal

This comment has been minimized.

Copy link
Owner

commented Jul 16, 2019

Sure, done

@Luflosi

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

Does kitty build in a reproducible way now? According to https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/kitty.html, the latest version available in Debian does not because of a bug sphinx used to have.

@bmwiedemann

This comment has been minimized.

Copy link
Author

commented Jul 17, 2019

At least the filesystem-related influences are completely gone now in our build with master.

There is still some parallelism-related issue with .pyc file creation, but those have a poor track record already and it should probably be solved with an upstream python patch. https://bugzilla.opensuse.org/show_bug.cgi?id=1049186 already added one ordering patch in openSUSE, but that was not enough here.
I think one underlying problem is that .pyc files contain too much of the internal python state and that was not designed to be reproducible, yet.

Setting num_workers = 1 in compile_python gets rid of the parallelism-related issue, but that might be a trade-off you dont want to make.

@bmwiedemann bmwiedemann deleted the bmwiedemann:sort branch Jul 17, 2019

@kovidgoyal

This comment has been minimized.

Copy link
Owner

commented Jul 17, 2019

Odd, pyc files are built via py_compile() in compile_python()
num_workers in parallel_run should not affect that.

@bmwiedemann

This comment has been minimized.

Copy link
Author

commented Jul 17, 2019

Yes, it is a python thing. something like reference counters depend on the order of processing. While we sorted filesystem order of py_compile, the parallelism here introduces random processing order again.

The other annoying aspect (does not matter here) is that .pyc files differ between architectures, so even python packages that do not include .so files are somewhat architecture-dependent.

@kovidgoyal

This comment has been minimized.

Copy link
Owner

commented Jul 17, 2019

Hmm this should really be fixed in python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.