Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gen_binary_files adds non-determinism to pinyin_index.bin #162

Open
bmwiedemann opened this issue Oct 10, 2023 · 5 comments
Open

gen_binary_files adds non-determinism to pinyin_index.bin #162

bmwiedemann opened this issue Oct 10, 2023 · 5 comments

Comments

@bmwiedemann
Copy link

While working on reproducible builds for openSUSE, I found that our libpinyin package had
a pinyin_index.bin and addon_pinyin_index.bin that vary for every build, unless ASLR is disabled.

Steps to reproduce:

cd ~/rpmbuild/BUILD/libpinyin-2.8.1/data && for i in 1 2 ; do
   rm -f addon_pinyin_index.bin && ../utils/storage/gen_binary_files --table-dir ../data && md5sum addon_pinyin_index.bin ;
done

Actual result:
produces different results with roughly 16 bits of randomness.

Expected result:
output should be fully deterministic.

Here are two sets of samples:
https://rb.zq1.de/temp/pinyin_index.tar.xz

@epico
Copy link
Member

epico commented Oct 13, 2023

Do you know some good tools for binary compare?

Thanks!

@bmwiedemann
Copy link
Author

bmwiedemann commented Oct 13, 2023

I use filterdiff as
filterdiff "hexdump -C" $file1 $file2
or
filterdiff "od -tx1 -Ax" $file1 $file2

@epico
Copy link
Member

epico commented Oct 19, 2023

I think libpinyin uses Kyoto Cabinet now.

I guess the difference between pinyin_index.bin and phrase_index.bin is that:

  • pinyin_index.bin aligned with 2 bytes
  • phrase_index.bin aligned with 4 bytes

BTW, does Kyoto Cabinet support reproducible builds?

@bmwiedemann
Copy link
Author

That 2-byte-alignment could cause an added 2 bytes of padding that remain uninitialized.
Do you know where the structs that get written into pinyin_index.bin get allocated?

@epico
Copy link
Member

epico commented Dec 20, 2023

Maybe the code is in the ChewingLargeTable2::add_index_internal method of the chewing_large_table2_kyotodb.cpp file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants