New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Binary Value of each Mnemonic and its Wordlist Index Value #132

Closed
hatgit opened this Issue Nov 19, 2017 · 12 comments

Comments

Projects
None yet
3 participants
@hatgit
Copy link
Contributor

hatgit commented Nov 19, 2017

A proposal to add two functions to further improve the usefulness of the standalone HTML tool https://github.com/iancoleman/bip39/blob/master/bip39-standalone.html

(not sure if this has already been considered, sharing just in case)

Point 1) Return word list index values for each word in mnemonic sentence (ms)

If the option could be added to return the number from the wordlist index - for each mnemonic word returned, it could be useful for users across other wordlist-languages and/or in the event that the wordlist order was changed at the application source.

For example, using this 24 ms:

lunch anger issue giggle scout cloth once marriage busy save notice farm syrup rally garment tennis price rather unusual brother whisper issue orphan toe

the associated index number for each word would follow (commas are just for ease of reading here and could be spaces):

1003,30,951,790,1745,412,1229,1147,252,1564,1192,710,1754,1476,800,1841,1290,1458,1918,241,2025,951,1238,1770

Point 2) Return the binary string that was encoded to each word and/or the bip39 seed in binary format.

This could be added as an optional checkbox that when ticked will also display the 11-bit binary number for each word including the hashed checksum (or show it in the existing section where users can supply their own entropy). For example, in the current version of the html tool, users can check the box to supply their own source of entropy, but if a user wanted to see the source of entropy generated by the tool itself, this is not currently available, yet could be a useful feature for security to backup/verify or cross-reference in other applications, etc..

For example, if the tool generated the following 24-word MS:

wink fashion differ love acid stool spy rich copy horn goose curious input act athlete rare quiz school crucial amateur trend valve basic army

The user could opt to see the associated 264-bit string including the concactenated checksum:

10101010101001010000100000111110101111011111011110110010101010010111111110111111011110000100010001000101010001001000101101001010101010100101010000001000100100110011001100111100001111001000101111000010100101011101010100101010101101010101010010101010000111100c1b002d


The index numbers for each word proposed in point 1 above, and their full entropy string proposed in point 2, could complement existing backup recovery sentences in a number of ways, in cases where users found using such data to be advantageous despite obvious trade-offs between security risks/ convenience.

@iancoleman

This comment has been minimized.

Copy link
Owner

iancoleman commented Nov 20, 2017

Thanks for these interesting feature suggestions. Overall I disagree with adding these features but remain open to being convinced otherwise.

it could be useful for users across other wordlist-languages and/or in the event that the wordlist order was changed at the application source

Currently entering a wordlist and changing languages in the tool does use a direct index lookup. In the tests (see this test) the word 'abandon' (first in English list) is converted to 'abaco' (first in Italian list) when Italian is selected. So the tool does act as a direct conversion tool between languages in this way.

I don't think adding the list of indexes in the interface is useful because the mnemonic is what is recorded, not the indexes. That's the point of mnemonics! So if a user were to record their indexes for safekeeping bip39 has basically failed at the intended purpose.

Return the binary string that was encoded to each word and/or the bip39 seed in binary format.

Is this the same as the 'BIP39 Seed' field just below the optional passphrase field?

Representing the mnemonic phrase in any form other than the words seems overall detrimental since it encourages poor user behaviour and does not facilitate any additional use (at least not that I can see, feel free to point out any additional uses).

complement existing backup recovery sentences in a number of ways

Please outline the ways it complements existing backups because currently I'm not convinced this is in the interest of users of this tool. Mnemonics are the secret and are the backup, not other numbers or encodings despite many such forms being possible.

@iancoleman

This comment has been minimized.

Copy link
Owner

iancoleman commented Nov 21, 2017

I think perhaps point 1 "Return word list index values for each word in mnemonic sentence" could be easily achieved by adding another 'language' to the list called 'numbers' which shows the index numbers instead of the word. This is a pretty elegant way to achieve that goal without too many negative side effects.

@hatgit

This comment has been minimized.

Copy link
Contributor

hatgit commented Nov 21, 2017

Thanks for the reply and interesting feedback! Much appreciate the dialogue and willingness to explore how adding a few functions to the existing tool could further empower its potential range of uses and depending on the underlying evolving needs of its users.

Currently entering a wordlist and changing languages in the tool does use a direct index lookup. In the tests (see this test) the word 'abandon' (first in English list) is converted to 'abaco' (first in Italian list) when Italian is selected. So the tool does act as a direct conversion tool between languages in this way.

Regarding Point 1 (Index values as numbers)

Noted regarding this point, thanks for the education regarding the language differences. I guess the only question I have left here that could be worth exploring is whether foreign language apps could detect other languages more easily by letting users input the index number or the foreign matching word? (i.e. if an Italian crypto coin wallet allowed importing English words - such as “abandon” in your example, would it be easier for them to request the word or the matching index number “0”?) It certainly wouldn’t be easier for users unless they also had the index value from wordlist, and since the app/tool does use the index lookup feature already in the backend maybe bringing this to the front-end for users could make sense at some point in the future, not sure. (p.s. just saw your latest comment addressing how this could be added in a subtle manner, sounds that a potentially viable option worth considering!).

I don't think adding the list of indexes in the interface is useful because the mnemonic is what is recorded, not the indexes. That's the point of mnemonics! So if a user were to record their indexes for safekeeping bip39 has basically failed at the intended purpose.

I totally agree that the point of mnemonics are indeed the convenience they provide with phrases that are easy to read/write/relay versus having to do that with the underlying initial seed (if the tool didn’t exist). I’d like to highlight that the context of this suggestion was to complement that same purpose served by BIP39 and not meant by any means as a replacement, even though individual users can decide how to use their data - I think it could empower them if other formats of the same data is available (not just for creative purposes - such as obscuring their mnemonic, but also for verifications or redundancy, perhaps).

Regarding Point 2 (showing the filtered entropy for a mnemonic generated by the BIP39 tool)

Return the binary string that was encoded to each word and/or the bip39 seed in binary format.
Is this the same as the 'BIP39 Seed' field just below the optional passphrase field?

I am not sure. I'd say no unless the BIP39 seed can also be used to derive the raw binary equivalent (and/or its subsequently filtered entropy)? Below is a good argument for why showing the filtered entropy for a Mnemonic Sentence (MS) generated by the BIP39 tool could be useful:

For example, using the same entropy and MS from point 2 above, given the case where “Supply my own source of entropy” is selected in the current version of the BIP39 tool, and the user enters into the “Entropy” field the following 264-bit string (256-bit entropy plus 8-bit hashed checksum):

10101010101001010000100000111110101111011111011110110010101010010111111110111111011110000100010001000101010001001000101101001010101010100101010000001000100100110011001100111100001111001000101111000010100101011101010100101010101101010101010010101010000111100c1b002d

The above string is duplicated correctly in the “filtered entropy” section as well as “entropy section” and the resulting BIP39 mnemonic is shown correctly:

wink fashion differ love acid stool spy rich copy horn goose curious input act athlete rare quiz school crucial amateur trend valve basic army

Representing the mnemonic phrase in any form other than the words seems overall detrimental since it encourages poor user behaviour and does not facilitate any additional use (at least not that I can see, feel free to point out any additional uses).

I would argue by pointing out that the tool already displays other forms of the mnemonic (although in non-mnemonic form) such as the "filtered entropy" shown when a users supplies their own source of entropy, as these represent the mathematical corollary of the resulting mnemonic (and provided they are used correctly in terms of calculating checksum as the tool warns). Likewise, for the mnemonic generated by the BIP39 tool, showing the matching filtered entropy would just be the inverse of that function and could serve more advanced users although I agree that putting it front and center alongside the generated ms words wouldn’t be ideal which is why below is a proposed solution to where it could best fit.

complement existing backup recovery sentences in a number of ways

Please outline the ways it complements existing backups because currently I'm not convinced this is in the interest of users of this tool. Mnemonics are the secret and are the backup, not other numbers or encodings despite many such forms being possible.

Considering that the tool currently shows the “filtered entropy” value for a given generated mnemonic in cases when a user supplies their own source of entropy, I don’t think it would be unreasonable to also show the values for the filtered entropy in cases when the entropy/mnemonic was generated by the BIP39 tool as well (and not just for those importing their own entropy).

Basically, you’d just be enabling the inverse of a function already enabled in the tool and thus making the feature more versatile for those who want it - in cases when they are not supplying their own source of entropy. (i.e. when relying on the pseudo-random number generator used by the BIP39 standalone tool to generate the initial entropy and the resulting filtered entropy that includes the last word as the checksum.)

Point 2 Example

For example, if a user generated the following mnemonic as noted in point 1 using the tool:

lunch anger issue giggle scout cloth once marriage busy save notice farm syrup rally garment tennis price rather unusual brother whisper issue orphan toe

There is no way - at least that I am aware of - for them to obtain the filtered entropy from the tool even though you offer this option to users who supply their own entropy in the advanced feature section.

Perhaps it could be conveniently housed there in the advanced section?

I think this subtle change - if warranted - could be of value for those who do already rely on the P(RNG) used in the tool to generate the ms, in addition to those who supply their own entropy source and who already can access it.

@iancoleman

This comment has been minimized.

Copy link
Owner

iancoleman commented Nov 22, 2017

would it be easier for them to request the word or the matching index number “0”?

The word. Because a) users will / should only have the word not the index, and b) the specific word does matter.

The addresses change if the language is changed because the derived seed depends on the words.

Quoting BIP39 - From mnemonic to seed

To create a binary seed from the mnemonic, we use the PBKDF2 function with a mnemonic sentence

So this also means the prior idea I had to show indexes/numbers as a 'language' instead of words won't work.

since the app/tool does use the index lookup feature already in the backend

This tool (as far as I know) is the only tool that does this. More to the point, no other tool / wallet does conversion between languages so this is not even a consideration for them. This doesn't negate the point to show indexes, but it's something to consider.

individual users can decide how to use their data

Great point, I completely agree with you on this.

I'd say no [the BIP39 Seed is not what is being requested] unless the BIP39 seed can also be used to derive the raw binary equivalent

Right, the seed can't be reversed back into the mnemonic, so you're looking for some new information to be displayed. I understand now.

the tool already displays other forms of the mnemonic

True. The tool is basically just a lot of different encodings and conversions of a single root piece of entropy.

There is no way - at least that I am aware of - for them to obtain the filtered entropy from the tool even though you offer this option to users who supply their own entropy in the advanced feature section.

You've provided a fairly nice suggestion - populate the entropy field with the value from the PRNG if 'generate' is clicked. Even though the user doesn't see it, they can choose to see it if they want to by revealing the hidden entropy section. Good idea!

I think adding the word indexes as an extra field in the entropy section will also work nicely.

Thanks for working through this and providing useful examples and justifications. It's people like yourself that make this tool useful.


Proposed changes to implement for this issue:

  • Change text 'Supply my own source of entropy' to 'Show entropy details'
  • Populate the Entropy field when 'generate' is clicked (probably hex-encoded)
  • Add display of word indexes to the entropy section (probably below the 'Raw Binary' option)
  • Move the Mnemonic Length select field to be below the Entropy input, so all user inputs for entropy are grouped together

iancoleman added a commit that referenced this issue Nov 22, 2017

@hatgit

This comment has been minimized.

Copy link
Contributor

hatgit commented Nov 25, 2017

Thanks for the positive feedback and glad to see the implementations moving along! I have a 3 questions that came up during some manual tests: I was trying to map the entropy+checksum pasted above in the point 2 example to the mnemonic and was having difficulty finding the direct link between the entropy:

image

and the 24-word mnemonic sentence, in terms of how the 11-bit groups of bits (after 264/32 = 11 groups as per above snapshot) are used to encode each number corresponding to the index value (and its related word) as per BIP39 from the following used in our example:

wink fashion differ love acid stool spy rich copy horn goose curious input act athlete rare quiz school crucial amateur trend valve basic army

Question 1: Not sure if this is entirely the bitarray.js function that encodes the 11-bit strings as 32-bit words or if it is affected by other data. For example, do you have any handy method, such as pointing out which binary string from the 264-bit array above in point 2 corresponds to the word "wink" from the first word in the mnemonic sentence?

screen shot 2017-11-24 at 7 37 09 pm

Raw Entropy showing as hex value if entropy contains any hex values (should it show as "mixed" input instead?)

I also noticed that in the BIP39 tool the software was interpreting the inputted entropy as hex values (under "entropy type") and therefore stretching the "raw binary" shown out by four times with extra padding showing each bit as a 4-bit string (i.e. 1 as 0001). (Should this instead be mixed entropy and only the hex values should be converted to binary instead of treating it all as hex? perhaps not relevant but thought to ask in case)

*Question 2: More importantly, should the first 8 bits from the hashed value from the SHA2-hashed checksum "0c1b002d" not be in hex format? Becuase changing it to binary "00001100" (after using first 8bits of 00001100000110110000000000101101 converted from hex) makes the 264-bit string hex free:

101010101010010100001000001111101011110111110111101100101010100101111111101111110111100001000100010001010100010010001011010010101010101001010100000010001001001100110011001111000011110010001011110000101001010111010101001010101011010101010100101010100001111000001100

but results in a different mnemonic:

custom close toddler rival item wisdom can opinion there music rail priority amazing blind usual giant joy smooth oxygen undo fatigue pond immense add

**Question 3 *while both could be "correct" mnemonics in a different context, which do you think is correct here for the purpose of the point 2 example and having the proper checksum value from the original 256-bit starting point for each one? I know this is handled by the code in software but thought that mapping it manually could also help users (including myself) better understand it on the surface. I appreciate any light you can shed on this!

@iancoleman

This comment has been minimized.

Copy link
Owner

iancoleman commented Nov 28, 2017

I'm just going to gather some facts together before diving into this...

The entropy in question is

10101010101001010000100000111110101111011111011110110010101010010111111110111111011110000100010001000101010001001000101101001010101010100101010000001000100100110011001100111100001111001000101111000010100101011101010100101010101101010101010010101010000111100c1b002d

Where does this entropy initially come from? Is it an example from a webpage? I'm just curious because I never could make this example entropy match with the example mnemonic. Can you please outline the steps to go from the the supplied entropy to the mnemonic (using any tool, not necessarily this one).

As you stated, this is a mix of binary and hex which requires some disclaimer from me:

The tool does not work with mixed entropy because of the reason given at the top comment of entropy.js: "Automatically uses lowest entropy to avoid issues such as interpretting 0101 as hexadecimal which would be 16 bits when really it's only 4 bits of binary entropy."

I don't consider using mixed entropy as a reasonable interface since it will always involve some 'magical' degree of interpretation.

To directly address the questions:

do you have any handy method, such as pointing out which binary string from the 264-bit array above in point 2 corresponds to the word "wink"

In the jsbip39 library being used by this tool, the toMnemonic method illustrates what's happening.

The words come directly from the entropy. So the first word is simply the first 11 bits of the entropy. Those 11 bits are converted to a number between 0 and 2047, and used to look up the word at that index. It's a direct map from entropy to words. The only exception is the checksum bits which are appended to the raw entropy, so it affects the final word in the mnemonic.

The first 11 bits of the example entropy are "10101010101" which is index 1365 which is line 1366 and is word 'primary' - so I don't know why the first word of the example mnemonic is 'wink'.

should the first 8 bits from the hashed value from the SHA2-hashed checksum "0c1b002d" not be in hex format?

The checksum is added automatically to the entropy by the jsbip39 library. See jsbip39.js#L88.

Entropy entered into this tool is pure entropy, not bip39-checksummed-entropy. Do not enter the checksum. If you want to do that manually for some other reason, use the same encoding for the checksum as the entropy (hex or binary or whatever).

It's also worth pointing out that each hex character represents 4 bits, so the checksum "0c1b002d" is 32 bits, not 8.

Another point is that the checksum is not always 8 bits. The amount of checksum to add to the original entropy depends on how much entropy there is in the first place. I recommend reading the section of BIP39 titled Generating the mnemonic for more info.

which do you think is correct here for the purpose of the point 2 example

Neither. Please explain the original mapping used between entropy and mnemonic, because I could never get it to match. Until it's clearer about how to investigate the original it's hard to look much further into it.

@hatgit

This comment has been minimized.

Copy link
Contributor

hatgit commented Nov 30, 2017

Okay, great feedback here and thanks again for your time and patience. Bottom-line, it looks like when I was including the checksum into the raw entropy field - that was causing the incorrect mnemonic starting with “wink” which couldn’t be further duplicated since the tool computes the checksum on its own as you stated.

Resolved:

I was eventually able to correctly duplicate the mnemonic that you mentioned should start with “primary” from that same initial entropy in two different ways (one correct, and one incorrect worth exploring perhaps), first using your suggestion of not including the checksum:

(1st way) Initial 256-bit ENT

1010101010100101000010000011111010111101111101111011001010101001011111111011111101111000010001000100010101000100100010110100101010101010010101000000100010010011001100110011110000111100100010111100001010010101110101010010101010110101010101001010101000011110

Resulting 24-word MS:

primary choose autumn know kite feed year upper dust clay carpet next pipe affair error guide develop fun pistol prevent prize prevent position silly

image

I believe this is the correct mnemonic and how the tool should be used, where “silly” is the correct checksum and if so, since it maps to the binary number 11001000101 or decimal 1605 in the wordlist from its first right-most 8-bits 01000101, I believe revealing the checksum value could also be useful (please see further below) because it would be otherwise unclear on the front-end how that happens.

Proposed enhancements:

Therefore, now that this has been resolved, if you agree with the above (?), I would suggest the following tiny additions be made that could help users:

  • Warn not to include any checksum into the raw entropy field (and maybe state that this will automatically be computed by the standalone tool).
  • Reveal the resulting checksum in binary format so that a user could retain that along with the initial ENT that is being used, if needed, and so they can reverse calculate it if needed to confirm the tool mapped it correctly.

Even though the checksum values will still be the values we were referencing earlier (which simply didn't need to be pasted into the raw entropy field) it would help bring full circle the functionality of this enhancement by revealing those values either as:

8bit hex CS of above ENT:

0c1b002d

or Converted to 32-bits:

00001100000110110000000000101101

Or shown as some other value related to the last word?

Overall, I think both of these pointers (especially revealing the checksum value) would help achieve fully what we set out to do with the last commits you made from this thread so far and for manual verification of the resulting outputs.

Additional feedback from testing potentially worth exploring if relevant below:

(2nd way) Initial ENT +32bit binary checksum (288-bits)

Although this is incorrect since the checksum shouldn’t be entered, it turns out that if we use the previously mentioned 8-bit hex checksum converted to 32-bit binary as the checksum, then the resulting 288-bit string will correctly map to the mnemonic but with what appears to be an error/bug as the 27 words are returned instead of 24 and the last word (after omitting the last three) is “screen” instead of the correct checksum “silly.”

primary choose autumn know kite feed year upper dust clay carpet next pipe affair error guide develop fun pistol prevent prize prevent position screen brand accident legal

image

Obviously, since that is not in line with what BIP39 suggests as it is beyond the limit, as you also pointed out the checksum is proportional to the entropy size, where ENT/32= CS length in bits, and because the initial entropy in question is 256 bits, its checksum shouldn’t be longer than 8 bits, which is why I tried to only use the first 8 bits in earlier testing (00001100) but that produced a different mnemonic (not realizing that I didn’t have to enter the checksum value at all -since the tool calculates it). So for those mistakenly trying a 288-bit string as in the above example, below are some thoughts that came to mind:

Takeaway optional suggestion:

  • Perhaps it is worth exploring whether in an exact scenario like this where 23 of the 24 words are correct (after omitting the last three) that some error message is triggered in case a user otherwise thinks they have the right one and just need to drop the last three words? Because otherwise limiting the max number of bits that can be entered would prevent longer mnemonics which could be commonplace in the future.
  • Restrict the option on mnemonic length only to the drop-down value "Use Raw Entropy (3 words per 32 bits)" when a user selects "supply own entropy/show entropy"

This suggestion is less important but I thought could be of value for those who could otherwise run into errors if attempting to supply their own entropy where the string includes checksum as a binary value where the tool might not be able to distinguish the mistake by the user as their final checksum word will be wrong (as in the above example).

@hatgit

This comment has been minimized.

Copy link
Contributor

hatgit commented Dec 10, 2017

Hi Ian, Nice work with the updated version Release v0.3.1, looks great:

image

Not sure if you've had time to look at the additional suggestions in the last comment above. I look forward to your comments if you have time, thank you.

@iancoleman

This comment has been minimized.

Copy link
Owner

iancoleman commented Dec 11, 2017

Yes I have and agree they should be added. Will do so when I have time. Thanks for the added suggestions.

Repository owner deleted a comment from calaiou Dec 13, 2017

Repository owner deleted a comment from calaiou Dec 13, 2017

@hatgit

This comment has been minimized.

Copy link
Contributor

hatgit commented Dec 13, 2017

Ian, my pleasure, thank you! Was also thinking if it would be feasible to be able to import/paste a mnemonic into the tool at some stage (into the BIP39 Mnemonic field) in the future in order to reveal its raw/filtered entropy? Comparable to how wallets allow an import for recovery, or would the mnemonic alone not be sufficient to calculate the original ENT (even if the xpub and xprv were accessible using the mnemonic?) Thanks again.

@nekrozon

This comment has been minimized.

Copy link

nekrozon commented Jan 29, 2018

This thread has been super helpful .. thanks to both Ian and hatgit!

I second the request to add the checksum value when providing manual entropy.

I would also take this a step further...because it’s good to see the steps involved :

  • show the sha256 hash of the manual entropy
  • show the hex bits (in hex) used for checksum calculation
  • Show the binary conversion of these bits
  • Show the padding added to make the last checksum word 11 bits

EDIT: MISTAKE I MADE (SOLVED): Hash tool I was using was treating the binary as a string, and not a binary number.

Site I used to generate CORRECT hash of a binary number: https://cryptii.com/hash-function

@iancoleman

This comment has been minimized.

Copy link
Owner

iancoleman commented Mar 12, 2018

v0.3.4 has some updates for this issue

  • d6cade8 - Add spaces every 11 bits to raw binary of entropy
  • 09d6329 - Show the checksum value in the entropy details
  • 548d949 - Warn that entropy values should exclude checksum
  • f8ca25c - Add spacing every 11 bits to the checksum

I didn't add the full sha256(entropy) value to the ui, just the checksum. This is a cleaner interface and I think works well because the entropy binary is now grouped into 11 bits, and the last part of the entropy is very clearly not 11 bits. The checksum binary directly follows that and is of a length that 'fills in' the 'missing' entropy binary.

Feedback welcome but I now consider this feature to be complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment