Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

The data of EFF's long wordlist should live in a different file #2

Open
gonzalo-bulnes opened this issue Oct 7, 2018 · 10 comments
Open
Labels
good first issue Good for newcomers

Comments

@gonzalo-bulnes
Copy link
Owner

gonzalo-bulnes commented Oct 7, 2018

Edit: If you are looking to make your first (or second, or third!) open-source contribution, welcome! You can read the original comments in chronological order for context, or jump to the issue re-definition below! 馃檪


As it is currently, that long piece of data (7776 numbers -dice throws- and 7776 words) is bloating the dice/wordlist/mod.rs file.

Github does a pretty good job, only skipping syntactic coloration for those lines, but otherwise editing the file isn't pleasant. Also, the data is unlikely to change while the logic certainly will.

I made an attempt to move the throws and words definitions to a different file, but Rust didn't like the variable assignments outside a function. I then tried encapsulating them in two functions, but got a bit lots with the ownership details as that implied extra borrowing...

I certainly need to understand better the ownership model, but I'm also sure there must be a pattern for this use case. Do you know of one? Please let me know! 馃檪

@gonzalo-bulnes gonzalo-bulnes added help wanted Extra attention is needed help me learn Rust! I'm not sure how to do this, but would love to learn from you labels Oct 7, 2018
@sebglazebrook
Copy link

Things that I think you could try is this space are:

  • load data from files
  • use structs to house the data and have function implemented on those structs to wrap/use that data.
  • use constants for data and then it can live outside the function

@gonzalo-bulnes
Copy link
Owner Author

Hi @sebglazebrook!

Thank for the pointers!
I think the constants are what I was trying to achieve, and thought that my variable being immutable, then they would be constant... I'll search specifically for constants : )

@gonzalo-bulnes
Copy link
Owner Author

gonzalo-bulnes commented Oct 10, 2018

Yep, there are constants in Rust, I forgot that paragraph it seems, but that makes sense to me. (:bulb: thanks @sebglazebrook)

While loading the data from file would allow to swap easily the wordlists, I don't think there is any need for that currently. (I personally like the long worldlist from the EFF.)

Using a struct could be a next step. For now I think constants are going to fulfill the purpose of making the code more readable.

I'll reword this issue now that the way to go is clear. 馃檪


The data of EFF's long wordlist should live in a different file

As it is currently, that long piece of data (7776 numbers -dice throws- and 7776 words) is bloating the dice/wordlist/mod.rs file.

Github does a pretty good job, only skipping syntactic coloration for those lines, but otherwise editing the file isn't pleasant. Also, the data is unlikely to change while the logic certainly will.

Objective: move data to a different file, in the form of constants so that it doesn't get mixed with the code that is likely to be improved (and change) over time.

Challenge: as explained in the link above, constants require a type annotation... what is their type?

Bonus points: I think that probably the only constant that needs to be public is the eff_long_wordlist itself, keeping the other ones private would be nice.

This is a refactoring, so there is no need to add new tests, the tests that are already written are sufficient. The goal is to make them green once the move is over 鉁旓笍

Of course, I'm happy to discuss alternatives or help at any stage 馃檪 I am learning Rust as well!

@gonzalo-bulnes gonzalo-bulnes added good first issue Good for newcomers and removed help me learn Rust! I'm not sure how to do this, but would love to learn from you help wanted Extra attention is needed labels Oct 10, 2018
@Bassetts
Copy link
Contributor

@gonzalo-bulnes Personally I think that loading the words from a file is the best approach here.

  1. As you said it will mean that different word lists could be used
  2. I would consider the list to be an external depedency. By loading from a file it would mean that if the list were to ever be updated then updating the list for the program is as simply as replacing the existing file with the new one
  3. It would mean the application could, in future, ship with multiple word lists. Users could select between them by passing an argument
  4. It means your users are not trusting you to have recreated the list properly. They can take the list that ships with the application and verify it is the same as the one provided by EFF themselves. Admittedly this one only benefits the truly paranoid, where is my tin foil hat?
  5. It will reduce the size of the produced binary and will likely mean that a release compresses better when creating an archive. I am no expert but I believe a text file has more potential for compression than a binary

Just my two cents 馃槃

@gonzalo-bulnes
Copy link
Owner Author

gonzalo-bulnes commented Oct 13, 2018

I think that your considerations raise good questions @Bassetts. Before going into the details, my current thinking is that without discarding it in the future, loading the wordlist from a file is not what I would start with. I'll do my best to lay out what I have in mind 馃檪

It is probably useful to start by stating that by principle I try to do as little as is necessary at once, keeping in mind that in software we can always do more in a week if we think it's useful, and that an iterative approach, in my experience, often leads to find other useful things to do, with a different perspective, once the problem at hand is solved.

Also, at this moment, I am as interested in training myself to building the right thing as in learning Rust (i.e. what building the thing right means in Rust). 馃檪 I don't mind doing in 3 small PRs what could be done in one if there was time pressure.

So, this is what I have currently in mind:

  1. Iterative approach: if someone is interested in using a different wordlist in the future, I would consider it then. Playing devil's advocate: there is not evidence of that at this point. (I am the only user I know of.)
  2. Nature of wordlists: Yes, like you, I do consider the wordlist as an external dependency; with the following characteristics:
    1. by its nature, it is very unlikely to be updated
    2. there aren't many reasons for someone to use multiple wordlists simultaneously
  3. Trust / security [1]:
    1. I agree that reviewablity is key, with the following caveat:
    2. I don't think reviewability of the wordlist taken in isolation brings much trust value, and I would consider the reviewabilty of the wordlist and its surrounding code at once as a metric
      1. as authors unworth of trust: both code and wordlist must be reviewed before using the tool, and I think simpler code surrouding a constant will be simpler, overall, than more complex code around a text file... YMMV?
      2. as trustworthy authors that can misspell a wordlist: reviewing a 7000+ lines of text is unreliable enough in itself to call for an automated test to catch misspellings : )
  4. Usability / Value proposition for this tool / Why is this tool for? From an usability perspective, the practical use cases I can think of [2] would favour:
    1. a simple command (no arguments) over a flexible one (wordlist as argument for example), mainly because of (1.ii)
    2. an atomic / standalone binary over a binary with attached files
    3. a binary as small as possible (I completely agree with you on that)
    4. a program that starts as fast as possible (the task in not important enough to wait IMHO). If embedding the wordlist meant faster user experience (I don't know if that is true), then I'd probably favour that because of (1) and (2)

I am no expert but I believe a text file has more potential for compression than a binary

I'm no expert either, that statement seems intuitively correct to me... I would also check (gut felling) what happens during the compilation steps. I feel that this question can be an interesting pretext to learn things about Rust internals if that's something we / someone wants to do.

With all that in mind, I'd like to reword / refine the objectives for this step from:

Github does a pretty good job, only skipping syntactic coloration for those lines, but otherwise editing the file isn't pleasant. Also, the data is unlikely to change while the logic certainly will.

Objective: move data to a different file, in the form of constants so that it doesn't get mixed with the code that is likely to be improved (and change) over time.

...to:

Github does a pretty good job, only skipping syntactic coloration for those lines, but otherwise editing the file isn't pleasant. Also, the data is unlikely to change while the logic certainly will.

Objective: move data to a different file, in the form of constants so that it doesn't get mixed with the code that is likely to be improved (and change) over time, so that:

  • the dice/wordlist/mod.rs is easier to work with (more colors, less scrolling)...
  • ...without becoming more complex to review (one step at a time)
  • it is clearer that the EFF long wordlist is an external dependency and shouldn't be modified

[1] The trust / security dicussion is an interesting one in itself I believe, and I'm tempted to open a separate issue on that specific topic not to sidetrack this one too much. (Done, see #6 )
[2] Not excluding that there could be other use cases than the ones that I can think of, see (1)

@gonzalo-bulnes
Copy link
Owner Author

@Bassetts Althought I would be reluctant to use it here for the reasons above, you might be interested in rust-embed for the purpose of working with separate source files while still shipping a single binary. And it is a macro, which could also be interesting. : )

@Bassetts
Copy link
Contributor

@gonzalo-bulnes I was looking into the options for loading a file into the binary at compile time the other day and there is the include_str! macro.

I don't think it is particularly suitable for this application as you would then just have one huge static string embedded into the binary.

I also came across support for code generation by using a build script. That option is very interesting to me as you could have a build step that takes the word list file and generates code for you e.g. making each line into a variable, or putting all the lines into a HashMap.

Doing code generation would mean your word list text file can just be committed to source control, which would give you all the benefits that come with that such as;

  • changes being easier to see using git/github
  • being able to verify the file by doing a straight diff between the file in your repo and the one downloaded directly from the "upstream" (in this case EFF)

All that would then need to be done in terms of an audit is to verify the code generation code doesn't modify the text when generating the code, which I think should be fairly simple.

This also has the benefit that if someone wanted to provide a binary with an alternate word list then they could fork your repo, replace the word list with their own file (as long as it is in the expected format), and then just build. The code generation will take care of the rest.

For the reasons you laid out above this probably isn't the approach to take right now, but I thought it was worth highlighting as I found it very interesting, and we are both learning Rust 馃 here. I may fire up a branch at some point to play with the code generation as it seems very interesting.

@gonzalo-bulnes
Copy link
Owner Author

Code generation using a build script is very interesting indeed! Thanks for the pointer @Bassetts!

@gonzalo-bulnes
Copy link
Owner Author

If you skipped the context above, here is the description of the issue 馃槈


The data of EFF's long wordlist should live in a different file

As it is currently, that long piece of data (7776 numbers -dice throws- and 7776 words) is bloating the dice/wordlist/mod.rs file.

Github does a pretty good job, only skipping syntactic coloration for those lines, but otherwise editing the file isn't pleasant. Also, the data is unlikely to change while the logic certainly will.

Objective: move data to a different file, in the form of constants so that it doesn't get mixed with the code that is likely to be improved (and change) over time, so that:

  • the dice/wordlist/mod.rs is easier to work with (more colors, less scrolling)...
  • ...without becoming more complex to review (one step at a time)
  • it is clearer that the EFF long wordlist is an external dependency and shouldn't be modified

Challenge: as explained in the link above, constants require a type annotation... what is their type?

Bonus points: I think that probably the only constant that needs to be public is the eff_long_wordlist itself, keeping the other ones private would be nice.

This is a refactoring, so there is no need to add new tests, the tests that are already written are sufficient. The goal is to make them green once the move is over 鉁旓笍

Of course, I'm happy to help at any stage 馃檪 I am learning Rust as well!

@oknono
Copy link

oknono commented Nov 3, 2018

I would like to pick this issue up please :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants