Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generators from corpora #30

Closed
jorinvo opened this issue May 23, 2017 · 9 comments
Closed

generators from corpora #30

jorinvo opened this issue May 23, 2017 · 9 comments
Milestone

Comments

@jorinvo
Copy link
Contributor

jorinvo commented May 23, 2017

dictionary could return a random word from an English dictionary (or a top 1000 words list).
Along with this there could be dictionary.noun, dictionary.verb, dictionary.adjective.

@lucapette
Copy link
Owner

Oh I really like the idea of having the dotted generators! I'm wondering how to source them though. We'll need to do some little research about a dictionary. Then we can embed it in the binary with go-bindata, I used it often and it's nice. Would you like to give it a try @jorinvo ? If so I assign the issue to you. No strings attached though! I'll have some time later this week/next week and I planned to add more generators

@jorinvo
Copy link
Contributor Author

jorinvo commented May 23, 2017

This looks interesting: https://github.com/dariusk/corpora
Public domain.
I guess we could include more categories than just the above mentioned.
dictionary without sub-category is not that useful, I think. And a category like noun, verb is not much different than existing generators like color, country, product.category.
Maybe we can add them top-level instead?

There are many other useful categories. Some examples:
animal, car, industry, newspaper, company, zodiac, TV show, food, ricer, ocean, city, author, name, mood, occupation, drug, music genre, instrument, Greek god, object, plant, religion, element, planet, programming language, emoji, ...

Since the list of generators will probably grow no matter how many we decide to add now,
I think, it would be good to think about how to organize this in the code and in the command line help.

If we figure this out, I can give it a try.

@lucapette
Copy link
Owner

I think I'd split the problem in parts (the usual "divide et impera"):

  • We figure out how to import/embed data (and from where)
  • We figure out how to organize the naming of the generators
  • We start adding new generators.
  • We refactor existing generators like color and country so that they are sourced by corpora

Possibly I'd do the first two steps in a PR and just open issues/adding new generators on the go. I like the idea of adding as many generators as possible given people use them. Otherwise it feels like "over-engineering" (maybe not a great term though).

I like corpora:

  • It's public domain
  • it's stored in github/json format which helps us to automate the process
  • it has a nice organization of the data

I'd be fine following the lead of corpora about how to organize the naming of the generators and go for something like:

$ fakedata dict.animals # generates random animal
$
$ fakedata dcit.animals.cats # generates random cat breed

Which would be funny to implement too (nice side-effect :)). I'm a bit unsure about the top name dict. I feel I can suggest only:

  • words
  • things
  • data

none of them makes me happy though as they don't seem very fitting so, as always, feedback is more than welcome!

Does this organization works? I feel like it's "good enough" to get us started

@gnanet
Copy link

gnanet commented May 23, 2017

Can i ask for loadable dictionaries, so there is a small possibility to use other dictionaries that english?

@lucapette
Copy link
Owner

@gnanet sure you can! But I believe it makes sense to discuss it in a separate issue, I think the feature may have a similar behavior but it needs a different user interface.

@jorinvo
Copy link
Contributor Author

jorinvo commented May 24, 2017

We figure out how to import/embed data (and from where)

dariusk/corpora seems like good place to start with.
I was looking again into go-binddata. It seems to be not maintained for two years already.
Maybe we can just convert the JSON to Go code manually once.
I think binary size, memory size, execution time are not an issue even if we have it as code in our program. As far as I know, the system only loads the code once you access it. And in the binary the data takes little space.
This way we also have more control about which parts to keep and how to organize things.

We figure out how to organize the naming of the generators

As you pointed out, the organization in that repo looks pretty good already.
We could use that as a guideline.
However, I would completely remove the parent namespace (dict/dictionary).
I think that animal should be at the same level as color is currently.

We start adding new generators.

I think we should add only the once that appear useful to us.
This list is too long and for many I don't see any usecase:
https://gist.github.com/jorinvo/768c2c1051ab87a378faa47ffaae8066

We refactor existing generators like color and country so that they are sourced by corpora

If we import the code manually we can stick to the existing data. We could still extend it.

I would separate the current dict.go file though. We could create a package pkg/data and add one .go file for each generator.

What do you think @lucapette ?

@lucapette
Copy link
Owner

I know go-bindata isn't maintained anymore but, to be fair, I've been using it a lot and had no real issues with it. But your suggestion of just keeping everything as go code is very appealing to me, the project remains go gettable and that's pretty nice. So I'd say we proceed as you say.

I buy your point of skipping the parent namespace. And thank you very much for the suggestion, sometimes the simplest solution is hard to see!

I would still suggest we automate the process of importing data from corpora, we could add a make update-data script. Not a hard requirement thought. It's more like a "very nice to have".

I love the idea of getting a pkg/data. It makes a lot of sense. I propose I'll start the work on it myself as soon as possible (maybe today!) because of #33 so we make it easier to pick this issue for whoever feels comfortable doing it.

About adding new generators, I agree we shouldn't import everything so I say we decide what's worth importing upfront in the context of this issue.

To wrap up,

  • I move the current dict.go to multiple files under pkg/data/
  • We come up with a list of data to import from corpora
  • we import what we need. Either manually or with a script (I expressed my preference already :))
  • 🎉

@jorinvo what do you think?

@jorinvo
Copy link
Contributor Author

jorinvo commented May 24, 2017

Sounds perfect!
You can do the reorganizing thing first.
After, we can look into adding new content.

This was referenced May 24, 2017
@lucapette lucapette added this to the v1.0.0 milestone May 30, 2017
@jorinvo jorinvo changed the title dictionary generator generators from corpora Jun 3, 2017
@lucapette
Copy link
Owner

As this was done in #47 (thanks @jorinvo !!!) I'm closing this one. I'll create a new issue so we can discuss (and collect some feedback) about what to import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants