generators from corpora #30

jorinvo · 2017-05-23T13:14:17Z

dictionary could return a random word from an English dictionary (or a top 1000 words list).
Along with this there could be dictionary.noun, dictionary.verb, dictionary.adjective.

The text was updated successfully, but these errors were encountered:

lucapette · 2017-05-23T14:32:54Z

Oh I really like the idea of having the dotted generators! I'm wondering how to source them though. We'll need to do some little research about a dictionary. Then we can embed it in the binary with go-bindata, I used it often and it's nice. Would you like to give it a try @jorinvo ? If so I assign the issue to you. No strings attached though! I'll have some time later this week/next week and I planned to add more generators

jorinvo · 2017-05-23T15:08:35Z

This looks interesting: https://github.com/dariusk/corpora
Public domain.
I guess we could include more categories than just the above mentioned.
dictionary without sub-category is not that useful, I think. And a category like noun, verb is not much different than existing generators like color, country, product.category.
Maybe we can add them top-level instead?

There are many other useful categories. Some examples:
animal, car, industry, newspaper, company, zodiac, TV show, food, ricer, ocean, city, author, name, mood, occupation, drug, music genre, instrument, Greek god, object, plant, religion, element, planet, programming language, emoji, ...

Since the list of generators will probably grow no matter how many we decide to add now,
I think, it would be good to think about how to organize this in the code and in the command line help.

If we figure this out, I can give it a try.

lucapette · 2017-05-23T15:32:12Z

I think I'd split the problem in parts (the usual "divide et impera"):

We figure out how to import/embed data (and from where)
We figure out how to organize the naming of the generators
We start adding new generators.
We refactor existing generators like color and country so that they are sourced by corpora

Possibly I'd do the first two steps in a PR and just open issues/adding new generators on the go. I like the idea of adding as many generators as possible given people use them. Otherwise it feels like "over-engineering" (maybe not a great term though).

I like corpora:

It's public domain
it's stored in github/json format which helps us to automate the process
it has a nice organization of the data

I'd be fine following the lead of corpora about how to organize the naming of the generators and go for something like:

$ fakedata dict.animals # generates random animal
$
$ fakedata dcit.animals.cats # generates random cat breed

Which would be funny to implement too (nice side-effect :)). I'm a bit unsure about the top name dict. I feel I can suggest only:

words
things
data

none of them makes me happy though as they don't seem very fitting so, as always, feedback is more than welcome!

Does this organization works? I feel like it's "good enough" to get us started

gnanet · 2017-05-23T22:58:03Z

Can i ask for loadable dictionaries, so there is a small possibility to use other dictionaries that english?

lucapette · 2017-05-24T05:37:14Z

@gnanet sure you can! But I believe it makes sense to discuss it in a separate issue, I think the feature may have a similar behavior but it needs a different user interface.

jorinvo · 2017-05-24T09:16:49Z

We figure out how to import/embed data (and from where)

dariusk/corpora seems like good place to start with.
I was looking again into go-binddata. It seems to be not maintained for two years already.
Maybe we can just convert the JSON to Go code manually once.
I think binary size, memory size, execution time are not an issue even if we have it as code in our program. As far as I know, the system only loads the code once you access it. And in the binary the data takes little space.
This way we also have more control about which parts to keep and how to organize things.

We figure out how to organize the naming of the generators

As you pointed out, the organization in that repo looks pretty good already.
We could use that as a guideline.
However, I would completely remove the parent namespace (dict/dictionary).
I think that animal should be at the same level as color is currently.

We start adding new generators.

I think we should add only the once that appear useful to us.
This list is too long and for many I don't see any usecase:
https://gist.github.com/jorinvo/768c2c1051ab87a378faa47ffaae8066

We refactor existing generators like color and country so that they are sourced by corpora

If we import the code manually we can stick to the existing data. We could still extend it.

I would separate the current dict.go file though. We could create a package pkg/data and add one .go file for each generator.

What do you think @lucapette ?

lucapette · 2017-05-24T13:21:22Z

I know go-bindata isn't maintained anymore but, to be fair, I've been using it a lot and had no real issues with it. But your suggestion of just keeping everything as go code is very appealing to me, the project remains go gettable and that's pretty nice. So I'd say we proceed as you say.

I buy your point of skipping the parent namespace. And thank you very much for the suggestion, sometimes the simplest solution is hard to see!

I would still suggest we automate the process of importing data from corpora, we could add a make update-data script. Not a hard requirement thought. It's more like a "very nice to have".

I love the idea of getting a pkg/data. It makes a lot of sense. I propose I'll start the work on it myself as soon as possible (maybe today!) because of #33 so we make it easier to pick this issue for whoever feels comfortable doing it.

About adding new generators, I agree we shouldn't import everything so I say we decide what's worth importing upfront in the context of this issue.

To wrap up,

I move the current dict.go to multiple files under pkg/data/
We come up with a list of data to import from corpora
we import what we need. Either manually or with a script (I expressed my preference already :))
🎉

@jorinvo what do you think?

jorinvo · 2017-05-24T13:43:31Z

Sounds perfect!
You can do the reorganizing thing first.
After, we can look into adding new content.

lucapette · 2017-06-25T21:23:36Z

As this was done in #47 (thanks @jorinvo !!!) I'm closing this one. I'll create a new issue so we can discuss (and collect some feedback) about what to import

lucapette added the generators label May 23, 2017

This was referenced May 24, 2017

Load a list of words from a file #34

Closed

More top-level domain generators #33

Closed

This was referenced May 24, 2017

Introduce pkg/data #35

Merged

More TLDs #36

Merged

v1.0.0 #44

Closed

lucapette added this to the v1.0.0 milestone May 30, 2017

jorinvo changed the title ~~dictionary generator~~ generators from corpora Jun 3, 2017

jorinvo mentioned this issue Jun 3, 2017

Add corpora importer and first corpora generators #47

Merged

lucapette closed this as completed Jun 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generators from corpora #30

generators from corpora #30

jorinvo commented May 23, 2017 •

edited

Loading

lucapette commented May 23, 2017

jorinvo commented May 23, 2017

lucapette commented May 23, 2017

gnanet commented May 23, 2017

lucapette commented May 24, 2017

jorinvo commented May 24, 2017

lucapette commented May 24, 2017

jorinvo commented May 24, 2017

lucapette commented Jun 25, 2017

generators from corpora #30

generators from corpora #30

Comments

jorinvo commented May 23, 2017 • edited Loading

lucapette commented May 23, 2017

jorinvo commented May 23, 2017

lucapette commented May 23, 2017

gnanet commented May 23, 2017

lucapette commented May 24, 2017

jorinvo commented May 24, 2017

lucapette commented May 24, 2017

jorinvo commented May 24, 2017

lucapette commented Jun 25, 2017

jorinvo commented May 23, 2017 •

edited

Loading