Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting rough: a versioned JSON dump #4

Open
andru opened this issue Jan 14, 2016 · 21 comments
Open

Starting rough: a versioned JSON dump #4

andru opened this issue Jan 14, 2016 · 21 comments

Comments

@andru
Copy link
Contributor

andru commented Jan 14, 2016

@simonv3 What are your thoughts on getting this repo rolling as a JSON file of crops?

I think you already did a bunch of data scraping from openly licensed sets for OpenFarm, and I've done the same for Hortomatic. I think we should get something rough rolling with this for now...

Proposal:

  • We dump our sources in this repo: open data sets, website scrapings, etc
  • Work on some simple command line scripts that combine those data sources and spit out a single JSON file per crop[1]
  • Work on some simple command line build tools to take the source files and spit out a big JSON crop dump
  • For now, edits to the database can be made by editing the individual source files and re-build

What do you think? Could this model work for OpenFarm for the time being?

@mstenta you mentioned you've already got some crop data going for FarmOS, could this model work for you?

1: By which I mean a taxonomic ID of some kind, not common names... species, variety, cultivar... there will be duplicates because horticultural naming is a mess, but it should get us close to something unique

@simonv3
Copy link
Member

simonv3 commented Jan 14, 2016

The main issue I can think of is that we'd be splitting our dataset at this moment, and it would evolve separately on OF until this one becomes more usable as an endpoint. However, I don't think that we get enough edits at the moment for that to be a real concern.

We could also build some scripts that check a variety of sources and aggregate that data, then let humans check merge conflicts?

@mstenta
Copy link

mstenta commented Jan 14, 2016

@andru I like it! Getting started with at least a sketch is the best first step. And it will help to identify where the commonalities are.

I agree that each crop/variety should be a separate file. See my comment here: #2 (comment)

I would also suggest that we consider YAML instead of JSON. The Drupal community recently chose YAML over JSON for all of it's configuration management. I'm not familiar with all of the reasons (I'm sure there are lots of comparisons out there) - but one that stood out to me is that YAML can have comments embedded in it. I do love comments. :-)

@mstenta
Copy link

mstenta commented Jan 14, 2016

It looks like there are options for converting YAML to JSON in Javascript, as well. So I don't think it would be an impediment to JS-only apps. What do you think?

https://nodeca.github.io/js-yaml/

@andru
Copy link
Contributor Author

andru commented Jan 14, 2016

I think YAML could be a good fit, since we're talking about hand-editing files for now. Less braces, but whitespace sensitive markup can be confusing for some people too. I don't have a clear preference. Whether we go with YAML or JSON for the source files, we should look into options for validation - maybe there's a github pull request integration that can handle it?

@simonv3 is there a way you could keep a log of changes made to the data at OpenFarm which, depending on the quantity, could be manually applied to the repo or scripted?

From the perspective of Hortomatic, in the short term I'll be using the data as read-only.

@simonv3
Copy link
Member

simonv3 commented Jan 15, 2016

Making crops in OF read only for now is an option, but I'd have to discuss that with the other people working on the project. Thinking about it - the main editing that's been happening on crops is actually link to wikipedia and uploading images, all of which is not really "crop" information.

I'm personally cool with YAML.

@pmackay
Copy link

pmackay commented Jan 15, 2016

I'm curious, whats the goal? What will the list of crops be used for?

@simonv3
Copy link
Member

simonv3 commented Jan 15, 2016

@pmackay The goal is to provide a bunch of services that use crop data with a consistent crop knowledge base. For example, both FarmOS, OpenFarm and Hortomatic would be able to draw from the same "crop" data set.

There's this issue, which attempt to answer those questions: #2

@mstenta
Copy link

mstenta commented Jan 17, 2016

Hey everyone! I sketched up two quick proof-of-concept repositories, to demonstrate sort of what I'm thinking. It's not meant to be "final solution" - I just find it easier to get my ideas out in code sometimes. And maybe it can provide a starting point for further conversation.

The two repositories are:

https://github.com/farmOS/CropDB-Spec
https://github.com/farmOS/CropDB-Base

CropDB-Spec serves as a place to define the data specification. It basically just has two files: cropdb.schema.yml, which defines the basic schema of a crop YAML file; and db/example.yml which is an example crop YAML file that contains comments about each field/value.

CropDB-Base serves as an example of an actual crop collection that implements the spec. I just added a single crop file called "tomato.yml" as an example, but we could start building out more if you like this approach.

The way I see "crop collections" is: perhaps we can provide a "base" collection that contains very general information about a set of very common crops. But other people could create their own sets for more specific ones - ie: seed producers could create sets that have files for each of their available varieties/cultivars. And they could use the "base" set as a starting point - utilizing the "inherits" field I proposed.

So for example, Johnny's Seeds sells a Tomato variety called "Big Beef" (http://www.johnnyseeds.com/p-7958-big-beef.aspx). In their data set, they could create a file called tomato.big_beef.yml (or something like that) and specify in there that it "inherits" from the base tomato.yml file. But they could also include a line in that file that overrides the "days_to_maturity" and set it to 70, because it's different from the default 60 defined in tomato.yml.

Again, this is all just a sketch - meant to convey some possible ideas and get your feedback. I haven't implemented any actual code to use these files, nor do I have much experience with YAML - so there may be things wrong - but hopefully it at least makes sense from a conceptual point of view.

What do you think?

@mstenta
Copy link

mstenta commented Jan 17, 2016

If anyone want's commit access to those repos, let me know! Feel free to bang on it, propose changes, etc.

Or, if it's completely different from what you're thinking - we can throw them out completely - but this is roughly what I am going to need in farmOS. :-)

@simonv3
Copy link
Member

simonv3 commented Jan 17, 2016

I want to put a link to the datapackages set of tools here: https://www.npmjs.com/search?q=datapackage

http://dataprotocols.org/data-packages/

Your spec and implementation files reminded me of it @mstenta, and there's a group of well defined tools for this already - it's probably worth just reading up on them and seeing what they do.

@pmackay
Copy link

pmackay commented Jan 17, 2016

@mstenta would it be possible to start by capturing the models and properties you need? Separately from the data format? (wrote a bit more on here #2 (comment)).

@mstenta
Copy link

mstenta commented Jan 17, 2016

Thanks @simonv3 ! That looks like a good guide to follow and learn from! I'll spend some time familiarizing myself with it.

The format and structure I used for the YAML was loosely based on the format Drupal 8 is using for configuration storage. I'm sure there's some overlap in the concepts so it would be helpful to identify those.

@pmackay - Definitely! I agree starting a wiki to sketch out the properties is a good next step. So far, in the YAML sketch I made, the "crop" model looks something like this:

id: tomato
uuid: [uuid]
inherits: [uuid]
label: 'Tomato'
data:
    days to maturity: 60
    frost tolerance: not tolerant

Just a start... I'm starting to compile a list of other data properties that I plan to use. Should we start a wiki to compile them?

@pmackay
Copy link

pmackay commented Jan 17, 2016

Want to fill out more info here https://github.com/openfarmcc/Crops/wiki/Crop-data-needs?

@mstenta
Copy link

mstenta commented Jan 17, 2016

And just to be clear: my current use-case is specifically to build a set of files that can be imported into farmOS. Within farmOS, users will be able to plan out their plantings via a "Planting Wizard", which will use the data in these files to auto-generate tasks with specific dates. The "frost tolerance" and "days to maturity" that I included in the schema are both useful for that specific purpose.

@andru would be able to use these files for Hortomatic, as well. And OpenFarm.cc could use them as a basis upon which guides could be built. It would also help to accomplish your goal in #1 I think.

@mstenta
Copy link

mstenta commented Jan 17, 2016

Great! Thanks @pmackay - I will start adding more to that...

@mstenta
Copy link

mstenta commented Jan 17, 2016

@simonv3 - I really like how the datapackages format is put together. That would mean that the crop sets would be CSV files, too - which is good - lots of things can read CSV. :-)

Do you know if it can handle other formats too? Is YAML out? I don't really have strong opinions on the format at this point - just curious what the options are.

Question: is it limited to flat single-row data? In other words: if we discovered that we needed to represent nested objects somehow, or many-to-one relationships, do you know if that's possible with datapackages?

I don't know if that will be necessary - I suppose we'll see what comes together in https://github.com/openfarmcc/Crops/wiki/Crop-data-needs

@roryaronson
Copy link
Member

Great conversation all!

"And just to be clear: my current use-case is specifically to build a set of files that can be imported into farmOS. Within farmOS, users will be able to plan out their plantings via a "Planting Wizard", which will use the data in these files to auto-generate tasks with specific dates."

^ This is pretty much what I need for FarmBot 👍 Though we we're hoping to use OpenFarm Guides as the main source of data.

@roryaronson
Copy link
Member

@andru
Copy link
Contributor Author

andru commented Jan 18, 2016

Great to see the ball rolling!

That would mean that the crop sets would be CSV files, too - which is good - lots of things can read CSV. :-)
I worry that CSV would tie us to a schema. The way I see things we need to define a core schema while not imposing a limit on additional data fields.

@roryaronson That scientific crop traits spreadsheet is great - what's the source ontology?

@mstenta I think the discussion over a common schema could use it's own issue, so I've started it off with my thoughts over at #5

@roryaronson
Copy link
Member

@andru I don't remember anymore cause I made that list like a year ago. Its from a lot of sources cobbled together. I think I just googled "plaint traits list" and copy-pasted from like 100 places haha

@mundotazo
Copy link

I second using YAML. It's readble.

The USDA has CSV files for plants.
http://plants.usda.gov/java/

If the seed varieties could be cross referenced with seed vendors it would be really helpful.
http://www.organicseedfinder.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants