Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support import/export of Data Packages #16

Open
danfowler opened this issue Jul 27, 2016 · 4 comments
Open

Support import/export of Data Packages #16

danfowler opened this issue Jul 27, 2016 · 4 comments

Comments

@danfowler
Copy link

Breve allows for the assignment of data type per column and immediate validation against those types. This is excellent! However, once the dataset has been cleaned, the only output seems to be the cleaned CSV. I believe this tool would be even more useful if the type information created through Breve were recorded using JSON Table Schema and the data exported as a Tabular Data Package. Likewise, on import, the type information could be automatically set using validation rules expressed via the Data Package format.

screen shot 2016-07-27 at 15 30 24

A Data Package provides a minimal "container" for transporting any kind of data. It is designed for extension to allow publishers to add additional constraints on the format and type of data and metadata.

Concretely, you can create a Data Package by placing a specially formatted file, datapackage.json, in the directory containing the files that comprise your dataset. Given a dataset called dataset.csv that looks like this:

a,b,c
1,2,3
4,5,6

A very simple example of a datapackage.json that would accompany the unaltered CSV would look like this:

{
  "name": "my-first-dataset",
  "title": "My First Dataset",
  "resources": [
    {
      "path": "dataset.csv",
      "format": "csv",
      "schema": {
        "fields": [
          {
            "name": "a",
            "type": "integer"
          },
          {
            "name": "b",
            "type": "integer"
          },
          {
            "name": "c",
            "type": "integer"
          }
        ]
      }
    }
  ]
}

The data types you support would all be expressible via the JSON Table Schema language using a combination of type, format, and constraints per field:

http://specs.frictionlessdata.io/json-table-schema/#field-descriptors

screen shot 2016-07-27 at 15 38 55

We're building an ecosystem of tools and integrations that allow the reading of Data Packages in tools already in use today: http://frictionlessdata.io/about/ . We can definitely assist in supporting this integration.

@esjewett
Copy link
Member

esjewett commented Jul 27, 2016

I think we'd certainly be interested in adding support to these formats to Palladio's capabilities and we have looked at them before, but there are a few questions. It's worth noting that Breve uses Palladio's data processing engine internally and doesn't expose all of Palladio's capabilities. If you can provide some insight into these issues, that would be helpful:

  1. Is there a single-file format available? The ability for users to port data using a single file is fairly integral to the approach of Palladio and Breve, not least because the ability to download the file from the browser client-side is limited to one file at a time.
  2. How much flexibility is there in the types? Here are some examples of where the JSON Table Schema types and the Palladio types don't agree, as far as I recall:
    1. No URL/URI type in JSON Table Schema
    2. Palladio supports heterogenous data columns (e.g. a column with both 1234-12-12 YYYY-MM-DD and 1234 YYYY formats) and we hope to support even more in the future. The goal is to be able to support different types of fuzzy dates and durations.
    3. Breve supports ordinal/nominal indicators and such a concept doesn't seem to exist in JSON data table.
  3. The Palladio format stores multiple tables in a single data file along with information about the user-defined mapping/join between the tables. Would this be possible with the Data Packages format?
  4. Related to 2 above, the Palladio format stores further user-supplied information about dimensions. Does the format allow storing this type of additional information. A couple of examples:
    1. If the user has selected the dimension as displayed or not.
    2. If the dimension uses multi-value delimiters internally

I don't mean to imply that the Palladio format is some sort of superior format. It is really just based on Palladio's internal representation. If we could switch to a format based on Data Packages, that would probably be ideal, but as a research project we also need to maintain the ability to be flexible and expressive in areas where Data Packages has made the sort of decisions to limit expression that make perfect sense in a standard format.

Thanks!

@danfowler
Copy link
Author

Hi Ethan, thanks for your quick, thoughtful, and thorough response!

  1. Data Packages can support a single-file use case similar to Palladio's export format. A given "resource" in the resources array (equivalent to a "file" in the Palladio export JSON files array) in a Data Package can have one of "url", "path", or "data"; the "data" attribute can be used to store in-line data in a JSON array exactly equivalent to the "data" attribute in Palladio's export format.

    http://specs.frictionlessdata.io/data-packages/#inline-data

  2. Types

    1. Each field type in JSON Table Schema has a set of format options. For strings, this does, in fact, include a uri format: http://specs.frictionlessdata.io/json-table-schema/#string

    2. For a field type of date (or time or datetime), there actually is a format option "any" which is specified like so:

      any: Any parsable representation of the type. The implementing library can attempt to parse the datetime via a range of strategies. An example is dateutil.parser.parse from the python-dateutils library.

      I'm not sure if this fully supports your use case, so let me know.

    3. No explicit support for "ordinal" or "nominal" indicators on a field.

  3. The Tabular Data Package format supports multiple tables and user-defined relations between them. See this section for details:

    http://specs.frictionlessdata.io/json-table-schema/#foreign-keys

    Example: countries-and-currencies/datapackage.json

  4. The Data Package format does allow for extra fields.

    NOTE: A Data Package author MAY add any number of additional fields beyond those listed in the specification here.

    http://specs.frictionlessdata.io/data-packages/#optional-fields

cc'ing @rgrp @pwalsh as they might have further thoughts on the above

@esjewett
Copy link
Member

Hi Dan,

This is really encouraging. Given all this, I think it may be possible to simply move Palladio's data format to Data Packages, which would solve this problem for Palladio as well as Breve. It's going to a process, but I'll probably start prototyping in a branch of the Palladio repo soon.

Thanks,
Ethan

@danfowler
Copy link
Author

Hi @esjewett just flagging that I am available for any questions, both here and in our Frictionless Data chat: https://gitter.im/frictionlessdata/chat 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants