Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package API #1

Open
16 tasks
ijlyttle opened this issue Mar 23, 2019 · 5 comments
Open
16 tasks

Package API #1

ijlyttle opened this issue Mar 23, 2019 · 5 comments

Comments

@ijlyttle
Copy link
Collaborator

ijlyttle commented Mar 23, 2019

I think this package will revolve around an S3 class, tbl_stw that will extend tibble's tbl_df.

There are a few things I have in mind for a tbl_stw to do:

Reading/writing flat-files:

  • stw_write(x, name, path) writes out a csv file and a yaml file
  • stw_read(name, path) builds a tbl_stw from a csv file and a yaml file
  • stw_colspec(file) given a yaml-file, generate a readr::read_csv() column-specification
  • stw_get_function_read(file) given a yaml-file, return a function (based on readr::read_csv()) that will read a csv file and return a tbl_df

Perhaps the metadata "wants" its own (perhaps internal) class: stw_meta...

Reading/writing package-data:

  • stw_use_data(x) wraps usethis::use_data() to publish data to a package, also writes out the data-documentation
  • stw_read_data(data, package) parses the documentation for a package-dataset into a tbl_stw

Accessing attributes:

  • stw_get_dict(x) returns a tibble for the data-dictionary, containing variables: name, type, description
  • stw_get_title(x) returns the title
  • stw_get_synopsis(x) returns the synopsis
  • stw_get_source(x) returns the source

Helpers:

  • stw_format_gt() helper function that returns a gt format (?)

Building a tbl_stw:

  • stw(.data) constructor
  • stw_title(x, title) used to provide a title for a dataset
  • stw_source(x, source) used to note the source of the dataset
  • stw_describe(x, ...) used to provide a longer description of a dataset, or to build a data-dictionary. Has syntax like dplyr::mutate(), but used to provide a character description of a column or columns. Maybe named arguments apply to the variables in the data frame, unnamed apply to the dataset itself.
  • stw_validate() used to make sure that all the columns are described.
@ijlyttle
Copy link
Collaborator Author

ijlyttle commented Mar 23, 2019

The next thing to consider is the structure of the tbl_stw class itself.

I think tbl_stw should be a tbl_df with additional attributes:

  • title: character title of the dataset, i.e. title found in package data
  • description: character description of the dataset
  • source: character source of the dataset

The data dictionary will use a description attribute attached to each of the columns in the data frame.

Restriction on types

I think there needs to be a restriction on the types possible for the columns. This limitation is provided by the flat-files themselves, given that we will want to write, then read, without any loss.

  • logical
  • integer
  • double
  • Date
  • POSIXct, using timezone as metadata
  • character
  • factor, using levels as metadata

@ijlyttle ijlyttle changed the title New S3 class: tbl_stw Package API Mar 24, 2019
@ijlyttle
Copy link
Collaborator Author

ijlyttle commented Mar 24, 2019

We can also think about the structure about the stw_meta class.

In paractice, a stw_meta object would be either extracted from a tbl_stw, or built from reading a metadata file (yaml).

It will have a:

  • name: character name of the dataset, e.g. "diamonds"
  • title: character
  • description: character
  • source: character
  • dictionary: tbl_df with one observation for each variable in the tbl_stw and variables:
    • name: character, name of the variable, e.g. "colour"
    • type: character, e.g. "character"
    • description: character

@iqis
Copy link

iqis commented Apr 13, 2019

Incidentally, a few years back, I've implemented a naive solution to that, in either a dcf file (remember these?) or the data frame's attributes, documents the data's origin, owner, field names, dimension, and a copy of processing code that transformed the original data to the current state. I believe I can dig that out and take a look as my 2c.

I wonder what would be an appropriate metadata schema to be implemented here for data sets? The Schema.org quickly comes to mind. A quick search also gives me DataCite, which appears to be a newer initiative. There is also one from the Federal CIO Council.

@ijlyttle
Copy link
Collaborator Author

Hi @iqis

Thanks for your interest! Looking forward to seeing what you have done already.

I had a quick look at the links you provided - I saw ways to describe entire datasets, but I did not see (maybe because I looked too quickly) ways to describe variables within a dataset.

@iqis
Copy link

iqis commented Apr 13, 2019

@ijlyttle,
I apologize for not being thorough with my thought. It's true that none of these already provide ways to fully describe individual variables, and that can be the work to be done.

Apparently, it is very hard to get people to agree on things, I just think it is is better to adopt a prevailing schema, use the necessary attributes, and extend upon it, rather than inventing something entirely new.

What are your thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants