Package API #1

ijlyttle · 2019-03-23T20:49:59Z

I think this package will revolve around an S3 class, tbl_stw that will extend tibble's tbl_df.

There are a few things I have in mind for a tbl_stw to do:

Reading/writing flat-files:

stw_write(x, name, path) writes out a csv file and a yaml file
stw_read(name, path) builds a tbl_stw from a csv file and a yaml file
stw_colspec(file) given a yaml-file, generate a readr::read_csv() column-specification
stw_get_function_read(file) given a yaml-file, return a function (based on readr::read_csv()) that will read a csv file and return a tbl_df

Perhaps the metadata "wants" its own (perhaps internal) class: stw_meta...

Reading/writing package-data:

stw_use_data(x) wraps usethis::use_data() to publish data to a package, also writes out the data-documentation
stw_read_data(data, package) parses the documentation for a package-dataset into a tbl_stw

Accessing attributes:

stw_get_dict(x) returns a tibble for the data-dictionary, containing variables: name, type, description
stw_get_title(x) returns the title
stw_get_synopsis(x) returns the synopsis
stw_get_source(x) returns the source

Helpers:

stw_format_gt() helper function that returns a gt format (?)

Building a tbl_stw:

stw(.data) constructor
stw_title(x, title) used to provide a title for a dataset
stw_source(x, source) used to note the source of the dataset
stw_describe(x, ...) used to provide a longer description of a dataset, or to build a data-dictionary. Has syntax like dplyr::mutate(), but used to provide a character description of a column or columns. Maybe named arguments apply to the variables in the data frame, unnamed apply to the dataset itself.
stw_validate() used to make sure that all the columns are described.

The text was updated successfully, but these errors were encountered:

ijlyttle · 2019-03-23T22:01:34Z

The next thing to consider is the structure of the tbl_stw class itself.

I think tbl_stw should be a tbl_df with additional attributes:

title: character title of the dataset, i.e. title found in package data
description: character description of the dataset
source: character source of the dataset

The data dictionary will use a description attribute attached to each of the columns in the data frame.

Restriction on types

I think there needs to be a restriction on the types possible for the columns. This limitation is provided by the flat-files themselves, given that we will want to write, then read, without any loss.

logical
integer
double
Date
POSIXct, using timezone as metadata
character
factor, using levels as metadata

ijlyttle · 2019-03-24T13:02:38Z

We can also think about the structure about the stw_meta class.

In paractice, a stw_meta object would be either extracted from a tbl_stw, or built from reading a metadata file (yaml).

It will have a:

name: character name of the dataset, e.g. "diamonds"
title: character
description: character
source: character
dictionary: tbl_df with one observation for each variable in the tbl_stw and variables:
- name: character, name of the variable, e.g. "colour"
- type: character, e.g. "character"
- description: character

iqis · 2019-04-13T19:20:37Z

Incidentally, a few years back, I've implemented a naive solution to that, in either a dcf file (remember these?) or the data frame's attributes, documents the data's origin, owner, field names, dimension, and a copy of processing code that transformed the original data to the current state. I believe I can dig that out and take a look as my 2c.

I wonder what would be an appropriate metadata schema to be implemented here for data sets? The Schema.org quickly comes to mind. A quick search also gives me DataCite, which appears to be a newer initiative. There is also one from the Federal CIO Council.

ijlyttle · 2019-04-13T19:42:40Z

Hi @iqis

Thanks for your interest! Looking forward to seeing what you have done already.

I had a quick look at the links you provided - I saw ways to describe entire datasets, but I did not see (maybe because I looked too quickly) ways to describe variables within a dataset.

iqis · 2019-04-13T20:28:41Z

@ijlyttle,
I apologize for not being thorough with my thought. It's true that none of these already provide ways to fully describe individual variables, and that can be the work to be done.

Apparently, it is very hard to get people to agree on things, I just think it is is better to adopt a prevailing schema, use the necessary attributes, and extend upon it, rather than inventing something entirely new.

What are your thoughts on this?

ijlyttle changed the title ~~New S3 class: tbl_stw~~ Package API Mar 24, 2019

ijlyttle mentioned this issue Apr 4, 2019

R Package that documents data files and sensitive data (PPI) within the file uncoast-unconf/uu-2019#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Package API #1

Package API #1

ijlyttle commented Mar 23, 2019 •

edited

Loading

ijlyttle commented Mar 23, 2019 •

edited

Loading

ijlyttle commented Mar 24, 2019 •

edited

Loading

iqis commented Apr 13, 2019 •

edited

Loading

ijlyttle commented Apr 13, 2019

iqis commented Apr 13, 2019

Package API #1

Package API #1

Comments

ijlyttle commented Mar 23, 2019 • edited Loading

ijlyttle commented Mar 23, 2019 • edited Loading

Restriction on types

ijlyttle commented Mar 24, 2019 • edited Loading

iqis commented Apr 13, 2019 • edited Loading

ijlyttle commented Apr 13, 2019

iqis commented Apr 13, 2019

ijlyttle commented Mar 23, 2019 •

edited

Loading

ijlyttle commented Mar 23, 2019 •

edited

Loading

ijlyttle commented Mar 24, 2019 •

edited

Loading

iqis commented Apr 13, 2019 •

edited

Loading