Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-classes of StatisticalDataset #15

Closed
FranckCo opened this issue Jun 6, 2019 · 25 comments
Closed

Sub-classes of StatisticalDataset #15

FranckCo opened this issue Jun 6, 2019 · 25 comments
Assignees

Comments

@FranckCo
Copy link
Member

FranckCo commented Jun 6, 2019

Decided during May 7 meeting: define what sub-classes of StatisticalDataset we want.

Example candidate: TimeSeries.

Linked to issue #6.

@JALinnerud
Copy link
Collaborator

Event History?

@JALinnerud
Copy link
Collaborator

Checking GSIM v1.2 Dataset
Definition: An organized collection of data.
Explanatory text: Examples of Data Sets could be observation registers, time series, longitudinal data, survey data, rectangular data sets, event-history data, tables, data tables, cubes, registers, hypercubes, and matrixes. A broader term for Data Set could be data. A narrower term for Data Set could be data element, data record, cell, field.

@abrycsaba
Copy link
Collaborator

There was an ADMIN VIP project in the EU, named ADMIN. That is what I found for data classification:
https://ec.europa.eu/eurostat/cros/content/statistical-data_en

In my opinion we should take into consideration only those statistical data classifications (from point of view of aggregation or level of process e.g) which can be used as parameters for IT application later on handling those kind of data. This parameter can trigger different kind of tasks for different kind of data sets in an IT application.

@JALinnerud
Copy link
Collaborator

I remember that pre-GSIM we had classifications, but while creating GSIM it was pointed out that our classifications are more strict than other classifications ( its elements are mutually exclusive and complete) so we were persuaded to call the GSIM information object Statistical Classification.
I am not convinced that our datasets are any different from anyone elses datasets. I do not see an advantage in creating a specialisation. Our GSIM Datasets do inherit from Identifiable Artefact so maybe that is an essential difference?
Maybe we could just use themes or domains to say that our datsets are within statistics? The same could be done for almost all GSIM information objects so that we do not need to put 'Statistical' in front of them all. Or maybe we could use a name space gsim: ?

@zoltanvereczkei
Copy link
Collaborator

What do our ModernStats models say about statistical dataset (or dataset in general)?

GSBPM

Does not mention data sets. Basically it doesn’t have to. It’s a process model.

GSIM

• Data Set: An organized collection of data. Examples of Data Sets could be observation registers, time series, longitudinal data, survey data, rectangular data sets, event-history data, tables, data tables, cubes, registers, hypercubes, and matrixes. A broader term for Data Set could be data. A narrower term for Data Set could be data element, data record, cell, and field.
• Unit Data Set: A collection of data that conforms to a known structure and describes aspects of one or more Units. Example: A synthetic unit record file is a collection of artificially constructed Unit Data Records, combined in a file to create a Unit Data Set. Synonyms: Micro data, unit data, synthetic unit record file
• Dimensional Data Set: A collection of dimensional data that conforms to a known structure.
• Information Set: Organized collections of statistical content. Statistical organizations collect, process, analyse and disseminate Information Sets, which contain data (Data Sets), referential metadata (Referential Metadata Sets), or potentially other types of statistical content, which could be included in additional types of Information Set.

GAMSO

Does not mention datasets. It doesn’t have to as data sets are information objects.

CSDA

Despite the fact that we do not take CSDA into consideration there is a classification for data sets in the document, which is as follows:

• Explorative: Data that is obtained from outside sources, is usually “sampled” and is used to assess the nature, structure and quality (usability) of that data source. After the exploration, this data in most cases loses its value.
• Organizational: The true (data) assets of the organization, that are to be treated as such and must be protected and shared where possible. An important sub-type of “Organizational” is the Master Data such as statistical registers, back-bones of populations, collections of statistical units. For instance: Company register, People Register, Buildings register.
• Temporary, local: Data that is produced as an intermediate product in a statistical process and has no real value outside that process. This data usually loses its value after the process (cycle) is completed, but may have value for the next cycle as a reference. May be persisted within the process space


Conclusion and suggestion:

Only such classification should be taken on board which are relevant from the point of view of information management meaning that these kind of datasets have to be handled in a different way (other process, other methods etc.). The sub-classes should also has a statistical perspective to make it easy for the user (statistician) of the ontology to understand the sub-classes we define.

GSIM provides one kind of classification (according to structure) for data sets, which is
• Unit data set
• Dimensional data set

This also corresponds to the breakdown of microdata/tabular (aggregated) data. We think that this breakdown provided by GSIM is a good basis. If we need further breakdown but we need to agree on the purpose.

• We can differentiate unit data set by source (data collection, data transmission, other / unimode, multimode, etc.) for by phases where the dataset is made available (corresponding to GSBPM Phases IV, V. VI, VII).
• We can classify dimensional data set by the type of data included (Nominal, Ordinal, Discrete, Continuous), by domains, or sensitivity (SDC perspective).

Maybe we can move forward with the currently available two sub-classes defined by GSIM. To be honest, this more like a feeling than a well-based opinion…

@FlavioRizzolo
Copy link
Collaborator

Datasets can be classified in so many dimensions that's really hard to come up with a comprehensive list. Summing up the postings above, some of them are:

  • scope: explorative, organizational, local (as per CSDA)
  • domain: social, economics, etc.
  • granularity: micro (unit), aggregate (dimensional)
  • sensitivity: privacy/confidentiality related (public vs. confidential? some scale from 1 to n?)

I'd add also

  • status: preliminary, edited/imputed, final, revised, etc.

I'm not sure how to classify nominal, ordinal, discrete, continuous because I don't know what the first two mean, and time series, longitudinal, event history either... Types of data?

@ChLaaboudi
Copy link

The Code Lists used in DCAT are available in EU Vocabularies:
Relevant for us:

  • Theme (13 themes used for classifying datasets in EU and European open data portals)
  • Access right (Sensitivity)

A code list CL_CONF_Status is available in the SDMX Global Registry.

@JALinnerud
Copy link
Collaborator

Sometimes I struggle to find the human readable content under EU Vocabularies.
A hint is to click on the blue button Browse content on the right hand side after you have chosen your vocabulary.
For dataset type that takes you to https://op.europa.eu/en/web/eu-vocabularies/concept-scheme/-/resource?uri=http://publications.europa.eu/resource/authority/dataset-type
For access rights that takes you to https://op.europa.eu/en/web/eu-vocabularies/concept-scheme/-/resource?uri=http://publications.europa.eu/resource/authority/access-right

@JALinnerud
Copy link
Collaborator

ONS has a data sensitivity model and a content sensitivity model. In Statistics Norway we have adopted 4 levels of Privacy from the ONS model: private, confidential, commercial, open. ONS also had a Sensitivity Assessment tool that we have translated to Norwegian and are implementing in our organisation as part of our GDPR compliance.

@dgillman4909
Copy link

These are all interesting comments. I think Flavio is on the right track. He comments there are many dimensions by which to create subtypes of datasets. I propose we follow the definition of dataset from GSIM - organized collection of data - and build subtypes based on organization. Other criteria for identifying subtypes are not germane from that point of view, and I comment below on why I think we should not base subtypes on them.

In DDI-CDI, 4 basic structural types of organizing data sets have been defined: rectangular, event history, key-value pair, and dimensional. Several of the types could be used to structure the same data. There is not a canonical structure in all cases, though some data is much more amenable to one structure over the others.

The types are defined roughly as follows:

  1. rectangular (or wide) - rows are units and columns are variables
  2. event history (or tall or long) - rows are based on the value for each variable, one unit at a time - and this could be visualized as rows are variables and columns are units
  3. dimensional - a pre-defined set of cells defined by the combination of categories, one from each of a set of dimensions (category sets), used to handle the value of some measure (variable) restricted to the cell
  4. key-value - a set of values, each associated with some key

Dimensional data are usually associated with aggregates. Key-value data are often taken from scraping the web. Even-history is used to describe events over some time period.

The nominal, ordinal, interval, ratio are not used to differentiate datasets. Rather, they are families of datatypes used to describe variables. Nominal data are those conforming to a finite set of categories with no other conditions (sex categories). Ordinal data are those conforming to an ordered finite set of categories, but the difference between adjacent categories is not necessarily uniform (Likert scale measures of satisfaction). Interval data are numeric with no zero (absence of quantity) defined (Celsius temperature). Ratio data are numeric with a defined zero (Kelvin temperature). These apply to any kind of statistical data.

The distinction between aggregate and unit data is based on the definition of the variables in the dataset. A dataset can contain both unit and aggregate data.

Access restrictions on data (e.g., public, restricted, private) are assigned by the business and can change over the life-cycle of the dataset.

The domain for a dataset is defined by the subject field that data apply to. However, some datasets are merged from others, so a merged set can have the combination of its constituents. There seems to be no restriction on the number of subject fields.

Mode of transmission is not definitional for a dataset, as a single dataset can be obtained multiple ways. The phases of GSBPM may not be useful, as a single dataset can pass through a phase without change. Further, the phases impose a usage criterion (data for collection; data for editing; etc.) that seems arbitrary and would be useless in another domain (outside statistics).

Similarly, the explorative, temporary, and organizational categorization is based on intent, rather than the data per se. Plus, the categorization could change without any change to the data. If we change the organizational structure described above (rectangular, etc.), then we should call that a new dataset.

@FranckCo
Copy link
Member Author

Decided at the May 25 meeting:

  • create four sub-classes of coos:StatisticalDataset corresponding to the types listed by Dan above, say coos:RectangularDataset, coos:EventHistoryDataset, coos:DimensionalDataset and coos:KeyValueDataset.
  • all other categorizations should be rendered by properties with enumerated ranges (concept schemes)

@FranckCo
Copy link
Member Author

Remaining questions:

  • should we create a Metadataset sub-class of coos:StatisticalDataset?
  • Where would graph data (RDF, property graph) go in the typology chosen?

@flo7894
Copy link
Collaborator

flo7894 commented May 27, 2021

Current GSIM model distinguishes between Data Set and Referential Metadata Set, both being sub-classes of Information Set. A metadataset sub-class of coos:StatisticalDataset might not be fully consistent with GSIM then.

@flo7894
Copy link
Collaborator

flo7894 commented May 27, 2021

Considering rdf data as triples, I think they would be best rendered by the long format (event history). The subject of the triple would be the identifier component, the predicate would be the variable descriptor component, the object would be the variable value component.
If needed the named graph of the triple could be stored as an attribute component.

In the currently available documents for CDI-DDI reviews, classes are defined as follow (see Part_2_DDI-CDI_Detailed_Model_PR_1.pdf) :
Wide Data: Traditional rectangular unit record data sets. Each record has a unit identifier and a set of measures for the same unit.
Long Data: Each record has a unit identifier and a set of measures but there may be multiple records for any given unit. The structure is used for many different data types, for example event data and spell data.
Multi-Dimensional Data: Data in which observations are identified using a set of dimensions. Examples are multi-dimensional cubes and time series. (Note that support is provided for time-series-specific constructs to support some legacy systems which are not based around the manipulation of multi-dimensional data “cubes”.)
Key-Value Data: A set of measures, each paired with an identifier, suited to describing No SQL and Big Data systems.
Do we agree to use those as definitions for coos classes ?

I don't know if the names of the classes WideDataSet, LongDataSet will stay the same in the final specification of cdi-ddi or if they will be replaced by RectangularDataset, EventHistoryDataset. Either way shouldn't we name coos classes the same way cdi-ddi does ?

@FlavioRizzolo
Copy link
Collaborator

Note that in the DDI-CDI model the actual name of the third type of data is Dimensional. We use Multi-Dimensional only informally in the documentation because people relate to that.

How are we introducing these concepts in the ontology? As data or as datasets?

In either case, I think the definitions need some work, they kind of look like explanatory text to me for the most part. Perhaps we could start all with something like "organized collection of data in which..." and then provide the characterization. That would be in line with the way they are all defined in GSIM and DDI-CDI.

That brings us to the question, I think, of what to do with the definitions when the classes already exists and are defined in one of the base models we are integrating. Do re-write them here or use them as-is from the source?

@dgillman4909
Copy link

dgillman4909 commented May 27, 2021 via email

@egreising
Copy link

I apologize if this comment is out-of-time, but I have two questions after reading carefully all the comments:

  1. I don't think that these four sub-classes of coos:StatisticalDataset corresponding to the types listed by Dan above, say coos:RectangularDataset, coos:EventHistoryDataset, coos:DimensionalDataset and coos:KeyValueDataset will be enough. What happens with other formats that can be found in the statistical domain like "Transposed", "Unstructured" or maybe "Blockchain" in the future. Is it possible to add sub-classes?
  2. I think coos:StatisticalMetadataset should be a class at the same level than coos:StatisticalDataset

@dgillman4909
Copy link

dgillman4909 commented Jun 3, 2021 via email

@egreising
Copy link

Dan,

Thank you for your reply. Transposed is an old format that certain statistical datawarehouses used to implement to make it very efficient I/O by minimizing data transfer. The data is stored in multiple binary files, one for each variable with all the values in the same order. The nth value of each file compose a unit record. The structure can be complemented with a B-tree structure for indexing each unit. An example of a statistical product using such format is REDATAM.

I don't know enough about "blockchain", just that it is a back-linked list of "transactions" with a header containing metadata. I don't know if it will ever been used for statistical processes, but it could be. As far as I know, there are many ways of implementing it, using databases or even flat files, which makes it not different from the "Rectangular" or "Dimensional" types. In my understanding, it is not the data support format what differentiates the types, but the way the information is organized, and blockchain is different from a rectangular or dimensional dataset.

Regarding StatisticalMetadataSet, I don't fully understand your point on "there is no structure implied by simply being metadata".
If you use a DDI-C template for reference metadata, these metadata sets have a structure. Similarly, when you exchange reference metadata in SDMX there is always an MSD that defines the structure of the metadata set. And what is more important for me, is that these structures are different from and independent of the data structures. That's why I think that StatiticalMetadataSet is a different class.

Best,
Edgardo

@dgillman4909
Copy link

dgillman4909 commented Jun 4, 2021 via email

@FlavioRizzolo
Copy link
Collaborator

Another side to this discussion is to carefully look at the DDI-CDI and GSIM models to do a mapping. CDI Wide Data Structure is not the same as GSIM Unit Data Structure: the former is uniform, in the sense that each row has the same components/columns, whereas the latter is heterogenous, in the sense that each row might be associated to a different logical record. Just to keep in mind.

@FranckCo
Copy link
Member Author

FranckCo commented Jul 4, 2021

Ad hoc meeting was held on June 30th, Florian updated the ontology accordingly (see commit 217590e):

  • EventHistoryDataset renamed TransposedDataset
  • GraphDataset added (definition is missing)
  • no specific subtype for metadata: a property could be used
  • translations are missing
  • remaining question on mapping StatisticalDataset to GSIM (Information Set)

@FranckCo
Copy link
Member Author

FranckCo commented Jul 5, 2021

Regarding the property indicating that a dataset contains metadata, we could have a simple boolean "isMetadata" property or an object property like "metadataFor" whose domain could be the union of prov:Entity and prov:Activity (for process metadata). For metadata not attached to a particular process or entity (e.g. a statistical classification), the value of "metadataFor" could just be the "Official Statistics" individual.

@FranckCo
Copy link
Member Author

FranckCo commented Jul 5, 2021

Current state of things regarding the "products" domain:
coos-prod-ds

@FranckCo FranckCo assigned FranckCo and unassigned flo7894 Jul 26, 2021
@FranckCo
Copy link
Member Author

Actually, limiting the range of the property might prevent some use cases (e.g. metadata on a prov:SoftwareAgent), so it is preferable to let the range open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants