Skip to content

Deep Data Structures Manipulation Design

Ilya Sher edited this page Oct 18, 2021 · 26 revisions

Deep Data Structures Manipulation Design - WIP

As data-focused programming language, Next Generation Shell should have deep data structures manipulaton as part of the standard library.

NGS GitHub Issue #437: Design Deep Data Structures Operations

Requirements

  • Minimal syntax changes (implemented pattern -> action)
  • Customizable for user-defined types (postponed)
    • Should work with tree-like data structure where "pre" and "post" visit are alternatives (processing the node or the children first)
  • Should work with flat data structures as well (Arr, Hash) (implemented)
  • Integrate well with existing facilities (implemented)
  • support transformation (map-like behavior) (implemented)
  • support filtering (filter-like behavior) (implemented)
  • provide context to deep_... functions callbacks (path, etc), not only current value (implemented)
  • support passing arbitrary values from the callback to other callbacks invocations (todo)
  • support in-place modifications? (maybe later)
  • maybe: support for data which is not present yet - when needed additional data will be lazily fetched (think AWS results pagination, where you maybe find what you are looking for before fetching all pages) (maybe later)
  • maybe: support implicit loops (not sure what's the use case) (maybe later)
  • maybe: support of pub/sub (maybe somewhat like Go channels and select) (maybe later)

Sample use cases

  • Remove all Hashes that match {'url': ...} which are in an array which resides under extension key in any Hash in the data. (implemented: my_data.tr({"url": AnyOf("http://hl7.org/fhir/StructureDefinition/maxValue", "http://hl7.org/fhir/StructureDefinition/minValue")} -> skip(Y)))
  • Remove all empty arrays/hashes/strings (or strings containing only spaces) from a data structure (implemented: my_data.tr([] -> skip(Y)))
  • Parse JSON files into NGS objects (more thought needed regarding schema and serialization/deserialization in general) (maybe later)
  • Prepend/append element to an array (implemented: tr([1,2,[3,4]], Arr -> { ["start"] + descend(B) + ["end"] } ))
  • "Deep uniq" - remove duplicate primitive values anywhere, remove empty Hashes or Arrays. See https://github.com/ngs-lang/nsd/blob/5c6cd31930cabe28b20bea6fb13e0631e655e2f5/data-manipulation/deep-uniq.ngs (maybe later)

Prior Art

In no particular order:

Design (WIP!)

Overview

  • New type - TrContext - transformation context (implemented)
  • PatternAction objects are restricted to (Any -> Fun) type in this design, at least for now. Actions are called if patterns match.
    • Note that this allows defining functions like for example skip(Any, TrContext) and then using them like Something -> skip. (don't know how to solve yet. dangerous because later skip(?, pattern:Any) can be defined and might potentially take precedence)
  • New method tr(data:Any, transformation:Any)

Patterns

UPM patterns will be used. UPM patterns need to be updated to return MatchResult (containing both indexed and named matches)

TBD: Whether UPM functionality will be extended with something specific for tr().

UPM should have a way to match on the context.

Actions

Action callbacks (in pattern -> action) will be called with the following arguments:

  • Current matched data
  • Context object with:
    • [N] - Positional matches (up until now)
    • .something - Named matches (up until now), can not start with _.
    • ._path - an array which represents the path of current focus relative to object root. [] would signify the root. (implemented)
    • ._root - the root object with which tr() (or =~) was invoked. (implemented)
    • ._cur - current item being processed (implemented, used by descend(TrContext), not sure should be "public API")
    • ._transformation - currently active transformation rules (implemented)
    • .del() method to delete currently matching data: bad_data_pattern -> { B.del() } (maybe later)
    • .replace() ... bad_data_pattern -> { B.replace("(warning, bad data: $A)") } (maybe later)
    • .return(v) - return v as top level value of the tr() operation (not done yet, open issue: how to set a default if .return() is not called)
    • .descend() - continue recursion into current focus object. (implemented) TBD: pass rules/mode? Maybe .tr()?
    • .skip() - exclude current focus from the new structure (implemented)

An action callback can either

  • Return a value to be used as replacement in the newly generated data structure or
  • Call one of the methods on the context object (to affect original data structure or do other, TBD, operations)

Code Examples that Should Work

# Removing specific extension from FHIR Questionnaire (implemented)
questionnaire.tr({'url': 'my_extension'} -> skip(Y))

# Removing extension can lead to empty array, which is invalid in FHIR

questionnaire.tr({[] -> skip(Y)})

# Is our extension used?
questionnaire.tr({'url': 'my_extension'} -> {B.return(true)})

# maybe # our_ext_used = questionnaire =~ Deep({'url': 'my_extension'})

Open Issues / To Think

  • What would tr(data:Any) mean? (Without any transformations passed). Currently nothing. Maybe: cleanup all empty arrays/hashes/strings.
  • Think about peg/leg style parsing, where
    • The focus is moving inside a string
    • Patterns which are defined later are referenced from earlier rules
  • Augment UPM with =~~ and !~~ for recursive search? Probably not but provide Deep(pattern) facility.
  • Start deprecating the older ~ match?
  • Base the solution exclusively on Universal Pattern Matching?
    • Probably not, sounds like filter()-like and map()-like functionality should be in anyway.
  • A function to get to element at given path? Example: data.at([0, "a", 10, "b"])
    • How to behave for missing path? Exception?
    • Would it work with patterns for path and handle multiple paths? (Example: data.at([Int, {"x": 1}]))
  • Visit the node first or children first?
    • Implement as visit_node() and visit_children()?
    • visit_node() would be a no-op for non-tree containers such as Arr?
  • "view" design that would allow modifying or operating on previously selected paths/items. -- supported by specifying the rules that only work with "interesting" objects + default rules to recursively apply rules.
  • Context object (to pass to callbacks to each/visit/etc) with
    • Current path
    • Most recent container, index/key, and value
  • Should callbacks be multimethods? (Considering that currently there is no syntax for expressing unnamed multimethods so specifying multimethod inline is not straightforward).
  • Naming: deep_... vs ..._deep

Information

  • Searches:
    • nested data structures
    • deep data structures
    • graph data structures (less applicable)