Support one-to-many binding between a logical dataset and the physical table, view, or query #61

redblackcoder · 2026-02-09T18:31:02Z

redblackcoder
Feb 9, 2026

Problem

The current spec models a dataset (a logical dataset schema) and the underlying physical in a tight 1:1 coupling, this would cause many duplicate datasets and semantic models to be authored to support varity of underlying physical dataset, resulting in goverance overhead to ensure the semantic models and metrics are correct and in sync.

Consider a dataset with high cardinality fields, to support more in-depth analysis in pipelines like Spark, and a pre-aggregated derived dataset constructed from it, like a cube. They both share the same logical semantic model and can answer the same questions through metric definition, however, with current proposed 1:1 mapping, woould require two completely independent semantic models to be defined.

Proposal

Extract the binding between a logical dataset and its physical counterpart in a separate structure to allow re-use of the same semantic model, layered over multiple physical datasets, thus avoiding duplication and providing a single high quality semantic model which can work across varied systems.

# Note, the addition of bindings
semantic_model:
  - name: string
    description: string
    ai_context: string
    datasets: []
    bindings: []
    relationships:[]
    metrics: []
    custom_extensions:
      - vendor_name: string  # Must be one of the values from 'vendors' enum above
        data: string

# Note, no source field in the dataset
datasets:
  - name: string
    primary_key: []  # Array of column names (single or composite)
    unique_keys:
      - []  # Array of column names (single or composite)
    description: string
    ai_context: string
    fields: []
    custom_extensions:
      - vendor_name: string
        data: string

bindings:
  # Required: The unique name for the binding
  - name: string

     # Required: A list of bindings between the logical fields, to the physical fields
     field_bindings: []

     # Optional: Vendor specific custom extensions
     custom_extensions:
      - vendor_name: string
        data: string

field_bindings:
  # Required: Unique identifier for the field binding
  - name: string

    # Required: The physical table, view or query
    # Same value as the source in the datasets from the original spec, i.e. it can be either a table, a view or a query.
    source: string

    # Required: The physical columns
    # An array of strings to represent the source part of the binding.
    # Supports specifying multiple together to avoid repeating the binding from the same source in multiple field bindings.
    # The number of values in the source_columns should match the number of values in the target_columns. They are
   # bound in order.
    source_columns: []

    # Required: The logical dataset
    # A valid binding references one of the datasets from the semantic model
    target: string

    # Required: The logical columns from the target dataset
    # An array of strings to represent the target part of the binding.
    target_columns: []

    # Optional: Vendor specific custom extensions
     custom_extensions:
      - vendor_name: string
        data: string

Partial Bindings

The binding, as defined based on the modified spec, need not be complete, i.e. bind all the logical fields in the semantic model. This provides a flexible model where a semantic model can be complete, but the underlying physical entities can be incomplete, through partial bindings.

A semantic model can be described as a whole, without limiting to presence of every aspect of it in one physical system. At runtime, dynamically, the appropriate bindings can be choosen based on what aspects of the model are in use.

An example where such loose coupling can be useful is the supported dimensions for a metric to slice and dice with. In a big data pipeline, supporting a high cardinality dimension is fine, however, in a OLAP dataset which needs low latency query support for interactive visualizations, such dimensions would be unavailable.

willpugh · 2026-02-09T20:20:13Z

willpugh
Feb 9, 2026

One question I have here is how you think about the differences of things like your cube and bronze layer. You can't just arbitrarily swap these out, because the cube will not be able to handle the same questions as the raw data does.

For example, a cube will only be able to answer the pre-aggregates that are built for it. The unaggregated table can answer any aggregations. So, I don't think we really want the same logical table for both, do we?

I had imagined that we would actually want different logical tables for each that describe what they can do. For example, for a pre-aggregated table, I imagined that you would have two different logical tables linked through some grain based description. I think the modeling for those aggregation tables is going to be more intricate. This is because it is not just a field mapping, but an operational mapping.

e.g. the "mapping" from the bronze table to the aggregation table is more:
SUM(tbl_a.field_a) @ LOD (DIMENSION_1) => agg_tbl_a.sum_field_a

For a cube it would be a little more intricate, because the LOD is more flexible. However, I think the same physics exists.

1 reply

redblackcoder Feb 10, 2026
Author

It is probably a more complex example to look at data sources with different aggregation level, but there are many more examples where one-to-many binding with loose semantic can help with keeping the number of logical models sane. Having completely separate logical model for each variant would result in dilution of semantic model, which should be defined once to build trust by providing the same answer to the same question.

Lets take few simpler examples to see how this can help.

Data Sharing - The same data is shared across multiple system, either through copy or through zero-copy. Having the same logical semantic model bind to each of them, without having to re-define it, would provide a single view and understanding of the metric, irrespective of the underneath physical dataset being used.
Hot, Warm, and Cold Layers - The same data can be spread across various systems to provide efficient and cost-effective way to consume it. Recent data may live in the hot layer which provide fast ingestion and high query throughput with low latency, but move old data in the cold layer for infrequent and high latency access. They can still have same aggregation level across those layers, so even without accounting for difference in aggregation levels, the semantic model can be pointed to either layer to get the consistent answers.
Additional Dimensions - The core dataset could have a limited number of dimensions that are mostly used with a measure, but have additional dimensional data live in a separate dataset, which can be joined together to provide additional fields in the semantice model to utilize. Keeping the logical model completely separate across these various combinations for the same metric, would make governing and managing them a nightmare.

In the more complex scenario, where level of aggregation differ across underlying physical dataset, the lowest granularity for consuming the metrics from the semantic model will change dynamically based on binding in use.

willpugh · 2026-02-09T20:22:39Z

willpugh
Feb 9, 2026

In addition to my comment on mapping from unaggregated to aggregated data, I think it is also worth thinking through what we could do by just adding general composablity.

The loose coupling you are looking for could be base models that create the core logical mapping, then more specialized models pull them in and extend with more specialized functions.

1 reply

redblackcoder Feb 10, 2026
Author

Bindings is a way to provide this composability and extensibility for the models.

khush-bhatia · 2026-02-10T18:44:11Z

khush-bhatia
Feb 10, 2026
Maintainer

Hey @redblackcoder We have spinned off a composability working group to address this. Are you interested in joining it ?

I suggest start with joining the slack workspace for OSI https://join.slack.com/t/opensemanticx/shared_invite/zt-3pq1j0lid-tQBbEvAngAvz0I0vZm~HJw and ping [@dianne Wood]

1 reply

redblackcoder Feb 10, 2026
Author

Hi Khushboo
Thanks for the invite, I'll join it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support one-to-many binding between a logical dataset and the physical table, view, or query #61

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support one-to-many binding between a logical dataset and the physical table, view, or query #61

Uh oh!

redblackcoder Feb 9, 2026

Problem

Proposal

Partial Bindings

Replies: 3 comments · 3 replies

Uh oh!

willpugh Feb 9, 2026

Uh oh!

Uh oh!

redblackcoder Feb 10, 2026 Author

Uh oh!

willpugh Feb 9, 2026

Uh oh!

redblackcoder Feb 10, 2026 Author

Uh oh!

khush-bhatia Feb 10, 2026 Maintainer

Uh oh!

redblackcoder Feb 10, 2026 Author

redblackcoder
Feb 9, 2026

Replies: 3 comments 3 replies

willpugh
Feb 9, 2026

redblackcoder Feb 10, 2026
Author

willpugh
Feb 9, 2026

redblackcoder Feb 10, 2026
Author

khush-bhatia
Feb 10, 2026
Maintainer

redblackcoder Feb 10, 2026
Author