Skip to content

Introduce Schema Inference to simplify graph definition #6

@metalshanked

Description

@metalshanked

Is your feature request related to a problem? Please describe.

Currently, defining a graph schema using client.set_schema() is a very manual and verbose process. The user must construct a large, deeply nested dictionary that explicitly maps every vertex, edge, attribute, and ID from the source tables.

The current process has several drawbacks:

  1. High Initial Friction: It requires developers to manually inspect their source data schemas (e.g., in Databricks or using another tool) and then meticulously transcribe every table name, column name, and data type into the PuppyGraph JSON format.
  2. Error-Prone: This manual transcription is highly susceptible to typos in field names (from_field), attribute names, or data types, which can lead to frustrating debugging sessions.
  3. Difficult to Maintain: If a column is added or renamed in the source Delta table, the developer must find and manually update the corresponding entry in the large schema dictionary. This brittleness can make schema evolution challenging.
  4. Cognitive Overhead: The current approach forces the user to focus on low-level mapping details rather than the high-level conceptual model of their graph (i.e., "this table is a node, this other table defines the relationship between them").

While the querying is "Zero-ETL," the initial setup feels like a manual data mapping task.

Describe the solution you'd like

I propose the introduction of a schema inference mechanism. Since the client is already configured with connection details to a data catalog (e.g., a Unity Catalog), it should be able to use that connection to automatically inspect the schemas of the underlying tables.

This could be exposed through a more intuitive, high-level API, such as a SchemaBuilder class. This would allow users to define their graph conceptually, while the builder handles the low-level details of attribute mapping.

Example of the Proposed API

Here's a comparison of the current approach versus how it could look with a SchemaBuilder.

Current Approach (Manual & Verbose):

# User has to write this entire dictionary by hand
client.set_schema({
    "catalogs": [...], # a lot of boilerplate
    "vertices": [
        {
            "table_source": {"catalog_name": "imdb_catalog", "schema_name": "public", "table_name": "movies"},
            "label": "Movie",
            "attributes": [
                {"name": "title", "from_field": "title", "type": "String"},
                {"name": "release_year", "from_field": "release_year", "type": "Integer"},
            ],
            "id": [{"name": "movie_id", "from_field": "movie_id", "type": "String"}]
        },
        # ... and so on for Actors ...
    ],
    "edges": [
        {
            "table_source": {"catalog_name": "imdb_catalog", "schema_name": "public", "table_name": "acted_in"},
            "label": "ACTED_IN",
            "from_label": "Actor",
            "to_label": "Movie",
            "from_id": [{"name": "actor_id", "from_field": "actor_id", "type": "String"}],
            "to_id": [{"name": "movie_id", "from_field": "movie_id", "type": "String"}],
            # ... etc ...
        }
    ]
})

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions