Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Defining a Star Schema

hbutani edited this page Nov 12, 2016 · 3 revisions

Define a Star Schema using the create|alter star schema command. The Star Schemas is defined on the Fact Table. Star Schemas will be the basis for defining OLAP Indexes and other Metadata. Internally the Star Schema specification is stored as Table Properties on the Fact Table.

For example for the TPCH schema the create command is:

create star schema on lineitembase
as many_to_one join of lineitembase with orders on l_orderkey = o_orderkey
   many_to_one join of lineitembase with partsupp on
          l_partkey = ps_partkey and l_suppkey = ps_suppke
   many_to_one join of partsupp with part on ps_partkey = p_partkey
   many_to_one join of partsupp with supplier on ps_suppkey = s_suppkey
   many_to_one join of orders with customer on o_custkey = c_custkey
   many_to_one join of customer with custnation on c_nationkey = cn_nationkey
   many_to_one join of custnation with custregion on cn_regionkey = cr_regionkey
   many_to_one join of supplier with suppnation on s_nationkey = sn_nationkey
   many_to_one join of suppnation with suppregion on sn_regionkey = sr_regionkey

You create the Star Schema on the Fact Table and you specify how the dimension tables are related(primary key - foreign key relationships). Each relation definition specifies the tables involved(leftTable, rightTable) the type of relation(1-1 or n-1) and the joining condition(specified as a conjunction of equality predicates)

The '''Star Schema'''s we support have the following constraints:

  • We only support '''one-one''' or '''many-one''' relations between entities.
  • A table can be related to the '''Fact Table''' via only 1 unique Path.
  • The ''column names'' across the Star Schema must be unique. So 2 tables in the Star Schema cannot have columns with the same name.

The first 2 points are an issue only in the most involved star schema models; for e.g. we show how tpch can be modeled below. The 3rd restriction is an implementation issue: when performing QueryPlan rewrites we don't have access to the table an Attribute belongs to, for now we get around this issue by forcing column names to be unique across the Star Schema. So for the Tpch Model we model the Star Schema as:

FactTable = LineItem
StarRelations: [
  LineItem - n:1 - Order => [[li_orderkey],[o_orderkey]]
  LineItem - n:1 - PartSupp => [[li_partkey, li_suppkey],[ps_partkey, ps_suppkey]]
  Order - n: 1 - Customer => [[o_custkey], [c_custkey]]
  PartSupp - n:1 - Part => [[ps_partkey], [p_partkey]]
  PartSupp - n:1 - Supplier => [[ps_suppkey], [s_suppkey]]
  Customer - n:1 - CustNation => [[c_nationkey], [cn_nationkey]]
  CustNation - n:1 - CustRegion => [[cn_regionkey], [cr_regionkey]]
  Supplier - n:1 - SupptNation => [[s_nationkey], [sn_nationkey]]
  SuppNation - n:1 - SuppRegion => [[sn_regionkey], [sr_regionkey]]
]

Because of our restrictions we have had to model the Nation table as separate CustNation and SuppNation tables. Similar separation has to be done for CustRegion and SuppRegion. Having to setup separate entities for Supplier and Customer Nation is not atypical when directly writing SQLs; these would be views on the same Nation Dimension table. Currently we are being more restrictive than this, we require the 2 views to be tables in the Metastore(this is because during Plan rewrite we loose the Table association in __ Catalyst Attributereferences__. But note, this doesn't require the data to be copied, both tables can point to the same underlying data in the storage layer.

We have to rename the column names in the 2 Nation(and region) tables. This is so that we can infer the Attribute to Tables(in the Star Schema) associations in a Query Plan.

Specifying the Star Schema on a Druid DataSource backed DataFrame

For backward compatibility, we will continue to support specifying the Star Schema for a Druid DataSource backed DataFrame, this is specified as an option in the DataSource definition. The option name is starSchema. The option must be specified as json string, for example:

For example for the TPCH schema the definition is:

{
  "factTable": "lineitem",
  "relations": [
    {
      "leftTable": "lineitem",
      "rightTable": "orders",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "l_orderkey",
          "rightAttribute": "o_orderkey"
        }
      ]
    },
    {
      "leftTable": "lineitem",
      "rightTable": "partsupp",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "l_partkey",
          "rightAttribute": "ps_partkey"
        },
        {
          "leftAttribute": "l_suppkey",
          "rightAttribute": "ps_suppkey"
        }
      ]
    },
    {
      "leftTable": "partsupp",
      "rightTable": "part",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "ps_partkey",
          "rightAttribute": "p_partkey"
        }
      ]
    },
    {
      "leftTable": "partsupp",
      "rightTable": "supplier",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "ps_suppkey",
          "rightAttribute": "s_suppkey"
        }
      ]
    },
    {
      "leftTable": "orders",
      "rightTable": "customer",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "o_custkey",
          "rightAttribute": "c_custkey"
        }
      ]
    },
    {
      "leftTable": "customer",
      "rightTable": "custnation",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "c_nationkey",
          "rightAttribute": "cn_nationkey"
        }
      ]
    },
    {
      "leftTable": "custnation",
      "rightTable": "custregion",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "cn_regionkey",
          "rightAttribute": "cr_regionkey"
        }
      ]
    },
    {
      "leftTable": "supplier",
      "rightTable": "suppnation",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "s_nationkey",
          "rightAttribute": "sn_nationkey"
        }
      ]
    },
    {
      "leftTable": "suppnation",
      "rightTable": "suppregion",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "sn_regionkey",
          "rightAttribute": "sr_regionkey"
        }
      ]
    }
  ]
}
Clone this wiki locally