Skip to content

Data Warehouse standard Code Branch 69

Wade edited this page Oct 10, 2020 · 8 revisions

Introduction
This standard documentation is made by Wade Song.
Target is to make a standard to cover 80% DWH demands to build a professional DWH.
There are lots of standard to build a DWH, I just pick one of them and give it a code name as 69.
In the world, this is no answer about one thing is correct or not, sometimes we had lots of options, this documentation is the one I picked from lots of optional standards. Also, I didn’t mean that it is not professional that someone didn’t follow the standards. This is just what I picked.
So if you are building a new DWH, or seeking vendor, this maybe one of your reference.

Out of scope
Real time DWH and reporting.
To use SQLServer or Oracle.

Layer design
DWH should has layer design concept, and at lease has 3 layers.
Stage layer

  • A intermedia layer, data first arrived here and also be archived here.

Core layer

  • A real warehouse to get data well organized.

Mart layer

  • More close to reporting layer for presentation.

Infrastructure
To make sure data security, the network should has at least 3 subnet.
Database subnet, only allow database connection from application subnet and admin subnet, and remote access to admin subnet.
Application subnet, connect to database subnet, and only allow remote access from admin subnet.
Admin subnet, remote to database subnet and application subnet, and connect to database subnet for db connection.

How Data organized
In Stage layer:

  • Data is one to one map load to this layer.
  • To get historical data stored, PSA approach should be applied.
  • Different source has different schema.

In Core layer:

  • Data is organized with Data Vault. (Refer to Data Vault Standard).
  • Also with Operational data vault(Schema as DV) and Business data vault(Schema as BDV).

In Mart layer:

  • For cube relevant use, data is organized with dimensional model, which by fact and dimension table.
  • For normal reporting, data is organized with flat table.
  • For data mining, data is organized with flat table.

Data flow
Data is transferred from Stage layer, then to Core layer, finally to Mart layer, layer jump is not allowed.
Logic code is in target layer:

  • USP code from Stage to Core, should be in Stage layer.
  • USP code from Core to Mart, should be in Mart layer.
  • PSA code in Stage layer, as it is Stage to Stage.

If the calculation only uses table itself, it should be from DV to BDV.
If the calculation is among tables, it should be from DV or BDV to MART.
Data transfer should use USP.

Reporting
For management report, like dashboard, like Power BI and Tableau.
For operational report, normal reporting is suggested, like SQLServer SSRS.

How data landing to DWH
This part only provide suggestion, as data source in the world is different.
All below suggestions, incremental data is better than full data.
Data can be loaded to Stage layer or any middle layer database, better incremental data.
Or, data source database can be connected directly.
Or, file database, like Access.
Or, flat file like CSV.
What is not suggested:
Source database backup.
Excel, especially with manual input.

Naming Conversation
Stage:
STAGE.[SOURCE SHORTNAME].[SOURCE TABLENAME]
STAGE.[SOURCE SHORTNAME].[SOURCE TABLENAME]PSA
STAGE.[SOURCE SHORTNAME].V_[SOURCE TABLENAME]
STAGE.[SOURCE SHORTNAME].USP_[SOURCE TABLENAME]
Core:
CORE.DV.HUB_[NAME]
CORE.DV.SAT_[NAME]
CORE.DV.LINK_[NAME]
CORE.BDV.SAT_[NAME]
Mart:
MART.FLAT.[NAME]
MART.DIM.FACT_[NAME]
MART.DIM.DIM
[NAME]

Tables
STAGE:
Stage layer should follow PSA approach, which:
- One landing table, data structure same as source table.
- INC table, from landing table, changed data is stored.
- HIS table.
Core:

Log function
Log function should contain below basic information:

  • Log time
  • Source(the log information is trigger from where, for example, name os usp).
  • Rows impact(if available)
  • Log type(I=information, E=error)
  • Log content.(for example, xxx process started, xxx process ended, or any error message)

For data transfer USP code, the log should be triggered:

  • Begin of the USP, with information, USP process started.
  • End of the USP, with information, USP process ended, also with rows impacted, if available.
  • Error information, in try-catch phase.
Clone this wiki locally