In this section we'll use dbt (data build tool) to model our data that we've extracted and loaded into our data warehouse. We'll then create dimension and fact tables in accordance with Kimball modelling principles.
dbt is a powerful tool that allows anyone that knows SQL to transform data whilst following software engineering best practices such as modularity, portability, CI/CD, and documentation. It simply sits on top of our data warehouse and helps us to transform the raw data into something useful for business users.
Each 'model' is simply a .sql file containing SELECT statements only, therefore it abstracts away a lot of the complexity of DDL/DML.
- Firstly, create a new service account for dbt with the BigQuery Admin role and generate a key
- create a new dbt account
- create a new dbt cloud project, selecting BigQuery as your data warehouse connection type
- next, upload your service account key, which will pre-populate most fields
- now you'll have the option of using a dbt-managed repository, or adding your own
Now you should be set-up to start developing in dbt cloud.
To begin, initialise your project and create a new branch. Now we'll create our first model.
- On the right hand side you'll see the
File explorer
. Create a new folder within themodels
folder. - within the new folder, create a new file
schema.yml
- enter the details for your BigQuery table - the one which the pipeline exported to
version: 2
sources:
- name: staging_scrobble
database: de-zoomcamp-test
schema: scrobble_data
tables:
(you should see 'Generate model' here)
- name: raw_scrobble_data
...
- Note the
Generate model
above. Click this to generate a template for the model creation.- You can also define your schema datatypes within
schema.yml
before generating
- You can also define your schema datatypes within
models:
- name: stg_staging_scrobble_data
description: ""
columns:
- name: timestamp
data_type: timestamp
description: ""
tests:
- not_null:
severity: warn
- name: artist
data_type: string
description: ""
- name: artist_mbid
data_type: string
description: ""
...
You may have noticed that in the raw table, the tags are all combined into one string field, separated by commas. In this form the field is a bit messy and it's hard to extract anything meaningful from it if it was put in a dashboard. The way that last.fm tagging works, the first tag is the most popular genre that users have voted for. For sake of simplicity, I'll create a macro that can be used in model creation to simply extract the first tag.
Macros are similar to functions that are written in jinja, a templating language similar to python.
Creating a macro is easy - we'll create a new one in the macros
folder e.g. your_macro.sql
{#
This macro splits the tags
and takes the first one as the main tag
#}
{% macro get_first_tag(tag) -%}
SPLIT(tag, ',')[OFFSET(0)] AS tag
{%- endmacro %}
Now simply call it in the generated code
with
source as (
select * from {{ source('staging_scrobble', 'raw_scrobble_data') }}
),
renamed as (
select
cast(timestamp as timestamp) AS timestamp,
artist,
artist_mbid,
track,
track_mbid,
album,
album_mbid,
{{ get_first_tag("tag") }}
from source
)
select * from renamed
Now we can run dbt build
- dbt will pickup all the models and tests as defined in the current project and execute them in order.
We can now see the new staging table created in BigQuery.
As you're building your models, dbt will also record the lineage of your models. This is useful to understand how your tables were built from its respective data sources.
We can aggregate some of the playing history to create a basic fact table containing some useful information for the end-user. In this very simple example, I'll generate one for the total listens by artist and include some supplementary information such as genre. This won't appear in the final data model, but serves as an example of what you could do with a more complicated dataset with more potential measures.
This time we'll materialise it as a table, rather than a view:
{{
config(
materialized='table'
)
}}
select
count(*) as total_listens,
artist,
tag as genre
from dbt_jmurray.stg_staging_scrobble__raw_scrobble_data
group by 2, 3
We can create a lot of complicated transformations and thankfully dbt will abstract a lot of the complexity away, making it syntax-compliant for the destination database regardless of how it's written within dbt.
We can also introduce testing in the modelling process. Because I want to ensure that all timestamps are valid, I'll introduce a test to check that there are no nulls in that field.
Under our schema.yml
file, add a test by:
...
models:
- name: stg_staging_scrobble_data
description: ""
columns:
- name: timestamp
data_type: timestamp
description: ""
tests:
- not_null:
severity: warn
Finally, dbt has a built-in way to generate documentation. Simply run dbt docs generate
and navigate to the documentation tab beside the version control section.
Here you can view a lot of useful information, including the SQL that generated the table, location of source data, macros used, descriptions (if provided) and so on.
Once we're satisfied with development we can open a request to merge into the main branch. These models can then run in the production environment using the main branch. Here's how we'd schedule a production run:
- Within dbt, we can create a new environment 'PROD', seting the deployment type to
Production
- within the Production environment, create a new job under
Deploy > Jobs
- Here we can create a job to update the production tables and also a
CI/CD
job- once the jobs run, the changes should be visible within our BigQuery tables
In my data model, the dimension tables could be supplemented with further information to build a more comprehensive and insightful database. The MusicBrainz API could be used to retrieve dimensional data based on the MBIDs, such as band members or country of origin in the artist table, additional writing credits on the tracks or album table, and instruments used on the track table.
Because the final 'fact' table doesn't contain any measures, it's actually a factless fact table. The table itself stores a record of scrobbles and dates, so it allows us aggregate and answer questions such as "how many tracks were listened to on a certain date?" or "how many progressive metal or electronic tracks were listened to?".