Data Platform as a Service for SMB E-commerce, using dlt and a modern data stack.
This project uses uv for package and environment management. Install the necessary dependencies, including the etl, dbt, and dev extras:
uv pip install -e .[etl,dbt,dev]Credentials for data sources are stored in .dlt/secrets.toml. Make sure this file is not committed to version control. The project's .gitignore is already configured to ignore the .dlt/ directory.
For dbt, the project uses a flexible profile system (dbt_profiles/profiles.yml) that allows switching between local DuckDB for development and BigQuery for production. The dbt project settings are in dbt_project/dbt_project.yml.
The project follows a modular structure to support scalable data pipelines and easy client deployments:
pipelines/: Contains alldltingestion pipelines. Each data source (e.g.,shopify,facebook_ads,tiktok_ads) is a sub-directory with its own_pipeline.pyscript.pipelines/mock_data/: Houses faker scripts for generating local DuckDB data for rapid testing and demos.
dbt_project/: Contains thedbtmodels for data transformation, organized into staging and marts layers.terraform/: Stores Terraform configurations for deploying the entire data platform infrastructure to Google Cloud Platform (GCP).duckdb_files/: Local DuckDB databases generated during local development and testing.tests/: Unit and integration tests for pipelines and other components.scripts/: Utility scripts (e.g., for populating mock data).workflows/: Google Cloud Workflow definitions for orchestrating jobs.
This section covers how raw data is brought into the platform. All dlt ingestion pipelines are standardized and can load data to either local DuckDB for development or BigQuery for production.
A single script, pipelines/run_pipeline.py, acts as a centralized entry point to execute any ingestion pipeline.
Usage:
uv run python -m pipelines.run_pipeline <pipeline_name> --destination <destination_type><pipeline_name>:shopify,facebook_ads,tiktok_ads<destination_type>:duckdb(for local development/testing) orbigquery(for GCP deployment)
Examples:
- Run Shopify pipeline to local DuckDB:
uv run python -m pipelines.run_pipeline shopify --destination duckdb
- Run Facebook Ads pipeline to GCP BigQuery:
uv run python -m pipelines.run_pipeline facebook_ads --destination bigquery
For rapid local testing and building client demos with DuckDB, you can leverage the faker scripts located in pipelines/mock_data/. The standardized dlt pipelines (e.g., facebook_ads, tiktok_ads) will use these when --destination duckdb is specified. The faker library is now included in the etl dependencies to ensure mock data generation works in all environments.
- Populate additional mock Shopify sales data (if using local DuckDB):
uv run python scripts/populate_shopify_sales.py
The project uses dbt to transform raw data into structured, analyzable formats, organized into silver (staging) and gold (marts) layers.
The dbt project is configured to dynamically switch between local DuckDB and cloud BigQuery data sources based on the active dbt profile (target).
- When running with the
devtarget (DuckDB),dbtwill look for data in the localduckdb_files/. - When running with the
prodtarget (BigQuery),dbtwill connect to thesmb-dataplatformGCP project and relevant BigQuery datasets (e.g.,shopify_data_raw,facebook_ads_data,tiktok_ads_data).
This ensures seamless transition between local development and cloud deployment without manual configuration changes within dbt models or source definitions.
- Silver Layer (
models/staging/): Contains cleaned and standardized views of the raw data. These models serve as direct interfaces to the bronze layer (raw DLT output). - Gold Layer (
models/marts/): Houses aggregated and business-ready tables for reporting and analytics. This layer includes intermediate aggregates (models/marts/intermediate/) and final Key Performance Indicator (KPI) reports.
The gold layer provides the following key e-commerce metrics:
- Contribution Margin:
Net Sales - COGS - Shipping - Transaction Fees - Total Ad Spend - Marketing Efficiency Ratio (MER):
mer_total_paid_ads:Net Sales / (Facebook Spend + Instagram Spend + TikTok Spend)mer_facebook:Net Sales / Facebook Spendmer_instagram:Net Sales / Instagram Spend
- New Customer Cost Per Acquisition (ncCPA):
Total Ad Spend / Count(First Time Orders) - Net Profit on Ad Spend (NPOAS):
Contribution Margin / Total Ad Spend - LTV:CAC Ratio:
(Avg Order Value * Purchase Freq) / ncCPA
- COGS Mocking: Since raw COGS data is not available, a mock value of
50% of net_salesis applied in thestg_ordersmodel. - Ad Platform Granularity: Meta (Facebook) ad spend is broken down by
publisher_platform(Facebook vs. Instagram) in theint_daily_ad_spendmodel, enabling granular performance comparison. - Safe Division: A
safe_dividemacro is implemented to prevent division-by-zero errors in metric calculations.
Use the following poe commands to interact with the dbt project:
- Install dbt packages:
uv run poe dbt-deps
- Run all models and tests: (Recommended for full pipeline execution)
uv run poe dbt-build
- Run dbt models only:
uv run poe dbt-run
- Run dbt tests only:
uv run poe dbt-test
- Debug dbt connection:
uv run poe dbt-debug
- Clean dbt target and packages directories:
uv run poe dbt-clean
The dbt project includes comprehensive data quality tests:
- Generic Tests: Applied in
schema.ymlfiles to enforce constraints likeunique,not_null,not_negative, andaccepted_values(e.g., ensuringpublisher_platformis either 'facebook' or 'instagram'). - Singular Tests: Custom SQL tests (e.g.,
assert_contribution_margin_logic.sql) to validate specific business rules and data behaviors.
This section outlines the refined process for deploying the entire data platform for a new client using Terraform and orchestrating it with Cloud Workflows.
Before running any local commands, you must perform the following setup steps manually in the Google Cloud Console and the client's respective data source platforms (e.g., Shopify, Facebook Ads, TikTok Ads).
- Create a new Google Cloud Project for the client (e.g.,
your-client-gcp-project-id) or select an existing one. - Ensure you have the Owner or Editor role for the project.
- Make sure billing is enabled for the project.
- Authenticate your
gcloudCLI and set the project:Replacegcloud auth login gcloud config set project your-gcp-project-idyour-gcp-project-idwith the actual project ID you created/selected.
For each data source, you will need to set up API access and obtain credentials.
- Shopify Custom App Setup:
- Log in to your client's Shopify store admin interface.
- Navigate to Apps -> Develop apps for your store.
- Create a new Custom App (e.g.,
Data Platform Integration). - In the app's settings, configure the API scopes. You will need at least
read_products,read_orders, andread_customersfor the ingestion pipeline. - Go to the API credentials tab and note down the API key and API secret key.
- Facebook Ads API Setup:
- Create a Facebook App and configure access to the Ads API.
- Obtain an Access Token (usually a Long-Lived User Access Token or System User Access Token).
- Note down the Ad Account ID(s).
- TikTok Ads API Setup:
- Create a TikTok for Developers App and configure access to the Ads API.
- Obtain an Access Token and the Advertiser ID(s).
You must create secrets in Google Secret Manager to securely store the API credentials for each data source. The names of these secrets (e.g., shop_url, client_id, client_secret for Shopify; facebook_access_token, facebook_ad_account_id for Facebook Ads; tiktok_access_token, tiktok_advertiser_id for TikTok Ads) are the default values expected by the Terraform configuration and DLT.
Crucially, the secret values must be the raw text, without any surrounding quotes. If you include quotes, the pipeline will fail with potential parsing errors.
Run the following gcloud commands to create the secrets, replacing the placeholder values with your client's actual information:
# Shopify Credentials
echo "your-client-shop-name.myshopify.com" | gcloud secrets create shop_url --data-file=- --project=your-gcp-project-id
echo "your-shopify-app-api-key" | gcloud secrets create client_id --data-file=- --project=your-gcp-project-id
echo "your-shopify-app-api-secret-key" | gcloud secrets create client_secret --data-file=- --project=your-gcp-project-id
# Facebook Ads Credentials (example, adjust secret names as per dlt configuration)
echo "your-facebook-long-lived-access-token" | gcloud secrets create facebook_access_token --data-file=- --project=your-gcp-project-id
echo "your-facebook-ad-account-id" | gcloud secrets create facebook_ad_account_id --data-file=- --project=your-gcp-project-id
# TikTok Ads Credentials (example, adjust secret names as per dlt configuration)
echo "your-tiktok-access-token" | gcloud secrets create tiktok_access_token --data-file=- --project=your-gcp-project-id
echo "your-tiktok-advertiser-id" | gcloud secrets create tiktok_advertiser_id --data-file=- --project=your-gcp-project-idRemember to replace your-gcp-project-id with your project ID in these commands.
Once the manual prerequisites are met, you can deploy the entire infrastructure from your local machine using Terraform and Poe.
- Navigate to the
terraform/directory in your project. - Create a copy of the example variables file:
cp terraform.tfvars.example terraform.tfvars
- Open
terraform.tfvarsand fill in the values specific to your client's deployment:Ensure# terraform/terraform.tfvars gcp_project_id = "your-gcp-project-id" # e.g., "my-client-project-123" gcp_region = "us-central1" # Or your desired GCP region client_name = "your-unique-client-prefix" # e.g., "acme-corp" - used for naming resources # Optional: Override default secret names if they are different in Secret Manager # shopify_shop_url_secret_name = "custom-shop-url-secret" # shopify_client_id_secret_name = "custom-client-id-secret" # shopify_client_secret_secret_name = "custom-client-secret-secret"
gcp_project_idmatches the project ID set ingcloud config. Ensuregcp_regionis consistent across all deployments.
The Docker images for the ingestion and transformation services encapsulate all necessary code and dependencies.
Important Considerations for Python Packages & Docker:
- The
pipelines/directory is treated as the top-level Python package for all ingestion logic. - The
fakerlibrary, used by mock data generators, is included in theetldependencies to ensure it's available in the Docker image. - The Docker builds ensure that the Python environment within the container (specifically the
PYTHONPATH) is correctly configured to discover allpipelinesmodules.
You must be authenticated with gcloud and configured Docker to use Artifact Registry for this step.
gcloud auth configure-docker ${gcp_region}-docker.pkg.devReplace ${gcp_region} with the region you specified in terraform.tfvars.
Now, run the following poe commands from the root of the project:
# Build, tag, and push the ingestion image
uv run poe build-push-ingestion
# Build, tag, and push the transformation image
uv run poe build-push-transformationThis step will create all the GCP resources (Artifact Registry, Service Account, Cloud Run jobs, Workflow, Scheduler, IAM bindings for secrets, etc.) as defined in your Terraform files.
Important: Cloud Run Job Memory for dbt Transformations:
If you encounter Out-of-memory errors (exit code 137) during the transformation job, it indicates that dbt requires more memory for its compilation and execution tasks. You will need to adjust the memory limit in terraform/main.tf for the transformation_runner job. Locate the resources block within the containers definition and increase the memory value. A good starting point is 2Gi or 4Gi.
Example snippet for transformation_runner in terraform/main.tf (ensure correct nesting):
resource "google_cloud_run_v2_job" "transformation_runner" {
# ...
template {
template {
# ...
containers {
# ...
resources {
limits = {
memory = "2Gi" # Increase this value if OOM errors occur
}
}
}
}
}
}From the root of the project:
# Initialize Terraform (only needed once per project setup)
uv run poe tf-init
# Plan and apply the changes
uv run poe tf-applyTerraform will read the variables from your terraform.tfvars file and configure the resources accordingly. Review the plan carefully before approving the apply.
If a Cloud Run job fails or a Workflow execution errors out, the most important step is to check the logs.
- For Cloud Run Jobs: Navigate to the Google Cloud Console -> Cloud Run -> Jobs. Select the job (e.g.,
your-unique-client-prefix-ingestion-runner), go to the Executions tab, and click on a failed execution to view its logs. - For Cloud Workflows: Navigate to the Google Cloud Console -> Workflows. Select the workflow (e.g.,
your-unique-client-prefix-main-workflow), go to the Executions tab, and click on a failed execution to view its logs and identify the failing step. The workflow is configured with robusttry/excepterror handling to provide clearer insights into job failures.
Once deployed, you can trigger the entire data pipeline immediately via the Cloud Workflow:
gcloud workflows execute your-unique-client-prefix-main-workflow --location=your-gcp-regionReplace placeholders with your configured values.
The pipeline is scheduled to run daily via a Cloud Scheduler job (e.g., your-unique-client-prefix-workflow-scheduler). By default, this job is created in a paused state to prevent unexpected charges or executions during initial setup.
To activate or pause the daily pipeline runs:
- Open the
terraform/main.tffile. - Find the
google_cloud_scheduler_job.workflow_schedulerresource. - Set the
pausedargument tofalse(to activate daily runs) ortrue(to pause them). - Apply the change:
uv run poe tf-apply
To completely remove all deployed cloud resources for a client and stop all associated costs, you can run terraform destroy.
From the root of the project:
uv run poe tf-destroy # This task should be added to pyproject.tomlThis project uses Evidence.dev for data visualization and reporting. The Evidence project is located in the /viz directory and connects directly to the DuckDB database generated by the dbt models.
To start the local development server for Evidence:
-
Navigate to the viz directory:
cd viz -
Install NPM dependencies (only required once):
npm install
-
Run the dev server:
npm run dev
The dashboard will open in your browser, typically at
http://localhost:3000.
The frontend includes two pre-built dashboards:
- / (Daily Health Check): The main dashboard inspired by the "Health Check" design. It features interactive metric cards with conditional coloring based on goal achievement, a date range filter, and a trend chart with a built-in metric selector.
- /monthly-recap: A page that provides a high-level overview of the same KPIs aggregated by month, allowing for broader trend analysis.
For local development and rapid iteration, you can run ingestion pipelines and dbt transformations against local DuckDB files.
-
Run Ingestion to DuckDB: Use the unified pipeline runner with
--destination duckdb:uv run python pipelines/run_pipeline.py run_pipeline shopify --destination duckdb uv run python pipelines/run_pipeline.py run_pipeline facebook_ads --destination duckdb uv run python pipelines/run_pipeline.py run_pipeline tiktok_ads --destination duckdb
This will create or update
.duckdbfiles in theduckdb_files/directory. -
Run dbt against DuckDB: Use the default
devdbt target, which is configured for DuckDB:uv run poe dbt-build
This will materialize your dbt models into
duckdb_files/dbt_metrics.duckdb.
- Unit & Integration Tests:
uv run poe test - Linting & Formatting:
uv run poe lint
- All Checks:
uv run poe check
- Transformation Layer: Standardized and deployed the dbt transformation layer to GCP Cloud Run. Fixed various errors related to permissions, memory, and dbt BigQuery type compatibility.
- Ingestion Layer: Standardized
dltingestion pipelines for Shopify, Facebook Ads, and TikTok Ads, enabling flexible deployment to both local DuckDB and BigQuery. A unified pipeline runner has been implemented. - Dbt Configuration: Updated
dbtsource configurations to dynamically switch between DuckDB and BigQuery based on the activedbttarget. - Terraform: Initial Terraform configuration is set up for deploying GCP resources, with client-specific parameterization handled via
terraform.tfvars. - Cloud Workflows Orchestration: Implemented robust error handling for Cloud Run job failures within the main workflow, ensuring better visibility and stability.
- Docker Builds: Ensured correct Python package discovery (
PYTHONPATH) and dependency management (fakerincluded inetldependencies) within the Docker images for seamless GCP deployment.