Build an ETL pipeline using SSIS

This project provides a starting point for building an ETL pipeline using SQL Server Integration Services (SSIS) in Visual Studio 2019.

Goal

The primary goal of the project is to provide a basic solution for anyone who is building a new ETL pipeline using SSIS.

Prerequisites

Visual Studio 2019 installed. Note Visual Studio 2017 works slightly different regarding SSIS and this article may not work exactly for Visual Studio 2017.
Install SQL Server Integration Services in Visual Studio 2019
SQL Server already installed.
SQL Server Management Studio (SSMS) installed
The data source is the database of AdventureWorks 2019. Download here

Overview

The dataset represents a sample dataset for a bicycle company. Our objective is to design and implement an ETL solution to generate sales reports per employee, sales reports by region, reports on the number of orders by employee, reports on the number of orders by region, reports on sales by product subcategory, and reports on sales by product and region.

Overview - ETL Architecture Design

First-time ETL

Build a staging area, design a data warehouse architecture, and create an SSIS package to extract, transform, and load data from the source into the data warehouse.

Overview - First-time ETL process

Import source data into SQL Server Management Studio (SSMS)
Design a data warehouse architecture to serve reports for the above-mentioned overview section
Create a data warehouse database in SQL Server Scripts
Create a staging area database in SQL Server Scripts
Extract data from the source into the staging area
Transform data in the staging area into dimension and fact tables and save it in the staging area
Load dimension tables level 1 from the staging area into the data warehouse
Load dimension tables level 2 from the staging area into the data warehouse
Load dimension tables level 3 from the staging area into the data warehouse
Load fact SalesOrder tables from the staging area into the data warehouse
Load fact Product tables from the staging area into the data warehouse
Execute the truncate table command on the dimension and fact tables in the staging area to prepare for the subsequent ETL processes

Next-time ETL

From the source data, only extract the newly added, updated or deleted data to perform the ETL process

Solution Idea

New and updated records: use Slowly Changing Dimension and Lookup Transformation to compare the data from the source with the data saved to the staging area previously
!Tips: You should refer to the articles below to understand how to use Slowly Changing Dimension and Lookup Transformation
Slowly Changing Dimension with Lookup and Conditional Split
An Alternative to the OLE DB Command to update Slowly Changing Dimensions
Deleted records: In the design of the staging area and the data warehouse for the first time, the tables will include an "IsDeleted" column with a default value of 0. Before extracting data from the source, we will update the "IsDeleted" column to a value of 1 for the corresponding rows. During the data extraction process, we will update the "IsDeleted" values to 0 for the rows that satisfy the conditions, matching the IDs between the source data and the previously retrieved staging area. By doing this, the records with a value of 1 in the "IsDeleted" column will represent the deleted records in the source data.

Overview - Next-time ETL process

Tasks

Update the "IsDeleted" column in Staging area to 1 for corresponding rows before extracting data from the source.
Extract the newly added, updated or deleted data from the source into the staging
Using Slowly Changing Dimension for tables with small data
Retrieve data from the source and split it into two streams. The first stream will utilize Slowly Changing Dimensions (SCDs) to add new records (those with distinct IDs compared to the ones already present in the Staging area) and update existing data with records having the same IDs but different information in the remaining fields. On the other hand, the second stream will focus on updating the IsDeleted column to zero for records whose IDs from the source match the IDs of the records in the staging area

Using Lookup Transformation for tables with large data
Retrieve the data from the source and divide it into two streams. The initial stream will employ a Lookup on the IDs column to identify records with IDs originating from a different source than those in the Staging area. These new records will be added accordingly. For records with identical IDs, utilize an additional lookup block to update the modified data in the remaining fields.
The second stream will solely extract the IDs column from the source and store them in the 'Updated_IsDelete_nameTable' table within the Staging area. This process will facilitate a subsequent bulk update on the IsDeleted column of the corresponding tables.
Utilize the 'Execute SQL Task' block to update the IsDeleted column with a value of 0 in the corresponding tables (Person, OrderHeader, and OrderDetail tables). This update will occur when the records' IDs in the corresponding tables match the IDs in the 'Update_IsDelete_nameTable' table, which were obtained in the task mentioned earlier. Once the previous tasks are completed, you can add an additional "Execute SQL Task" block to truncate the 'Update_IsDelete_nameTable' table to prepare it for reuse in the next ETL
Delete records whose value in column IsDeleted is equal to 1 in the staging area of each respective table
Transform data in the staging area into dimension and fact tables and save it in the staging area
Update the "IsDeleted" column to 1 for the corresponding rows in both the Dimension and Fact tables in the data warehouse, following the same concept as in the staging area
Load the newly added, updated or deleted data of the Dimension and Fact table from the staging into the data warehouse are by using Slowly Changing Dimension and Lookup Transformation and update "IsDeleted" values to 0 for rows that meet the conditions, matching IDs between the staging and previously loaded data warehouse, following the same concept as in step 2, 3.
Note! We will not use the "hard delete" technique like step 4, but only implement the "soft delete" technique in the data warehouse

Load dimension tables level 1

Load dimension tables level 2

Load dimension tables level 3

Load fact SalesOrder tables

Load fact Product tables

Update the IsDeleted column with a value of 0 in the corresponding tables and truncate Update_IsDeleted_nameTable in DW after the update is complete
Execute the truncate table command on the dimension and fact tables in the staging area to prepare for the subsequent ETL processes, following the same concept as in step 12 of First-time ETL

!!! I HAVE PLACED ALL THE CODE FOR SSIS IN THE PACKAGE SECTION Package-SSIS
If you have any questions about the code or encounter any issues while creating your ETL flow, please do not hesitate to contact me via email. Thank you.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
DataSource		DataSource
Images		Images
Package-SSIS		Package-SSIS
Scripts		Scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build an ETL pipeline using SSIS

Goal

Prerequisites

Overview

First-time ETL

Next-time ETL

Solution Idea

Tasks

About

Releases

Packages

Languages

License

phucvn16409/build-etl-using-ssis

Folders and files

Latest commit

History

Repository files navigation

Build an ETL pipeline using SSIS

Goal

Prerequisites

Overview

First-time ETL

Next-time ETL

Solution Idea

Tasks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages