Skip to content

Starter project for building an ETL pipeline using SSIS in Visual Studio 2019

License

Notifications You must be signed in to change notification settings

phucvn16409/build-etl-using-ssis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build an ETL pipeline using SSIS

This project provides a starting point for building an ETL pipeline using SQL Server Integration Services (SSIS) in Visual Studio 2019.

Goal

The primary goal of the project is to provide a basic solution for anyone who is building a new ETL pipeline using SSIS.

Prerequisites

Overview

The dataset represents a sample dataset for a bicycle company. Our objective is to design and implement an ETL solution to generate sales reports per employee, sales reports by region, reports on the number of orders by employee, reports on the number of orders by region, reports on sales by product subcategory, and reports on sales by product and region.

Overview - ETL Architecture Design

First-time ETL

Build a staging area, design a data warehouse architecture, and create an SSIS package to extract, transform, and load data from the source into the data warehouse.

Overview - First-time ETL process

  1. Import source data into SQL Server Management Studio (SSMS)
  2. Design a data warehouse architecture to serve reports for the above-mentioned overview section

  3. Create a data warehouse database in SQL Server Scripts
  4. Create a staging area database in SQL Server Scripts
  5. Extract data from the source into the staging area

  6. Transform data in the staging area into dimension and fact tables and save it in the staging area

  7. Load dimension tables level 1 from the staging area into the data warehouse

  8. Load dimension tables level 2 from the staging area into the data warehouse

  9. Load dimension tables level 3 from the staging area into the data warehouse

  10. Load fact SalesOrder tables from the staging area into the data warehouse

  11. Load fact Product tables from the staging area into the data warehouse

  12. Execute the truncate table command on the dimension and fact tables in the staging area to prepare for the subsequent ETL processes

Next-time ETL

From the source data, only extract the newly added, updated or deleted data to perform the ETL process

Solution Idea

  • New and updated records: use Slowly Changing Dimension and Lookup Transformation to compare the data from the source with the data saved to the staging area previously
    !Tips: You should refer to the articles below to understand how to use Slowly Changing Dimension and Lookup Transformation
    Slowly Changing Dimension with Lookup and Conditional Split
    An Alternative to the OLE DB Command to update Slowly Changing Dimensions
  • Deleted records: In the design of the staging area and the data warehouse for the first time, the tables will include an "IsDeleted" column with a default value of 0. Before extracting data from the source, we will update the "IsDeleted" column to a value of 1 for the corresponding rows. During the data extraction process, we will update the "IsDeleted" values to 0 for the rows that satisfy the conditions, matching the IDs between the source data and the previously retrieved staging area. By doing this, the records with a value of 1 in the "IsDeleted" column will represent the deleted records in the source data.

    Overview - Next-time ETL process

Tasks

  1. Update the "IsDeleted" column in Staging area to 1 for corresponding rows before extracting data from the source.

  2. Extract the newly added, updated or deleted data from the source into the staging

    Using Slowly Changing Dimension for tables with small data

    Retrieve data from the source and split it into two streams. The first stream will utilize Slowly Changing Dimensions (SCDs) to add new records (those with distinct IDs compared to the ones already present in the Staging area) and update existing data with records having the same IDs but different information in the remaining fields. On the other hand, the second stream will focus on updating the IsDeleted column to zero for records whose IDs from the source match the IDs of the records in the staging area


    Using Lookup Transformation for tables with large data

    Retrieve the data from the source and divide it into two streams. The initial stream will employ a Lookup on the IDs column to identify records with IDs originating from a different source than those in the Staging area. These new records will be added accordingly. For records with identical IDs, utilize an additional lookup block to update the modified data in the remaining fields.
    The second stream will solely extract the IDs column from the source and store them in the 'Updated_IsDelete_nameTable' table within the Staging area. This process will facilitate a subsequent bulk update on the IsDeleted column of the corresponding tables.

  3. Utilize the 'Execute SQL Task' block to update the IsDeleted column with a value of 0 in the corresponding tables (Person, OrderHeader, and OrderDetail tables). This update will occur when the records' IDs in the corresponding tables match the IDs in the 'Update_IsDelete_nameTable' table, which were obtained in the task mentioned earlier. Once the previous tasks are completed, you can add an additional "Execute SQL Task" block to truncate the 'Update_IsDelete_nameTable' table to prepare it for reuse in the next ETL

  4. Delete records whose value in column IsDeleted is equal to 1 in the staging area of ​​each respective table

  5. Transform data in the staging area into dimension and fact tables and save it in the staging area

  6. Update the "IsDeleted" column to 1 for the corresponding rows in both the Dimension and Fact tables in the data warehouse, following the same concept as in the staging area

  7. Load the newly added, updated or deleted data of the Dimension and Fact table from the staging into the data warehouse are by using Slowly Changing Dimension and Lookup Transformation and update "IsDeleted" values to 0 for rows that meet the conditions, matching IDs between the staging and previously loaded data warehouse, following the same concept as in step 2, 3.
    Note! We will not use the "hard delete" technique like step 4, but only implement the "soft delete" technique in the data warehouse

    Load dimension tables level 1


    Load dimension tables level 2


    Load dimension tables level 3


    Load fact SalesOrder tables


    Load fact Product tables


    Update the IsDeleted column with a value of 0 in the corresponding tables and truncate Update_IsDeleted_nameTable in DW after the update is complete

  8. Execute the truncate table command on the dimension and fact tables in the staging area to prepare for the subsequent ETL processes, following the same concept as in step 12 of First-time ETL

!!! I HAVE PLACED ALL THE CODE FOR SSIS IN THE PACKAGE SECTION Package-SSIS
If you have any questions about the code or encounter any issues while creating your ETL flow, please do not hesitate to contact me via email. Thank you.

About

Starter project for building an ETL pipeline using SSIS in Visual Studio 2019

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages