# CREATING AN ETL PIPELINE WITH SSIS

## Introduction

In this notebook, you will learn how to create an ETL pipeline with SSIS. ETL stands for Extract, Transform, and Load. We will Extract our data from a flat file source, Transform it and Load it into a relational database. Our flat file source will be a CSV file (part of the GreenConstruct Sales Database - you can download it from here https://raw.githubusercontent.com/mdagteki/etl-projects/main/Project_ETL_SSIS_Package/GreenConstruct_Sales_Dataset_Long.csv)



Here is the structure of the CSV file:

In [1]:
import pandas as pd
data = pd.read_csv('GreenConstruct_Sales_Dataset_Long.csv')
data.head(10)

Unnamed: 0,Month,Region,ProductGroup,Sales
0,2010-01-01,Central,Accessories,375.0
1,2010-01-01,Central,Construction Parts,72360.93
2,2010-01-01,Central,Pipes,14013.93
3,2010-01-01,Central,Tubes,2476.4
4,2010-01-01,East,Accessories,40755.19
5,2010-01-01,East,Construction Parts,71147.43
6,2010-01-01,East,Pipes,100408.35
7,2010-01-01,East,Tubes,23224.78
8,2010-01-01,North,Accessories,8024.25
9,2010-01-01,North,Construction Parts,117884.41


Our data contains 4 columns and 960 rows. The sales data is in a long format and contains sales totals for each month from 2010 to 2013.
We are going to EXTRACT this data to our SSIS solution project and TRANSFORM the data into different tables according to Regions and Product Groups then LOAD it into our relational database.

## Create an ETL pipeline
### EXTRACT data from flat file

First we are going to create a Data Flow Task in our Control Flow and name it "Load CSV File to Source and distribute data according to region and product categories". Then we will create a flat file connection to our CSV file and name it LoadCsvFromSource. Inside our Control Flow-Data Flow task, we will create a Flat File source and use our LoadCsvFromSource connection to Extract the data into our pipeline.

![image.png](P_003.png)

![image.png](P_007.png)

![image.png](P_008.png)

### TRANSFORM data

Now we need to TRANSFORM our data into two groups of multiple tables for each Region and Product group. We will cast our data into two different sources and transform them separately. To be able to do that we are going to Multicast our data into two different Data Flow Tasks named "Distribute Data According to Regions" and "Distribute Data According to Product Groups".

![image.png](P_009.png)

For each group, we are going to create a Conditional Split Transformation to be able to filter the data according to certain conditions
![image.png](P_010.png)

![image.png](P_011.png)


### LOAD data into relational database

Now we are going to LOAD our data into our relational database. To do that we will create tables in our SQL server first, Luckily Visual Studio for SQL Server has an SSIS package that will allow us to do that easily when we add the destination. First we will add ADO.NET destination to each of our Conditional Split Transformation outputs. 5 for Region and 4 for Product Groups. For each destination we will create a table in our SQL server easily by hitting the new button in ADO NET Destination Editor and correct the name of the table.

![image.png](P_012.png)

We are going to repeat this for all of our destinations. (Total 9)

Our Data Flow Task will look like this:

![image.png](P_002.png)


### Executing the ETL pipeline SSIS package
Now our package is ready to be executed.  We could either use the green start button in the top menu or we could right-click our control flow object and execute it.

![image.png](P_013.png)

![image.png](P_014.png)

After execution, if everything is done correctly we should see all green thick marks in our data flow objects.

![image.png](P_001.png)

 Also Now we can see our newly created tables in our SQL server and our data in our relational database.

![image.png](P_004.png)

![image.png](P_005.png)

![image.png](P_006.png)



# Conclusion

We have successfully created an ETL pipeline using SSIS package. As you will see it is not a complicated process at all.