# ETL demo

The core concept of the ETL is to operate on the data. 
So you need to understand the data, and know the relationships between them.
And then using scripts to sort them out inside the database, or trigger next progress.

It is a powerful way to get multiple data sources to work together, get the insights or applications we want. 

We will use SQL Server, which is backed by Microsoft as an example to do an ETL demo. 
The goal here will be try to load the data from a csv file to the database.
(In this way you can learn relational database solutions for different vendors, at the same time, SQL Server and PostgreSQL are the two most popular relational database on the market)


## Connect to the SQL Server
first, same as the postgresql connection process, you will need to have a driver to help you connect to the database via python.
You have two options here

- pyodbc
- pymssql

Either way works, we pick one and continue this

In [1]:
import pymssql

This line imports the pymssql module, which provides a Python interface to Microsoft SQL Server.

In [2]:
# Define your connection parameters
server_name = 'sqlserver'
database_name = 'AdventureworksDWDemo'
username = 'sa'
password = 'YourStrongPassw0rd'

These lines define the connection parameters for accessing the SQL Server database. It includes the server name `(server_name)`, the name of the database `(database_name)`, the username `(username)`, and the password `(password)`.

In [3]:
conn = pymssql.connect(server_name, username, password, database_name)

This line establishes a connection to the SQL Server database using the parameters defined earlier. It creates a connection object (conn) using the pymssql.connect() function.
python


In [4]:
conn

<pymssql._pymssql.Connection at 0x7ffffc171f80>

In [5]:
cursor = conn.cursor()

This line creates a cursor object (cursor) from the connection object. The cursor is used to execute SQL queries and fetch results.

In [6]:
cursor

<pymssql._pymssql.Cursor at 0x7ffffc117940>

In [7]:
script_parts = [
    "USE AdventureworksDWDemo",
    "CREATE TABLE DimCustomer (CustomerID int PRIMARY KEY IDENTITY, CustomerAltID varchar(50) NOT NULL, CustomerName varchar(256), Gender varchar(20))",
    "CREATE TABLE DimProduct (ProductKey int PRIMARY KEY IDENTITY, ProductAltKey varchar(10) NOT NULL, ProductName varchar(100), ProductActualCost money, ProductSalesCost money)",
    '''
      CREATE TABLE DimStores
    (
        StoreID int PRIMARY KEY IDENTITY,
        StoreAltID varchar(10) NOT NULL,
        StoreName varchar(100),
        StoreLocation varchar(100),
        City varchar(100),
        State varchar(100),
        Country varchar(100)
    )
    ''',
    '''
    CREATE TABLE DimSalesPerson
    (
        SalesPersonID int PRIMARY KEY IDENTITY,
        SalesPersonAltID varchar(10) NOT NULL,
        SalesPersonName varchar(100),
        StoreID int,
        City varchar(100),
        State varchar(100),
        Country varchar(100)
    )
    ''',
    '''
    CREATE TABLE FactProductSales
    (
        TransactionId bigint PRIMARY KEY IDENTITY,
        SalesInvoiceNumber int NOT NULL,
        StoreID int NOT NULL,
        CustomerID int NOT NULL,
        ProductID int NOT NULL,
        SalesPersonID int NOT NULL,
        Quantity float,
        SalesTotalCost money,
        ProductActualCost money,
        Deviation float
    )
    ''',
    '''
    ALTER TABLE FactProductSales ADD CONSTRAINT FK_StoreID FOREIGN KEY (StoreID) REFERENCES DimStores(StoreID)
    ''',
    '''
    ALTER TABLE FactProductSales ADD CONSTRAINT FK_CustomerID FOREIGN KEY (CustomerID) REFERENCES DimCustomer(CustomerID)
    ''',
    '''
    ALTER TABLE FactProductSales ADD CONSTRAINT FK_ProductKey FOREIGN KEY (ProductID) REFERENCES DimProduct(ProductKey)
    ''',
    '''
    ALTER TABLE FactProductSales ADD CONSTRAINT FK_SalesPersonID FOREIGN KEY (SalesPersonID) REFERENCES DimSalesPerson(SalesPersonID)
    '''
]

This variable defines a list called script_parts, which contains SQL statements to create tables and define relationships between them.

In [8]:
for part in script_parts:
    try:
        cursor.execute(part)
        conn.commit()  # Commit changes for DDL statements
    except Exception as e:
        print(f"Error executing SQL script: {e}")
        break  # Stop execution on error

# Close the connection
cursor.close()
conn.close()

Error executing SQL script: (2714, b"There is already an object named 'DimCustomer' in the database.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n")


This loop iterates over each SQL statement in script_parts. It tries to execute each SQL statement using cursor.execute(part). If an error occurs during execution, it prints an error message and breaks out of the loop. Otherwise, it commits the changes to the database using conn.commit(). \


These lines close the cursor and connection objects to release database resources once the script has finished executing.

until now, we are creating the tables and add the contraints, next we will try to dump the data into the tables.

There are multiple ways to do so, you can select which way best suits you when you doing projects.

- Construct the SQL scripts and execute them as above
- Use pandas (which means other people doing the above step for you)

We will demo the way using pandas

In [9]:
import pandas as pd

In [10]:
from sqlalchemy import create_engine

These lines import the necessary modules: pandas for data manipulation and create_engine from sqlalchemy for database interaction.

In [11]:
connection_string = f"mssql+pymssql://{username}:{password}@{server_name}/{database_name}"
engine = create_engine(connection_string)

This section sets up a connection string using the connection parameters (username, password, server_name, database_name) and creates an SQLAlchemy engine using create_engine().

In [12]:
from pathlib import Path
# data path is current path's parent and then AdventureWorkDWDemo
script_path = Path.cwd().parent
data_path = script_path / "data" / "AdventureWorkDWDemo"

dim_customer_csv = data_path / "DimCustomer.csv"

These lines use pathlib to construct the path to the CSV file (DimCustomer.csv) containing the data to be imported into the database

In [13]:
customer_df = pd.read_csv(dim_customer_csv, header=None)

In [14]:
customer_df

Unnamed: 0,0,1,2
0,IMI-001,Henry Ford,M
1,IMI-002,Bill Gates,M
2,IMI-003,Muskan Shaikh,F
3,IMI-004,Richard Thrubin,M
4,IMI-005,Emma Wattson,F


In [15]:
customer_df.columns = ["CustomerAltID", "CustomerName", "Gender"]

This section reads the CSV file into a Pandas DataFrame (customer_df). Since there is no header in the CSV file, header=None is specified. Then, it assigns column names to the DataFrame.

In [16]:
customer_df.to_sql("DimCustomer", con=engine, if_exists="append", index=False)

5

This line uses the to_sql() method to write the DataFrame (customer_df) to the SQL database table named "DimCustomer". It specifies that if the table already exists, the data should be appended (if_exists="append") and specifies not to include the DataFrame index in the table.

In [17]:
# confirm it via run query 

In [18]:
conn = pymssql.connect(server_name, username, password, database_name)
cursor = conn.cursor()

# These lines establish a new connection to the SQL Server database using pymssql and create a cursor object for executing SQL queries.

cursor.execute("SELECT * FROM DimCustomer")

# Fetch all rows
# This section executes a SQL query to select all rows from the "DimCustomer" table and fetches the result rows.
rows = cursor.fetchall()

if rows:
    # Print each row
    for row in rows:
        print(row)

(1, 'IMI-001', 'Henry Ford', 'M')
(2, 'IMI-002', 'Bill Gates', 'M')
(3, 'IMI-003', 'Muskan Shaikh', 'F')
(4, 'IMI-004', 'Richard Thrubin', 'M')
(5, 'IMI-005', 'Emma Wattson', 'F')
(1002, 'IMI-001', 'Henry Ford', 'M')
(1003, 'IMI-002', 'Bill Gates', 'M')
(1004, 'IMI-003', 'Muskan Shaikh', 'F')
(1005, 'IMI-004', 'Richard Thrubin', 'M')
(1006, 'IMI-005', 'Emma Wattson', 'F')
(2002, 'IMI-001', 'Henry Ford', 'M')
(2003, 'IMI-002', 'Bill Gates', 'M')
(2004, 'IMI-003', 'Muskan Shaikh', 'F')
(2005, 'IMI-004', 'Richard Thrubin', 'M')
(2006, 'IMI-005', 'Emma Wattson', 'F')


You can do the rest for the csv files under the AdventureWorkDWDemo folder, and then use this db as the db for the cube creation.

Overall, this script reads data from a CSV file into a Pandas DataFrame, writes the DataFrame to a SQL Server database table, and then verifies the data insertion by querying and printing the contents of the table.