# <center> Data Warehouse Project <center/>
<center> DLBDSEBI02 - Business Intelligence's Project <center/>
<center> IU International University of Applied Sciences <center/>

# Aim
We aim to provide an optimal Business Intelligence (BI) architecture by developing a Data Warehouse (DWH) system. This involves identifying source systems, such as operational and departmental systems, to provide data for the DWH. A suitable DWH architecture will be proposed by evaluating different variants and discussing their pros and cons. Additionally, one key performance indicator (KPI) will be selected, and the required Extract, Transform, Load (ETL) process will be explained. The goal is to enhance transparency, implement KPIs, and reduce manual data consolidation across departments.

# List of contents :
1. __Introduction__
2. __The Data Warehouse Architecture__
3. __Visualizing the Data__
4. __Summary__

Initiating the project with loading the required libraries 

In [2]:
# Importing libraries
import pandas as pd
import psycopg2
import warnings

# Ignore useless warnings
warnings.filterwarnings('ignore')

### 1. Introduction

In this notebook, we illustrate how the Data Warehouse delivers structured, cleaned, and non-redundant datasets for Business Intelligence Solutions. An example of how the Sales Data Mart is conceptualized, populated, and the process of deriving the Sales KPI is provided.


### 2. The Data Warehouse Architecture
A Data warehouse typically consists of four __LAYERS__ similar to the following:

![Local Image](./figures/Figure%201.png)

From our perspective, we decided to use the following architecture in the cloud for better results using AWS services

![Local Image](./figures/Figure%202.png)

#### 2.1. Data Marts Architecture 

As explained by the analysis in the __Project Report__, we agreed on using the architecture of Data Mart's Bus. In other words, we'll create approximately similar data marts across departments using the same policies and technologies, aiming to create a big and coherent Data Warehouse soon. We'll provide the example of the Sales Department, in which we use the following __STAR SCHEMA__.

![Local Image](./figures/Figure%203.png)

#### 2.2. The Dimensions and fact tables

Using the script __main_db.sql__, The aforementioned database, schema, and tables were created. After populating these tables with data in the appropriate format, we can use the following script to generate a full data set out of the table __my_fact_transactions__.

In [3]:
# Connect to the database
connection = psycopg2.connect(dbname="fedor_warehouse",
                              user="postgres",
                              password="postgres",
                              host="localhost",
                              port="5433" # Default is 5432
                              )

cursor = connection.cursor() 

# Let's get the version of the database
cursor.execute("SELECT version();")
    
# Fetch the response
db_version = cursor.fetchone() 
    
# Print the response
print(f"Connected to {db_version[0]}")

Connected to PostgreSQL 16.0, compiled by Visual C++ build 1935, 64-bit


Now, we delve into getting the data from the data mart. Either can handle this process : 
* __SQL__ by merging the tables on their respective IDs than retrieving it.
* __Python__ by manipulating the retrieved data from the tables into one large dataset.

In our case, I chose to do it using the Pandas library in Python

In [4]:
# Load the addresses data 
addresses = pd.read_csv('./data/my_dim_address.csv')

# Print its info
addresses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   address_id  300 non-null    int64  
 1   zip_code    300 non-null    int64  
 2   street      300 non-null    object 
 3   city        300 non-null    object 
 4   country     300 non-null    object 
 5   latitude    300 non-null    float64
 6   longitude   300 non-null    float64
dtypes: float64(2), int64(2), object(3)
memory usage: 16.5+ KB


In [5]:
# Load the dates' table
dates = pd.read_csv('./data/my_dim_date.csv')

# Print its info
dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date_id       546 non-null    int64 
 1   date          546 non-null    object
 2   year          546 non-null    int64 
 3   quarter       546 non-null    int64 
 4   quarter_name  546 non-null    object
 5   month         546 non-null    int64 
 6   month_name    546 non-null    object
 7   day           546 non-null    int64 
 8   weekday       546 non-null    int64 
 9   weekday_name  546 non-null    object
dtypes: int64(6), object(4)
memory usage: 42.8+ KB


In [6]:
# Load the invoices' data
invoices = pd.read_csv('./data/my_dim_invoice.csv')

# Print its info
invoices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5001 entries, 0 to 5000
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   invoice_id      5001 non-null   int64  
 1   invoice_amount  5001 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 78.3 KB


In [7]:
# Loading the customers data
customers = pd.read_csv('./data/my_dim_customer.csv')

# Print its info
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   customer_id          300 non-null    int64 
 1   customer_first_name  300 non-null    object
 2   customer_last_name   300 non-null    object
 3   customer_mail        300 non-null    object
dtypes: int64(1), object(3)
memory usage: 9.5+ KB


In [8]:
# Load the products' data
products = pd.read_csv('./data/my_dim_products.csv')

# Print its info
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   product_id           3000 non-null   int64  
 1   product_description  3000 non-null   object 
 2   unit_price           3000 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 70.4+ KB


In [9]:
# Load the transaction's data
transactions = pd.read_csv('./data/my_fact_transactions.csv')

# Print its info
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 8 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   transaction_id      500000 non-null  int64  
 1   quantity            500000 non-null  int64  
 2   transaction_amount  500000 non-null  float64
 3   date_id             500000 non-null  int64  
 4   product_id          500000 non-null  int64  
 5   address_id          500000 non-null  int64  
 6   customer_id         500000 non-null  int64  
 7   invoice_id          500000 non-null  int64  
dtypes: float64(1), int64(7)
memory usage: 30.5 MB


The previous data was all the dimensions we made in the __Sales__ Data Mart. Now, we delve to the fact tables: __Transactions__

In [10]:
# Merge the fact table with dimension tables based on foreign keys
merged_df = pd.merge(transactions, dates, on='date_id', how='inner')
merged_df = pd.merge(merged_df, addresses, on='address_id', how='inner')
merged_df = pd.merge(merged_df, customers, on='customer_id', how='inner')
merged_df = pd.merge(merged_df, products, on='product_id', how='inner')
merged_df = pd.merge(merged_df, invoices, on='invoice_id', how='inner')

# Drop the ID columns
df = merged_df.drop(columns=['date_id', 'address_id', 'customer_id', 'product_id', 'invoice_id'])

# Print the first few rows
df.head(10)

Unnamed: 0,transaction_id,quantity,transaction_amount,date,year,quarter,quarter_name,month,month_name,day,...,city,country,latitude,longitude,customer_first_name,customer_last_name,customer_mail,product_description,unit_price,invoice_amount
0,1,1,1.95,17/04/2023,2023,2,Q2,4,April,17,...,Stuttgart,Germany,48.7775,9.18,Benjamin,Brown,Benjamin.Brown@mail.com,PACK 3 BOXES CHRISTMAS PANNETONE,1.95,1558.84
1,117047,7,14.7,29/09/2022,2022,3,Q3,9,September,29,...,Basel,Switzerland,47.5606,7.5906,Sophia,Taylor,Sophia.Taylor@mail.com,GROW YOUR OWN BASIL IN ENAMEL MUG,2.1,1558.84
2,490593,1,2.1,03/10/2022,2022,4,Q4,10,October,3,...,Cologne,Germany,50.9422,6.9578,John,Turner,John.Turner@mail.com,GROW YOUR OWN BASIL IN ENAMEL MUG,2.1,1558.84
3,89403,4,50.0,18/02/2023,2023,1,Q1,2,February,18,...,Frankfurt,Germany,50.1106,8.6822,Alexander,Turner,Alexander.Turner@mail.com,LANDMARK FRAME NOTTING HILL,12.5,1558.84
4,475625,8,33.04,14/05/2022,2022,2,Q2,5,May,14,...,Cologne,Germany,50.9422,6.9578,Sebastian,Anderson,Sebastian.Anderson@mail.com,FAIRY TALE COTTAGE NIGHT LIGHT,4.13,1558.84
5,25297,3,4.98,01/06/2023,2023,2,Q2,6,June,1,...,Glasgow,United Kingdom,55.8611,-4.25,Sophia,Green,Sophia.Green@mail.com,PARTY PIZZA DISH BLUE POLKADOT,1.66,1558.84
6,163338,2,1.7,18/01/2022,2022,1,Q1,1,January,18,...,Amsterdam,Netherlands,52.3728,4.8936,Olivia,Thompson,Olivia.Thompson@mail.com,LETTER L BLING KEY RING,0.85,1558.84
7,424778,9,71.55,13/07/2022,2022,3,Q3,7,July,13,...,Edinburgh,United Kingdom,55.9533,-3.1892,Abigail,Martin,Abigail.Martin@mail.com,ICE CREAM DESIGN GARDEN PARASOL,7.95,1558.84
8,480774,5,39.75,11/03/2022,2022,1,Q1,3,March,11,...,Stuttgart,Germany,48.7775,9.18,John,Green,John.Green@mail.com,ICE CREAM DESIGN GARDEN PARASOL,7.95,1558.84
9,141095,9,11.25,13/04/2022,2022,2,Q2,4,April,13,...,Liverpool,United Kingdom,53.4075,-2.9919,Elijah,Campbell,Elijah.Campbell@mail.com,CRAZY DAISY HEART DECORATION,1.25,1558.84


__Important__ : The previous data is saved in the _data_ folder of this project. The following data set isn't included in the aforementioned folder due to its large size. However. You can always build it using the previous script.

In [11]:
# Generates the full dataset
# NOTE: Uncomment the following line and specify the path you want
# df.to_csv("Your preferred PATH.csv", index=False) 

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 24 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   transaction_id       500000 non-null  int64  
 1   quantity             500000 non-null  int64  
 2   transaction_amount   500000 non-null  float64
 3   date                 500000 non-null  object 
 4   year                 500000 non-null  int64  
 5   quarter              500000 non-null  int64  
 6   quarter_name         500000 non-null  object 
 7   month                500000 non-null  int64  
 8   month_name           500000 non-null  object 
 9   day                  500000 non-null  int64  
 10  weekday              500000 non-null  int64  
 11  weekday_name         500000 non-null  object 
 12  zip_code             500000 non-null  int64  
 13  street               500000 non-null  object 
 14  city                 500000 non-null  object 
 15  country          

### 4. Visualizing the Data
In this phase, the data set is available to build our __Interactive Dashboard__ and a wide range of options is available. I chose to use IBM Cognos Analytics for its simplicity and smoothness. With the data from the previous phase, the following dashboard was obtained :

![Local Image](./figures/Figure%204.png)

### 5. Summary


The Data Warehouses and Business Intelligence are crucial techniques not only for large companies but specifically for small and mid-size organizations and startups. They provide a whole overview on the health of the businesses and more importantly, keep their data or Goldmine safe.

## About the Author
<a href="https://www.linkedin.com/in/ab0858s/">Abdelali BARIR</a> is a former veteran in the Moroccan's Royal Armed Forces, and a self-taught python programmer. Currently enrolled in B.Sc. Data Science in IU International University of Applied Sciences.

## Change Log
| Date         | Version   | Changed By       | Change Description        |
|--------------|-----------|------------------|---------------------------|
| 2024-09-16   | 1.01      | Abdelali Barir   | Modified markdown         |
| ------------ | --------- | ---------------- | ------------------------- |