This repository contains all of the database code - postgresql - and configuration files (bash scripts/dockerfiles) as well as specific configuration settings for networking and parameter tuning. Docker images are created from each of the components in the cluster to form the database cluster that airflow relies upon for the ETL pipeline.
E-R diagram constructed in 3rd normalised form representing all the entity relationships and multiplicities.
![Screenshot 2023-08-02 at 13 06 34](https://private-user-images.githubusercontent.com/75015699/257807033-9a386528-19c0-4aa4-997c-9235614ffac9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5MTgxNTAsIm5iZiI6MTcxODkxNzg1MCwicGF0aCI6Ii83NTAxNTY5OS8yNTc4MDcwMzMtOWEzODY1MjgtMTljMC00YWE0LTk5N2MtOTIzNTYxNGZmYWM5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIwVDIxMTA1MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTY4NGYwZGNjMWNhZWEyYWNjNTJlOTI3YzE3Nzc4MmY0YTU1OTdiYTI5ZGJiNDJjNTcwYTdhZjEzYzY2Y2JhNjgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.fFhtQ7D-sn7iij1N6e_meXj-_rF2kC6jKl9jNTW7P0I)
A database was chosen for implementation over an operational data store, because the data collected for this project has been sourced from a single API and lacks heterogeneity of data sources.
An overview of the final entity descriptions for those names represented in the diagram are displayed below:
![Screenshot 2023-08-02 at 13 29 09](https://private-user-images.githubusercontent.com/75015699/257812478-e1595501-ee7b-43bf-a0c9-1440803fda6e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5MTgxNTAsIm5iZiI6MTcxODkxNzg1MCwicGF0aCI6Ii83NTAxNTY5OS8yNTc4MTI0NzgtZTE1OTU1MDEtZWU3Yi00M2JmLWEwYzktMTQ0MDgwM2ZkYTZlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIwVDIxMTA1MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTUxNWVlNGIzZTkxNWU3MTdlYjVlMDkzZmRjOWRmZTYyM2MwYWRmZTJiMWMwOTNhZWNmMjQ4YzY0YzFjYTRkMDUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.xyimnSHtThGQ7FaSvuxc4G99wszv3V1VGicyTHLqzPk)
Entity names shown above were categorised by brainstorming all the potential occurrences of things that would be expected to be stored within the database system. Doing so, reoccurring entities revealed themselves, and any potential aliases were documented to group those aliases with one general term.
![Screenshot 2023-08-02 at 13 27 33](https://private-user-images.githubusercontent.com/75015699/257812116-06dd101d-b129-4ad0-a51b-41787f6676ee.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5MTgxNTAsIm5iZiI6MTcxODkxNzg1MCwicGF0aCI6Ii83NTAxNTY5OS8yNTc4MTIxMTYtMDZkZDEwMWQtYjEyOS00YWQwLWE1MWItNDE3ODdmNjY3NmVlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIwVDIxMTA1MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTBhZjc2NTE1MTE5MGIwMDRiNTU5Y2M1YzU3OGE0NTljYzQ1MGZjY2ExYWMxZTNhODJhMTU1NGQ4YmQ5ZGMxYWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.OmXc_G6I8MAc-CEEHlAcuHD38b7PHgfnHkIU8fzYBbI)
In addition to this, the process for identifying the relations between entities was to give relationship names to the entity descriptions that connected them to one another. Multiplicities of the entity occurrences involved in the relationships were then identified via close referral with FIA regulations to accurately represent the domain.
As displayed within the metadata descriptions, the primary and foreign keys of each entity were also identified at this point, and the resulting full description of the final entity relationships and multiplicities can be seen below in a few different examples:
Requirements | Description | Rationale |
---|---|---|
Proxy Server | An authentication method shall be provided for use of PgBouncer to authenticate users against those allowed into the database system. | This is so that the database is not exposed to malicious insiders or threats from hackers. |
Database - Privileges | The system shall limit the CRUD functionality of the users on the database based on designated privilege levels. | This is to ensure the integrity of the database and prevent unauthorised users from accessing the database and corrupted the contents, hence destroying the system. |
Data constraints – Airflow webserver | Airflow webserver shall make use of DAG serialisation to store DAGs in the metadata database. | This is to stop the webserver from processing DAGs, and prevent airflow being the point of leakage of data, as by default the webserver UI is accessible from any external network. |
Network Security – Airflow Images | Airflow components should run on security proven container images verified by Docker Trust Registry. | Provides security against man in the middle attacks. |
Access Management - Database | Access to the database shall be authenticated via PgBouncer authentication method as an unprivileged user. | This way existing connections are reused and the computational resource to fire up new connections each time is avoided. |
Metadata | Metadata shall be stored for both the DBMS and the data warehouse. | So that prior and intermediary states of data can be logged before, during and after any data transformations. |
DBMS Backup, Replicability and Failover | The system may run a standby node for the database to ensure high availability of the DBMS in case of system failures or crashes. | To prevent loss of data and metadata. |
Choice of Database - Metadata | The system shall use PostgreSQL as the backend database used to store the metadata of the system. | Relational database which has plenty of open-source support and is necessary to model entity-relationships in the metadata model in preparation for the data warehouse. Here the metadata model, job templates for python DAG creation and synchronisation as well as PostgreSQL jobs via templates will be stored. |
Docker Desktop for Mac is recommended for install in order to have access to the dockerd and in order to be able to use docker-compose.
- su-exec
- postgresql
- docker-compose
- docker
You can then clone and pull the repo and install the environment following these steps:
- Open terminal app in desktop
- Change the current working directory to the location which you want to directory to be cloned to.
- Use the git clone command, and the URL type you require (This example uses HTTPS).
git clone https://github.com/nbdevs/postgres-db-cluster.git
- Once you click enter you should see the following to confirm success.
$ git clone https://github.com/nbdevs/postgres-db-cluster.git > Cloning into `Project-Folder`... > remote: Counting objects: 10, done. > remote: Compressing objects: 100% (8/8), done. > remove: Total 10 (delta 1), reused 10 (delta 1) > Unpacking objects: 100% (10/10), done.
Ongoing - minor structural changes expected due to a few pending feature additions.
Copyright © 2022 Nicholas Bojor.
The code in this repository is licensed under the MIT license.