Skip to content

nbdevs/postgres-db-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Database backend for the apache-airflow ETL pipeline

This repository contains all of the database code - postgresql - and configuration files (bash scripts/dockerfiles) as well as specific configuration settings for networking and parameter tuning. Docker images are created from each of the components in the cluster to form the database cluster that airflow relies upon for the ETL pipeline.

Summary of Database System

E-R diagram constructed in 3rd normalised form representing all the entity relationships and multiplicities.

Screenshot 2023-08-02 at 13 06 34

A database was chosen for implementation over an operational data store, because the data collected for this project has been sourced from a single API and lacks heterogeneity of data sources.

An overview of the final entity descriptions for those names represented in the diagram are displayed below:

Screenshot 2023-08-02 at 13 29 09

Entity names shown above were categorised by brainstorming all the potential occurrences of things that would be expected to be stored within the database system. Doing so, reoccurring entities revealed themselves, and any potential aliases were documented to group those aliases with one general term.

Screenshot 2023-08-02 at 13 27 33

In addition to this, the process for identifying the relations between entities was to give relationship names to the entity descriptions that connected them to one another. Multiplicities of the entity occurrences involved in the relationships were then identified via close referral with FIA regulations to accurately represent the domain.

As displayed within the metadata descriptions, the primary and foreign keys of each entity were also identified at this point, and the resulting full description of the final entity relationships and multiplicities can be seen below in a few different examples:

Driver Championship  Grid  Pit  Driver Champ Stand 

Requirements related to Database System

Requirements Description Rationale
Proxy Server An authentication method shall be provided for use of PgBouncer to authenticate users against those allowed into the database system. This is so that the database is not exposed to malicious insiders or threats from hackers.
Database - Privileges The system shall limit the CRUD functionality of the users on the database based on designated privilege levels. This is to ensure the integrity of the database and prevent unauthorised users from accessing the database and corrupted the contents, hence destroying the system.
Data constraints – Airflow webserver Airflow webserver shall make use of DAG serialisation to store DAGs in the metadata database. This is to stop the webserver from processing DAGs, and prevent airflow being the point of leakage of data, as by default the webserver UI is accessible from any external network.
Network Security – Airflow Images Airflow components should run on security proven container images verified by Docker Trust Registry. Provides security against man in the middle attacks.
Access Management - Database Access to the database shall be authenticated via PgBouncer authentication method as an unprivileged user. This way existing connections are reused and the computational resource to fire up new connections each time is avoided.
Metadata Metadata shall be stored for both the DBMS and the data warehouse. So that prior and intermediary states of data can be logged before, during and after any data transformations.
DBMS Backup, Replicability and Failover The system may run a standby node for the database to ensure high availability of the DBMS in case of system failures or crashes. To prevent loss of data and metadata.
Choice of Database - Metadata The system shall use PostgreSQL as the backend database used to store the metadata of the system. Relational database which has plenty of open-source support and is necessary to model entity-relationships in the metadata model in preparation for the data warehouse. Here the metadata model, job templates for python DAG creation and synchronisation as well as PostgreSQL jobs via templates will be stored.

Software Dependencies/Installation

Docker Desktop for Mac is recommended for install in order to have access to the dockerd and in order to be able to use docker-compose.

  • su-exec
  • postgresql
  • docker-compose
  • docker

You can then clone and pull the repo and install the environment following these steps:

  1. Open terminal app in desktop
  2. Change the current working directory to the location which you want to directory to be cloned to.
  3. Use the git clone command, and the URL type you require (This example uses HTTPS).
    git clone https://github.com/nbdevs/postgres-db-cluster.git
  4. Once you click enter you should see the following to confirm success.
    $ git clone https://github.com/nbdevs/postgres-db-cluster.git
    > Cloning into `Project-Folder`...
    > remote: Counting objects: 10, done.
    > remote: Compressing objects: 100% (8/8), done.
    > remove: Total 10 (delta 1), reused 10 (delta 1)
    > Unpacking objects: 100% (10/10), done.

Project status

Ongoing - minor structural changes expected due to a few pending feature additions.

License

Copyright © 2022 Nicholas Bojor.

The code in this repository is licensed under the MIT license.

Releases

No releases published

Packages

No packages published