Data Engineering Lab

1. Setup

To go through this lab you will need to install the following tools in your workstation.

1.A PostgreSQL

Download it from the official website. It uncludes both the database (which we will use to store and process data) as well as pgAdmin, and an environment to query and manage Postgre databases. During installation keep note of 2 things we will need to connect to the database afterwards:

1. Port: By default it should be 5432. If it's any other, keep note of it
1. Superuser's (postgres) password: To keep it simple I set it to "postgres", but if you set any other... Don't forget it!

1.B Python 3 libraries

We need the following libraries:

Jupyter: Needed to work with Jupyter notebooks.
Luigi: This library allows us to build and schedule data pipelines.

⚠️ To have access to all Luigi's features you need a UNIX machine (Linux, macOS). In this lab we will just run the pipelines using the local-scheduler mode.

Pandas: One of the best-known Python libraries for data manipulation.
SQL Alchemy: Library used as an interface to interact from Python with different databases.
Psycopg: This module serves as a connector to PostgreSQL.

If you have pip installed, you can download these libraries using the requirements file in this repository as follows:

pip install -r requirements.txt

2. Learn SQL

In this lab we use some basic SQL (Structured Query Language). This is the main tool to interact with the vast majority of databases. Each SQL has its own "dialect", but there is a common core to all of them.

If you are not familiar with SQL already have no worries, it's by far the easiest programming language. You can get a grasp of the basics in any of the resources below:

3. Repository content

Files you will find in this repository:

Data Engineering Lab.ipynb: A Jupyter notebook with some examples of data manipulation that serves as an initial setup.
lab_params.py: Several parameter we will use during the lab (database settings, file paths)
lab_utils.py: A collection of functions to be reused in the lab, mostly to interact with the database (create tables, load data, run queries,retrieve data...).
pipelines (folder): You can find here some examples of data pipelines using Luigi.
- rpl_covid_survey.py : Downloads daily reports from API.
- covid_survey_covid_mask.py : Joins 2 reports from different indicators (covid & mask) into a single table.
- covid_survey_covid_mask_2.py: Similar to the previous pipeline, but using a table schema that is more escalable.
- covid_survey_json.py: Yet another iteration on covid_survey_covid_mask, but now making it fully escalable to handle a dinamic list of rpl_covid_XXX reports as input.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
pipelines		pipelines
README.md		README.md
data_engineering_lab.ipynb		data_engineering_lab.ipynb
lab_params.py		lab_params.py
lab_utils.py		lab_utils.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Lab

1. Setup

1.A PostgreSQL

1.B Python 3 libraries

2. Learn SQL

3. Repository content

About

Releases

Packages

Languages

princeAnalyst-ML/dsa

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Lab

1. Setup

1.A PostgreSQL

1.B Python 3 libraries

2. Learn SQL

3. Repository content

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages