Skip to content

🍺 A data engineering project showcasing an ELT pipeline using modern technologies such as Delta-rs, and Apache Airflow.

Notifications You must be signed in to change notification settings

jgrove90/rick-and-morty-deltalake

Repository files navigation

Rick and Morty ELT Pipeline

AboutInstallationDashboardELT DiagramAirflow GraphImprovements

About

In this data engineering project, my main goal was to build a data pipeline to extract, transform, and load data from the Rick and Morty API. I aimed to create an organized and efficient flow of data using a combination of tools and technologies.

To start, I utilized the Python requests package to fetch data from the Rick and Morty API. This marked the beginning of the data extraction process, where I collected information about characters, episodes, and locations.

For data storage, I took a modern approach by implementing a Delta Lake using the delta-rs package along with the Medallion architecture. This cutting-edge tool allowed me to manage structured and semi-structured data effectively without relying on Apache Spark. The result was a lightweight, yet powerful storage solution for anything less then big data.

In the transformation phase, I turned to the pandas library. This step involved shaping the extracted data into a more manageable format. Through a series of data cleaning, filtering, and structuring operations, I harnessed the flexibility of pandas DataFrames.

Apache Airflow, played a role in orchestrating the entire process. Airflow enabled me to schedule and automate data extraction, transformation, and loading.

Using Docker, I containerized the project. This step allowed me to encapsulate the entire workflow and its dependencies, making it easily portable across different environments. Containerization ensured consistency and eliminated potential compatibility issues, making deployment a breeze.

In essence, this project showcases my ability to seamlessly gather, store, transform, and orchestrate data using a well-chosen set of tools. From extracting data through APIs to utilizing advanced storage techniques, employing data transformation libraries, orchestrating tasks with Airflow, and finally containerizing the project, every step reflects a strategic approach to building a robust and efficient data pipeline, even if only on a small scale.

ELT Diagram

Airflow Graph

Key Technologies:

  • Python
  • Apache Airflow
  • Delta-rs: Data storage layer with ACID transactions and versioning
  • Docker: Containerization platform for easy deployment and reproducibility
  • Tableau: Visual analytics platform

Application services at runtime:

  • One airflow worker
  • One airflow scheduler
  • One airflow triggerer
  • One airflow webserver
  • Redis
  • Postgres

Installation

  1. Download Docker Desktop and start docker
  2. Clone Repo
git clone https://github.com/jgrove90/rick-and-morty-deltalake.git
  1. Run start.sh
sh start.sh
  1. Access application services via the web browser:

    • Airflow UI - http://localhost:8080/
  2. Run teardown.sh to remove application from system including docker images

sh teardown.sh

Improvements

Data validation can be done prior to loading the delta tables at each layer of the pipeline. This data was very clean and only needed to perform simple transformations. As a proof of concept I might return to this project and include a framework like:

I'd probably go with Soda Core or Pandera as they are lightweight frameworks compared to Great Expectations.

Finally, a more indepth statistical analysis could be performed using:

  • Jupyter Lab
  • Dashboards (might revisit this and make it more visually appealing)