Skip to content

A complete ETL infrastructure built using broad-scale parallel webscrapers in the cloud, a data lake, a custom ETL pipeline, and a data warehouse.

Notifications You must be signed in to change notification settings

MubassirAhmed/ELT-Data-Pipeline

Repository files navigation

Custom ELT Pipeline

Many jobs on Linkedin, tagged as entry-level roles, ask for several years of experience (YoE). Sorting through these is inefficient. The issue is further exacerbated when hundreds of new entry-level tagged jobs are posted everyday; going through all of them manually then becomes unfeasible.

This project began as a way to automate this.

Overview

Please press the play button on the top right of the flowchart below to start the animation.

Alt Text

Click here to view the live dashboard.

Technical Description

1. Scraping:

  • A Scrapy spider recursively crawls Linkedin, collecting job postings and uploading the data to a S3 bucket.

2. ELT:

  • Data from S3 is cleaned, & loaded into a temporary staging table; during this the data is cast into appropriate data types, before being copied into one wide mastertable in snowflake.
  • SQL transformations will then be executed on the mastertable to create fact and dimension tables like so :~

Alt Text

3. Visualization:

  • A dashboard finally queries these tables to visualize and present the organized data.

4. Orchestration:

  • Scraping, staging, and ELT are automated and scheduled to run every hour using an Airflow DAG.

Deploying The Pipeline Locally

  1. Install astro CLI with brew by running:
brew install astro
  1. Clone the repo, then run:
cd ELT-Data-Pipeline && pip3 install -r requirements.txt
  1. Run the following to start Airflow on localhost:
astro dev init && astro dev start

Known Issues

  • Some transformations are currently done with pandas. The equivalent SQL transformations are under development.
  • Some transformations scripts are not idempotent and so backfilling currently creates duplicate records.

About

A complete ETL infrastructure built using broad-scale parallel webscrapers in the cloud, a data lake, a custom ETL pipeline, and a data warehouse.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published