Custom ELT Pipeline

Many jobs on Linkedin, tagged as entry-level roles, ask for several years of experience (YoE). Sorting through these is inefficient. The issue is further exacerbated when hundreds of new entry-level tagged jobs are posted everyday; going through all of them manually then becomes unfeasible.

This project began as a way to automate this.

Overview

Please press the play button on the top right of the flowchart below to start the animation.

Click here to view the live dashboard.

Technical Description

1. Scraping:

A Scrapy spider recursively crawls Linkedin, collecting job postings and uploading the data to a S3 bucket.

2. ELT:

Data from S3 is cleaned, & loaded into a temporary staging table; during this the data is cast into appropriate data types, before being copied into one wide mastertable in snowflake.
SQL transformations will then be executed on the mastertable to create fact and dimension tables like so :~

3. Visualization:

A dashboard finally queries these tables to visualize and present the organized data.

4. Orchestration:

Scraping, staging, and ELT are automated and scheduled to run every hour using an Airflow DAG.

Deploying The Pipeline Locally

Install astro CLI with brew by running:

brew install astro

Clone the repo, then run:

cd ELT-Data-Pipeline && pip3 install -r requirements.txt

Run the following to start Airflow on localhost:

astro dev init && astro dev start

Known Issues

Some transformations are currently done with pandas. The equivalent SQL transformations are under development.
Some transformations scripts are not idempotent and so backfilling currently creates duplicate records.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.astro		.astro
dags		dags
include		include
tests/dags		tests/dags
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.astro

.astro

dags

dags

include

include

tests/dags

tests/dags

.DS_Store

.DS_Store

.dockerignore

.dockerignore

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

packages.txt

packages.txt

requirements.txt

requirements.txt

Repository files navigation

Custom ELT Pipeline

Overview

Please press the play button on the top right of the flowchart below to start the animation.

Click here to view the live dashboard.

Technical Description

Deploying The Pipeline Locally

Known Issues

About

Releases

Packages

Languages

MubassirAhmed/ELT-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Custom ELT Pipeline

Overview

Please press the play button on the top right of the flowchart below to start the animation.

Click here to view the live dashboard.

Technical Description

Deploying The Pipeline Locally

Known Issues

About

Resources

Stars

Watchers

Forks

Languages