ETL Project (Batch/Local Edition)

PS: This is a work in progress

Architecture diagram

Extraction

Raw data is extracted from sources and saved as json file before any further processing. This will ensure that we still have access to the data in case we want to perform additional analysis or loss of data. There are two categories of data collected.

Extracting Data from Job Postings

I extracted data from job listing websites (Adzuna, Remotive) using their respective REST APIs endpoints and RSS feeds (Stackoverflow jobs).

Extracting Data from Github Trends

Github trending repositories data is scrapped using Python requests library with BeautifulSoup.

Transformation

Pre-processing

Initially extracted data is pre-processed. Since the websites have different field names in their API responses, I ensure the data followed a common type/format in terms of the fields.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
googledrive		googledrive
resources		resources
sql		sql
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Project (Batch/Local Edition)

Architecture diagram

Extraction

Extracting Data from Job Postings

Extracting Data from Github Trends

Transformation

Pre-processing

Final Transformation

Loading

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ETL Project (Batch/Local Edition)

Architecture diagram

Extraction

Extracting Data from Job Postings

Extracting Data from Github Trends

Transformation

Pre-processing

Final Transformation

Loading

Tech Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages