# Module 2: Fundamentals of Data Engineering

## Sprint 3: Working with Data Pipelines & Apache Airflow

## Part 5: Job Application System

## About this Part

Congratulations! You've reached the final Part of this Sprint.
This Part serves as an integrative experience, allowing you to apply the knowledge and skills acquired in this and previous Sprints.

As the culmination of this Sprint, you'll construct a system that routinely retrieves job listing data and stores it in a Relational Database Management System (RDBMS).

Sprint projects often necessitate the use of skills, tools, and techniques not explicitly covered during the Sprint.
This is intentional, as true expertise stems from the ability to recognize the skills needed to solve a given problem and to acquire these skills as necessary.

Remember, perfection isn't expected at this stage.
You will continuously hone your skills and have ample opportunities to apply them in future projects.
For now, focus on leveraging what you've learned and giving it your best effort!

## Context

With all the new data engineering skills that you have gained you've decided that it is about time to start looking for a job.
As a first step, you want to do an analysis of the job market to evaluate the demand.
To gather the data for this analysis, you decided to aggregate job ads from multiple sources like [Remotive](https://remotive.com), [We Work Remotely](https://weworkremotely.com), etc.

You decided to have (at least) this information in your database:
- Job title
- Name of the company
- Link to the job ad
- Job type (full-time, part-time, contract, etc.)
- Region (anywhere in the World, Europe, etc.)
- Salary (in Eur)
- Timestamp, when the job ad was posted

You also decided to have calculate these metrics in your database:
- The number of new job ads that contain "data engineer" in the title (per day).
- The number of new job ads that contain "data engineer" in the title and are remote friendly (per day).
- The maximum, minimum, average, and standard deviation of salaries of job ads that contain "data engineer" in the title (total and per day).

Note: You are not limited to using APIs - you can also scrape the data from websites or ingest files. You are also not limited to just three sources - use as many as you want.

## Objectives for this Part

- Practice ingesting data from multiple sources.
- Practice orchestrating jobs using Apache Airflow.
- Practice setting up and managing an RDBMS.
- Practice building custom Apache Airflow operators.

## Requirements

- Your solution should encompass the functionality outlined in the Context section.
- You must ingest data from at least three different data sources.
- Your solution must be modular (e.g. each data source should be ingested using a separate Airflow Task).
- Offer insights on how your analysis could be improved.

## Bonus

- Implement custom Airflow sensors. It might require some creative thinking on where to implement them but you are determined to do it even if this means that you are overengineering things.
- You want to show this project in the interviews as part of your portfolio. Hence, you decided to set up your solution for production. Implement logging, regular backups and error handling. In addition, suggest an approach to monitor the system and potentially implement it. 

## Evaluation Criteria

- Adherence to the requirements. How well did you meet the requirements?
- Code quality. Was your code well-structured? Did you use the appropriate levels of abstraction? Did you remove commented-out and unused code?
- System design. Did your solution use suitable technologies, tools, software architecture, and algorithms?
- Presentation quality. How comprehensive is your presentation, and how well are you able to explain your solution to the target audience?
- Conceptual understanding. How well do you know the concepts covered in this and previous Sprints?

## Correction

During your project correction, you should present it as if talking to a potential employer for whom you are showcasing your solution.  
You can assume that they will have decent software and data engineering skills - they will understand technical jargon and are expected to notice and question things that could have been done better or ask about the choices you've made.
While they are not familiar with the problem, the project purpose is easy to understand - your best bet is to focus your presentation on technological and design choices.

During the presentation, you might be asked questions to test your understanding of the covered topics, such as:

- How multiple processes communicate with each other in Python?
- What is GIL? Explain why is it important to release the GIL for a Python thread?
- What do we mean when we say that an object is awaitable in Python? Which objects are awaitable and which are not?
- What is `hasattr` in Python? How does it work?
- What is the purpose of the `object.__dict__` construct in Python?

IMPORTANT: during the correction, you will also be asked to solve an exercise using Python.


## General Correction Guidelines

For an in-depth explanation about how corrections work at Turing College, please read [this doc](https://turingcollege.atlassian.net/wiki/spaces/DLG/pages/537395951/Peer+expert+reviews+corrections).
