This project automates the workflow for loading, cleaning, and uploading airline reviews data for visualization purposes. It leverages Apache Airflow for orchestration, PostgreSQL for data storage, and Elasticsearch for indexing and search functionalities.
- Docker and Docker Compose
- Apache Airflow
- PostgreSQL
- Elasticsearch
- Airflow Setup: Ensure Airflow is installed and configured as per the provided
airflow.yaml
configuration. - Database Setup: Use PostgreSQL for data storage. Credentials and configurations can be adjusted in the Airflow DAG file and
airflow.yaml
. - Elasticsearch Setup: Ensure Elasticsearch is running for indexing cleaned data. Adjust the connection details in the Airflow DAG script if necessary.
The dataset utilized in this project is available at Kaggle - Airline Reviews Dataset.
The project's workflow consists of the following steps:
- Load CSV to PostgreSQL: A CSV file containing raw airline reviews data is loaded into a PostgreSQL database.
- Fetch Data: Data is fetched from PostgreSQL for processing.
- Data Preprocessing: The data undergoes cleaning and preprocessing to make it suitable for analysis and visualization.
- Upload to Elasticsearch: The cleaned data is uploaded to Elasticsearch for indexing, also visualize the data.
This workflow is automated using Apache Airflow, with each step represented as a task in the DAG named P2M3_Panji_DAG_hck
.
To run the workflow:
- Start your Airflow environment.
- Navigate to the Airflow web interface.
- Trigger the
P2M3_Panji_DAG_hck
DAG.
Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes.