Skip to content

muKaustav/hltv-news-etl

Repository files navigation

HLTV News ETL with Apache Airflow and Spark 🧭

📚 | Introduction

  • ETL (Extract, Transform, Load) is a process in data warehousing responsible for pulling data out of the source systems, transforming it into a more digestible format, and loading it into the data warehouse.
  • In this project, we will be extracting news data from the HLTV website, transforming it into a more digestible format, and loading it into AWS S3.
  • We will be using Apache Airflow to schedule the ETL process and Apache Spark to transform the data.

Disclaimer

  • This project is for educational purposes only.
  • The data extracted from the website is the property of HLTV.
  • The data is not used for any commercial purposes.

🚀 | DAG

  • A DAG (Directed Acyclic Graph) is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
  • A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.
  • The DAG is used by Apache Airflow to schedule the ETL tasks and monitor them.
  • The DAG in current project is scheduled to run every every 1 day at 00:00 UTC, it can be configured to run at any time interval.

DAG Structure for current project:

DAG

  • The DAG consists of 6 tasks:
    • extract_hltv_news : Extracts the news data from the HLTV website.
    • check_downloaded_file : Checks if the data file is downloaded.
    • run_transform : Runs the transformation script on the downloaded data file.
    • spark_analysis : Runs Spark jobs on the transformed data for analysis, and stores the resultant csv file in AWS S3.
    • clear_temp_files : Clears the temporary files created during the ETL process.
    • send_email : Sends an email to the user with the DAG run status.

🌐 | Setup

  • I am running the project on an AWS EC2 t3.medium instance with Ubuntu 22.04 LTS.
  • Install Apache Airflow & Apache Spark on the EC2 instance.
  • Python version used: 3.10.12 | Java version used: openjdk 11.0.20.1
  • Install the necessary dependencies for the project. (Both pip and npm)
  • Start Airflow and Spark services using the following commands:
[Airflow]
$ airflow standalone

[Spark]
$ export SPARK_HOME=/path/to/your/spark/directory
$ $SPARK_HOME/sbin/start-master.sh
$ $SPARK_HOME/sbin/start-worker.sh spark://<HOST_IP>:7077

💻 | Analysis

  • The analysis is done using Spark SQL.
  • The visualizations are done using Matplotlib and Seaborn.
  • Here is some of the analysis done on the data:
Avg Comments by Country Max Comments by Country
Descriptive Analysis Total Articles by Country

🍻 | Contributing

Contributions, issues and feature requests are welcome.
Feel free to check issues page if you want to contribute.


🧑🏽 | Author

Kaustav Mukhopadhyay


🙌 | Show your support

Drop a ⭐️ if this project helped you!


📝 | License

Copyright © 2023 Kaustav Mukhopadhyay.
This project is MIT licensed.


About

ETL on HLTV news using Apache Airflow and Spark.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published