Skip to content
/ news_etl Public

News ETL CAPSTONE project for Data Engineering Bootcamp

Notifications You must be signed in to change notification settings

mddan/news_etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

Project Name: NEWS ETL Project

Team Name: GROUP 6

Document Version 1

DATE : May 02, 2023

Revision History

Version Authors Date Description
1.0 Vasanth Nair May/02/2023
Daniel Marinescu

Index

Table of Contents
  1. About The Project
  2. Goals
  3. Project Context
  4. Architecture
  5. Getting Started
  6. Usage
  7. Roadmap
  8. Contributing
  9. License
  10. Contact
  11. Acknowledgments

About The Project

The objective of this project is to construct an ETL pipeline that is scalable and efficient enough to extract news data from mediastack API in an incremental manner. The project necessitates a solution capable of handling voluminous data and guaranteeing the accuracy and integrity of the data. To achieve the project objective, we are using Kafka as a producer running as an AWS ECS Service to read data from the mediastack API and push it to a Kafka topic hosted in Confluent Cloud. This topic is then consumed by a Spark Streaming Kafka Consumer to load data into delta tables in Databricks, completing the Extract and Load step of ELT. Subsequently, the raw data is transformed using the medallion architecture steps of Bronze, Silver, and Gold, and data quality during the transformation steps is ensured by leveraging the Great Expectations Library. Data modeling techniques such as dimensional modeling and one big table are applied to the transformed data. PowerBI is used as the Semantic Layer to expose the transformed data in Databricks. The entire solution is hosted in the clouds (AWS, Confluent Cloud and Databricks) providing scalability, robustness, and reliability.

(back to top)

Built With

The team used a variety of tools in this project, including Databricks, Kafka, Python, Git, Docker, AWS, Confluent Cloud, Great Expectations, PowerBI and Visual Studio.

  • Python -- Python was used for developing custom scripts to perform data transformations and manipulation. Its powerful libraries, such as Pandas and NumPy, were utilized for data manipulation.

  • Git -- Git was used for version control to manage the codebase and collaborate with other team members.

  • AWS -- AWS was used as the cloud platform to host the applications, store, and leverage various services for data hosting.

  • Databricks -- Databricks to read from Kafka, perform transformations as the data is moved from bronze, to silver, to gold layers. Great Expectations test were also performed in the framework.

  • PowerBI -- PowerBI was used for the semantic, reporting layer of the project to illustrate data visualization and present metrics.

  • Apache Kafka -- Kafka was used in extracting and loading of Mediastack data to delta table in databricks of the ETL pipeline.

  • Docker -- Docker was used to create isolated environments for development, testing, and production, allowing for easy and efficient deployment of the applications.

  • Visual Studio Code -- Visual Studio was used as the integrated development environment (IDE) to write and debug code, as well as to collaborate with other team members.

(back to top)

Goals

This live and incremental data pipeline solution enables news data to be processed and delivered to consumers as quickly as possible. By utilizing real-time data processing, breaking news can be continuously ingested and transformed, ensuring that the latest developments are always available to news consumers with minimal delay. This pipeline solution also allows for the seamless addition of new data sources, ensuring that the system is scalable and can handle large volumes of news data. The ultimate objective is to create a reliable and high-performing news processing system that empowers consumers to stay informed and make knowledgeable decisions based on the most up-to-date news available.

(back to top)

Project Context

News travels fast. We would like to create a news/article ETL that provides real-time information to consumers about breaking world events, and other areas of news interest. Consumers of the data would be Average daily news consumers

(back to top)

Architechture

image

ETL Pipeline Steps

  1. An AWS EC2 Instance boots up and downloads the news_etl pipeline's Kafka Producer app docker image from AWS ECR.
  2. AWS EC2 instance runs this docker image in a docker container.
  3. Docker container reads ENV file from AWS S3 Bucket.
  4. It sets the read contents and set them as environment variable making it available for ETL program during runtime.
  5. On a scheduled time, ECS cron job kicks in and starts the ETL pipeline.
  6. The python Kafka Producer news_etl ETL pipeline makes a REST API call to MEDIASTACK API to get breaking news data .
  7. Response data from the API call is posted to Kafka topic hosted in Confluent Cloud, later consumed by a streaming Databricks Kafka Consumer and transformed, and also enriched with a source information in Databricks worlflow.
  8. In Databricks, mediastack_headlines landing table is read and the delta load workflow executes as follows:
  9. A delta article table is generated (only new article records is generated by only selecting article not yet in existing bronze table)
  10. A deduped list of sources from the delta article table is used to hit the sources API and extract the delta_sources table.
  11. Using existing bronze_articles, only new sources are added to bronze_sources, and where applicable include any new source-specific fields from delta_sources
  12. The updated bronze_articles and bronze_sources tables are enriched to silver_articles and silver_sources by enriching/transforming with country names, language names, date parsing, aggregating on article count by source, and adding useful boolean (e.g has-shock_value)
  13. Both silver tables is transformed to gold tables for both articles and sources - mainly renaming.
  14. And OBT table is formed from the 1 fact gold table (articles) and the 3 gold dim tables (sources, countries, languages).
  15. At each transition between bronze -> silver -> gold -> obt, QC tests are performed via Great Expectation tests to ensure outputs are consistently as expected throughout the data flow.
  16. The final transformed and enriched gold dataset and obt table, are brought into a Power BI dashboard where the data is dimensionally modeled, and exposed as semantic layer to illustrate latest available headlines by news category, and some basic statistics/trends on the full available news data.

Getting Started

Prerequisites

This project requires following softwares, packages and tools.

  1. Python 3.8.16 to write the NEWS_ETL ETL pipeline code
  2. Docker to containerize the NEWS_ETL KAFKA Consumer Portion of the ETL pipeline application
  3. AWS to host KAFKA Producer Portion of the NEWS_ETL ETL Pipeline application
  4. Confluent Cloud acts as Kafka Broker hosting the topic
  5. Databricks Streaming Job acts as a Kafka Consumer reading from Kafka topic to write to delta table
  6. Databricks Workflow to transform raw mediastack data landed in delta tables to Bronze, Silver and Gold Layers
  7. PowerBI acts as Semantic Layer to expose the business metrics based on transformed data.

Installation

Below are the installation steps for setting up the job_board ETL app.

  1. Get a Paid API Key at https://mediastack.com/product

  2. Clone the repo

    git clone  https://github.com/mddan/news_etl.git
  3. Install packages

    pip install  -r requirements.txt
  4. Create set_python_path.sh / set_python_path.bat file in src/ folder with following contents

    Linux / Mac

    #!/bin/bash
    
    export PYTHONPATH=`pwd`
    

    Windows

    set PYTHONPATH=%cd%
    
  5. Create a config.sh / config.bat file in src/ folder with following content

    Linux / Mac

    export KAFKA_BOOTSTRAP_SERVERS=<YOUR_KAFKA_BOOTSTRAP_SERVER>
    export KAFKA_SASL_USERNAME=<YOUR_KAFKA_USERNAME>
    export KAFKA_SASL_PASSWORD=<YOUR_KAFKA_PASSWORD>
    export KAFKA_TOPIC=<YOUR_KAFKA_TOPIC>
    export MEDIASTACK_ACCESS_KEY=<YOUR_MEDIASTACK_API_ACCESS_KEY>
    

    Windows

    SET KAFKA_BOOTSTRAP_SERVERS=<YOUR_KAFKA_BOOTSTRAP_SERVER>
    SET KAFKA_SASL_USERNAME=<YOUR_KAFKA_USERNAME>
    SET KAFKA_SASL_PASSWORD=<YOUR_KAFKA_PASSWORD>
    SET KAFKA_TOPIC=<YOUR_KAFKA_TOPIC>
    SET MEDIASTACK_ACCESS_KEY=<YOUR_MEDIASTACK_API_ACCESS_KEY>
    
  6. Create a .env file with below contents in root project folder

KAFKA_BOOTSTRAP_SERVERS=<YOUR_KAFKA_BOOTSTRAP_SERVER>
KAFKA_SASL_USERNAME=<YOUR_KAFKA_USERNAME>
KAFKA_SASL_PASSWORD=<YOUR_KAFKA_PASSWORD>
KAFKA_TOPIC=<YOUR_KAFKA_TOPIC>
MEDIASTACK_ACCESS_KEY=<YOUR_MEDIASTACK_API_ACCESS_KEY>

Running Locally and in a Docker Container

Steps

  1. CD into src/ folder
  2. Run . ./set_python_path.sh / set_python_path.bat file according to your Operating System to set PYTHONPATH
  3. Run config.sh / config.bat file to set additional environment variables needed to connect to MEDIASTACK API and KAFKA TOPIC
  4. CD back to src/ folder
  5. Run python news_etl/producer/mediastack_kafka_producer.py to run the Kafka Producer code locally.
  6. Alternatively instead of running steps 3 thru 5, we can run the Kafka producer pipeline in docker container as follows.
  7. From the root folder containing Dockerfile, Run docker build -t news_etl:1.0 . to create a docker image for News ETL pipeline's Kafka Producer component
  8. Run a container using the above image using docker run --env-file=.env news_etl:1.0 to see the Kafka Producer part of the ETL pipeline in action.

Running in AWS Cloud - Setup

  1. Create IAM roles as shown in image.
  2. Upload the .env file containing the JSEARCH API Key and AWS RDS Connection Details to an AWS S3 Bucket.
  3. Create docker file and upload the Docker image to AWS ECR.
  4. Create a Cron Schedule in AWS ECS to run the Kafka producer pipeline in a recurring schedule.
  5. Query Mediastack API for latest news and push to Kafka Topic hosted in Confluent Cloud

Running in Databricks Workflow - Setup

  1. Databricks Streaming Kafka Consumer reads latest offsets from Kafka Topic and writes to raw_landing delta table.
  2. Databricks workflow transforms data and enrich source information in medallaion architecture layers of Bronze, Silver and Gold.
  3. This transformed data is fact and dimensional modeled, tested for data quality using great expectations
  4. The dimensional data is exposed as semantic layer using PowerBI.

Screenshots of Components Used

IAM Roles Used

image

Env File in S3 Bucket

image

ECR hosting News ETL Kafka Producer Docker Image

image

Scheduled Task in ECS

image

Screenshots of Raw Mediastack Datasets landed in Databricks Delta Table

image

Screenshot of Databricks Delta Load Workflow

image

Screenshot of PowerBI Headlines Dashboard

image

(back to top)

Usage

An ETL data pipeline solution is essential for collecting, transforming, and loading data from various sources into a centralized repository. The pipeline benefits consumers of news in several ways. It centralizes the data for easy accessibility, standardizes and ensures data quality, consistency, and accuracy, and automates the process of data transformation and scaling up as the data volume grows. Consequently, data consumers can quickly access high-quality, consistent, and easily accessible data for making informed decisions.

(back to top)

Roadmap

  • Data extraction and loading:
    • Set up API for data extraction
    • Retrieve the News Data from the Mediastack API using a suitable extraction method (API calls)
    • Set up Kafka Producer and Consumer to incrementally extract and load data to Databricks Delta tables
  • Data transformation:
    • Clean the raw data to ensure it is in the desired format (e.g., removing duplicates, handling missing values, etc.).
    • Use the following transformation techniques : renaming columns, joining, grouping, typecasting, data filtering, sorting, and aggregating
    • Transform the data into a structured format (e.g., converting to a tabular form or creating a data model).
    • Exposing this dimenisional modeled data as semantic layer to PowerBI
  • Create a data Pipeline
    • Build a docker image using a Dockerfile
    • Test that the Docker container is runing locally
  • Incremental extraction and loading:
    • Kafka producer regularly extract newly available news data from the API and update the Kafka Topic with the latest information.
    • Ensure that the Kakfa Consumer Databricks Streaming App always reads from latest checkpoints and lands latest data in databricks delta table
  • Implement Great Expectation tests
    • Write Great Expectation Tests for the Data Transformation layer in Databricks workflow.
  • Cloud Hosting :
    • Host the Kafka Consumer on AWS, Kafka Topic in Confluent Cloud
    • Use AWS services (e.g., EC2, S3, ECR, ECS etc.) to ensure the robustness and reliability of the Kafka Producer pipeline.

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Vasanth Nair - @Linkedin

Daniel Marinescu - @Linkedin

Project Link: [https://github.com/mddan/news_etl

(back to top)

Acknowledgments

Use this space to list resources you find helpful and would like to give credit to. I've included a few of my favorites to kick things off!

(back to top)

About

News ETL CAPSTONE project for Data Engineering Bootcamp

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages