GitHub - mddan/news_etl: News ETL CAPSTONE project for Data Engineering Bootcamp

Project Name: NEWS ETL Project

Team Name: GROUP 6

Document Version 1

DATE : May 02, 2023

Revision History

Version	Authors	Date	Description
1.0	Vasanth Nair	May/02/2023
	Daniel Marinescu

Index

Table of Contents

About The Project
- Built With
Goals
Project Context
Architecture
- ETL Pipeline Steps
Getting Started
- Prerequisites
- Installation
- Running Locally and in a Docker Container
- Running in AWS Cloud - Setup
- Screenshots of Components Used
Usage
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

The objective of this project is to construct an ETL pipeline that is scalable and efficient enough to extract news data from mediastack API in an incremental manner. The project necessitates a solution capable of handling voluminous data and guaranteeing the accuracy and integrity of the data. To achieve the project objective, we are using Kafka as a producer running as an AWS ECS Service to read data from the mediastack API and push it to a Kafka topic hosted in Confluent Cloud. This topic is then consumed by a Spark Streaming Kafka Consumer to load data into delta tables in Databricks, completing the Extract and Load step of ELT. Subsequently, the raw data is transformed using the medallion architecture steps of Bronze, Silver, and Gold, and data quality during the transformation steps is ensured by leveraging the Great Expectations Library. Data modeling techniques such as dimensional modeling and one big table are applied to the transformed data. PowerBI is used as the Semantic Layer to expose the transformed data in Databricks. The entire solution is hosted in the clouds (AWS, Confluent Cloud and Databricks) providing scalability, robustness, and reliability.

(back to top)

Built With

The team used a variety of tools in this project, including Databricks, Kafka, Python, Git, Docker, AWS, Confluent Cloud, Great Expectations, PowerBI and Visual Studio.

-- Python was used for developing custom scripts to perform data transformations and manipulation. Its powerful libraries, such as Pandas and NumPy, were utilized for data manipulation.
-- Git was used for version control to manage the codebase and collaborate with other team members.
-- AWS was used as the cloud platform to host the applications, store, and leverage various services for data hosting.
-- Databricks to read from Kafka, perform transformations as the data is moved from bronze, to silver, to gold layers. Great Expectations test were also performed in the framework.
-- PowerBI was used for the semantic, reporting layer of the project to illustrate data visualization and present metrics.
-- Kafka was used in extracting and loading of Mediastack data to delta table in databricks of the ETL pipeline.
-- Docker was used to create isolated environments for development, testing, and production, allowing for easy and efficient deployment of the applications.
-- Visual Studio was used as the integrated development environment (IDE) to write and debug code, as well as to collaborate with other team members.

(back to top)

Goals

This live and incremental data pipeline solution enables news data to be processed and delivered to consumers as quickly as possible. By utilizing real-time data processing, breaking news can be continuously ingested and transformed, ensuring that the latest developments are always available to news consumers with minimal delay. This pipeline solution also allows for the seamless addition of new data sources, ensuring that the system is scalable and can handle large volumes of news data. The ultimate objective is to create a reliable and high-performing news processing system that empowers consumers to stay informed and make knowledgeable decisions based on the most up-to-date news available.

(back to top)

Project Context

News travels fast. We would like to create a news/article ETL that provides real-time information to consumers about breaking world events, and other areas of news interest. Consumers of the data would be Average daily news consumers

(back to top)

Architechture

ETL Pipeline Steps

An AWS EC2 Instance boots up and downloads the news_etl pipeline's Kafka Producer app docker image from AWS ECR.
AWS EC2 instance runs this docker image in a docker container.
Docker container reads ENV file from AWS S3 Bucket.
It sets the read contents and set them as environment variable making it available for ETL program during runtime.
On a scheduled time, ECS cron job kicks in and starts the ETL pipeline.
The python Kafka Producer news_etl ETL pipeline makes a REST API call to MEDIASTACK API to get breaking news data .
Response data from the API call is posted to Kafka topic hosted in Confluent Cloud, later consumed by a streaming Databricks Kafka Consumer and transformed, and also enriched with a source information in Databricks worlflow.
In Databricks, mediastack_headlines landing table is read and the delta load workflow executes as follows:
A delta article table is generated (only new article records is generated by only selecting article not yet in existing bronze table)
A deduped list of sources from the delta article table is used to hit the sources API and extract the delta_sources table.
Using existing bronze_articles, only new sources are added to bronze_sources, and where applicable include any new source-specific fields from delta_sources
The updated bronze_articles and bronze_sources tables are enriched to silver_articles and silver_sources by enriching/transforming with country names, language names, date parsing, aggregating on article count by source, and adding useful boolean (e.g has-shock_value)
Both silver tables is transformed to gold tables for both articles and sources - mainly renaming.
And OBT table is formed from the 1 fact gold table (articles) and the 3 gold dim tables (sources, countries, languages).
At each transition between bronze -> silver -> gold -> obt, QC tests are performed via Great Expectation tests to ensure outputs are consistently as expected throughout the data flow.
The final transformed and enriched gold dataset and obt table, are brought into a Power BI dashboard where the data is dimensionally modeled, and exposed as semantic layer to illustrate latest available headlines by news category, and some basic statistics/trends on the full available news data.

Getting Started

Prerequisites

This project requires following softwares, packages and tools.

Python 3.8.16 to write the NEWS_ETL ETL pipeline code
Docker to containerize the NEWS_ETL KAFKA Consumer Portion of the ETL pipeline application
AWS to host KAFKA Producer Portion of the NEWS_ETL ETL Pipeline application
Confluent Cloud acts as Kafka Broker hosting the topic
Databricks Streaming Job acts as a Kafka Consumer reading from Kafka topic to write to delta table
Databricks Workflow to transform raw mediastack data landed in delta tables to Bronze, Silver and Gold Layers
PowerBI acts as Semantic Layer to expose the business metrics based on transformed data.

Installation

Below are the installation steps for setting up the job_board ETL app.

Get a Paid API Key at https://mediastack.com/product

Clone the repo

git clone  https://github.com/mddan/news_etl.git

Install packages
```
pip install  -r requirements.txt
```
Create set_python_path.sh / set_python_path.bat file in src/ folder with following contents

Linux / Mac
```
#!/bin/bash

export PYTHONPATH=`pwd`
```
Windows
```
set PYTHONPATH=%cd%
```

Create a config.sh / config.bat file in src/ folder with following content

Linux / Mac

export KAFKA_BOOTSTRAP_SERVERS=<YOUR_KAFKA_BOOTSTRAP_SERVER>
export KAFKA_SASL_USERNAME=<YOUR_KAFKA_USERNAME>
export KAFKA_SASL_PASSWORD=<YOUR_KAFKA_PASSWORD>
export KAFKA_TOPIC=<YOUR_KAFKA_TOPIC>
export MEDIASTACK_ACCESS_KEY=<YOUR_MEDIASTACK_API_ACCESS_KEY>

Windows

SET KAFKA_BOOTSTRAP_SERVERS=<YOUR_KAFKA_BOOTSTRAP_SERVER>
SET KAFKA_SASL_USERNAME=<YOUR_KAFKA_USERNAME>
SET KAFKA_SASL_PASSWORD=<YOUR_KAFKA_PASSWORD>
SET KAFKA_TOPIC=<YOUR_KAFKA_TOPIC>
SET MEDIASTACK_ACCESS_KEY=<YOUR_MEDIASTACK_API_ACCESS_KEY>

Create a .env file with below contents in root project folder

KAFKA_BOOTSTRAP_SERVERS=<YOUR_KAFKA_BOOTSTRAP_SERVER>
KAFKA_SASL_USERNAME=<YOUR_KAFKA_USERNAME>
KAFKA_SASL_PASSWORD=<YOUR_KAFKA_PASSWORD>
KAFKA_TOPIC=<YOUR_KAFKA_TOPIC>
MEDIASTACK_ACCESS_KEY=<YOUR_MEDIASTACK_API_ACCESS_KEY>

Running Locally and in a Docker Container

Steps

CD into src/ folder
Run . ./set_python_path.sh / set_python_path.bat file according to your Operating System to set PYTHONPATH
Run config.sh / config.bat file to set additional environment variables needed to connect to MEDIASTACK API and KAFKA TOPIC
CD back to src/ folder
Run python news_etl/producer/mediastack_kafka_producer.py to run the Kafka Producer code locally.
Alternatively instead of running steps 3 thru 5, we can run the Kafka producer pipeline in docker container as follows.
From the root folder containing Dockerfile, Run docker build -t news_etl:1.0 . to create a docker image for News ETL pipeline's Kafka Producer component
Run a container using the above image using docker run --env-file=.env news_etl:1.0 to see the Kafka Producer part of the ETL pipeline in action.

Running in AWS Cloud - Setup

Create IAM roles as shown in image.
Upload the .env file containing the JSEARCH API Key and AWS RDS Connection Details to an AWS S3 Bucket.
Create docker file and upload the Docker image to AWS ECR.
Create a Cron Schedule in AWS ECS to run the Kafka producer pipeline in a recurring schedule.
Query Mediastack API for latest news and push to Kafka Topic hosted in Confluent Cloud

Running in Databricks Workflow - Setup

Databricks Streaming Kafka Consumer reads latest offsets from Kafka Topic and writes to raw_landing delta table.
Databricks workflow transforms data and enrich source information in medallaion architecture layers of Bronze, Silver and Gold.
This transformed data is fact and dimensional modeled, tested for data quality using great expectations
The dimensional data is exposed as semantic layer using PowerBI.

Screenshots of Components Used

IAM Roles Used

Env File in S3 Bucket

ECR hosting News ETL Kafka Producer Docker Image

Scheduled Task in ECS

Screenshots of Raw Mediastack Datasets landed in Databricks Delta Table

Screenshot of Databricks Delta Load Workflow

Screenshot of PowerBI Headlines Dashboard

(back to top)

Usage

An ETL data pipeline solution is essential for collecting, transforming, and loading data from various sources into a centralized repository. The pipeline benefits consumers of news in several ways. It centralizes the data for easy accessibility, standardizes and ensures data quality, consistency, and accuracy, and automates the process of data transformation and scaling up as the data volume grows. Consequently, data consumers can quickly access high-quality, consistent, and easily accessible data for making informed decisions.

(back to top)

Roadmap

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Vasanth Nair - @Linkedin

Daniel Marinescu - @Linkedin

Project Link: [https://github.com/mddan/news_etl

(back to top)

Acknowledgments

Use this space to list resources you find helpful and would like to give credit to. I've included a few of my favorites to kick things off!

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
src		src
Dockerfile		Dockerfile
README.md		README.md

mddan/news_etl

Folders and files

Latest commit

History

Repository files navigation

Project Name: NEWS ETL Project

Team Name: GROUP 6

Document Version 1

Revision History

Index

About The Project

Built With

Goals

Project Context

Architechture

ETL Pipeline Steps

Getting Started

Prerequisites

Installation

Running Locally and in a Docker Container

Steps

Running in AWS Cloud - Setup

Running in Databricks Workflow - Setup

Screenshots of Components Used

IAM Roles Used

Env File in S3 Bucket

ECR hosting News ETL Kafka Producer Docker Image

Scheduled Task in ECS

Screenshots of Raw Mediastack Datasets landed in Databricks Delta Table

Screenshot of Databricks Delta Load Workflow

Screenshot of PowerBI Headlines Dashboard

Usage

Roadmap

Contributing

License

Contact

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages