DATE : May 02, 2023
Version | Authors | Date | Description |
---|---|---|---|
1.0 | Vasanth Nair | May/02/2023 | |
Daniel Marinescu |
Table of Contents
The objective of this project is to construct an ETL pipeline that is scalable and efficient enough to extract news data from mediastack API in an incremental manner. The project necessitates a solution capable of handling voluminous data and guaranteeing the accuracy and integrity of the data. To achieve the project objective, we are using Kafka as a producer running as an AWS ECS Service to read data from the mediastack API and push it to a Kafka topic hosted in Confluent Cloud. This topic is then consumed by a Spark Streaming Kafka Consumer to load data into delta tables in Databricks, completing the Extract and Load step of ELT. Subsequently, the raw data is transformed using the medallion architecture steps of Bronze, Silver, and Gold, and data quality during the transformation steps is ensured by leveraging the Great Expectations Library. Data modeling techniques such as dimensional modeling and one big table are applied to the transformed data. PowerBI is used as the Semantic Layer to expose the transformed data in Databricks. The entire solution is hosted in the clouds (AWS, Confluent Cloud and Databricks) providing scalability, robustness, and reliability.
The team used a variety of tools in this project, including Databricks
, Kafka
, Python
, Git
, Docker
, AWS
, Confluent Cloud
, Great Expectations
, PowerBI
and Visual Studio
.
-
--
Python
was used for developing custom scripts to perform data transformations and manipulation. Its powerful libraries, such as Pandas and NumPy, were utilized for data manipulation. -
--
Git
was used for version control to manage the codebase and collaborate with other team members. -
--
AWS
was used as the cloud platform to host the applications, store, and leverage various services for data hosting. -
--
Databricks
to read from Kafka, perform transformations as the data is moved from bronze, to silver, to gold layers. Great Expectations test were also performed in the framework. -
--
PowerBI
was used for the semantic, reporting layer of the project to illustrate data visualization and present metrics. -
--
Kafka
was used in extracting and loading of Mediastack data to delta table in databricks of the ETL pipeline. -
--
Docker
was used to create isolated environments for development, testing, and production, allowing for easy and efficient deployment of the applications. -
--
Visual Studio
was used as the integrated development environment (IDE) to write and debug code, as well as to collaborate with other team members.
This live and incremental data pipeline solution enables news data to be processed and delivered to consumers as quickly as possible. By utilizing real-time data processing, breaking news can be continuously ingested and transformed, ensuring that the latest developments are always available to news consumers with minimal delay. This pipeline solution also allows for the seamless addition of new data sources, ensuring that the system is scalable and can handle large volumes of news data. The ultimate objective is to create a reliable and high-performing news processing system that empowers consumers to stay informed and make knowledgeable decisions based on the most up-to-date news available.
News travels fast. We would like to create a news/article ETL that provides real-time information to consumers about breaking world events, and other areas of news interest. Consumers of the data would be Average daily news consumers
- An AWS EC2 Instance boots up and downloads the news_etl pipeline's Kafka Producer app docker image from AWS ECR.
- AWS EC2 instance runs this docker image in a docker container.
- Docker container reads ENV file from AWS S3 Bucket.
- It sets the read contents and set them as environment variable making it available for ETL program during runtime.
- On a scheduled time, ECS cron job kicks in and starts the ETL pipeline.
- The python Kafka Producer news_etl ETL pipeline makes a REST API call to MEDIASTACK API to get breaking news data .
- Response data from the API call is posted to Kafka topic hosted in Confluent Cloud, later consumed by a streaming Databricks Kafka Consumer and transformed, and also enriched with a source information in Databricks worlflow.
- In Databricks, mediastack_headlines landing table is read and the delta load workflow executes as follows:
- A delta article table is generated (only new article records is generated by only selecting article not yet in existing bronze table)
- A deduped list of sources from the delta article table is used to hit the sources API and extract the delta_sources table.
- Using existing bronze_articles, only new sources are added to bronze_sources, and where applicable include any new source-specific fields from delta_sources
- The updated bronze_articles and bronze_sources tables are enriched to silver_articles and silver_sources by enriching/transforming with country names, language names, date parsing, aggregating on article count by source, and adding useful boolean (e.g has-shock_value)
- Both silver tables is transformed to gold tables for both articles and sources - mainly renaming.
- And OBT table is formed from the 1 fact gold table (articles) and the 3 gold dim tables (sources, countries, languages).
- At each transition between bronze -> silver -> gold -> obt, QC tests are performed via Great Expectation tests to ensure outputs are consistently as expected throughout the data flow.
- The final transformed and enriched gold dataset and obt table, are brought into a Power BI dashboard where the data is dimensionally modeled, and exposed as semantic layer to illustrate latest available headlines by news category, and some basic statistics/trends on the full available news data.
This project requires following softwares, packages and tools.
- Python 3.8.16 to write the NEWS_ETL ETL pipeline code
- Docker to containerize the NEWS_ETL KAFKA Consumer Portion of the ETL pipeline application
- AWS to host KAFKA Producer Portion of the NEWS_ETL ETL Pipeline application
- Confluent Cloud acts as Kafka Broker hosting the topic
- Databricks Streaming Job acts as a Kafka Consumer reading from Kafka topic to write to delta table
- Databricks Workflow to transform raw mediastack data landed in delta tables to Bronze, Silver and Gold Layers
- PowerBI acts as Semantic Layer to expose the business metrics based on transformed data.
Below are the installation steps for setting up the job_board ETL app.
-
Get a Paid API Key at https://mediastack.com/product
-
Clone the repo
git clone https://github.com/mddan/news_etl.git
-
Install packages
pip install -r requirements.txt
-
Create
set_python_path.sh
/set_python_path.bat
file insrc/
folder with following contentsLinux / Mac
#!/bin/bash export PYTHONPATH=`pwd`
Windows
set PYTHONPATH=%cd%
-
Create a
config.sh
/config.bat
file insrc/
folder with following contentLinux / Mac
export KAFKA_BOOTSTRAP_SERVERS=<YOUR_KAFKA_BOOTSTRAP_SERVER> export KAFKA_SASL_USERNAME=<YOUR_KAFKA_USERNAME> export KAFKA_SASL_PASSWORD=<YOUR_KAFKA_PASSWORD> export KAFKA_TOPIC=<YOUR_KAFKA_TOPIC> export MEDIASTACK_ACCESS_KEY=<YOUR_MEDIASTACK_API_ACCESS_KEY>
Windows
SET KAFKA_BOOTSTRAP_SERVERS=<YOUR_KAFKA_BOOTSTRAP_SERVER> SET KAFKA_SASL_USERNAME=<YOUR_KAFKA_USERNAME> SET KAFKA_SASL_PASSWORD=<YOUR_KAFKA_PASSWORD> SET KAFKA_TOPIC=<YOUR_KAFKA_TOPIC> SET MEDIASTACK_ACCESS_KEY=<YOUR_MEDIASTACK_API_ACCESS_KEY>
-
Create a
.env
file with below contents in root project folder
KAFKA_BOOTSTRAP_SERVERS=<YOUR_KAFKA_BOOTSTRAP_SERVER>
KAFKA_SASL_USERNAME=<YOUR_KAFKA_USERNAME>
KAFKA_SASL_PASSWORD=<YOUR_KAFKA_PASSWORD>
KAFKA_TOPIC=<YOUR_KAFKA_TOPIC>
MEDIASTACK_ACCESS_KEY=<YOUR_MEDIASTACK_API_ACCESS_KEY>
- CD into
src/
folder - Run
. ./set_python_path.sh
/set_python_path.bat
file according to your Operating System to set PYTHONPATH - Run
config.sh
/config.bat
file to set additional environment variables needed to connect to MEDIASTACK API and KAFKA TOPIC - CD back to
src/
folder - Run
python news_etl/producer/mediastack_kafka_producer.py
to run the Kafka Producer code locally. - Alternatively instead of running steps 3 thru 5, we can run the Kafka producer pipeline in docker container as follows.
- From the root folder containing
Dockerfile
, Rundocker build -t news_etl:1.0 .
to create a docker image for News ETL pipeline's Kafka Producer component - Run a container using the above image using
docker run --env-file=.env news_etl:1.0
to see the Kafka Producer part of the ETL pipeline in action.
- Create IAM roles as shown in image.
- Upload the .env file containing the JSEARCH API Key and AWS RDS Connection Details to an AWS S3 Bucket.
- Create docker file and upload the Docker image to AWS ECR.
- Create a Cron Schedule in AWS ECS to run the Kafka producer pipeline in a recurring schedule.
- Query Mediastack API for latest news and push to Kafka Topic hosted in Confluent Cloud
- Databricks Streaming Kafka Consumer reads latest offsets from Kafka Topic and writes to raw_landing delta table.
- Databricks workflow transforms data and enrich source information in medallaion architecture layers of Bronze, Silver and Gold.
- This transformed data is fact and dimensional modeled, tested for data quality using great expectations
- The dimensional data is exposed as semantic layer using PowerBI.
An ETL data pipeline solution is essential for collecting, transforming, and loading data from various sources into a centralized repository. The pipeline benefits consumers of news in several ways. It centralizes the data for easy accessibility, standardizes and ensures data quality, consistency, and accuracy, and automates the process of data transformation and scaling up as the data volume grows. Consequently, data consumers can quickly access high-quality, consistent, and easily accessible data for making informed decisions.
-
Data extraction and loading:
- Set up API for data extraction
- Retrieve the News Data from the Mediastack API using a suitable extraction method (API calls)
- Set up Kafka Producer and Consumer to incrementally extract and load data to Databricks Delta tables
-
Data transformation:
- Clean the raw data to ensure it is in the desired format (e.g., removing duplicates, handling missing values, etc.).
- Use the following transformation techniques : renaming columns, joining, grouping, typecasting, data filtering, sorting, and aggregating
- Transform the data into a structured format (e.g., converting to a tabular form or creating a data model).
- Exposing this dimenisional modeled data as semantic layer to PowerBI
-
Create a data Pipeline
- Build a docker image using a Dockerfile
- Test that the Docker container is runing locally
-
Incremental extraction and loading:
- Kafka producer regularly extract newly available news data from the API and update the Kafka Topic with the latest information.
- Ensure that the Kakfa Consumer Databricks Streaming App always reads from latest checkpoints and lands latest data in databricks delta table
-
Implement Great Expectation tests
- Write Great Expectation Tests for the Data Transformation layer in Databricks workflow.
-
Cloud Hosting :
- Host the Kafka Consumer on AWS, Kafka Topic in Confluent Cloud
- Use AWS services (e.g., EC2, S3, ECR, ECS etc.) to ensure the robustness and reliability of the Kafka Producer pipeline.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
Vasanth Nair - @Linkedin
Daniel Marinescu - @Linkedin
Project Link: [https://github.com/mddan/news_etl
Use this space to list resources you find helpful and would like to give credit to. I've included a few of my favorites to kick things off!