Real-time Wikipedia Data Streaming Project

The Real-time Wikipedia Data Streaming Project is designed to capture and process real-time data from the Wikipedia Data Stream. It uses AWS Kinesis, a powerful data streaming platform, to ingest, process, and distribute Wikipedia data. This README provides an overview of the project, its components, and instructions for setting it up.

Keywords: Python, AWS Kinesis, AWS Firehose, AWS Lambda, Snowflake, AWS S3, AWS Athena

Components

The project consists of the following components:

AWS Kinesis Data Stream: Ingests data from the Wikipedia Data Stream.
AWS Firehose: Ingest data from Kinesis to AWS S3.
AWS S3: Stores raw and processed data.
AWS Athena: Allows SQL-based querying of data stored in S3.
AWS Glue Crawler: Creates sthe schema of the table.
Snowflake Data Warehouse: Stores structured and transformed data for analytics.

Prerequisites

Before you begin, ensure you have the following prerequisites in place:

AWS account with appropriate permissions.
AWS CLI installed and configured.
Python 3.x installed (for Lambda functions development).
Snowflake account and credentials (if using Snowflake).
A basic understanding of AWS Kinesis and AWS Lambda.

Configuration

Set up an AWS Kinesis Data Stream and configure the necessary AWS resources (IAM roles, etc.) for your project.
Configure your AWS credentials using the AWS CLI: aws configure
Set up AWS Athena and Snowflake accounts if using these services.

Running the Project

Ingest Data:

Use the provided Kafka producer or an alternative method to ingest data from the Wikipedia Data Stream into your Kinesis Data Stream.

Run Lambda Functions:

Deploy your Lambda functions to AWS, specifying the Kinesis stream as the trigger.

Transform and Store Data:

Configure the Lambda functions to transform and store data in AWS S3 and, if applicable, Snowflake.

Query Data:

Use AWS Athena to query data stored in S3 for ad-hoc analysis and reporting.

License:

This project is open-source and distributed under the MIT License, allowing for collaboration and adaptation according to your requirements.

In summary, this project showcases how to use Kafka for real-time data processing and streaming, making it a valuable resource for anyone interested in building similar data streaming solutions.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
app_flow		app_flow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time Wikipedia Data Streaming Project

Components

Prerequisites

Configuration

Running the Project

Ingest Data:

Run Lambda Functions:

Transform and Store Data:

Query Data:

License:

About

Releases

Packages

Languages

License

leonardohss0/data-streaming-with-kinesis

Folders and files

Latest commit

History

Repository files navigation

Real-time Wikipedia Data Streaming Project

Components

Prerequisites

Configuration

Running the Project

Ingest Data:

Run Lambda Functions:

Transform and Store Data:

Query Data:

License:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages