Building a real-time data streaming application with Apache Kafka

Project Description

Real-Time Meetup RSPV Data Processing from https://www.meetup.com/. Real-Time Analytics using Apache Kafka, Zookeeper, Py Spark. Analyzing the real time RSVP data of meetup.com to get real-time insights such as trending topics, cities etc. along with other business insights related to Meetups RSVPs. The data processing scripts are developed in Python.

Getting this streaming data into Apache Spark-Streaming is the first step to perform various analytics, recommendations or visualizations on the data.

Technologies Used

Spark 3.1.2
Kafka 2.8.0
PySpark 2.4.8
Python 3.6
Data Feeds: kafka-python 2.0.2
ETL: Spark DataFrame, Spark Structured Streaming
Visualization: matplotlib 3.4.3
Git/GitHub

Kafka Python API is used to interact with kafka cluster. PySpark is used to write the spark streaming jobs.

Features

List of features ready and TODOs for future development

1. What are the current active cities in US which are scheduling Meetup Events?
2. What are the trending topics in US Meetup Events?
3. How many Big data Meetup Events events scheduled in each country?

Getting Started

Assuming Kafka and Spark of appropriate version is installed, the following commands are used to run the application.

Spark Streaming integeration with kafka 0.10.0.0 and above.

Run Zookeeper to maintain Kafka, command to be run from Kafka root dir

.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties

Start Kafka server, aditional servers can be added as per requirement.

.\bin\windows\kafka-server-start.bat .\config\server.properties

Start Producer.py to start reading data from the meetup stream and store it in '''meetup''' kafka topic.
Start Consumer notebook to consume the processed stream from the spark streaming.
Submit the spark job <spark_file>.py, to read the data into Spark Streaming from Kafka.

Spark depends on a external package for kafka integeration link

bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 spark_meetup.py localhost:2181 meetup

Start <consumer_file>.ipynb file to visualize the data.

License

This project uses the following license: Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Scripts		Scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Building a real-time data streaming application with Apache Kafka

Project Description

Technologies Used

Features

Getting Started

License

References

About

Uh oh!

Releases

Packages

Languages

License

imyusufansari/Spark-Streaming-with-Kafka

Folders and files

Latest commit

History

Repository files navigation

Building a real-time data streaming application with Apache Kafka

Project Description

Technologies Used

Features

Getting Started

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages