Skip to content

Data Pipeline with Docker using Kafka, Cassandra, Jupyter Lab and Dash

Notifications You must be signed in to change notification settings

hoangdesu/Docker-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Data Pipeline with Docker

Table of Contents

About

This project is the first assignment from Big Data for Engineering class. It utilizes Docker to deploy an end-to-end data pipeline on your local computer using containerized Kafka for data streaming, Cassandra for NoSQL database with Jupyter Lab and Dash framework for data analysis Visualization. There are 3 pipelines using data from Twitter and OpenWeatherMap APIs, Faker API and PokeAPI.

Data pipeline

Kafka Producers and Consumers help stream data from provided APIs:

Kafka

Data is then stored in Cassandra Database:

Cassandra

Using Jupyter Lab (or Dash) to query database and visualize data:

Jupyter

All Docker containers used in the pipeline:

Docker

How to replicate the project

Containers for all things mentioned in this project can be found in the src folder. All images have been pre-built, however if you want to replicate the pipeline, you can rebuild the images and compose them up again using the following guide:

Create docker networks

docker network create kafka-network docker network create cassandra-network

Start up Cassandra

docker-compose -f cassandra/docker-compose.yml up -d --build

Start up Kafka

docker-compose -f kafka/docker-compose.yml up -d

Start up the producers

  • OpenWeatherMap: docker-compose -f owm-producer/docker-compose.yml up -d --build
  • Twitter Producer: docker-compose -f twitter-producer/docker-compose.yml up --build
  • FakerAPI: docker-compose -f faker-producer/docker-compose.yml up -d --build
  • Pokemon Producer docker-compose -f pokemon-producer/docker-compose.yml up -d --build

Start all 4 consumers

docker-compose -f consumers/docker-compose.yml up --build

Check data in Cassandra DB

  • Log in to Cassandra CLI: docker exec -it cassandra bash
  • Query the data from 4 tables: $ cqlsh --cqlversion=3.4.4 127.0.0.1 #make sure you use the correct cqlversion cqlsh> use kafkapipeline; #keyspace name cqlsh:kafkapipeline> select * from twitterdata; cqlsh:kafkapipeline> select * from weatherreport; cqlsh:kafkapipeline> select * from fakerdata;

Data visualization

  • With Jupyter Notebook: docker-compose -f data-vis/docker-compose.yml up -d --build
  • With Dash: docker-compose -f dashboard/docker-compose.yml up -d

Demo video

Watch on Youtube!

Acknowledgement

Based on: https://github.com/salcaino/sfucmpt733/tree/main/foobar-kafka and https://github.com/vnyennhi/docker-kafka-cassandra

Connect with me

If you find this project useful, you can let me know. I would love to hear about it! 🔥

About

Data Pipeline with Docker using Kafka, Cassandra, Jupyter Lab and Dash

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published