This project built a real time data pipeline for ecommerce data, all the data were stream by kafka combine pyspark and store in mongodb. Finally create a dashboard to visualize data by streamlit. The project will allow for real time analytics on the ecommerce website, allowing for more efficient decision making and better customer experience. The data pipeline will make it easier to understand user behavior and trends, which will be beneficial for the website.
- Docker
- Streamlit
- Kafka
- Spark
- Mongodb
-
Dashboard:
- dashboard.py
-
Data-Generator: Generate data from dataset.csv
- AddUser.py
- dataset.csv
- Producer.py
-
database: Created when mongodb server generate
-
ETD_E_VENV virtual environment
-
kafka
- kafka-setup.sh create kafka topic
-
Stream:
- Aggregate_Country.py
- Aggregate_Gender.py
- Aggregate_Product.py
- Aggregate_Product.py
- jars jar file use to submit spark application
-
docker-compose.yaml.py
-
LICENSE
-
.gitignore
-
activate.sh
-
readme.md
-
requirements.txt
To run application Firstly install dependancies.
- Install dependencies:
pip install -r requirements.txtsecondly, create kafka-topic
cd kafka
bash kafka-setup.shlastly, Run project:
#run application
bash activate.sh