This is example project for hadoop, spark, hive and superset
Project run in local based on docker-compose.yml
in bigdata-stack
1. Run bigdata-stack
git clone git@github.com:loinguyen3108/bigdata-stack.git
cd bigdata-stack
docker compose up -d
2. Spark Standalone
Setup at spark document
3. Dataset
Data is downloaded at PostgreSQL Sample Database
4. Environment
export JDBC_URL=...
export JDBC_USER=...
export JDBC_PASSWORD=...
5. Build dependencies
./build_dependencies.sh
6. Insert local packages
./update_local_packages.sh
7. Args help
cd manager
python ingestion.py -h
python transform.py -h
cd ..
8. Run
# ingest data from postgres to datalake
spark-submit --py-files packages.zip manager/ingestion.py --exec-date YYYY:MM:DD --table-name <table_name> --p-key <key name> --loading-type <type>
# transform data from datalake to hive
spark-submit --py-files packages.zip manager/transform.py --exec-date YYYY:MM:DD
- Data Lake
- Hive
This software is licensed under the Apache © Loi Nguyen.