The project uses WWI data using PySpark to ETL from the source system to Hive. Then, we will analyze the Superset
This is project ETL data from csv files to hive. Then, this data will be analyzed with the superset
Project run in local based on docker-compose.yml
in bigdata-stack
1. Run bigdata-stack
git clone git@github.com:loinguyen3108/bigdata-stack.git
cd bigdata-stack
docker compose up -d
# setup superset
# 1. Setup your local admin account
docker exec -it superset superset fab create-admin --username admin --firstname Superset --lastname Admin --email admin@superset.com --password admin
2. Migrate local DB to latest
docker exec -it superset superset db upgrade
3. Load Examples
docker exec -it superset superset load_examples
4. Setup roles
docker exec -it superset superset init
Login and take a look -- navigate to http://localhost:8080/login/ -- u/p: [admin/admin]
2. Spark Standalone
Setup at spark document
3. Dataset
Data setup at WWI Dataset
4. Environment
export JDBC_URL=...
export JDBC_USER=...
export JDBC_PASSWORD=...
5. Build dependencies
./build_dependencies.sh
6. Insert local packages
./update_local_packages.sh
7. Args help
cd manager
python ingestion.py -h
python transform.py -h
cd ..
8. Run
# ingest data from postgres to datalake
spark-submit --py-files packages.zip manager/ingestion.py --table_name <table_name>
# transform data from datalake to hive
# Init dim_date
spark-submit --py-files packages.zip manager/transform .py--init --exec-date YYYY:MM:DD
#Transform
spark-submit --py-files packages.zip manager/transform.py
This software is licensed under the Apache © Loi Nguyen.