Skip to content

The project use WWI data use PySpark to ETL from source system to Hive. Then, we will analyze with Superset

License

Notifications You must be signed in to change notification settings

loinguyen3108/Wide-World-Importers-Spark

Repository files navigation

Welcome to my World-Wide-Importers-Spark project

The project uses WWI data using PySpark to ETL from the source system to Hive. Then, we will analyze the Superset

This is project ETL data from csv files to hive. Then, this data will be analyzed with the superset

github release date commit active license PRs welcome code with hearth by Loi Nguyen

🚩 Table of Contents

🎨 Stack

Project run in local based on docker-compose.yml in bigdata-stack

⚙️ Setup

1. Run bigdata-stack

git clone git@github.com:loinguyen3108/bigdata-stack.git

cd bigdata-stack

docker compose up -d

# setup superset
# 1. Setup your local admin account

docker exec -it superset superset fab create-admin --username admin --firstname Superset --lastname Admin --email admin@superset.com --password admin

2. Migrate local DB to latest

docker exec -it superset superset db upgrade

3. Load Examples

docker exec -it superset superset load_examples

4. Setup roles

docker exec -it superset superset init

Login and take a look -- navigate to http://localhost:8080/login/ -- u/p: [admin/admin]

2. Spark Standalone
Setup at spark document

3. Dataset
Data setup at WWI Dataset

4. Environment

export JDBC_URL=...
export JDBC_USER=...
export JDBC_PASSWORD=...

5. Build dependencies

./build_dependencies.sh

6. Insert local packages

./update_local_packages.sh

7. Args help

cd manager
python ingestion.py -h
python transform.py -h
cd ..

8. Run

# ingest data from postgres to datalake
spark-submit --py-files packages.zip manager/ingestion.py --table_name <table_name>

# transform data from datalake to hive
# Init dim_date
spark-submit --py-files packages.zip manager/transform .py--init --exec-date YYYY:MM:DD

#Transform
spark-submit --py-files packages.zip manager/transform.py

WWI Star Schema

WWI schema

📜 License

This software is licensed under the Apache © Loi Nguyen.

About

The project use WWI data use PySpark to ETL from source system to Hive. Then, we will analyze with Superset

Topics

Resources

License

Stars

Watchers

Forks

Packages