Typically, Spark runs in YARN, which is not convenient if we need finer control of executor placement (for example, run in a single machine with specific number of executors with exactly configuration). Standalone better suites this use cases.
- In order to use the spark-kit:
git clone https://github.com/stevenybw/spark-kit
cd spark-kit
source manage-standalone.sh
- Get Spark official release
wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
- Check the environment and follow the direction
check_environment
- Adjust the parameters in
manage-standalone.sh
. - Establish Spark standalone cluster with all the nodes in ${SLAVES_HOSTLIST}
reset_environment $DIST
- Establish Spark standalone cluster with a single node (the first node in ${SLAVES_HOSTLIST})
reset_environment $LOCAL
- Establish Spark standalone cluster with a single node (current node running the script)
reset_environment_locally $LOCAL
- Check the Spark standalone resource manager master
show_master_webui
- Show the command to launch a spark shell (its argument must be the same as how you setup the environments, assume distributed here)
show_spark_shell_command $DIST
- Or launch a spark shell
enter_spark_shell $DIST
- See the session web UI of the spark job at port 4040