diff --git a/integrations/spark/README.md b/integrations/spark/README.md new file mode 100644 index 0000000..322f835 --- /dev/null +++ b/integrations/spark/README.md @@ -0,0 +1,33 @@ +# Spark integration + +You can integrate sidekick with Apache Spark by adding the [sidekick_service_init.sh](./sidekick_service_init.sh) as an [init script]() in your Spark clusters. This init-script should be configured to run on all Spark nodes. + +Briefly, the init script does the following: + - Install sidekick on the Spark node + - Configure S3 endpoint (for specific buckets) to point to sidekick + - Setup a sytstemctl service to run sidekick as a daemon + +## Configuration + +To get started, download the sample [init script]() and make the following changes. + +1. +Add bucket endpoints and regions which will be accessed via sidekick by adding to this section in the init_script. + +```bash +cat >/databricks/driver/conf/sidekick-spark-conf.conf <.endpoint" = "http://localhost:7075" + "spark.hadoop.fs.s3a.bucket..endpoint.region" = + "spark.hadoop.fs.s3a.bucket..endpoint" = "http://localhost:7075" + "spark.hadoop.fs.s3a.bucket..endpoint.region" = +} +EOL +``` + +2. +Define the environment variables by adding these lines to the [sidekick service init script](./sidekick_service_init.sh): + +```bash +export SIDEKICK_APP_CLOUDPLATFORM= +``` diff --git a/integrations/spark/sidekick_service_init.sh b/integrations/spark/sidekick_service_init.sh new file mode 100644 index 0000000..7beb689 --- /dev/null +++ b/integrations/spark/sidekick_service_init.sh @@ -0,0 +1,50 @@ +#!/bin/bash +set -ex + +# Check if sidekick bin is present, if not download it +SIDEKICK_BIN=/usr/bin/sidekick +if [ -f "$SIDEKICK_BIN" ]; then + echo "$SIDEKICK_BIN already installed." +else + wget https://github.com/project-n-oss/sidekick/releases/latest/download/sidekick-linux-amd64.tar.gz + tar -xzvf sidekick-linux-amd64.tar.gz -C /usr/bin +fi +chmod +x $SIDEKICK_BIN +$SIDEKICK_BIN --help > /dev/null + +cat > /opt/spark/conf/style-path-spark-conf.conf <.endpoint" = "http://localhost:7075" + "spark.hadoop.fs.s3a.bucket..endpoint.region" = + "spark.hadoop.fs.s3a.bucket..endpoint" = "http://localhost:7075" + "spark.hadoop.fs.s3a.bucket..endpoint.region" = +} +EOL + +# Add any spark or env config here: +# -------------------------------------------------- + +# -------------------------------------------------- + +export SIDEKICK_APP_CLOUDPLATFORM="" + +# Create service file for the sidekick process +SERVICE_FILE="/etc/systemd/system/sidekick.service" +touch $SERVICE_FILE + +cat > $SERVICE_FILE << EOF +[Unit] +Description=Sidekick service file +[Service] +Environment=SIDEKICK_APP_CLOUDPLATFORM=$SIDEKICK_APP_CLOUDPLATFORM +ExecStart=$SIDEKICK_BIN serve -p 7075 +Restart=always +[Install] +WantedBy=multi-user.target +EOF + +systemctl daemon-reload +systemctl enable sidekick +systemctl start sidekick +systemctl status sidekick