docs: add instructions for generic spark (#103)

project-n-oss · May 24, 2024 · 6a2ed90 · 6a2ed90
1 parent efa7cc5
commit 6a2ed90
Show file tree

Hide file tree

Showing 2 changed files with 83 additions and 0 deletions.
diff --git a/integrations/spark/README.md b/integrations/spark/README.md
@@ -0,0 +1,33 @@
+# Spark integration
+
+You can integrate sidekick with Apache Spark by adding the [sidekick_service_init.sh](./sidekick_service_init.sh) as an [init script]() in your Spark clusters. This init-script should be configured to run on all Spark nodes.
+
+Briefly, the init script does the following:
+    - Install sidekick on the Spark node
+    - Configure S3 endpoint (for specific buckets) to point to sidekick
+    - Setup a sytstemctl service to run sidekick as a daemon
+
+## Configuration
+
+To get started, download the sample [init script]() and make the following changes.
+
+1. 
+Add bucket endpoints and regions which will be accessed via sidekick by adding to this section in the init_script.
+
+```bash
+cat >/databricks/driver/conf/sidekick-spark-conf.conf <<EOL
+[driver] {
+  "spark.hadoop.fs.s3a.bucket.<MY_BUCKET1>.endpoint" = "http://localhost:7075"
+  "spark.hadoop.fs.s3a.bucket.<MY_BUCKET1>.endpoint.region" = <AWS_REGION_OF_BUCKET1>
+  "spark.hadoop.fs.s3a.bucket.<MY_BUCKET2>.endpoint" = "http://localhost:7075"
+  "spark.hadoop.fs.s3a.bucket.<MY_BUCKET2>.endpoint.region" = <AWS_REGION_OF_BUCKET2>
+}
+EOL
+```
+
+2.
+Define the environment variables by adding these lines to the [sidekick service init script](./sidekick_service_init.sh):
+
+```bash
+export SIDEKICK_APP_CLOUDPLATFORM=<AWS|GCP>
+```
diff --git a/integrations/spark/sidekick_service_init.sh b/integrations/spark/sidekick_service_init.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+set -ex
+
+# Check if sidekick bin is present, if not download it
+SIDEKICK_BIN=/usr/bin/sidekick
+if [ -f "$SIDEKICK_BIN" ]; then
+    echo "$SIDEKICK_BIN already installed."
+else 
+    wget https://github.com/project-n-oss/sidekick/releases/latest/download/sidekick-linux-amd64.tar.gz
+    tar -xzvf sidekick-linux-amd64.tar.gz -C /usr/bin
+fi
+chmod +x $SIDEKICK_BIN
+$SIDEKICK_BIN --help > /dev/null
+
+cat > /opt/spark/conf/style-path-spark-conf.conf <<EOL
+[driver] {
+  "spark.hadoop.fs.s3a.path.style.access" = "true"
+  "spark.hadoop.fs.s3a.bucket.<MY_BUCKET1>.endpoint" = "http://localhost:7075"
+  "spark.hadoop.fs.s3a.bucket.<MY_BUCKET1>.endpoint.region" = <AWS_REGION_OF_BUCKET1>
+  "spark.hadoop.fs.s3a.bucket.<MY_BUCKET2>.endpoint" = "http://localhost:7075"
+  "spark.hadoop.fs.s3a.bucket.<MY_BUCKET2>.endpoint.region" = <AWS_REGION_OF_BUCKET2>
+}
+EOL
+
+# Add any spark or env config here:
+# --------------------------------------------------
+
+# --------------------------------------------------
+
+export SIDEKICK_APP_CLOUDPLATFORM="<AWS|GCP>"
+
+# Create service file for the sidekick process
+SERVICE_FILE="/etc/systemd/system/sidekick.service"
+touch $SERVICE_FILE
+
+cat > $SERVICE_FILE << EOF
+[Unit]
+Description=Sidekick service file
+[Service]
+Environment=SIDEKICK_APP_CLOUDPLATFORM=$SIDEKICK_APP_CLOUDPLATFORM
+ExecStart=$SIDEKICK_BIN serve -p 7075
+Restart=always
+[Install]
+WantedBy=multi-user.target
+EOF
+
+systemctl daemon-reload
+systemctl enable sidekick
+systemctl start sidekick
+systemctl status sidekick