Skip to content

Commit

Permalink
docs: add instructions for generic spark (#103)
Browse files Browse the repository at this point in the history
  • Loading branch information
cv-projectn committed May 24, 2024
1 parent efa7cc5 commit 6a2ed90
Show file tree
Hide file tree
Showing 2 changed files with 83 additions and 0 deletions.
33 changes: 33 additions & 0 deletions integrations/spark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Spark integration

You can integrate sidekick with Apache Spark by adding the [sidekick_service_init.sh](./sidekick_service_init.sh) as an [init script]() in your Spark clusters. This init-script should be configured to run on all Spark nodes.

Briefly, the init script does the following:
- Install sidekick on the Spark node
- Configure S3 endpoint (for specific buckets) to point to sidekick
- Setup a sytstemctl service to run sidekick as a daemon

## Configuration

To get started, download the sample [init script]() and make the following changes.

1.
Add bucket endpoints and regions which will be accessed via sidekick by adding to this section in the init_script.

```bash
cat >/databricks/driver/conf/sidekick-spark-conf.conf <<EOL
[driver] {
"spark.hadoop.fs.s3a.bucket.<MY_BUCKET1>.endpoint" = "http://localhost:7075"
"spark.hadoop.fs.s3a.bucket.<MY_BUCKET1>.endpoint.region" = <AWS_REGION_OF_BUCKET1>
"spark.hadoop.fs.s3a.bucket.<MY_BUCKET2>.endpoint" = "http://localhost:7075"
"spark.hadoop.fs.s3a.bucket.<MY_BUCKET2>.endpoint.region" = <AWS_REGION_OF_BUCKET2>
}
EOL
```

2.
Define the environment variables by adding these lines to the [sidekick service init script](./sidekick_service_init.sh):

```bash
export SIDEKICK_APP_CLOUDPLATFORM=<AWS|GCP>
```
50 changes: 50 additions & 0 deletions integrations/spark/sidekick_service_init.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/bin/bash
set -ex

# Check if sidekick bin is present, if not download it
SIDEKICK_BIN=/usr/bin/sidekick
if [ -f "$SIDEKICK_BIN" ]; then
echo "$SIDEKICK_BIN already installed."
else
wget https://github.com/project-n-oss/sidekick/releases/latest/download/sidekick-linux-amd64.tar.gz
tar -xzvf sidekick-linux-amd64.tar.gz -C /usr/bin
fi
chmod +x $SIDEKICK_BIN
$SIDEKICK_BIN --help > /dev/null

cat > /opt/spark/conf/style-path-spark-conf.conf <<EOL
[driver] {
"spark.hadoop.fs.s3a.path.style.access" = "true"
"spark.hadoop.fs.s3a.bucket.<MY_BUCKET1>.endpoint" = "http://localhost:7075"
"spark.hadoop.fs.s3a.bucket.<MY_BUCKET1>.endpoint.region" = <AWS_REGION_OF_BUCKET1>
"spark.hadoop.fs.s3a.bucket.<MY_BUCKET2>.endpoint" = "http://localhost:7075"
"spark.hadoop.fs.s3a.bucket.<MY_BUCKET2>.endpoint.region" = <AWS_REGION_OF_BUCKET2>
}
EOL

# Add any spark or env config here:
# --------------------------------------------------

# --------------------------------------------------

export SIDEKICK_APP_CLOUDPLATFORM="<AWS|GCP>"

# Create service file for the sidekick process
SERVICE_FILE="/etc/systemd/system/sidekick.service"
touch $SERVICE_FILE

cat > $SERVICE_FILE << EOF
[Unit]
Description=Sidekick service file
[Service]
Environment=SIDEKICK_APP_CLOUDPLATFORM=$SIDEKICK_APP_CLOUDPLATFORM
ExecStart=$SIDEKICK_BIN serve -p 7075
Restart=always
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable sidekick
systemctl start sidekick
systemctl status sidekick

0 comments on commit 6a2ed90

Please sign in to comment.