Skip to content
Big Data or Graph framework execution on PBS Scheduler.
Python Shell Roff
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
dask
flink
spark
.gitignore
LICENSE.txt
Readme.md

Readme.md

Big Data or graph frameworks on PBS

Introduction

This projects contains tooling and documentation for launching Spark (http://spark.apache.org), Flink (https://flink.apache.org/) or Dask (https://distributed.readthedocs.io/) on a PBS (http://pbspro.org/) cluster that has access to a shared file system.

The mechanisms are always the same :

  1. qsub a PBS job which takes sevral chunks using select option
  2. Use pbs-dsh command to start scheduler and workers on reserved chunks
  3. Either wait for qdel or submit a given application to the newly started cluster.

See how it works and how to use the provided tooling in each framework directory:

Quick example

These tools are made for being easy to use, once downloaded into your cluster, you will be able to start a Spark application as easy as that:

#!/bin/bash
#PBS -N spark-cluster-path
#PBS -l select=9:ncpus=4:mem=20G
#PBS -l walltime=01:00:00

# Qsub template for CNES HAL
# Scheduler: PBS

#Environment
export JAVA_HOME=/work/logiciels/rhall/jdk/1.8.0_112
export SPARK_HOME=/work/logiciels/rhall/spark/2.2.1
export PATH=$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH

$PBS_O_WORKDIR/pbs-launch-spark -n 4 -m "18000M" $SPARK_HOME/examples/src/main/python/wordcount.py $SPARK_HOME/conf/

Contact

This project has been tested on @CNES (Centre National d'Etude Spatial -- the French Space Agency) HPC cluster. Feel free to open an issue to ask for a correction or even for help, We will be glad to help.

You can’t perform that action at this time.