From 51423fd9accf945fab3857b67e13541f4a3dc591 Mon Sep 17 00:00:00 2001 From: Robin Andersson Date: Mon, 1 Sep 2025 15:57:06 +0200 Subject: [PATCH] [HWORKS-2322] Add defaultArgs field to job docs and clarify nested fields --- .github/workflows/mkdocs-release.yml | 4 ++ .../user_guides/projects/jobs/notebook_job.md | 36 +++++++------ docs/user_guides/projects/jobs/pyspark_job.md | 48 ++++++++++-------- docs/user_guides/projects/jobs/python_job.md | 33 ++++++------ docs/user_guides/projects/jobs/ray_job.md | 27 +++++++++- docs/user_guides/projects/jobs/spark_job.md | 50 +++++++++++-------- 6 files changed, 126 insertions(+), 72 deletions(-) diff --git a/.github/workflows/mkdocs-release.yml b/.github/workflows/mkdocs-release.yml index b116f3d54..6ac4b568e 100644 --- a/.github/workflows/mkdocs-release.yml +++ b/.github/workflows/mkdocs-release.yml @@ -4,6 +4,10 @@ on: push: branches: [branch-*\.*] +concurrency: + group: ${{ github.workflow }} + cancel-in-progress: false + jobs: publish-release: runs-on: ubuntu-latest diff --git a/docs/user_guides/projects/jobs/notebook_job.md b/docs/user_guides/projects/jobs/notebook_job.md index 7c724bcb7..02b9dc73f 100644 --- a/docs/user_guides/projects/jobs/notebook_job.md +++ b/docs/user_guides/projects/jobs/notebook_job.md @@ -6,13 +6,18 @@ description: Documentation on how to configure and execute a Jupyter Notebook jo ## Introduction +This guide describes how to configure a job to execute a Jupyter Notebook (.ipynb) and visualize the evaluated notebook. + All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service: - Python - Apache Spark +- Ray Launching a job of any type is very similar process, what mostly differs between job types is -the various configuration parameters each job type comes with. After following this guide you will be able to create a Jupyter Notebook job. +the various configuration parameters each job type comes with. Hopsworks support scheduling jobs to run on a regular basis, +e.g backfilling a Feature Group by running your feature engineering pipeline nightly. Scheduling can be done both through the UI and the python API, +checkout [our Scheduling guide](schedule_job.md). ## UI @@ -167,19 +172,22 @@ execution = job.run(args='-p a 2 -p b 5', await_termination=True) ``` ## Configuration -The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")` - -| Field | Type | Description | Default | -|-------------------------|----------------|------------------------------------------------------|--------------------------| -| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` | -| `appPath` | string | Project path to notebook (e.g `Resources/foo.ipynb`) | `null` | -| `environmentName` | string | Name of the python environment | `"pandas-training-pipeline"` | -| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` | -| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` | -| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` | -| `logRedirection` | boolean | Whether logs are redirected | `true` | -| `jobType` | string | Type of job | `"PYTHON"` | -| `files` | string | HDFS path(s) to files to be provided to the Notebook Job. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/file1.py,hdfs:///Project//Resources/file2.txt"` | `null` | +The following table describes the job configuration parameters for a PYTHON job. + +`conf = jobs_api.get_configuration("PYTHON")` + +| Field | Type | Description | Default | +|-------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------| +| `conf['type']` | string | Type of the job configuration | `"pythonJobConfiguration"` | +| `conf['appPath']` | string | Project relative path to notebook (e.g., `Resources/foo.ipynb`) | `null` | +| `conf['defaultArgs']` | string | Arguments to pass to the notebook.
Will be overridden if arguments are passed explicitly via `Job.run(args="...")`.
Must conform to Papermill format `-p arg1 val1` | `null` | +| `conf['environmentName']` | string | Name of the project Python environment to use | `"pandas-training-pipeline"` | +| `conf['resourceConfig']['cores']` | float | Number of CPU cores to be allocated | `1.0` | +| `conf['resourceConfig']['memory']` | int | Number of MBs to be allocated | `2048` | +| `conf['resourceConfig']['gpus']` | int | Number of GPUs to be allocated | `0` | +| `conf['logRedirection']` | boolean | Whether logs are redirected | `true` | +| `conf['jobType']` | string | Type of job | `"PYTHON"` | +| `conf['files']` | string | Comma-separated string of HDFS path(s) to files to be made available to the application. Example: `hdfs:///Project//Resources/file1.py,...` | `null` | ## Accessing project data diff --git a/docs/user_guides/projects/jobs/pyspark_job.md b/docs/user_guides/projects/jobs/pyspark_job.md index e329312f3..ecad2e07f 100644 --- a/docs/user_guides/projects/jobs/pyspark_job.md +++ b/docs/user_guides/projects/jobs/pyspark_job.md @@ -6,10 +6,13 @@ description: Documentation on how to configure and execute a PySpark job on Hops ## Introduction +This guide will describe how to configure a job to execute a pyspark script inside the cluster. + All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service: - Python - Apache Spark +- Ray Launching a job of any type is very similar process, what mostly differs between job types is the various configuration parameters each job type comes with. Hopsworks clusters support scheduling to run jobs on a regular basis, @@ -212,27 +215,30 @@ print(f_err.read()) ``` ## Configuration -The following table describes the JSON payload returned by `jobs_api.get_configuration("PYSPARK")` - -| Field | Type | Description | Default | -| ------------------------------------------ | -------------- |-----------------------------------------------------| -------------------------- | -| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` | -| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` | -| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` | -| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` | -| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` | -| `spark.executor.instances` | number (int) | Number of executor instances | `1` | -| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` | -| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` | -| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` | -| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` | -| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` | -| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` | -| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` | -| `files` | string | HDFS path(s) to files to be provided to the Spark application. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/file1.py,hdfs:///Project//Resources/file2.txt"` | `null` | -| `pyFiles` | string | HDFS path(s) to Python files to be provided to the Spark application. These will be added to the `PYTHONPATH` so they can be imported as modules. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/module1.py,hdfs:///Project//Resources/module2.py"` | `null` | -| `jars` | string | HDFS path(s) to JAR files to be provided to the Spark application. These will be added to the classpath. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/lib1.jar,hdfs:///Project//Resources/lib2.jar"` | `null` | -| `archives` | string | HDFS path(s) to archive files to be provided to the Spark application. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/archive1.zip,hdfs:///Project//Resources/archive2.tar.gz"` | `null` | +The following table describes the job configuration parameters for a PYSPARK job. + +`conf = jobs_api.get_configuration("PYSPARK")` + +| Field | Type | Description | Default | +|----------------------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------| +| `conf['type']` | string | Type of the job configuration | `"sparkJobConfiguration"` | +| `conf['appPath']` | string | Project path to spark program (e.g `Resources/foo.py`) | `null` | +| `conf['defaultArgs']` | string | Arguments to pass to the program. Will be overridden if arguments are passed explicitly via `Job.run(args="...")` | `null` | +| `conf['environmentName']` | string | Name of the project spark environment to use | `"spark-feature-pipeline"` | +| `conf['spark.driver.cores']` | float | Number of CPU cores allocated for the driver | `1.0` | +| `conf['spark.driver.memory']` | int | Memory allocated for the driver (in MB) | `2048` | +| `conf['spark.executor.instances']` | int | Number of executor instances | `1` | +| `conf['spark.executor.cores']` | float | Number of CPU cores per executor | `1.0` | +| `conf['spark.executor.memory']` | int | Memory allocated per executor (in MB) | `4096` | +| `conf['spark.dynamicAllocation.enabled']` | boolean | Enable dynamic allocation of executors | `true` | +| `conf['spark.dynamicAllocation.minExecutors']` | int | Minimum number of executors with dynamic allocation | `1` | +| `conf['spark.dynamicAllocation.maxExecutors']` | int | Maximum number of executors with dynamic allocation | `2` | +| `conf['spark.dynamicAllocation.initialExecutors']` | int | Initial number of executors with dynamic allocation | `1` | +| `conf['spark.blacklist.enabled']` | boolean | Whether executor/node blacklisting is enabled | `false` | +| `conf['files']` | string | Comma-separated string of HDFS path(s) to files to be made available to the application. Example: `hdfs:///Project//Resources/file1.py,...` | `null` | +| `conf['pyFiles']` | string | Comma-separated string of HDFS path(s) to python modules to be made available to the application. Example: `hdfs:///Project//Resources/file1.py,...` | `null` | +| `conf['jars']` | string | Comma-separated string of HDFS path(s) to jars to be included in CLASSPATH. Example: `hdfs:///Project//Resources/app.jar,...` | `null` | +| `conf['archives']` | string | Comma-separated string of HDFS path(s) to archives to be made available to the application. Example: `hdfs:///Project//Resources/archive.zip,...` | `null` | ## Accessing project data diff --git a/docs/user_guides/projects/jobs/python_job.md b/docs/user_guides/projects/jobs/python_job.md index 0fa2a9e9f..2e73b8395 100644 --- a/docs/user_guides/projects/jobs/python_job.md +++ b/docs/user_guides/projects/jobs/python_job.md @@ -6,10 +6,13 @@ description: Documentation on how to configure and execute a Python job on Hopsw ## Introduction +This guide will describe how to configure a job to execute a python script inside the cluster. + All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service: - Python - Apache Spark +- Ray Launching a job of any type is very similar process, what mostly differs between job types is the various configuration parameters each job type comes with. Hopsworks support scheduling jobs to run on a regular basis, @@ -165,20 +168,22 @@ print(f_err.read()) ``` ## Configuration -The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")` - -| Field | Type | Description | Default | -|-------------------------|----------------|-------------------------------------------------|--------------------------| -| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` | -| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` | -| `environmentName` | string | Name of the project python environment | `"pandas-training-pipeline"` | -| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` | -| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` | -| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` | -| `logRedirection` | boolean | Whether logs are redirected | `true` | -| `jobType` | string | Type of job | `"PYTHON"` | -| `files` | string | HDFS path(s) to files to be provided to the Python Job. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/file1.py,hdfs:///Project//Resources/file2.txt"` | `null` | - +The following table describes the job configuration parameters for a PYTHON job. + +`conf = jobs_api.get_configuration("PYTHON")` + +| Field | Type | Description | Default | +|-------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------|---------| +| `conf['type']` | string | Type of the job configuration | `"pythonJobConfiguration"` | +| `conf['appPath']` | string | Project relative path to script (e.g., `Resources/foo.py`) | `null` | +| `conf['defaultArgs']` | string | Arguments to pass to the script. Will be overridden if arguments are passed explicitly via `Job.run(args="...")` | `null` | +| `conf['environmentName']` | string | Name of the project Python environment to use | `"pandas-training-pipeline"` | +| `conf['resourceConfig']['cores']` | float | Number of CPU cores to be allocated | `1.0` | +| `conf['resourceConfig']['memory']` | int | Number of MBs to be allocated | `2048` | +| `conf['resourceConfig']['gpus']` | int | Number of GPUs to be allocated | `0` | +| `conf['logRedirection']` | boolean | Whether logs are redirected | `true` | +| `conf['jobType']` | string | Type of job | `"PYTHON"` | +| `conf['files']` | string | Comma-separated string of HDFS path(s) to files to be made available to the application. Example: `hdfs:///Project//Resources/file1.py,...` | `null` | ## Accessing project data !!! notice "Recommended approach if `/hopsfs` is mounted" diff --git a/docs/user_guides/projects/jobs/ray_job.md b/docs/user_guides/projects/jobs/ray_job.md index 1b79a6f49..34cd42bf3 100644 --- a/docs/user_guides/projects/jobs/ray_job.md +++ b/docs/user_guides/projects/jobs/ray_job.md @@ -6,11 +6,13 @@ description: Documentation on how to configure and execute a Ray job on Hopswork ## Introduction +This guide will describe how to configure a job to execute a ray program inside the cluster. + All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service: - Python - Apache Spark -- Ray +- Ray Launching a job of any type is very similar process, what mostly differs between job types is the various configuration parameters each job type comes with. Hopsworks support scheduling to run jobs on a regular basis, @@ -203,6 +205,29 @@ print(f_err.read()) ``` +## Configuration +The following table describes the job configuration parameters for a RAY job. + +`conf = jobs_api.get_configuration("RAY")` + +| Field | Type | Description | Default | +|----------------------------------------|-------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------| +| `conf['type']` | string | Type of the job configuration | `"rayJobConfiguration"` | +| `conf['appPath']` | string | Project relative path to script (e.g., `Resources/foo.py`) | `null` | +| `conf['defaultArgs']` | string | Arguments to pass to the script. Will be overridden if arguments are passed explicitly via `Job.run(args="...")` | `null` | +| `conf['environmentName']` | string | Name of the project Python environment to use | `"pandas-training-pipeline"` | +| `conf['driverCores']` | float | Number of CPU cores to be allocated for the Ray head process | `1.0` | +| `conf['driverMemory']` | int | Number of MBs to be allocated for the Ray head process | `2048` | +| `conf['driverGpus']` | int | Number of GPUs to be allocated for the Ray head process | `0` | +| `conf['workerCores']` | float | Number of CPU cores to be allocated for each Ray worker process | `1.0` | +| `conf['workerMemory']` | int | Number of MBs to be allocated for each Ray worker process | `4096` | +| `conf['workerGpus']` | int | Number of GPUs to be allocated for each Ray worker process | `0` | +| `conf['workerMinInstances']` | int | Minimum number of Ray workers | `1` | +| `conf['workerMaxInstances']` | int | Maximum number of Ray workers | `1` | +| `conf['jobType']` | string | Type of job | `"RAY"` | +| `conf['files']` | string | Comma-separated string of HDFS path(s) to files to be made available to the application. Example: `hdfs:///Project//Resources/file1.py,...` | `null` | + + ## Accessing project data The project datasets are mounted under `/home/yarnapp/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv` in your script. diff --git a/docs/user_guides/projects/jobs/spark_job.md b/docs/user_guides/projects/jobs/spark_job.md index 6345d5a65..a7c092126 100644 --- a/docs/user_guides/projects/jobs/spark_job.md +++ b/docs/user_guides/projects/jobs/spark_job.md @@ -6,10 +6,13 @@ description: Documentation on how to configure and execute a Spark (Scala) job o ## Introduction +This guide will describe how to configure a job to execute a spark program inside the cluster. + All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service: - Python - Apache Spark +- Ray Launching a job of any type is very similar process, what mostly differs between job types is the various configuration parameters each job type comes with. Hopsworks support scheduling to run jobs on a regular basis, @@ -213,28 +216,31 @@ print(f_err.read()) ``` ## Configuration -The following table describes the JSON payload returned by `jobs_api.get_configuration("SPARK")` - -| Field | Type | Description | Default | -|--------------------------------------------| -------------- |---------------------------------------------------------| -------------------------- | -| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` | -| `appPath` | string | Project path to spark program (e.g `Resources/foo.jar`) | `null` | -| `mainClass` | string | Name of the main class to run (e.g `org.company.Main`) | `null` | -| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` | -| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` | -| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` | -| `spark.executor.instances` | number (int) | Number of executor instances | `1` | -| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` | -| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` | -| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` | -| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` | -| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` | -| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` | -| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` -| `files` | string | HDFS path(s) to files to be provided to the Spark application. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/file1.py,hdfs:///Project//Resources/file2.txt"` | `null` | -| `pyFiles` | string | HDFS path(s) to Python files to be provided to the Spark application. These will be added to the `PYTHONPATH` so they can be imported as modules. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/module1.py,hdfs:///Project//Resources/module2.py"` | `null` | -| `jars` | string | HDFS path(s) to JAR files to be provided to the Spark application. These will be added to the classpath. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/lib1.jar,hdfs:///Project//Resources/lib2.jar"` | `null` | -| `archives` | string | HDFS path(s) to archive files to be provided to the Spark application. Multiple files can be included in a single string, separated by commas.
Example: `"hdfs:///Project//Resources/archive1.zip,hdfs:///Project//Resources/archive2.tar.gz"` | `null` | +The following table describes the job configuration parameters for a SPARK job. + +`conf = jobs_api.get_configuration("SPARK")` + +| Field | Type | Description | Default | +|----------------------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------| +| `conf['type']` | string | Type of the job configuration | `"sparkJobConfiguration"` | +| `conf['appPath']` | string | Project path to spark program (e.g., `Resources/foo.jar`) | `null` | +| `conf['mainClass']` | string | Name of the main class to run (e.g., `org.company.Main`) | `null` | +| `conf['defaultArgs']` | string | Arguments to pass to the program. Will be overridden if arguments are passed explicitly via `Job.run(args="...")` | `null` | +| `conf['environmentName']` | string | Name of the project spark environment to use | `"spark-feature-pipeline"` | +| `conf['spark.driver.cores']` | float | Number of CPU cores allocated for the driver | `1.0` | +| `conf['spark.driver.memory']` | int | Memory allocated for the driver (in MB) | `2048` | +| `conf['spark.executor.instances']` | int | Number of executor instances | `1` | +| `conf['spark.executor.cores']` | float | Number of CPU cores per executor | `1.0` | +| `conf['spark.executor.memory']` | int | Memory allocated per executor (in MB) | `4096` | +| `conf['spark.dynamicAllocation.enabled']` | boolean | Enable dynamic allocation of executors | `true` | +| `conf['spark.dynamicAllocation.minExecutors']` | int | Minimum number of executors with dynamic allocation | `1` | +| `conf['spark.dynamicAllocation.maxExecutors']` | int | Maximum number of executors with dynamic allocation | `2` | +| `conf['spark.dynamicAllocation.initialExecutors']` | int | Initial number of executors with dynamic allocation | `1` | +| `conf['spark.blacklist.enabled']` | boolean | Whether executor/node blacklisting is enabled | `false` | +| `conf['files']` | string | Comma-separated string of HDFS path(s) to files to be made available to the application. Example: `hdfs:///Project//Resources/file1.py,...` | `null` | +| `conf['pyFiles']` | string | Comma-separated string of HDFS path(s) to Python modules to be made available to the application. Example: `hdfs:///Project//Resources/file1.py,...` | `null` | +| `conf['jars']` | string | Comma-separated string of HDFS path(s) to jars to be included in CLASSPATH. Example: `hdfs:///Project//Resources/app.jar,...` | `null` | +| `conf['archives']` | string | Comma-separated string of HDFS path(s) to archives to be made available to the application. Example: `hdfs:///Project//Resources/archive.zip,...` | `null` | ## Accessing project data