Skip to content

Parameters

Steve Burnett edited this page May 14, 2024 · 72 revisions

PBench uses JSON files to define each stage of a benchmark. These stage files contain JSON format parameters.

Use the JSON parameters defined here to write stage files. For more information about stage files, see Creating a Stage File.

abort_on_error

Format

"abort_on_error": Boolean

Definition

Set abort_on_error to true to abort all running and future stages of the benchmark, and also any external running processes started by shell_scripts when an error occurs.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

Example:

"abort_on_error": true

catalog

Format

"catalog": "catalog-name"

Definition

Set the catalog for queries in queries and query_files.

catalog and schema cannot be set to null.

This parameter and its value are inherited by child stages.

New values for catalog, schema, session_params, and timezone assigned in a stage are not applied to the Presto client unless a stage also sets start_on_new_client = true.

Example:

"catalog": "iceberg"

cold_runs

Format

"cold_runs": integer

Definition

The number of cold runs to run to populate the cache. The default is 1.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

Example:

"cold_runs": 1

description

Format

"description": “Description of the JSON file.”

Definition

JSON does not support comments, so comments in PBench are formatted as data pairs. For more information see Comments Inside JSON - Commenting in a JSON File.

Begin every stage JSON file with a description.

PBench ignores description: it is not read, processed, or output in any way by PBench.

Example:

"description": "Specifies the catalog and the schema for TPC-DS Iceberg scale factor 1 TB partitioned."

expected_row_counts

Format

"expected_row_counts": {
  "file1": [
    1,
    1
  ],
  "file2": [
    1,
    1
  ]
}

Definition

A map from [catalog.schema] to arrays of integers that are expected row counts for the queries that are run under different schemas.

The key of this map can be either:

  • [schema] - match the schema name regardless of the catalog they are under

  • [catalog.schema] - match both catalog and schema

  • [regular expression] - used to match [catalog.schema]

List the expected row counts for queries first, then list the expected row counts for the queries in each query file listed in query_files.

Example:

"expected_row_counts": {
  "tpcds_sf10000_": [
    100,
    100
  ],
  "tpcds_sf1000_": [
    100,
    100
  ]
}

Use regular expressions to match multiple [catalog.schema] pairs. In this example, .*\\.tpcds_sf10000 matches hive.tcpdssf10000 and iceberg.tpcdssf10000.

"expected_row_counts": {
  ".*\\.tpcds_sf10000": [
    100,
    100
  ],
  "tpcds_sf1000_": [
    100,
    100
  ]
}

next

Format

"next": [
  "stage_2.json",
  "stage_3.json"
]

Definition

Specifies one or more child stages of the current stage. Child stages start after the parent stage finishes, and in parallel with each other. Child stages inherit some parameters from the parent stage if those parameters are not explicitly set in the child stage.

Example:

"next": [
  "stage_2.json",
  "stage_3.json"
]

queries

Format

"queries": [
  "query_string"
]

Definition

Run the SQL query in query_string. If a query is long or complex, or there are several queries, consider saving the queries in a SQL file to be run using query_files.

Do not end the SQL query in query_string with a semi-colon.

SQL queries in queries are executed first, then SQL queries in files listed in query_files are read and executed, then external commands in shell_scripts are run.

Example:

"queries": [
  "select 'query 1'"
]

query_files

Format

"query_files": [
  "file1",
  "file2",
]

Definition

One or more files containing SQL queries.

SQL queries in queries are executed first, then SQL queries in files listed in query_files are read and executed, then external commands in shell_scripts are run.

A relative file path in the query_files array is evaluated based on the location of the stage JSON file.

Example:

"query_files": [
  "queries/query_01.sql",
  "queries/query_02.sql",
]

random_execution

Format

"random_execution": Boolean

Definition

When random_execution is set to false, PBench runs the queries in queries and query_files sequentially.

When random_execution is set to true, PBench runs the queries and query_files randomly, until the duration or integer set using randomly_execute_until is met.

Each query file counts as 1 regardless of the number of queries in that query file. For example, a stage has:

  • 3 queries in queries
  • 2 query files in query_files, with 3 queries in each file

random_execution selects from 5 (3 queries + 2 query files), not 9 (3 queries + 3 queries in one file + 3 queries in the other file).

If a query file is selected, all of the queries in the file are executed and it is counted as 1 selection towards the integer specified in RandomlyExecuteUntil.

Expected row counts are ignored when random_execution is set to true.

The default value of random_execution is false.

Example:

"random_execution": true

randomly_execute_until

Format

"randomly_execute_until": "duration"

"randomly_execute_until": "integer"

Definition

Specify either

  • a duration like 15m1h5d
  • an integer as the number of queries

to randomly run SQL queries.

Example:

"randomly_execute_until": "15m"

"randomly_execute_until": "700"

save_column_metadata

Format

"save_column_metadata": Boolean

Definition

Save a JSON file of the query's column metadata in the columns field of Presto's query API response.

Column metadata is saved once for a query on its first run, regardless of the number of cold_runs and warm_runs.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

The file name format uses the naming process as described in PBench Output File Name Format.

Example:

"save_column_metadata": true

save_json

Format

"save_json": Boolean

Definition

Set save_json to true to save a successful query's JSON after the query is executed. The file name is [query_name].json. For example, ds_power_query_59.json. This file is valuable when debugging a problem with a run of PBench.

A failed query also saves the error information for the query in a file named [query_name].error.json.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

Example:

"save_json": true

save_output

Format

"save_output": Boolean

Definition

Set save_output to true to save the query result to files in raw form.

Set the parameter to null in a stage to unset the value inherited from a parent stage.

This parameter and its value are inherited by child stages.

The file name format uses the naming process as described in PBench Output File Name Format.

Example:

"save_output": true

schema

Format

"schema": “schema-name

Definition

Set the schema for queries in queries and query_files.

catalog and schema cannot be set to null.

This parameter and its value are inherited by child stages.

New values for catalog, schema, session_params, and timezone assigned in a stage are not applied to the Presto client unless a stage also sets start_on_new_client = true.

Example

"schema": "sf1"

session_params

Format

"session_params": {
  "session-property-name": "session-property-value"
}

Definition

Session properties passed to Presto.

This parameter and its value are inherited by child stages.

Set a session parameter to null to unset the value inherited from a parent stage.

New values for catalog, schema, session_params, and timezone assigned in a stage are not applied to the Presto client unless a stage also sets start_on_new_client = true.

Example:

"session_params": {
  "iceberg.hive_statistics_merge_strategy": "USE_NULLS_FRACTION_AND_NDV",
  "hive.pushdown_filter_enabled": false,
}

shell_scripts

Format

"shell_scripts": [
  "shell_command"
]

Definition

Run a shell script after executing all SQL queries in queries and query_files.

SQL queries in queries are executed first, then SQL queries in files listed in query_files are read and executed, then external commands in shell_scripts are run.

Example:

"shell_scripts": [
  "echo \"this is a script\"",
  "python3 test_script.py",
  "ls -l"
]

start_on_new_client

Format

"start_on_new_client": Boolean

Definition

Set start_on_new_client to true for this stage will create a new client to execute itself. Each client has its own set of client information, tags, session properties, user credentials, and other parameters.

Example:

"start_on_new_client": true

timezone

Format

"timezone": timezone_string

Definition

The value of timezone_string can be any value in the Time Zone ID column of Time Zone ID.

The default value of timezone is the user's local timezone.

New values for catalog, schema, session_params, and timezone assigned in a stage are not applied to the Presto client unless a stage also sets start_on_new_client = true.

This parameter and its value are inherited by child stages.

Example:

"timezone": "America/Los_Angeles"

warm_runs

Format

"warm_runs": integer

Definition

The number of query runs to perform after the number of cold runs. The default value is 0.

This parameter and its value are inherited by child stages. Set the parameter to null in a stage to unset the value inherited from a parent stage.

Example:

"warm_runs": 2