Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools - Add runner for sys info and update docs #532

Merged
merged 17 commits into from
Jun 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
32 changes: 32 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,26 @@ Execute GPT2 model benchmark in default configuration:
sb exec --config-override superbench.enable="['gpt2_models']"
```

### `sb node info`
cp5555 marked this conversation as resolved.
Show resolved Hide resolved
Get system info on the local node.

```bash title="SB CLI"
sb node info [--output-dir]
```

#### Optional arguments

| Name | Default | Description |
|----------------|---------|-----------------------------------------------------------------------------|
| `--output-dir` | `None` | Path to output directory, outputs/{datetime} will be used if not specified. |

#### Examples

Get system info on the local node and save it into the `outputs` dir:
```bash title="SB CLI"
sb node info --output-dir outputs
```

### `sb result diagnosis`

Filter the defective machines automatically from benchmarking results according to rules defined in rule file.
Expand Down Expand Up @@ -284,6 +304,7 @@ sb run [--config-file]
[--docker-image]
[--docker-password]
[--docker-username]
[--get-info]
[--host-file]
[--host-list]
[--host-password]
Expand All @@ -302,6 +323,7 @@ sb run [--config-file]
| `--docker-image` `-i` | `superbench/superbench` | Docker image URI. |
| `--docker-password` | `None` | Docker registry password if authentication is needed. |
| `--docker-username` | `None` | Docker registry username if authentication is needed. |
| `--get-info` | `False` | Collect system info. |
| `--host-file` `-f` | `None` | Path to Ansible inventory host file. |
| `--host-list` `-l` | `None` | Comma separated host list. |
| `--host-password` | `None` | Host password or key passphase if needed. |
Expand Down Expand Up @@ -335,6 +357,16 @@ sb run --no-docker --host-list localhost --config-override \
superbench.enable=kernel-launch superbench.env.SB_MICRO_PATH=/path/to/superbenchmark
```

Collect system info on all nodes in ./host.ini" distributed without running benchmarks:
```bash title="SB CLI"
sb run --get-info --host-file ./host.ini -C superbench.enable=none
```

Collect system info on all nodes in ./host.ini" distributed while running benchmarks:
```bash title="SB CLI"
sb run --get-info --host-file ./host.ini
yukirora marked this conversation as resolved.
Show resolved Hide resolved
```

### `sb version`

Print the current SuperBench CLI version.
Expand Down
30 changes: 29 additions & 1 deletion docs/user-tutorial/system-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ id: system-config

# System Config Info

This tool is to collect the system information automatically on the tested GPU nodes including the following hardware categories:

- [System](#system)
- [Memory](#memory)
- [CPU](#cpu)
Expand All @@ -12,7 +14,33 @@ id: system-config
- [Accelerator](#accelerator)
- [PCIe](#pcie)

## Parameter amd Details
## Usage

### Usage on local machine

1. [Install SuperBench](../getting-started/installation.mdx) on the local machine using root privilege.

2. Start to collect the sys info using `sb node info --output-dir ${output-dir}` command using root privilege.

3. After the command finished, you can find the output system info json file `sys-info.json` of local node under \${output_dir}.

### Usage on multiple remote machines

1. [Install SuperBench](../getting-started/installation.mdx) on the local machine.

2. [Deploy SuperBench](../getting-started/run-superbench.md#deploy) onto the remote machines.

2. Prepare the host file of the tested GPU nodes using [Ansible Inventory](../getting-started/configuration.md#ansible-inventory) on the local machine.

3. After installing the Superbnech and the host file is ready, you can start to collect the sys info automatically using `sb run --get-info` command. The detailed command can be found from [SuperBench CLI](../cli.md).

```
sb run --get-info -f host.ini --output-dir ${output-dir} -C superbench.enable=none
```

4. After the command finished, you can find the output system info json file `sys-info.json` of each node under \${output_dir}/nodes/${node_name}.

## Parameter and Details

### System

Expand Down
3 changes: 3 additions & 0 deletions superbench/cli/_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@ def load_arguments(self, command):
nargs='+',
help='Extra arguments to override config_file.'
)
ac.argument(
'get_info', options_list=('--get-info', '-g'), action='store_true', help='Collect node system info.'
)

with ArgumentsContext(self, 'benchmark') as ac:
ac.argument('name', options_list=('--name', '-n'), type=str, help='Benchmark name or regular expression.')
Expand Down
8 changes: 7 additions & 1 deletion superbench/cli/_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,8 @@ def run_command_handler(
output_dir=None,
private_key=None,
config_file=None,
config_override=None
config_override=None,
get_info=False,
):
"""Run the SuperBench benchmarks distributedly.

Expand All @@ -295,6 +296,7 @@ def run_command_handler(
config_file (str, optional): Path to SuperBench config file. Defaults to None.
config_override (str, optional): Extra arguments to override config_file,
following [Hydra syntax](https://hydra.cc/docs/advanced/override_grammar/basic). Defaults to None.
get_info (bool, optional): Collect node system info. Defaults to False.

Raises:
CLIError: If input arguments are invalid.
Expand All @@ -316,6 +318,10 @@ def run_command_handler(
)

runner = SuperBenchRunner(sb_config, docker_config, ansible_config, sb_output_dir)

runner.run()
if get_info:
runner.run_sys_info()

if runner.get_failure_count() != 0:
sys.exit(runner.get_failure_count())
4 changes: 4 additions & 0 deletions superbench/cli/_help.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@
text: >
{cli_name} run --no-docker --host-list localhost
--config-override superbench.enable=kernel-launch superbench.env.SB_MICRO_PATH=/path/to/superbenchmark
- name: Collect system info on all nodes in ./host.ini" without running benchmarks
text: {cli_name} run --get-info --host-file ./host.ini -C superbench.enable=none
- name: Collect system info on all nodes in ./host.ini" while running benchmarks
text: {cli_name} run --get-info --host-file ./host.ini
""".format(cli_name=CLI_NAME)

helps['benchmark'] = """
Expand Down
13 changes: 12 additions & 1 deletion superbench/cli/_node_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,28 @@

"""SuperBench CLI node subgroup command handler."""

from pathlib import Path
import json

from superbench.tools import SystemInfo
from superbench.common.utils import create_sb_output_dir


def info_command_handler():
def info_command_handler(output_dir=None):
"""Get node hardware info.

Args:
output_dir (str): Output directory.

Returns:
dict: node info.
"""
try:
info = SystemInfo().get_all()
output_dir = create_sb_output_dir(output_dir)
output_dir_path = Path(output_dir)
with open(output_dir_path / 'sys_info.json', 'w') as f:
json.dump(info, f)
except Exception as ex:
raise RuntimeError('Failed to get node info.') from ex
return info
18 changes: 18 additions & 0 deletions superbench/runner/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,24 @@ def deploy(self): # pragma: no cover
)
self._ansible_client.run(self._ansible_client.get_playbook_config('deploy.yaml', extravars=extravars))

def run_sys_info(self):
"""Run the system info on all nodes."""
self.check_env()

logger.info('Runner is going to get node system info.')

fcmd = "docker exec sb-workspace bash -c '{command}'"
if self._docker_config.skip:
fcmd = "bash -c 'cd $SB_WORKSPACE && {command}'"
ansible_runner_config = self._ansible_client.get_shell_config(
fcmd.format(command='sb node info --output-dir {output_dir}'.format(output_dir=self._sb_output_dir))
)
ansible_rc = self._ansible_client.run(ansible_runner_config, sudo=(not self._docker_config.skip))

if ansible_rc != 0:
self.cleanup()
self.fetch_results()

def check_env(self): # pragma: no cover
"""Check SuperBench environment."""
logger.info('Checking SuperBench environment.')
Expand Down