Skip to content

Commit

Permalink
Merge pull request #18 from hyoklee/main
Browse files Browse the repository at this point in the history
Use CTest for multi-node testing. (#17)
  • Loading branch information
hyoklee committed Sep 21, 2022
2 parents 1ced9d2 + 526c3cf commit 00da48d
Show file tree
Hide file tree
Showing 34 changed files with 606 additions and 29 deletions.
6 changes: 6 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeR

include(${CMAKE_CURRENT_LIST_DIR}/cmake/FindLIBFABRIC.cmake)

enable_testing()
include(CTest)
include(CTestConfig.cmake)
add_subdirectory(hlog)
add_subdirectory(transfer)
add_subdirectory(scripts)



4 changes: 4 additions & 0 deletions CTestConfig.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
set(CTEST_PROJECT_NAME "fabtsuite")
set(CTEST_NIGHTLY_START_TIME "00:00:00 CST")
set(SLURM FALSE)
set(PBS FALSE)
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
![logo](fabtsuite_logo.png)
# fabtsuite: a libfabric test suite
[![check spelling](https://github.com/mercury-hpc/fabtsuite/actions/workflows/spell.yml/badge.svg)](https://github.com/mercury-hpc/fabtsuite/actions/workflows/spell.yml)
[![cmake fabtsuite](https://github.com/mercury-hpc/fabtsuite/actions/workflows/cmake.yml/badge.svg)](https://github.com/mercury-hpc/fabtsuite/actions/workflows/cmake.yml)
Expand Down
16 changes: 14 additions & 2 deletions doc/building_cmake.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,27 @@ See [building.md](building.md) for the manual installation of libfabric.

mkdir build
cd build
cmake ..
cmake ..
make
make DESTDIR=/tmp/ install

Use `-DCMAKE_INSTALL_PREFIX=/my/local` to change installation
prefix from `/usr/local/` to something else (e.g., `/my/local`).

## Test

Let's assume that everything is installed under `/tmp/usr/local/bin`
and you have *write* permission on your current working directory.

export PATH=/tmp/usr/local/bin:$PATH
fabtrun localhost
fabtrun

## CTest on HPC Systems

Set either `SLURM` or `PBS` TRUE in [CTestConfig.cmake](../CTestConfig.cmake)
to run test on clusters.

Then, repeat the steps in **Build** and run `make test` after `make`.



2 changes: 1 addition & 1 deletion doc/building_spack.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ Then, run the following commands to install and test.

spack install fabtsuite ^libfabric fabrics=rxm,tcp,udp,rxd
spack load fabtsuite
fabtrun localhost
fabtrun
41 changes: 41 additions & 0 deletions doc/dev_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Developer's Guide

## Naming Conventions

There are 6 abbreviations (a.) for testing features:

| Feature | a. |
|----------------|----|
| FI_WAIT_FD | w |
| fi_cancel() | c |
| cross-job-comm | x |
| multi-thread | t |
| vectored-IO | v |
| MPI Interop. | m |

All multi-node scripts start with `fabt` and have file extension like `.sh`.

## Debugging with hlog


## Single-Node Test

[test/test.sh](../test/test.sh) is used to check if programs run correctly
on local host.

## Multi-Node Test

The programs require shell scripting because they do not generate time.
`nohup` is necessary .

## Adding a New CTest

### Local
1. Write a script that runs `fabtget` and `fabtput`.
2. Add the script to `transfer/CMakeTests.cmake'.

### Multi-node
1. Write a job script that runs `fabtget` and `fabtput` on different nodes.
2. Add the script to either `transfer/CMakeTests_s.cmake` or
`transfer/CMakeTests_p.cmake` file depending on SLURM or PBS job.

31 changes: 31 additions & 0 deletions doc/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# FAQ

* GitHub Action fails with `Error: Process completed with exit code 145.` Why?

We don't know the reason yet. However, you can try to run the failed job
again and it will pass eventually.

* I installed fabtsuite using Spack but I get the `available libfabric version
< 1.13` error when I run programs.

Please try update LD_LIBRARY_PATH and PATH like as follows.
```
export LD_LIBRARY_PATH=$PREFIX/lib:$LD_LIBRARY_PATH
export PATH=$PREFIX/bin:$PATH
```
The `PREFIX` is where Spack installed the libfabric and fabtsuite package.

* What is the default timeout value for CTest?

It is 1500 seconds (= 25 minutes).
If a test fails due to timeout, you'll get an output like below:

```
4/8 Test #4: fi_cancel ........................ Passed 554.06 sec
Start 5: cross-job-comm
5/8 Test #5: cross-job-comm ...................***Timeout 1500.12 sec
Start 6: multi-thread
6/8 Test #6: multi-thread .....................***Timeout 1500.10 sec
Start 7: vectored-IO
```

4 changes: 2 additions & 2 deletions doc/mockup.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Mockup `scripts/fabtrune` output and some implementation notes.
Mockup `scripts/fabtrun` output and some implementation notes.

```
fabtget parameter set duration (s) duration/default (%) result
Expand Down Expand Up @@ -34,7 +34,7 @@ key:
11 tests, 7 succeeded, 4 failed
```

Example Bourne shell script to print the fabtput results:
Example Bourne shell script to print the `fabtput` results:

```
printf "%-23s %16s %-24s %s\\n" "fabtput parameter set" "duration (s)" "duration/default (%)" "result"
Expand Down
57 changes: 37 additions & 20 deletions doc/tests.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,63 @@
# Running the tests

To start a test sequence, start the test script,
`scripts/fabtrun <hostname>`, replacing `<hostname>` either with the
name of the test host (e.g., `hostname` output) or with `localhost`.
`scripts/fabtrun`.
The script will run silently for a few minutes and then print a report
like this one:

```
fabtget parameter set duration (s) duration/default (%) result
phase get, testing parameter set default
phase get, testing parameter set cancel
phase get, testing parameter set cacheless
phase get, testing parameter set reregister
phase get, testing parameter set cacheless,reregister
phase get, testing parameter set wait
phase put, testing parameter set default
phase put, testing parameter set cancel
phase put, testing parameter set cacheless
phase put, testing parameter set reregister
phase put, testing parameter set cacheless,reregister
phase put, testing parameter set wait
phase put, testing parameter set contiguous
phase put, testing parameter set contiguous,reregister
phase put, testing parameter set contiguous,reregister,cacheless
get parameter set duration (s) duration/default (%) result
--------------------------------------------------------------------------
default 5.58 - ok
cancel 3.01 53 ok
cacheless 5.88 105 ok
reregister 6.40 114 ok
cacheless reregister 5.15 92 ok
fabtput parameter set duration (s) duration/default (%) result
default 6.19 - ok
cancel 2.02 32 ok
cacheless 6.29 101 ok
reregister 5.25 84 ok
cacheless reregister 6.04 97 ok
wait 10.94 176 ok
put parameter set duration (s) duration/default (%) result
--------------------------------------------------------------------------
default 5.27 - ok
cancel 3.00 56 ok
cacheless 5.65 107 ok
reregister 5.37 101 ok
cacheless reregister 5.24 99 ok
contiguous 9.18 174 ok
contiguous reregister 8.76 166 ok
contiguous reregister cacheless 8.96 170 ok
default 4.89 - ok
cancel 2.02 41 ok
cacheless 5.15 105 ok
reregister 5.21 106 ok
cacheless reregister 5.13 104 ok
wait 7.74 158 ok
contiguous 7.65 156 ok
contiguous reregister 8.46 173 ok
contiguous reregister cacheless 8.29 169 ok
key:
parameters:
default: register each RDMA buffer once, use scatter-gather RDMA
default: register each RDMA buffer once, use scatter-gather RDMA
cancel: -c, send SIGINT to cancel after 3 seconds
cacheless: env FI_MR_CACHE_MAX_SIZE=0, disable memory-registration cache
contiguous: -g, RDMA conti(g)uous bytes, no scatter-gather
reregister: -r, deregister/(r)eregister each RDMA buffer before reuse
wait: -w, wait for I/O using epoll_pwait(2) instead of fi_poll(3)
duration: elapsed real time in seconds
duration/default: elapsed real time as a percentage of the duration
measured with the default parameter set
13 tests, 13 succeeded, 0 failed
15 tests, 15 succeeded, 0 failed
```

Look at the summary result in the last line for a
Expand Down
Binary file added fabtsuite_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 8 additions & 1 deletion scripts/fabtget.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
#!/bin/bash

PREFIX=/lus/grand/projects/radix-io
export LD_LIBRARY_PATH=$PREFIX/lib:$LD_LIBRARY_PATH
export PATH=$PREFIX/bin:$PATH


# Write fabtget address to a file.
# On Polaris, fatbget can't write to Lustre file system.
Expand All @@ -10,7 +13,11 @@ FILE=$HOME/fabtget_a.txt
HOST=`cat /proc/sys/kernel/hostname`

echo "Running fabtget."
{ time -p $PREFIX/bin/fabtget -a $FILE; } &> $PREFIX/$HOST.txt
if [ -z "$1" ] ; then
{ time -p $PREFIX/bin/fabtget -a $FILE; } &> $PREFIX/$HOST.txt
else
{ time -p $PREFIX/bin/fabtget $1 -a $FILE; } &> $PREFIX/$HOST.txt
fi



Expand Down
8 changes: 7 additions & 1 deletion scripts/fabtput.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
#!/bin/bash

PREFIX=/lus/grand/projects/radix-io
export LD_LIBRARY_PATH=$PREFIX/lib:$LD_LIBRARY_PATH
export PATH=$PREFIX/bin:$PATH

# On Polaris, fabtget can't write to Lustre file system.
FILE=$HOME/fabtget_a.txt
Expand All @@ -12,4 +14,8 @@ HOST=`cat /proc/sys/kernel/hostname`
sleep 2
echo "$FILE exists. Running fabtput."
cat $FILE
{ time -p $PREFIX/bin/fabtput `cat $FILE`; } &> $PREFIX/$HOST.txt
if [ -z "$1" ] ; then
{ time -p $PREFIX/bin/fabtput `cat $FILE`; } &> $PREFIX/$HOST.txt
else
{ time -p $PREFIX/bin/fabtput $1 `cat $FILE`; } &> $PREFIX/$HOST.txt
fi
2 changes: 1 addition & 1 deletion scripts/fabtrun
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# each client & server instance what the directory is called using an
# environment variable or command-line parameter.
#
# MN: Currently, fabtrun performs runs every test step in two test
# MN: Currently, fabtrun runs every test step in two test
# phases (one phase for get, one phase for put). Then it produces
# a report. In one flexible approach to multi-node testing, each
# client-mode/server-mode instance will run only a single step; then the
Expand Down
2 changes: 1 addition & 1 deletion scripts/fabtrun.qsub
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
#PBS -l place=scatter
#PBS -l walltime=10:00
#PBS -q debug
#PBS -A radix-io
#PBS -A CSC250STDM12

PREFIX=/lus/grand/projects/radix-io

Expand Down
1 change: 1 addition & 0 deletions test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This directory has files for CTest.
37 changes: 37 additions & 0 deletions test/cancel.qsub
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/sh
##
## Usage: qsub cancel.qsub
## Author: Hyokyung Lee (hyoklee@hdfgroup.org)
## Last Update: 2022-09-14
##
#PBS -l select=2:system=polaris
#PBS -l place=scatter
#PBS -l walltime=10:00
#PBS -q debug
#PBS -A CSC250STDM12

# Set the libfabric library location.
PREFIX=/lus/grand/projects/radix-io
export LD_LIBRARY_PATH=$PREFIX/lib:$LD_LIBRARY_PATH

# Set the current working directory.
WORKDIR=$PBS_O_WORKDIR

# Get all node names first.
mpiexec -n 1 -ppn 1 cat $PBS_NODEFILE >& $WORKDIR/nodes.txt

# Run 1 server and (select - 1) client(s).
# The debug queue has only 2 nodes.
# Therefore, this script will run only 1 client.
# The first node in nodes.txt will be the server.
# The rest will be clients.
j=0
for i in `cat $WORKDIR/nodes.txt`; do
if [[ "$j" -gt 0 ]]; then
mpiexec -host $i -n 1 -ppn 1 $WORKDIR/tput.sh -c
else
mpiexec -host $i -n 1 -ppn 1 nohup $WORKDIR/tget.sh -c > fabtget.out 2> fabtget.err < /dev/null &
fi
((j++))
done
echo $?
24 changes: 24 additions & 0 deletions test/cancel.slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash
##
## Usage: sbatch cancel.slurm
## Author: Hyokyung Lee (hyoklee@hdfgroup.org)
## Last Update: 2022-09-14
##
#SBATCH -A CSC332_crusher
#SBATCH -J cancel
#SBATCH -o %x-%j.out
#SBATCH -t 00:00:20
#SBATCH -N 2
srun -N1 -n1 ./tget.sh -c &
srun -N1 -n1 ./tput.sh -c &
sleep 20

a=$(grep Result cancel-*.out | wc -l)
if [ "$a" -eq "0" ]; then
exit 1
fi

b=$(grep error cancel-*.out | wc -l)
exit $b


37 changes: 37 additions & 0 deletions test/cross.qsub
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/sh
##
## Usage: qsub cross.qsub
## Author: Hyokyung Lee (hyoklee@hdfgroup.org)
## Last Update: 2022-09-19
##
#PBS -l select=3:system=polaris
#PBS -l place=scatter
#PBS -l walltime=10:00
#PBS -q debug-scaling
#PBS -A CSC250STDM12

# Set the libfabric library location.
PREFIX=/lus/grand/projects/radix-io
export LD_LIBRARY_PATH=$PREFIX/lib:$LD_LIBRARY_PATH

# Set the current working directory.
WORKDIR=$PBS_O_WORKDIR

# Get all node names first.
mpiexec -n 1 -ppn 1 cat $PBS_NODEFILE >& $WORKDIR/nodes.txt

# Run 1 server and (select - 1) client(s).
# The debug queue has only 2 nodes.
# Therefore, this script will run only 1 client.
# The first node in nodes.txt will be the server.
# The rest will be clients.
j=0
for i in `cat $WORKDIR/nodes.txt`; do
if [[ "$j" -gt 0 ]]; then
mpiexec -host $i -n 1 -ppn 1 $WORKDIR/tput.sh -n 4 -k 2 > $WORKDIR/cross_p_$j.out 2> $WORKDIR/cross_p_$j.err
else
mpiexec -host $i -n 1 -ppn 1 nohup $WORKDIR/tget.sh -n 4 > $WORKDIR/cross_g.out 2> $WORKDIR/cross_g.err < /dev/null &
fi
((j++))
done
echo $?
Loading

0 comments on commit 00da48d

Please sign in to comment.