When I run the training part, I got msg.
xxxx-xx-xx xx:xx:xx|WARNING: MPI cluster info collection failed: collect_cluster_info() missing 1 required positional argument: 'results_dir'
"collect_cluster_info" needs results_dir, but it seems that when the "collect_cluster_info" is called, there's no results_dir
In mlpstorage_py/cluster_collector.py, line 1577~1588
def collect_cluster_info(
hosts: List[str],
mpi_bin: str,
logger,
results_dir: str,
allow_run_as_root: bool = False,
timeout_seconds: int = 60,
fallback_to_local: bool = True,
shared_staging_dir: Optional[str] = None,
shared_tmp_dir: Optional[str] = None, # deprecated, see note below
ssh_username: Optional[str] = None,
) -> Dict[str, Any]:
In mlpstorage_py/benchmarks/base.py, line 447~453
collected_data = collect_cluster_info(
hosts=self.args.hosts,
mpi_bin=mpi_bin,
logger=self.logger,
allow_run_as_root=allow_run_as_root,
timeout_seconds=timeout,
fallback_to_local=True
)
I think the failure of MPI cluster info collection causes the failure of reportgen.
Issues: [INVALID] None: Check check_num_files_train failed with error: 'NoneType' object has no attribute 'total_memory_bytes'
When I run the training part, I got msg.
xxxx-xx-xx xx:xx:xx|WARNING: MPI cluster info collection failed: collect_cluster_info() missing 1 required positional argument: 'results_dir'"collect_cluster_info" needs results_dir, but it seems that when the "collect_cluster_info" is called, there's no results_dir
I think the failure of MPI cluster info collection causes the failure of reportgen.
Issues: [INVALID] None: Check check_num_files_train failed with error: 'NoneType' object has no attribute 'total_memory_bytes'