Skip to content

collect_cluster_info() missing 1 required positional argument: 'results_dir' #363

@momonara

Description

@momonara

When I run the training part, I got msg.
xxxx-xx-xx xx:xx:xx|WARNING: MPI cluster info collection failed: collect_cluster_info() missing 1 required positional argument: 'results_dir'

"collect_cluster_info" needs results_dir, but it seems that when the "collect_cluster_info" is called, there's no results_dir

In mlpstorage_py/cluster_collector.py, line 1577~1588

def collect_cluster_info(
    hosts: List[str], 
    mpi_bin: str,
    logger,
    results_dir: str,
    allow_run_as_root: bool = False,
    timeout_seconds: int = 60,
    fallback_to_local: bool = True,
    shared_staging_dir: Optional[str] = None,
    shared_tmp_dir: Optional[str] = None,  # deprecated, see note below
    ssh_username: Optional[str] = None,
) -> Dict[str, Any]:

In mlpstorage_py/benchmarks/base.py, line 447~453

 collected_data = collect_cluster_info(
                hosts=self.args.hosts,
                mpi_bin=mpi_bin,
                logger=self.logger,
                allow_run_as_root=allow_run_as_root,
                timeout_seconds=timeout,
                fallback_to_local=True
            )

I think the failure of MPI cluster info collection causes the failure of reportgen.
Issues: [INVALID] None: Check check_num_files_train failed with error: 'NoneType' object has no attribute 'total_memory_bytes'

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions