mpijob using hostnetwork error #219

suluner · 2020-04-14T10:08:09Z

hi, all

I want to use hostnetwork when submit mpijob to improve training performance. The yaml file is as below:

apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
  annotations:
    monitoring.netease.com/enable-grafana-dashboard: "true"
  generateName: test-mpijob
  generation: 2
  labels:
    fairing-deployer: mpijob
    fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
    kubeflow.netease.com/userid: huting3
  namespace: ai-test
spec:
  activeDeadlineSeconds: 3600
  backoffLimit: 1
  cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          annotations:
            monitoring.netease.com/enable-grafana-dashboard: "true"
            sidecar.istio.io/inject: "false"
          creationTimestamp: null
          labels:
            fairing-deployer: mpijob
            fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
            kubeflow.netease.com/userid: huting3
          name: fairing-deployer
        spec:
          hostNetwork: "true"
          dnsPolicy: ClusterFirstWithHostNet
          containers:
          - command:
            - mpirun
            - --allow-run-as-root
            - -np
            - "2"
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /app/boot.py
            env:
            - name: FAIRING_RUNTIME
              value: "1"
            image: hub-inner.cn-east-p1.netease.com/deeplearning/fairing-job:8AB586D0
            name: mpi
            resources:
              limits:
                memory: 998579896320m
              requests:
                cpu: "1"
            securityContext:
              runAsUser: 0
            volumeMounts:
            - mountPath: /data
              name: fairing-volume-data-huting3
            workingDir: /app/
          imagePullSecrets:
          - name: hubinnercneastp1neteasecomdeeplearningstaffk8sai01serviceneteasecom
          restartPolicy: Never
          volumes:
          - name: fairing-volume-data-huting3
            persistentVolumeClaim:
              claimName: data-huting3
    Worker:
      replicas: 2
      template:
        metadata:
          annotations:
            monitoring.netease.com/enable-grafana-dashboard: "true"
            sidecar.istio.io/inject: "false"
          labels:
            fairing-deployer: mpijob
            fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
            kubeflow.netease.com/userid: huting3
          name: fairing-deployer
        spec:
          hostNetwork: "true"
          dnsPolicy: ClusterFirstWithHostNet
          containers:
          - env:
            - name: FAIRING_RUNTIME
              value: "1"
            image: hub-inner.cn-east-p1.netease.com/deeplearning/fairing-job:8AB586D0
            name: mpi
            resources:
              limits:
                memory: 6002216796160m
                nvidia.com/gpu: "1"
              requests:
                cpu: "4"
            securityContext:
              runAsUser: 0
            volumeMounts:
            - mountPath: /data
              name: fairing-volume-data-huting3
            workingDir: /app/
          restartPolicy: Never
          volumes:
          - name: fairing-volume-data-huting3
            persistentVolumeClaim:
              claimName: data-huting3
  slotsPerWorker: 1

but got error as below:

--------------------------------------------------------------------------
WARNING:  Open  MPI  accepted  a  TCP  connection  from  what  appears  to  be  a
another  Open  MPI  process  but  cannot  find  a  corresponding  process
entry  for  that  peer.

This  attempted  connection  will  be  ignored;  your  MPI  job  may  or  may  not
continue  properly.

    Local  host:  pri2-ainode28
    PID:                30
--------------------------------------------------------------------------

what is the problem? any hints is good to me

The text was updated successfully, but these errors were encountered:

carmark · 2020-04-15T07:52:19Z

@suluner Could you please attach more logs? And what's your network environment, RoCE, IB or others? It will be better to provide your result of ip a.

suluner · 2020-04-16T01:32:26Z

@carmark Thanks for your reply. My network environment is just ethernet. And when I add the following parameter to the mpirun command, it works well.

"-mca",
"btl_tcp_if_include",
"bond0.1200"

The bond0.1200 is the network device on the host machine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpijob using hostnetwork error #219

mpijob using hostnetwork error #219

suluner commented Apr 14, 2020

carmark commented Apr 15, 2020

suluner commented Apr 16, 2020

mpijob using hostnetwork error #219

mpijob using hostnetwork error #219

Comments

suluner commented Apr 14, 2020

carmark commented Apr 15, 2020

suluner commented Apr 16, 2020