Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpijob using hostnetwork error #219

Open
suluner opened this issue Apr 14, 2020 · 2 comments
Open

mpijob using hostnetwork error #219

suluner opened this issue Apr 14, 2020 · 2 comments

Comments

@suluner
Copy link

suluner commented Apr 14, 2020

hi, all

I want to use hostnetwork when submit mpijob to improve training performance. The yaml file is as below:

apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
  annotations:
    monitoring.netease.com/enable-grafana-dashboard: "true"
  generateName: test-mpijob
  generation: 2
  labels:
    fairing-deployer: mpijob
    fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
    kubeflow.netease.com/userid: huting3
  namespace: ai-test
spec:
  activeDeadlineSeconds: 3600
  backoffLimit: 1
  cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          annotations:
            monitoring.netease.com/enable-grafana-dashboard: "true"
            sidecar.istio.io/inject: "false"
          creationTimestamp: null
          labels:
            fairing-deployer: mpijob
            fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
            kubeflow.netease.com/userid: huting3
          name: fairing-deployer
        spec:
          hostNetwork: "true"
          dnsPolicy: ClusterFirstWithHostNet
          containers:
          - command:
            - mpirun
            - --allow-run-as-root
            - -np
            - "2"
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /app/boot.py
            env:
            - name: FAIRING_RUNTIME
              value: "1"
            image: hub-inner.cn-east-p1.netease.com/deeplearning/fairing-job:8AB586D0
            name: mpi
            resources:
              limits:
                memory: 998579896320m
              requests:
                cpu: "1"
            securityContext:
              runAsUser: 0
            volumeMounts:
            - mountPath: /data
              name: fairing-volume-data-huting3
            workingDir: /app/
          imagePullSecrets:
          - name: hubinnercneastp1neteasecomdeeplearningstaffk8sai01serviceneteasecom
          restartPolicy: Never
          volumes:
          - name: fairing-volume-data-huting3
            persistentVolumeClaim:
              claimName: data-huting3
    Worker:
      replicas: 2
      template:
        metadata:
          annotations:
            monitoring.netease.com/enable-grafana-dashboard: "true"
            sidecar.istio.io/inject: "false"
          labels:
            fairing-deployer: mpijob
            fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
            kubeflow.netease.com/userid: huting3
          name: fairing-deployer
        spec:
          hostNetwork: "true"
          dnsPolicy: ClusterFirstWithHostNet
          containers:
          - env:
            - name: FAIRING_RUNTIME
              value: "1"
            image: hub-inner.cn-east-p1.netease.com/deeplearning/fairing-job:8AB586D0
            name: mpi
            resources:
              limits:
                memory: 6002216796160m
                nvidia.com/gpu: "1"
              requests:
                cpu: "4"
            securityContext:
              runAsUser: 0
            volumeMounts:
            - mountPath: /data
              name: fairing-volume-data-huting3
            workingDir: /app/
          restartPolicy: Never
          volumes:
          - name: fairing-volume-data-huting3
            persistentVolumeClaim:
              claimName: data-huting3
  slotsPerWorker: 1

but got error as below:

--------------------------------------------------------------------------
WARNING:  Open  MPI  accepted  a  TCP  connection  from  what  appears  to  be  a
another  Open  MPI  process  but  cannot  find  a  corresponding  process
entry  for  that  peer.

This  attempted  connection  will  be  ignored;  your  MPI  job  may  or  may  not
continue  properly.

    Local  host:  pri2-ainode28
    PID:                30
--------------------------------------------------------------------------

what is the problem? any hints is good to me

@carmark
Copy link
Member

carmark commented Apr 15, 2020

@suluner Could you please attach more logs? And what's your network environment, RoCE, IB or others? It will be better to provide your result of ip a.

@suluner
Copy link
Author

suluner commented Apr 16, 2020

@carmark Thanks for your reply. My network environment is just ethernet. And when I add the following parameter to the mpirun command, it works well.

"-mca",
"btl_tcp_if_include",
"bond0.1200"

The bond0.1200 is the network device on the host machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants