Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Having trouble nni with frameworkcontroller on k8s again #75

@juniroc

Description

@juniroc

Hi, I got new problem during nni with frameworkcontroller on k8s and I created issue at below link

but, It couldn't get answer for a long time
Is there anyone can solve it?

Thanks!

microsoft/nni#4588 (comment)


Details

Describe the issue:
When I tried nni with frameworkcontroller on k8s, I used these yaml files

  • I tried nfs

for nni config
config_framework.yml

authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: maximize
trial:
  codeDir: .
  taskRoles:
    - name: worker
      taskNum: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
      frameworkAttemptCompletionPolicy:
        minFailedTaskCount: 3
        minSucceededTaskCount: 1
frameworkcontrollerConfig:
  storage: nfs
  nfs:
    # Your NFS server IP, like 10.10.10.10
    server: 192.168.1.106
    # Your NFS server export path, like /var/nfs/nni
    path: /home/mj_lee/mount
  serviceAccountName: frameworkcontroller

and for frameworkcontroller Statefulset
frameworkcontroller-with-default-config.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: frameworkcontroller
  namespace: default
spec:
  serviceName: frameworkcontroller
  selector:
    matchLabels:
      app: frameworkcontroller
  replicas: 1
  template:
    metadata:
      labels:
        app: frameworkcontroller
    spec:
      # Using the ServiceAccount with granted permission
      # if the k8s cluster enforces authorization.
      serviceAccountName: frameworkcontroller
      containers:
      - name: frameworkcontroller
        image: frameworkcontroller/frameworkcontroller
        # Using k8s inClusterConfig, so usually, no need to specify
        # KUBE_APISERVER_ADDRESS or KUBECONFIG
        env:
        #- name: KUBE_APISERVER_ADDRESS
        #  value: {http[s]://host:port}
          - name: KUBECONFIG
            value: ~/.kube/config

and execute below command for k8s statefulset

kubectl apply -f frameworkcontroller-with-default-config.yaml

then frameworkcontroller-0 set to Run

image

and execute nnictl command

nnictl create --config config_framework.yml

then new experiment worker pod created
but it failed to run

image

when I check logs by kubectl logs nniexp~

image

so I checked the nfs mount directory,
and there is not nni directory, but It has envs directory and run.sh file

image

I think it should create nni/experiment_id/run.sh in mount folder

here is describe of nniexp-worker-0 pod

Name:         nniexpr2ys5f9aenvzchoa-worker-0
Namespace:    default
Priority:     0
Node:         zerooneai-p210908-4/192.168.1.104
Start Time:   Fri, 25 Feb 2022 14:33:07 +0900
Labels:       FC_FRAMEWORK_NAME=nniexpr2ys5f9aenvzchoa
              FC_TASKROLE_NAME=worker
              FC_TASK_INDEX=0
Annotations:  FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
              FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
              FC_FRAMEWORK_ATTEMPT_ID: 0
              FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
              FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
              FC_FRAMEWORK_NAMESPACE: default
              FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
              FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
              FC_TASKROLE_NAME: worker
              FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
              FC_TASK_ATTEMPT_ID: 0
              FC_TASK_INDEX: 0
              FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
              cni.projectcalico.org/podIP: 10.0.243.33/32
              cni.projectcalico.org/podIPs: 10.0.243.33/32
Status:       Running
IP:           10.0.243.33
IPs:
  IP:           10.0.243.33
Controlled By:  ConfigMap/nniexpr2ys5f9aenvzchoa-attempt
Init Containers:
  frameworkbarrier:
    Container ID:   docker://b05885b647cdb41dba4587f6f93eeb5bd19a390641687012bc017d73cc21aa79
    Image:          frameworkcontroller/frameworkbarrier
    Image ID:       docker-pullable://frameworkcontroller/frameworkbarrier@sha256:9d95e31152460e3cc5c7ad2b09738c1fdb540ff7a50abc72b2f8f9d0badb87da
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 25 Feb 2022 14:33:12 +0900
      Finished:     Fri, 25 Feb 2022 14:33:22 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpr2ys5f9aenvzchoa
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpr2ys5f9aenvzchoa-attempt
      FC_POD_NAME:                        nniexpr2ys5f9aenvzchoa-worker-0
      FC_FRAMEWORK_UID:                   2c55ec33-69b8-43a4-a643-84ff4e0604b2
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_CONFIGMAP_UID:                   0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_TASKROLE_UID:                    751bf95b-c6bd-4dd0-aafe-e160f9c10220
      FC_TASK_UID:                        27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Containers:
  framework:
    Container ID:  docker://dc9ff6579c67e8bc394c734e8add70fbdb581d014541044d1877a9e5d888f828
    Image:         msranni/nni:latest
    Image ID:      docker-pullable://msranni/nni@sha256:8985fb134204ef523e113ac4a572ae7460cd246a5ff471df413f7d17dd917cd1
    Port:          4000/TCP
    Host Port:     0/TCP
    Command:
      sh
      /tmp/mount/nni/r2ys5f9a/run.sh
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   sh: 0: Can't open /tmp/mount/nni/r2ys5f9a/run.sh

      Exit Code:    127
      Started:      Fri, 25 Feb 2022 14:36:43 +0900
      Finished:     Fri, 25 Feb 2022 14:36:43 +0900
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:     1
      memory:  8Gi
    Requests:
      cpu:     1
      memory:  8Gi
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpr2ys5f9aenvzchoa
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpr2ys5f9aenvzchoa-attempt
      FC_POD_NAME:                        nniexpr2ys5f9aenvzchoa-worker-0
      FC_FRAMEWORK_UID:                   2c55ec33-69b8-43a4-a643-84ff4e0604b2
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_CONFIGMAP_UID:                   0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_TASKROLE_UID:                    751bf95b-c6bd-4dd0-aafe-e160f9c10220
      FC_TASK_UID:                        27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /tmp/mount from nni-vol (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  nni-vol:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    192.168.1.106
    Path:      /home/zerooneai/mj_lee/mount
    ReadOnly:  false
  frameworkbarrier-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  frameworkcontroller-token-7sw6q:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  frameworkcontroller-token-7sw6q
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  6m19s                 default-scheduler  Successfully assigned default/nniexpr2ys5f9aenvzchoa-worker-0 to zerooneai-p210908-4
  Normal   Pulling    6m18s                 kubelet            Pulling image "frameworkcontroller/frameworkbarrier"
  Normal   Pulled     6m15s                 kubelet            Successfully pulled image "frameworkcontroller/frameworkbarrier" in 3.364620261s
  Normal   Created    6m14s                 kubelet            Created container frameworkbarrier
  Normal   Started    6m14s                 kubelet            Started container frameworkbarrier
  Normal   Pulled     6m1s                  kubelet            Successfully pulled image "msranni/nni:latest" in 2.375328373s
  Normal   Pulled     5m56s                 kubelet            Successfully pulled image "msranni/nni:latest" in 4.709013579s
  Normal   Pulled     5m36s                 kubelet            Successfully pulled image "msranni/nni:latest" in 2.373976028s
  Normal   Pulling    5m9s (x4 over 6m4s)   kubelet            Pulling image "msranni/nni:latest"
  Normal   Created    5m7s (x4 over 6m1s)   kubelet            Created container framework
  Normal   Pulled     5m7s                  kubelet            Successfully pulled image "msranni/nni:latest" in 2.484752039s
  Normal   Started    5m6s (x4 over 6m1s)   kubelet            Started container framework
  Warning  BackOff    71s (x22 over 5m54s)  kubelet            Back-off restarting failed container

please let me know how to solving this trouble thanks!

Environment:

  • NNI version: 2.6
  • Training service (local|remote|pai|aml|etc): frameworkcontroller
  • Client OS: ubuntu 18.04
  • Server OS (for remote mode only):
  • Python version: 3.6.9
  • PyTorch/TensorFlow version: 1.10.1+cu102

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions