Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having trouble nni with frameworkcontroller on k8s #4588

Closed
juniroc opened this issue Feb 25, 2022 · 7 comments
Closed

Having trouble nni with frameworkcontroller on k8s #4588

juniroc opened this issue Feb 25, 2022 · 7 comments

Comments

@juniroc
Copy link

juniroc commented Feb 25, 2022

Describe the issue:
When I tried nni with frameworkcontroller on k8s, I used these yaml files

  • I tried nfs

for nni config
config_framework.yml

authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: maximize
trial:
  codeDir: .
  taskRoles:
    - name: worker
      taskNum: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
      frameworkAttemptCompletionPolicy:
        minFailedTaskCount: 3
        minSucceededTaskCount: 1
frameworkcontrollerConfig:
  storage: nfs
  nfs:
    # Your NFS server IP, like 10.10.10.10
    server: 192.168.1.106
    # Your NFS server export path, like /var/nfs/nni
    path: /home/mj_lee/mount
  serviceAccountName: frameworkcontroller

and for frameworkcontroller Statefulset
frameworkcontroller-with-default-config.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: frameworkcontroller
  namespace: default
spec:
  serviceName: frameworkcontroller
  selector:
    matchLabels:
      app: frameworkcontroller
  replicas: 1
  template:
    metadata:
      labels:
        app: frameworkcontroller
    spec:
      # Using the ServiceAccount with granted permission
      # if the k8s cluster enforces authorization.
      serviceAccountName: frameworkcontroller
      containers:
      - name: frameworkcontroller
        image: frameworkcontroller/frameworkcontroller
        # Using k8s inClusterConfig, so usually, no need to specify
        # KUBE_APISERVER_ADDRESS or KUBECONFIG
        env:
        #- name: KUBE_APISERVER_ADDRESS
        #  value: {http[s]://host:port}
          - name: KUBECONFIG
            value: ~/.kube/config

and execute below command for k8s statefulset

kubectl apply -f frameworkcontroller-with-default-config.yaml

then frameworkcontroller-0 set to Run

image

and execute nnictl command

nnictl create --config config_framework.yml

then new experiment worker pod created
but it failed to run

image

when I check logs by kubectl logs nniexp~

image

so I checked the nfs mount directory,
and there is not nni directory, but It has envs directory and run.sh file

image

I think it should create nni/experiment_id/run.sh in mount folder

here is describe of nniexp-worker-0 pod

Name:         nniexpr2ys5f9aenvzchoa-worker-0
Namespace:    default
Priority:     0
Node:         zerooneai-p210908-4/192.168.1.104
Start Time:   Fri, 25 Feb 2022 14:33:07 +0900
Labels:       FC_FRAMEWORK_NAME=nniexpr2ys5f9aenvzchoa
              FC_TASKROLE_NAME=worker
              FC_TASK_INDEX=0
Annotations:  FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
              FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
              FC_FRAMEWORK_ATTEMPT_ID: 0
              FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
              FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
              FC_FRAMEWORK_NAMESPACE: default
              FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
              FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
              FC_TASKROLE_NAME: worker
              FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
              FC_TASK_ATTEMPT_ID: 0
              FC_TASK_INDEX: 0
              FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
              cni.projectcalico.org/podIP: 10.0.243.33/32
              cni.projectcalico.org/podIPs: 10.0.243.33/32
Status:       Running
IP:           10.0.243.33
IPs:
  IP:           10.0.243.33
Controlled By:  ConfigMap/nniexpr2ys5f9aenvzchoa-attempt
Init Containers:
  frameworkbarrier:
    Container ID:   docker://b05885b647cdb41dba4587f6f93eeb5bd19a390641687012bc017d73cc21aa79
    Image:          frameworkcontroller/frameworkbarrier
    Image ID:       docker-pullable://frameworkcontroller/frameworkbarrier@sha256:9d95e31152460e3cc5c7ad2b09738c1fdb540ff7a50abc72b2f8f9d0badb87da
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 25 Feb 2022 14:33:12 +0900
      Finished:     Fri, 25 Feb 2022 14:33:22 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpr2ys5f9aenvzchoa
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpr2ys5f9aenvzchoa-attempt
      FC_POD_NAME:                        nniexpr2ys5f9aenvzchoa-worker-0
      FC_FRAMEWORK_UID:                   2c55ec33-69b8-43a4-a643-84ff4e0604b2
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_CONFIGMAP_UID:                   0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_TASKROLE_UID:                    751bf95b-c6bd-4dd0-aafe-e160f9c10220
      FC_TASK_UID:                        27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Containers:
  framework:
    Container ID:  docker://dc9ff6579c67e8bc394c734e8add70fbdb581d014541044d1877a9e5d888f828
    Image:         msranni/nni:latest
    Image ID:      docker-pullable://msranni/nni@sha256:8985fb134204ef523e113ac4a572ae7460cd246a5ff471df413f7d17dd917cd1
    Port:          4000/TCP
    Host Port:     0/TCP
    Command:
      sh
      /tmp/mount/nni/r2ys5f9a/run.sh
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   sh: 0: Can't open /tmp/mount/nni/r2ys5f9a/run.sh

      Exit Code:    127
      Started:      Fri, 25 Feb 2022 14:36:43 +0900
      Finished:     Fri, 25 Feb 2022 14:36:43 +0900
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:     1
      memory:  8Gi
    Requests:
      cpu:     1
      memory:  8Gi
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpr2ys5f9aenvzchoa
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpr2ys5f9aenvzchoa-attempt
      FC_POD_NAME:                        nniexpr2ys5f9aenvzchoa-worker-0
      FC_FRAMEWORK_UID:                   2c55ec33-69b8-43a4-a643-84ff4e0604b2
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_CONFIGMAP_UID:                   0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_TASKROLE_UID:                    751bf95b-c6bd-4dd0-aafe-e160f9c10220
      FC_TASK_UID:                        27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /tmp/mount from nni-vol (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  nni-vol:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    192.168.1.106
    Path:      /home/zerooneai/mj_lee/mount
    ReadOnly:  false
  frameworkbarrier-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  frameworkcontroller-token-7sw6q:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  frameworkcontroller-token-7sw6q
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  6m19s                 default-scheduler  Successfully assigned default/nniexpr2ys5f9aenvzchoa-worker-0 to zerooneai-p210908-4
  Normal   Pulling    6m18s                 kubelet            Pulling image "frameworkcontroller/frameworkbarrier"
  Normal   Pulled     6m15s                 kubelet            Successfully pulled image "frameworkcontroller/frameworkbarrier" in 3.364620261s
  Normal   Created    6m14s                 kubelet            Created container frameworkbarrier
  Normal   Started    6m14s                 kubelet            Started container frameworkbarrier
  Normal   Pulled     6m1s                  kubelet            Successfully pulled image "msranni/nni:latest" in 2.375328373s
  Normal   Pulled     5m56s                 kubelet            Successfully pulled image "msranni/nni:latest" in 4.709013579s
  Normal   Pulled     5m36s                 kubelet            Successfully pulled image "msranni/nni:latest" in 2.373976028s
  Normal   Pulling    5m9s (x4 over 6m4s)   kubelet            Pulling image "msranni/nni:latest"
  Normal   Created    5m7s (x4 over 6m1s)   kubelet            Created container framework
  Normal   Pulled     5m7s                  kubelet            Successfully pulled image "msranni/nni:latest" in 2.484752039s
  Normal   Started    5m6s (x4 over 6m1s)   kubelet            Started container framework
  Warning  BackOff    71s (x22 over 5m54s)  kubelet            Back-off restarting failed container

please let me know how to solving this trouble thanks!

Environment:

  • NNI version: 2.6
  • Training service (local|remote|pai|aml|etc): frameworkcontroller
  • Client OS: ubuntu 18.04
  • Server OS (for remote mode only):
  • Python version: 3.6.9
  • PyTorch/TensorFlow version: 1.10.1+cu102
@hviet2603
Copy link

Hi, do we have a solution for this?

@amznero
Copy link
Contributor

amznero commented Jul 28, 2022

Face a similar issue(in Kubeflow training service), I fix it by hacking trialDispatcher.ts, kubeflowEnvironmentService.ts, kubernetesEnvironmentService.ts.

Suppose experiment id is ABCDE
The reason for this phenomenon happened is that the entry point("run.sh") and envs(contain the execution environment) are uploaded to the root path of the NFS Server, but the content of the “run.sh” is

sh /tmp/mount/nni/ABCDE/run.sh && ...

So, it will raise "can't open /tmp/mount/nni/ABCDE/run.sh" error.

P.S. In the container, the NFS path will be mounted to /tmp/mount.
P.P.S When the trial concurrency is large than 1, the run.sh will be overwritten by other environments.


I'll create a PR ASAP to fix this issue.


Related Issues: microsoft/frameworkcontroller#75, #4874, #5026.

@hviet2603
Copy link

Face a similar issue(in Kubeflow training service), I fix it by hacking trialDispatcher.ts, kubeflowEnvironmentService.ts, kubernetesEnvironmentService.ts.

Suppose experiment id is ABCDE The reason for this phenomenon happened is that the entry point("run.sh") and envs(contain the execution environment) are uploaded to the root path of the NFS Server, but the content of the “run.sh” is

sh /tmp/mount/nni/ABCDE/run.sh && ...

So, it will raise "can't open /tmp/mount/nni/ABCDE/run.sh" error.

P.S. In the container, the NFS path will be mounted to /tmp/mount. P.P.S When the trial concurrency is large than 1, the run.sh will be overwritten by other environments.

I'll create a PR ASAP to fix this issue.

Related Issues: microsoft/frameworkcontroller#75, #4874, #5026.

Thanks, I figured that out too. so you also modify the ts files and build nni again from source? I did not know about the overwriting, the overwritten of run.sh will not be a problem, right? because as I remember, it always creates a new env folder for each trial and runs the code there?

@amznero
Copy link
Contributor

amznero commented Aug 3, 2022

Face a similar issue(in Kubeflow training service), I fix it by hacking trialDispatcher.ts, kubeflowEnvironmentService.ts, kubernetesEnvironmentService.ts.
Suppose experiment id is ABCDE The reason for this phenomenon happened is that the entry point("run.sh") and envs(contain the execution environment) are uploaded to the root path of the NFS Server, but the content of the “run.sh” is

sh /tmp/mount/nni/ABCDE/run.sh && ...

So, it will raise "can't open /tmp/mount/nni/ABCDE/run.sh" error.
P.S. In the container, the NFS path will be mounted to /tmp/mount. P.P.S When the trial concurrency is large than 1, the run.sh will be overwritten by other environments.
I'll create a PR ASAP to fix this issue.
Related Issues: microsoft/frameworkcontroller#75, #4874, #5026.

Thanks, I figured that out too. so you also modify the ts files and build nni again from source? I did not know about the overwriting, the overwritten of run.sh will not be a problem, right? because as I remember, it always creates a new env folder for each trial and runs the code there?

Sorry for the late replay.

so you also modify the ts files and build nni again from source

Yes, I modify the TS files and build NNI wheel from source.

it always creates a new env folder for each trial and runs the code there

No, I think it only create the number of trialConcurrency envs at all, and assign each trial to a free env(like round-robin), maybe that's why it's called reusable?

For the overwriting problem, every env will actually run the same(the latest generated) run.sh script, and https://github.com/microsoft/nni/blob/v2.8/nni/tools/trial_tool/trial_runner.py#L164 will use dir name as runner id, finally it will raise error.

an example of run.sh

cd /tmp/mount/nni/5nfd2kzc && mkdir -p envs/ZKtWr && cd envs/ZKtWr && sh ../install_nni.sh && python3 -m nni.tools.trial_tool.trial_runner 1>/tmp/mount/nni/5nfd2kzc/envs/ZKtWr/trialrunner_stdout 2>/tmp/mount/nni/5nfd2kzc/envs/ZKtWr/trialrunner_stderr

Every envs will use ZKtWr as runner_id.

@hviet2603
Copy link

@amznero at first, I move the run.sh file to the correct experiment folder, but then, the trial doesn't seem to run concurrently, and yes, as you say, I also found out that the last config is applied for every worker (environment).
So then my solution is to create different run.sh files for each envs, e.g: run_zktwr.sh and also refine the start command for each worker accordingly. This seems to bring the concurrency to work, but the trials then take more time than when there's only 1 worker. Is it also the case for you? If not, can I have your source code, thank you in advance

@amznero
Copy link
Contributor

amznero commented Aug 3, 2022

@vincenthp2603

but the trials then take more time than when there's only 1 worker.

Does "time" mean training duration? If so, this scenario didn't happen to me, and I don't think the concurrency will affect the training duration.

You can freeze random seeds(NumPy, torch, cuda, cudnn, et al) and set worker=1 to record the experiment baseline(batch size, epoch, model parameters, training duration). Then use concurrent mode to train the model and compare it with the baseline. Maybe the training duration is related to model complexity or training strategies(like Genetic Algorithm)?


You can see my changes here: #5045.

@liuzhe-lz
Copy link
Contributor

NNI v2.9 has been released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants