-
Notifications
You must be signed in to change notification settings - Fork 43
Having trouble nni with frameworkcontroller on k8s again #75
Description
Hi, I got new problem during nni with frameworkcontroller on k8s and I created issue at below link
but, It couldn't get answer for a long time
Is there anyone can solve it?
Thanks!
Details
Describe the issue:
When I tried nni with frameworkcontroller on k8s, I used these yaml files
- I tried nfs
for nni config
config_framework.yml
authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
taskRoles:
- name: worker
taskNum: 1
command: python3 mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 3
minSucceededTaskCount: 1
frameworkcontrollerConfig:
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: 192.168.1.106
# Your NFS server export path, like /var/nfs/nni
path: /home/mj_lee/mount
serviceAccountName: frameworkcontroller
and for frameworkcontroller Statefulset
frameworkcontroller-with-default-config.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: frameworkcontroller
namespace: default
spec:
serviceName: frameworkcontroller
selector:
matchLabels:
app: frameworkcontroller
replicas: 1
template:
metadata:
labels:
app: frameworkcontroller
spec:
# Using the ServiceAccount with granted permission
# if the k8s cluster enforces authorization.
serviceAccountName: frameworkcontroller
containers:
- name: frameworkcontroller
image: frameworkcontroller/frameworkcontroller
# Using k8s inClusterConfig, so usually, no need to specify
# KUBE_APISERVER_ADDRESS or KUBECONFIG
env:
#- name: KUBE_APISERVER_ADDRESS
# value: {http[s]://host:port}
- name: KUBECONFIG
value: ~/.kube/config
and execute below command for k8s statefulset
kubectl apply -f frameworkcontroller-with-default-config.yaml
then frameworkcontroller-0 set to Run
and execute nnictl command
nnictl create --config config_framework.yml
then new experiment worker pod created
but it failed to run
when I check logs by kubectl logs nniexp~
so I checked the nfs mount directory,
and there is not nni directory, but It has envs directory and run.sh file
I think it should create nni/experiment_id/run.sh in mount folder
here is describe of nniexp-worker-0 pod
Name: nniexpr2ys5f9aenvzchoa-worker-0
Namespace: default
Priority: 0
Node: zerooneai-p210908-4/192.168.1.104
Start Time: Fri, 25 Feb 2022 14:33:07 +0900
Labels: FC_FRAMEWORK_NAME=nniexpr2ys5f9aenvzchoa
FC_TASKROLE_NAME=worker
FC_TASK_INDEX=0
Annotations: FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
FC_TASKROLE_NAME: worker
FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
FC_TASK_ATTEMPT_ID: 0
FC_TASK_INDEX: 0
FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
cni.projectcalico.org/podIP: 10.0.243.33/32
cni.projectcalico.org/podIPs: 10.0.243.33/32
Status: Running
IP: 10.0.243.33
IPs:
IP: 10.0.243.33
Controlled By: ConfigMap/nniexpr2ys5f9aenvzchoa-attempt
Init Containers:
frameworkbarrier:
Container ID: docker://b05885b647cdb41dba4587f6f93eeb5bd19a390641687012bc017d73cc21aa79
Image: frameworkcontroller/frameworkbarrier
Image ID: docker-pullable://frameworkcontroller/frameworkbarrier@sha256:9d95e31152460e3cc5c7ad2b09738c1fdb540ff7a50abc72b2f8f9d0badb87da
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 25 Feb 2022 14:33:12 +0900
Finished: Fri, 25 Feb 2022 14:33:22 +0900
Ready: True
Restart Count: 0
Environment:
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
FC_TASKROLE_NAME: worker
FC_TASK_INDEX: 0
FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
FC_TASK_ATTEMPT_ID: 0
FC_POD_UID: (v1:metadata.uid)
FC_TASK_ATTEMPT_INSTANCE_UID: 0_$(FC_POD_UID)
Mounts:
/mnt/frameworkbarrier from frameworkbarrier-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Containers:
framework:
Container ID: docker://dc9ff6579c67e8bc394c734e8add70fbdb581d014541044d1877a9e5d888f828
Image: msranni/nni:latest
Image ID: docker-pullable://msranni/nni@sha256:8985fb134204ef523e113ac4a572ae7460cd246a5ff471df413f7d17dd917cd1
Port: 4000/TCP
Host Port: 0/TCP
Command:
sh
/tmp/mount/nni/r2ys5f9a/run.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: sh: 0: Can't open /tmp/mount/nni/r2ys5f9a/run.sh
Exit Code: 127
Started: Fri, 25 Feb 2022 14:36:43 +0900
Finished: Fri, 25 Feb 2022 14:36:43 +0900
Ready: False
Restart Count: 5
Limits:
cpu: 1
memory: 8Gi
Requests:
cpu: 1
memory: 8Gi
Environment:
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
FC_TASKROLE_NAME: worker
FC_TASK_INDEX: 0
FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
FC_TASK_ATTEMPT_ID: 0
FC_POD_UID: (v1:metadata.uid)
FC_TASK_ATTEMPT_INSTANCE_UID: 0_$(FC_POD_UID)
Mounts:
/mnt/frameworkbarrier from frameworkbarrier-volume (rw)
/tmp/mount from nni-vol (rw)
/var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nni-vol:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 192.168.1.106
Path: /home/zerooneai/mj_lee/mount
ReadOnly: false
frameworkbarrier-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
frameworkcontroller-token-7sw6q:
Type: Secret (a volume populated by a Secret)
SecretName: frameworkcontroller-token-7sw6q
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m19s default-scheduler Successfully assigned default/nniexpr2ys5f9aenvzchoa-worker-0 to zerooneai-p210908-4
Normal Pulling 6m18s kubelet Pulling image "frameworkcontroller/frameworkbarrier"
Normal Pulled 6m15s kubelet Successfully pulled image "frameworkcontroller/frameworkbarrier" in 3.364620261s
Normal Created 6m14s kubelet Created container frameworkbarrier
Normal Started 6m14s kubelet Started container frameworkbarrier
Normal Pulled 6m1s kubelet Successfully pulled image "msranni/nni:latest" in 2.375328373s
Normal Pulled 5m56s kubelet Successfully pulled image "msranni/nni:latest" in 4.709013579s
Normal Pulled 5m36s kubelet Successfully pulled image "msranni/nni:latest" in 2.373976028s
Normal Pulling 5m9s (x4 over 6m4s) kubelet Pulling image "msranni/nni:latest"
Normal Created 5m7s (x4 over 6m1s) kubelet Created container framework
Normal Pulled 5m7s kubelet Successfully pulled image "msranni/nni:latest" in 2.484752039s
Normal Started 5m6s (x4 over 6m1s) kubelet Started container framework
Warning BackOff 71s (x22 over 5m54s) kubelet Back-off restarting failed container
please let me know how to solving this trouble thanks!
Environment:
- NNI version: 2.6
- Training service (local|remote|pai|aml|etc): frameworkcontroller
- Client OS: ubuntu 18.04
- Server OS (for remote mode only):
- Python version: 3.6.9
- PyTorch/TensorFlow version: 1.10.1+cu102



