Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example test on yarn #3435

Closed
2 tasks
qiuxin2012 opened this issue Nov 9, 2021 · 29 comments
Closed
2 tasks

Example test on yarn #3435

qiuxin2012 opened this issue Nov 9, 2021 · 29 comments

Comments

@qiuxin2012
Copy link
Contributor

qiuxin2012 commented Nov 9, 2021

  1. Change code to add a deploymode option.
    Reference: https://github.com/intel-analytics/BigDL/blob/branch-2.0/python/dllib/src/bigdl/dllib/models/inception/inception.py, https://github.com/intel-analytics/BigDL/blob/branch-2.0/python/dllib/src/bigdl/dllib/models/lenet/lenet5.py
  2. Add test to python/dllib/src/bigdl/dllib/examples/run-example-tests-yarn-integration.sh
  3. Run jenkins http://10.112.231.51:18888/view/ZOO-PR/job/ZOO-PR-Python-integration-test/

TODO: Move run-example-tests-yarn-integration.sh to python/dllib/examples/. (xin)

  • dllib examples
  • orca examples

dllib examples, use init_nncontext

Module Example Added Client Mode Cluster Mode
autograd custom.py Y Succeed Succeed
autograd customloss.py Y Succeed Succeed
nnframes imageInference Y Succeed Succeed
nnframes imageTransferLearning Y Succeed Succeed

orca examples, use init_orca_context

Module Example Added Client Mode Cluster Mode
automl autoestimator/autoestimator_pytorch.py Y Succeed Succeed
automl autoxgboost/AutoXGBoostClassifier.p (intel-analytics/analytics-zoo#27) Y Succeed Succeed
automl autoxgboost/AutoXGBoostRegressor.py (intel-analytics/analytics-zoo#27) Y Succeed Succeed
data spark_pandas.py Y Succeed Succeed
bigdl learn/bigdl/attention/transformer.py Y Succeed Failed
bigdl learn/bigdl/imageInference/imageInference.py Y Succeed Failed
horovod learn/horovod/pytorch_estimator.py Y Succeed Succeed
horovod simple_horovod_pytorch.py Y Succeed
mxnet learn/mxnet/lenet_mnist.py Y Succeed
openvino learn/openvino/predict.py Not Added
pytorch learn/pytorch/cifar10/cifar10.py Y Succeed Failed
pytorch learn/pytorch/fashion_mnist/fashion_mnist.py Y Succeed Failed
pytorch  learn/pytorch/super_resolution/super_resolution.py Y Succeed Failed
tf learn/tf/basic_text_classification/basic_text_classification.py Y Succeed Failed
tf learn/tf/image_segmentation/image_segmentation.py Y Succeed Failed
tf learn/tf/inception/inception.py Y Succeed Failed
tf learn/tf/transfer_learning/transfer_learning.py Y Failed Failed
tf2 learn/tf2/resnet/resnet-50-imagenet.py Y Failed Failed
tf2 learn/tf2/yolov3/yoloV3.py Y Succeed Succeed
ray_on_spark ray_on_spark/parameter_server/async_parameter_server.py Y Succeed Failed
ray_on_spark ray_on_spark/parameter_server/sync_parameter_server.py Y Succeed Succeed
ray_on_spark ray_on_spark/rl_pong/rl_pong.py Y Succeed Succeed
ray_on_spark ray_on_spark/rllib/multiagent_two_trainers.py Y Succeed Succeed
tfpark tfpark/estimator/estimator_dataset.py Y Succeed Failed
tfpark tfpark/estimator/estimator_inception.py Y Succeed Failed
tfpark tfpark/estimator/pre-made-estimator.py Not Added
tfpark tfpark/gan/gan_train_and_evaluate.py Y Failed
tfpark tfpark/keras/keras_dataset.py Y Succeed Failed
tfpark tfpark/keras/keras_ndarray.py Y Succeed Failed
tfpark tfpark/tf_optimizer/evaluate.py Y Succeed Failed
tfpark tfpark/tf_optimizer/train.py Y Succeed Failed
torchmodel torchmodel/train/imagenet/main.py Y Succeed Failed
torchmodel torchmodel/train/mnist/main.py Y Succeed Failed
torchmodel torchmodel/train/resnet_finetune/resnet_finetune.py Y Succeed Failed
@sgwhat

This comment has been minimized.

@sgwhat
Copy link
Contributor

sgwhat commented Nov 16, 2021

@sgwhat

This comment has been minimized.

@sgwhat
Copy link
Contributor

sgwhat commented Nov 17, 2021

Tfpark gan

Yarn-Client Exception

Traceback (most recent call last):
  File "/opt/work/jenkins/workspace/ZOO-PR-Python-integration-test/python/orca/example/tfpark/gan/gan_train_and_evaluate.py", line 107, in <module>
    opt.train(input_fn, MaxIteration(1000))
  File "/opt/work/conda/envs/py37-test/lib/python3.7/site-packages/bigdl/orca/tfpark/gan/gan_estimator.py", line 175, in train
    opt.optimize(end_trigger)
  File "/opt/work/conda/envs/py37-test/lib/python3.7/site-packages/bigdl/orca/tfpark/tf_optimizer.py", line 781, in optimize
    checkpoint_trigger=checkpoint_trigger)
  File "/opt/work/conda/envs/py37-test/lib/python3.7/site-packages/bigdl/dllib/estimator/estimator.py", line 167, in train_minibatch
    validation_method)
  File "/opt/work/conda/envs/py37-test/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
    raise e
  File "/opt/work/conda/envs/py37-test/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
    java_result = api(*args)
  File "/opt/work/conda/envs/py37-test/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/work/conda/envs/py37-test/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o76.estimatorTrainMiniBatch.
: java.lang.NullPointerException
	at com.intel.analytics.bigdl.dllib.optim.AbstractOptimizer.clearState(AbstractOptimizer.scala:241)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer.clearState(DistriOptimizer.scala:757)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1175)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1345)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1015)
	at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:190)
	at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

@pinggao187
Copy link
Contributor

successfully installed decorator-5.1.0 tensorflow-datasets-2.0.0 tensorflow-gan-2.0.0 tensorflow-hub-0.12.0 tensorflow-probability-0.7.0 graphviz-0.8.4 idna-2.6 mxnet-cu91-1.2.1.post1 numpy-1.19.2 requests-2.18.4 urllib3-1.22

jenkins node OS is ubuntu16.04, openvino cannot be installed.

@shanyu-sys
Copy link
Contributor

automl tests have been added by PR intel-analytics/analytics-zoo#409 and PR intel-analytics/analytics-zoo#401

@ManfeiBai
Copy link

ManfeiBai commented Nov 19, 2021

image_segmentation.py

Yarn-Client Exception

FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://172.168.2.151:9000/carvana/train.zip'

Yarn-Cluster Exception

  File "/dir3/yarn/nm_0/usercache/root/appcache/application_1635129125298_1235/container_1635129125298_1235_02_000001/python_env/lib/python3.7/zipfile.py", line 1240, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://172.168.2.151:9000/carvana/train.zip'

@ManfeiBai
Copy link

ManfeiBai commented Nov 19, 2021

transfer_learning.py

Yarn-Client Exception

Traceback (most recent call last):
  File "/opt/work/jenkins/workspace/ZOO-PR-Python-integration-test/python/orca/example/learn/tf/transfer_learning/transfer_learning.py", line 111, in <module>
    builder = tfds.ImageFolder(base_dir)
AttributeError: module 'tensorflow_datasets' has no attribute 'ImageFolder'

Yarn-Cluster Exception

Downloading data from https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip
2021-11-22 21:42:45 ERROR ApplicationMaster:91 - Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]

@sgwhat

This comment has been minimized.

@ManfeiBai

This comment has been minimized.

@piaolaidelangman

This comment has been minimized.

@piaolaidelangman

This comment has been minimized.

@ManfeiBai

This comment has been minimized.

@ManfeiBai
Copy link

basic_text_classification.py

Yarn-Cluster Exception

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz

@sgwhat
Copy link
Contributor

sgwhat commented Nov 22, 2021

basic_text_classification.py

Yarn-Cluster Exception

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz

Downloading is not work in Yarn-cluster mode.

@ManfeiBai
Copy link

Inception.py

Yarn-Cluster Exception

Traceback (most recent call last):
  File "inception.py", line 28, in <module>
    from inception_preprocessing import preprocess_for_train, \
ModuleNotFoundError: No module named 'inception_preprocessing'

@sgwhat
Copy link
Contributor

sgwhat commented Nov 23, 2021

Need to check in new clusters. https://github.com/intel-analytics/arda-docker/issues/511

@ManfeiBai
Copy link

resent-50.py

Yarn-Cluster Exception

ModuleNotFoundError: No module named 'ray._private'

@sgwhat
Copy link
Contributor

sgwhat commented Nov 23, 2021

Orca examples test had been conducted in PR #3546.

@piaolaidelangman

This comment has been minimized.

@piaolaidelangman
Copy link
Contributor

piaolaidelangman commented Nov 29, 2021

fashino_mnist.py

jep error

Yarn-Cluster Exception

creating: createZooKerasAccuracy
creating: createEstimator
creating: createEveryEpoch
creating: createMaxEpoch
2021-11-27 05:17:21 INFO  DistriOptimizer$:824 - caching training rdd ...
2021-11-27 05:17:50 INFO  DistriOptimizer$:650 - Cache thread models...
2021-11-27 05:17:51 INFO  DistriOptimizer$:652 - Cache thread models... done
2021-11-27 05:17:51 INFO  DistriOptimizer$:162 - Count dataset
2021-11-27 05:18:04 INFO  DistriOptimizer$:166 - Count dataset complete. Time elapsed: 12.86615251s
2021-11-27 05:18:16 WARN  DistriOptimizer$:168 - If the dataset is built directly from RDD[Minibatch], the data in each minibatch is fixed, and a single minibatch is randomly selected in each partition. If the dataset is transformed from RDD[Sample], each minibatch will be constructed on the fly from random samples, which is better for convergence.
2021-11-27 05:18:16 INFO  DistriOptimizer$:174 - config  {
	computeThresholdbatchSize: 100
	maxDropPercentage: 0.0
	warmupIterationNum: 200
	isLayerwiseScaled: false
	dropPercentage: 0.0
 }
2021-11-27 05:18:16 INFO  DistriOptimizer$:178 - Shuffle data
2021-11-27 05:18:16 INFO  DistriOptimizer$:181 - Shuffle data complete. Takes 2.89354E-4s
2021-11-27 05:18:17 ERROR DistriOptimizer$:935 - Error: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:710)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:910)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1123)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:793)
	at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:190)
	at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: jep.JepException: jep.JepException: <class 'ImportError'>: /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType$lzycompute(TorchOptim.scala:55)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType(TorchOptim.scala:38)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.updateHyperParameter(TorchOptim.scala:166)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:420)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:419)
	at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:419)
	... 23 more
Caused by: jep.JepException: <class 'ImportError'>: /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/subprocess.<module>(subprocess.py:152)
	at /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/platform.<module>(platform.py:116)
	at /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/site-packages/torch/__init__.<module>(__init__.py:14)
	at <string>.<module>(<string>:2)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:106)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

Traceback (most recent call last):
  File "fashion_mnist.py", line 196, in <module>
    main()
  File "fashion_mnist.py", line 170, in main
    checkpoint_trigger=EveryEpoch())
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/pytorch_spark_estimator.py", line 170, in fit
    checkpoint_trigger, val_fset, self.metrics)
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/estimator/estimator.py", line 167, in train_minibatch
    validation_method)
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
    raise e
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
    java_result = api(*args)
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o76.estimatorTrainMiniBatch.
: jep.JepException: jep.JepException: <class 'ImportError'>: /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType$lzycompute(TorchOptim.scala:55)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType(TorchOptim.scala:38)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.updateHyperParameter(TorchOptim.scala:166)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:420)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:419)
	at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:419)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:710)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:910)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1123)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:793)
	at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:190)
	at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: jep.JepException: <class 'ImportError'>: /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/subprocess.<module>(subprocess.py:152)
	at /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/platform.<module>(platform.py:116)
	at /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1545/container_1635129125298_1545_02_000001/python_env/lib/python3.7/site-packages/torch/__init__.<module>(__init__.py:14)
	at <string>.<module>(<string>:2)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:106)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

Stopping orca context

@piaolaidelangman
Copy link
Contributor

piaolaidelangman commented Nov 29, 2021

super_resolution.py

jep error

Yarn-Cluster Exception

2021-11-27 07:18:23 INFO  DistriOptimizer$:162 - Count dataset
2021-11-27 07:18:24 INFO  DistriOptimizer$:166 - Count dataset complete. Time elapsed: 0.778880255s
2021-11-27 07:18:25 WARN  DistriOptimizer$:168 - If the dataset is built directly from RDD[Minibatch], the data in each minibatch is fixed, and a single minibatch is randomly selected in each partition. If the dataset is transformed from RDD[Sample], each minibatch will be constructed on the fly from random samples, which is better for convergence.
2021-11-27 07:18:25 INFO  DistriOptimizer$:174 - config  {
	computeThresholdbatchSize: 100
	maxDropPercentage: 0.0
	warmupIterationNum: 200
	isLayerwiseScaled: false
	dropPercentage: 0.0
 }
2021-11-27 07:18:25 INFO  DistriOptimizer$:178 - Shuffle data
2021-11-27 07:18:25 INFO  DistriOptimizer$:181 - Shuffle data complete. Takes 2.8297E-4s
2021-11-27 07:18:28 ERROR DistriOptimizer$:935 - Error: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:710)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:910)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1123)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:793)
	at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:190)
	at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: jep.JepException: jep.JepException: <class 'ImportError'>: /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType$lzycompute(TorchOptim.scala:55)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType(TorchOptim.scala:38)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.updateHyperParameter(TorchOptim.scala:166)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:420)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:419)
	at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:419)
	... 23 more
Caused by: jep.JepException: <class 'ImportError'>: /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/python_env/lib/python3.7/subprocess.<module>(subprocess.py:152)
	at /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/python_env/lib/python3.7/platform.<module>(platform.py:116)
	at /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/python_env/lib/python3.7/site-packages/torch/__init__.<module>(__init__.py:14)
	at <string>.<module>(<string>:2)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:106)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

2021-11-27 07:18:28 INFO  DistriOptimizer$:949 - Retrying 1 times
Traceback (most recent call last):
  File "super_resolution.py", line 277, in <module>
    checkpoint_trigger=EveryEpoch())
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/python_env/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/pytorch_spark_estimator.py", line 170, in fit
    checkpoint_trigger, val_fset, self.metrics)
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/estimator/estimator.py", line 167, in train_minibatch
    validation_method)
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
    raise e
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
    java_result = api(*args)
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1551/container_1635129125298_1551_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o76.estimatorTrainMiniBatch.
: java.lang.NullPointerException
	at com.intel.analytics.bigdl.dllib.optim.AbstractOptimizer.clearState(AbstractOptimizer.scala:241)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer.clearState(DistriOptimizer.scala:757)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:953)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1123)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:793)
	at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:190)
	at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Stopping orca context

@piaolaidelangman
Copy link
Contributor

piaolaidelangman commented Nov 29, 2021

mnist.py

jep error

Yarn-Cluster Exception

Traceback (most recent call last):
  File "main.py", line 122, in <module>
    main()
  File "main.py", line 118, in main
    validation_method=[Accuracy()])
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1547/container_1635129125298_1547_01_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/estimator/estimator.py", line 167, in train_minibatch
    validation_method)
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1547/container_1635129125298_1547_01_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
    raise e
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1547/container_1635129125298_1547_01_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
    java_result = api(*args)
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1547/container_1635129125298_1547_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1547/container_1635129125298_1547_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o76.estimatorTrainMiniBatch.
: jep.JepException: jep.JepException: <class 'ImportError'>: /dir2/yarn/nm_0/usercache/root/appcache/application_1635129125298_1547/container_1635129125298_1547_01_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)

@piaolaidelangman
Copy link
Contributor

piaolaidelangman commented Nov 29, 2021

resnet_finetune.py

jep error

Yarn-Cluster Exception

2021-11-29 04:42:48 INFO  DistriOptimizer$:473 - [Epoch 1 2048/2034][Iteration 128][Wall Clock 160.670773111s] Epoch finished. Wall clock time is 160816.21661 ms
2021-11-29 04:42:48 INFO  DistriOptimizer$:112 - [Epoch 1 2048/2034][Iteration 128][Wall Clock 160.670773111s] Validate model...
2021-11-29 04:43:00 INFO  DistriOptimizer$:178 - [Epoch 1 2048/2034][Iteration 128][Wall Clock 160.670773111s] validate model throughput is 15.346056 records/second
2021-11-29 04:43:00 INFO  DistriOptimizer$:181 - [Epoch 1 2048/2034][Iteration 128][Wall Clock 160.670773111s] Top1Accuracy is Accuracy(correct: 178, count: 188, accuracy: 0.9468085106382979)
creating: createToTuple
creating: createChainedPreprocessing
Traceback (most recent call last):
  File "resnet_finetune.py", line 112, in <module>
    predictionDF = catdogModel.transform(validationDF) \
  File "/dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/pyspark.zip/pyspark/ml/base.py", line 173, in transform
  File "/dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/pyspark.zip/pyspark/ml/wrapper.py", line 312, in _transform
  File "/dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o618.transform.
: jep.JepException: jep.JepException: <class 'ImportError'>: /dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108)
	at com.intel.analytics.bigdl.orca.net.TorchModel.load$lzycompute(TorchModel.scala:70)
	at com.intel.analytics.bigdl.orca.net.TorchModel.load(TorchModel.scala:50)
	at com.intel.analytics.bigdl.orca.net.TorchModel.evaluate(TorchModel.scala:201)
	at com.intel.analytics.bigdl.orca.net.TorchModel.evaluate(TorchModel.scala:34)
	at com.intel.analytics.bigdl.dllib.nnframes.NNModel.internalTransform(NNEstimator.scala:718)
	at org.apache.spark.ml.DLTransformerBase.transform(DLTransformerBase.scala:37)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: jep.JepException: <class 'ImportError'>: /dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at /dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/python_env/lib/python3.7/subprocess.<module>(subprocess.py:152)
	at /dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/python_env/lib/python3.7/platform.<module>(platform.py:116)
	at /dir1/yarn/nm_0/usercache/root/appcache/application_1635129125298_1632/container_1635129125298_1632_01_000001/python_env/lib/python3.7/site-packages/torch/__init__.<module>(__init__.py:14)
	at <string>.<module>(<string>:2)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:106)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

Stopping orca context
2021-11-29 04:43:01 ERROR ApplicationMaster:70 - User application exited with status 1

@piaolaidelangman
Copy link
Contributor

cifar10.py

jep error

Yarn-Cluster Error

2021-11-29 08:59:09 INFO  DistriOptimizer$:178 - Shuffle data
2021-11-29 08:59:09 INFO  DistriOptimizer$:181 - Shuffle data complete. Takes 2.65277E-4s
2021-11-29 08:59:09 ERROR DistriOptimizer$:935 - Error: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:710)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:910)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1123)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:793)
	at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:190)
	at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: jep.JepException: jep.JepException: <class 'ImportError'>: /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType$lzycompute(TorchOptim.scala:55)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType(TorchOptim.scala:38)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.updateHyperParameter(TorchOptim.scala:166)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:420)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:419)
	at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:419)
	... 23 more
Caused by: jep.JepException: <class 'ImportError'>: /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/subprocess.<module>(subprocess.py:152)
	at /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/platform.<module>(platform.py:116)
	at /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/site-packages/torch/__init__.<module>(__init__.py:14)
	at <string>.<module>(<string>:2)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:106)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

Traceback (most recent call last):
  File "cifar10.py", line 162, in <module>
    checkpoint_trigger=EveryEpoch())
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/pytorch_spark_estimator.py", line 170, in fit
    checkpoint_trigger, val_fset, self.metrics)
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/estimator/estimator.py", line 167, in train_minibatch
    validation_method)
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
    raise e
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/site-packages/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
    java_result = api(*args)
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o76.estimatorTrainMiniBatch.
: jep.JepException: jep.JepException: <class 'ImportError'>: /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType$lzycompute(TorchOptim.scala:55)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.optimType(TorchOptim.scala:38)
	at com.intel.analytics.bigdl.orca.net.TorchOptim.updateHyperParameter(TorchOptim.scala:166)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:420)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$optimize$5.apply(DistriOptimizer.scala:419)
	at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
	at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:419)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
	at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:710)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:910)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1123)
	at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:793)
	at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:190)
	at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: jep.JepException: <class 'ImportError'>: /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/lib-dynload/_posixsubprocess.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _Py_write_noraise
	at /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/subprocess.<module>(subprocess.py:152)
	at /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/platform.<module>(platform.py:116)
	at /dir5/yarn/nm_0/usercache/root/appcache/application_1635129125298_1644/container_1635129125298_1644_02_000001/python_env/lib/python3.7/site-packages/torch/__init__.<module>(__init__.py:14)
	at <string>.<module>(<string>:2)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply$mcV$sp(PythonInterpreter.scala:106)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$1.apply(PythonInterpreter.scala:105)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

Stopping orca context
2021-11-29 08:59:10 ERROR ApplicationMaster:70 - User application exited with status 1

@piaolaidelangman
Copy link
Contributor

jep error in new issue

@piaolaidelangman

This comment has been minimized.

@piaolaidelangman

This comment has been minimized.

@piaolaidelangman
Copy link
Contributor

Transfer_learning.py error in this issue
Yolov3.py error in this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants