Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK: wait_for_job reports typeError #1445

Closed
Jeffwan opened this issue Oct 9, 2021 · 8 comments
Closed

SDK: wait_for_job reports typeError #1445

Jeffwan opened this issue Oct 9, 2021 · 8 comments

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Oct 9, 2021

I follow instruction https://github.com/kubeflow/training-operator/blob/master/sdk/python/examples/kubeflow-tfjob-sdk.ipynb and when I run to tfjob_client.wait_for_job('mnist', namespace=namespace, watch=True) and I got follow errors. Rest of the steps work well. I think it's probably the python and dependency compatibility issue.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/tx/s3hg89kj56xcyhdhw_1ykwy00000gq/T/ipykernel_99459/444279540.py in <module>
----> 1 tfjob_client.wait_for_job('mnist', namespace=namespace, watch=True)

/usr/local/lib/python3.9/site-packages/kubeflow/training/api/tf_job_client.py in wait_for_job(self, name, namespace, timeout_seconds, polling_interval, watch, status_callback)
    240 
    241         if watch:
--> 242             tfjob_watch(
    243                 name=name,
    244                 namespace=namespace,

/usr/local/lib/python3.9/site-packages/retrying.py in wrapped_f(*args, **kw)
     47             @six.wraps(f)
     48             def wrapped_f(*args, **kw):
---> 49                 return Retrying(*dargs, **dkw).call(f, *args, **kw)
     50 
     51             return wrapped_f

/usr/local/lib/python3.9/site-packages/retrying.py in call(self, fn, *args, **kwargs)
    210                 if not self._wrap_exception and attempt.has_exception:
    211                     # get() on an attempt with an exception should cause it to be raised, but raise just in case
--> 212                     raise attempt.get()
    213                 else:
    214                     raise RetryError(attempt)

/usr/local/lib/python3.9/site-packages/retrying.py in get(self, wrap_exception)
    245                 raise RetryError(self)
    246             else:
--> 247                 six.reraise(self.value[0], self.value[1], self.value[2])
    248         else:
    249             return self.value

/usr/local/lib/python3.9/site-packages/six.py in reraise(tp, value, tb)
    717             if value.__traceback__ is not tb:
    718                 raise value.with_traceback(tb)
--> 719             raise value
    720         finally:
    721             value = None

/usr/local/lib/python3.9/site-packages/retrying.py in call(self, fn, *args, **kwargs)
    198         while True:
    199             try:
--> 200                 attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
    201             except:
    202                 tb = sys.exc_info()

/usr/local/lib/python3.9/site-packages/kubeflow/training/api/tf_job_watch.py in watch(name, namespace, timeout_seconds)
     54             update_time = last_condition.get('lastTransitionTime', '')
     55 
---> 56             tbl(tfjob_name, status, update_time)
     57 
     58             if name == tfjob_name:

/usr/local/lib/python3.9/site-packages/table_logger/table_logger.py in __call__(self, *args)
    202 
    203         line = self.format_row(*row_cells)
--> 204         self.print_line(line)
    205 
    206     def format_row(self, *args):

/usr/local/lib/python3.9/site-packages/table_logger/table_logger.py in print_line(self, text)
    306 
    307     def print_line(self, text):
--> 308         self.file.write(text.encode(self.encoding))
    309         self.file.write(b'\n')
    310         self.file.flush()

/usr/local/lib/python3.9/site-packages/ipykernel/iostream.py in write(self, string)
    499 
    500         if not isinstance(string, str):
--> 501             raise TypeError(
    502                 f"write() argument must be str, not {type(string)}"
    503             )

TypeError: write() argument must be str, not <class 'bytes'>

image

/cc @alembiewski @terrytangyuan

Env:

  • Docker Desktop on Mac
  • SDK in master branch
@Jeffwan
Copy link
Member Author

Jeffwan commented Oct 9, 2021

Look at call chain


/usr/local/lib/python3.9/site-packages/retrying.py in wrapped_f(*args, **kw)
     47             @six.wraps(f)
     48             def wrapped_f(*args, **kw):
---> 49                 return Retrying(*dargs, **dkw).call(f, *args, **kw)
     50 
     51             return wrapped_f
➜  ~ python3 -m pip show retrying
Name: retrying
Version: 1.3.3
Summary: Retrying
Home-page: https://github.com/rholder/retrying
Author: Ray Holder
Author-email:
License: Apache 2.0
Location: /usr/local/lib/python3.9/site-packages
Requires: six
Required-by: kubeflow-training

We have a recent change #1439? but I doubt it's related. I did freeze and find retrying==1.3.3 and but seems existing dependencies implicitly install it.

Any clues?
/cc @alembiewski @terrytangyuan

@alembiewski
Copy link
Member

alembiewski commented Oct 9, 2021

I think it is might be ipykernel and table-logger compatibility issue. But this is definitely not related to the retrying package - the watch method is annotated to retry on error, so that's the reason we see it in the stacktrace. FTR, I was able to run the notebook under question end-to-end without any issues on the following env:

kubeflow@tensorflow-0:~$ jupyter --version
jupyter core     : 4.7.1
jupyter-notebook : 6.4.2
qtconsole        : not installed
ipython          : 7.26.0
ipykernel        : 5.5.5
jupyter client   : 6.1.12
jupyter lab      : 3.0.16
nbconvert        : 6.0.7
ipywidgets       : 7.6.3
nbformat         : 5.0.8
traitlets        : 4.3.3
kubeflow@tensorflow-0:~$ python --version
Python 3.7.6
kubeflow@tensorflow-0:~$ pip show retrying
Name: retrying
Version: 1.3.3
Summary: Retrying
Home-page: https://github.com/rholder/retrying
Author: Ray Holder
Author-email: UNKNOWN
License: Apache 2.0
Location: /opt/conda/lib/python3.7/site-packages
Requires: six
Required-by: kubeflow-training

Can you share the output of jupyter --version?

@Jeffwan
Copy link
Member Author

Jeffwan commented Oct 10, 2021

@alembiewski Yeah, i think it's probably something else. My python3 version is higher, I can change to 3.7 for another try.

Here's my envs.

✗ jupyter --version
jupyter core     : 4.7.1
jupyter-notebook : 6.4.3
qtconsole        : 5.1.1
ipython          : 7.27.0
ipykernel        : 6.2.0
jupyter client   : 7.0.1
jupyter lab      : not installed
nbconvert        : 6.1.0
ipywidgets       : 7.6.3
nbformat         : 5.1.3
traitlets        : 5.0.5

image

@alembiewski
Copy link
Member

alembiewski commented Oct 10, 2021

Hey @Jeffwan, I dug a bit deeper into this issue and it seems that table-logger library used in watchmethod of the tf_job_watch.py is not compatible with ipykernel 6.0.0 and up. So most likely if you try to downgrade the ipkernel version to 5.x, it will work on your env. Sharing some findings below.

Here is the commit that changes the logic of the write method ipython/ipykernel@1a50cda, which now expects a string argument and raises an error otherwise. And if we look at the code from table-logger, we can clearly see it writes bytes, not string.
Not sure what is the best way for us to proceed here, but it seems that table-logger is no longer maintained (the last commit was in Aug 2019), so we might consider dropping and look for alternatives.

Do you think it's a blocker for releasing the SDK to PyPI? It looks like tfjob_client.wait_for_job method provides a special watch flag that can be used to disable logs printing using table-logger, and thus bypass this issue

@Jeffwan
Copy link
Member Author

Jeffwan commented Oct 11, 2021

@alembiewski I think this should not be the blocker. I check Kubeflow notebook image and it still uses lower version.

https://github.com/kubeflow/kubeflow/blob/19140259241a505d0262a392f691d47616d45fe5/components/example-notebook-servers/jupyter/requirements.txt#L3

Really appreciate your findings and we should be good to go. @terrytangyuan @kubeflow/wg-training-leads WDYT?

It looks like tfjob_client.wait_for_job method provides a special watch flag that can be used to disable logs printing using table-logger, and thus bypass this issue

Sounds like a plan. Could you help create an issue to track it? I think this can be done later.

@terrytangyuan
Copy link
Member

Yes agree it's not a blocker. We can probably fix this by adding logic around imports on different versions of the problematic module.

@johnugeorge
Copy link
Member

Agree

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Mar 2, 2022
@stale stale bot closed this as completed Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants