New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unintended session initialization on TF backend #8377

Merged
merged 3 commits into from Nov 5, 2017

Conversation

Projects
None yet
6 participants
@datumbox
Copy link
Contributor

datumbox commented Nov 3, 2017

This PR tries to address the issues #8353, #8311 and #8374. Unfortunately they affect each other.

A recent PR #8021 added a check at the beginning of the tensorflow_backend.py to determine the number of available GPUs. Unfortunately the tensorflow.python.client.device_lib.list_local_devices() method seems to create of a new TF Session and to register of all the available GPUs on the system. To resolve this I replace all the device_lib.list_local_devices() calls with K.get_session().list_devices().

Moreover on commit 9166733 line 174, if the candidate_vars is empty we call session.run() with an empty list which causes a RuntimeException. This is addressed by checking if the variable is empty.

NOTE: The unit-tests passed on my machine but I've been having issues with timeouts today on Travis. Please check the code, don't merge directly. Happy to update the PR if necessary based on feedback.

datumbox added some commits Nov 3, 2017

@ahundt

This comment has been minimized.

Copy link
Contributor

ahundt commented Nov 3, 2017

these look like nice changes

@datumbox

This comment has been minimized.

Copy link
Contributor

datumbox commented Nov 3, 2017

@fchollet the tests still timeout. Let me know what you think.

@fchollet
Copy link
Collaborator

fchollet left a comment

Looks good to me; one question



def _normalize_device_name(name):
name = name.lower().replace('device:', '')
name = '/' + name.lower().split('device:')[1]

This comment has been minimized.

@fchollet

fchollet Nov 3, 2017

Collaborator

Why would this line change at all?

Note that tests for this util are not run on Travis, so please run them manually on a multi-GPU machine.

This comment has been minimized.

@datumbox

datumbox Nov 3, 2017

Contributor

The output of the two functions for some reason is different so I need to treat it differently to get the same names out.

This comment has been minimized.

@fchollet

fchollet Nov 3, 2017

Collaborator

Why is it different, and what does it look like in both cases?

This comment has been minimized.

@datumbox

datumbox Nov 4, 2017

Contributor

I have no idea why it looks diffferent, this is on the side of Tensorflow. Here is how they look in each case:

>>> from tensorflow.python.client import device_lib
>>> l1=[d.name for d in device_lib.list_local_devices()]
>>> l1
[u'/cpu:0']
>>> from keras import backend as K
>>> l2=[d.name for d in K.get_session().list_devices()]
>>> l2
['/job:localhost/replica:0/task:0/device:CPU:0']

That's why I'm splitting on "device:", lowercase and prepend slash to make them look the same which is required in the GPU utils code.

This comment has been minimized.

@datumbox

datumbox Nov 4, 2017

Contributor

BTW this device_lib.list_local_devices method is extremely problematic. In earlier versions of TF it was actually registering devices that were not whitelisted in CUDA_VISIBLE_DEVICES. The K.get_session().list_devices() method does the trick just fine; I've been using this in large GPU clusters (Keras+Spark) for distributed prediction and never had an issue.

@datumbox

This comment has been minimized.

Copy link
Contributor

datumbox commented Nov 3, 2017

@fchollet I would still be more comfortable if the tests actually pass on Travis. I'll check again on a different machine but it might be worth rerunning them tomorrow on Travis in case they have an issue today.

@fchollet

This comment has been minimized.

Copy link
Collaborator

fchollet commented Nov 3, 2017

I'll rerun and see what happens. There is indeed a chance that this PR is resulting in stalling the tests on Travis.

@fchollet

This comment has been minimized.

Copy link
Collaborator

fchollet commented Nov 3, 2017

Regardless of Travis status, please run the tests for multi_gpu_model manually on a multi-GPU machine since these aren't run on Travis at all.

@fchollet

This comment has been minimized.

Copy link
Collaborator

fchollet commented Nov 3, 2017

The tests on Travis are reliably stalling, but it's impossible at this point to tell whether it's a Travis issue or an issue with the PR. We'll know soon.

@datumbox

This comment has been minimized.

Copy link
Contributor

datumbox commented Nov 3, 2017

Let's agree not to merge until the tests on Travis are fixed. It's way better to close the PR than merge something that is "looks correct" but renders the tests on Travis unusable.

I just got home I'll test the codebase with a profiler on a different machine. I'm more than happy to close the PR, send a new one or let someone have another go. At this point, consider that this PR sketches a potential solution but not necessarily the best one. :)

@ozabluda

This comment has been minimized.

Copy link
Contributor

ozabluda commented Nov 4, 2017

Travis tests got to the point of meaningful info. For both Python2 and Python3, only TF backend stalls in
tests/keras/utils/data_utils_test.py::test_finite_generator_enqueuer_processes

@datumbox

This comment has been minimized.

Copy link
Contributor

datumbox commented Nov 4, 2017

@ozabluda Thanks for looking into this! Can you provide some more info on that? I wrote that test in the previous release and there is nothing backend specific. Also in all my runs the test seems to pass fine.

@datumbox

This comment has been minimized.

Copy link
Contributor

datumbox commented Nov 4, 2017

Updates:

It's the test_inceptionresnetv2_notop test
Let's look at the log of https://travis-ci.org/datumbox/keras/jobs/297166691

  • Line 1288: This is the last time gw0 reports a PASS.
  • Line 1289: gw0 seems to start the test_inceptionresnetv2_notop but it never finishes.
  • Line 1290+: only gw1 runs the tests. gw0 is stuck and gw1 at the end runs out of tasks. Travis times out 10 minutes after gw1 reports the last pass.

Running it on itsown works fine
I tried running the code of the test_inceptionresnetv2_notop in python shell:
https://pastebin.com/4UqmQJs4

In a small VM with 1 CPU it takes 218 seconds but it does finish. So it seems the test_inceptionresnetv2_notop is stalling.

Reverting to the previous get_session() method
I started a temp branch on top of the PR where I revert commit 9166733:
datumbox@6af3b2c

Running the test on the python shell finishes in 105 seconds, nevertheless the tests on Travis are still stalling. Edit: This was a single measurement.

My guess is that the 3 reported issues are connected. By removing the GPU count from backend the session is no longer initialized on import but rather on the first time we call get_session(). This might have all sorts of side-effects. I'm running out of ideas here...

@datumbox datumbox force-pushed the datumbox:tf_init_and_gpus branch from ff3bac9 to 8201921 Nov 4, 2017

@datumbox

This comment has been minimized.

Copy link
Contributor

datumbox commented Nov 4, 2017

I patched the damn thing...

Apparently requesting for the available devices is VERY costly. This is why @TimZaman in his original PR he uses a cache variable which is populated only once. In a hindsight, I should have understood this earlier... This is the reason the unit-tests were stalling. Anyway, the only thing that needed to change on the original PR is where we actually populate the variable and how.

The initialization happens the first time the _get_available_gpus() is called and we use the get_session().list_devices() to populate it rather than device_lib.list_local_devices(). I was considering putting the initialization on the get_session() but there is one test that calls the _get_available_gpus() method directly without starting a session.

@fchollet I also run the multi_gpu_model on AWS in a p2.8xlarge and it works fine. Here is proof: https://pastebin.com/7EE3Psqa

The Travis tests pass, the PR is complete from my side, so if there are no objections please merge. Also if possible I would suggest to create a hotfix release for this. It would be particularly useful for those who do distributed predictions using Keras + Spark.

@TimZaman

This comment has been minimized.

Copy link
Contributor

TimZaman commented Nov 5, 2017

Reintroducing the lazy device loading, and swapping out device_lib.list_local_devices() for get_session().list_devices() LGTM!

Not sure how the other code you added regarding candidate_vars relates to the above; if it doesn't I'd always suggest to keep it atomic to merge fast.

@alsrgv

This comment has been minimized.

Copy link
Contributor

alsrgv commented Nov 5, 2017

I just tested this fix with Horovod and it works well. Looking forward to merging and a new release.

@fchollet
Copy link
Collaborator

fchollet left a comment

LGTM, thanks!

@fchollet fchollet merged commit fabb40b into keras-team:master Nov 5, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@fchollet

This comment has been minimized.

Copy link
Collaborator

fchollet commented Nov 5, 2017

In all likelihood we will release 2.1.0 within a week. It turned out there were several bugs in 2.0.9.

datumbox referenced this pull request Nov 24, 2017

fix list_devices() function is not available issue (#8567)
* fix list_devices() function is not available issue

When using tensorflow as backend and the version of tensorflow is under r1.3, the list_devices() function is not available of Session instance. This may cause some issue like keras.utils.multi_gpu_model(model, gpus) function can not work under r1.3 of tensorflow. This change is a hack to fix that issue.

* Update tensorflow_backend.py

fix pep8 check failure

* Update tensorflow_backend.py

follow fchollet review comment. use `if not`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment