Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMAP hangs when training in a multiprocessing.Process #707

Closed
theahura opened this issue Jun 24, 2021 · 5 comments
Closed

UMAP hangs when training in a multiprocessing.Process #707

theahura opened this issue Jun 24, 2021 · 5 comments

Comments

@theahura
Copy link

theahura commented Jun 24, 2021

Hey Leland,

Thanks for the great library.

I've got a strange error. Looks like umap training completely hangs if it is run inside a multiprocessing.Process. Minimum example on py3.8.5:

import umap
import multiprocessing
import numpy as np
import sys
import time


def train_model(q=None):
  embeddings = np.random.rand(100, 512)
  reducer = umap.UMAP()
  print("Got reducer, about to start training")
  sys.stdout.flush()
  if not q:
    return reducer.fit_transform(embeddings)
  print("outputting to q")
  q.put(reducer.fit_transform(embeddings))
  print("output to q")


start = time.time()
model_output = train_model()
print('normal took: ', time.time() - start)
print('got: ', model_output)

start = time.time()
q = multiprocessing.Queue()
p = multiprocessing.Process(target=train_model, args=(q,), daemon=True)
p.start()
model_output = q.get()
print('multi took: ', time.time() - start)
print('got: ', model_output)

This results in the following output:

(env) amol@amol-small:~/code/soot/api-server/src$ python umap_multiprocessing_test.py                                                                                                                              
2021-06-24 16:09:46.233186: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such
 file or directory                                                                                                                                                                                                 
2021-06-24 16:09:46.233212: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.                                                      
Got reducer, about to start training                                                                                                                                                                               
normal took:  7.140857934951782                                                                                                                                                                                    
got:  [[ 5.585276  10.613853 ]                                                                                                                                                                                     
 [ 3.6862304  8.075892 ]                                                                                                                                                                                           
 [ 4.7457848  8.287621 ]                                                                                                                                                                                           
 [ 3.1373663  9.443794 ]                                                                                                                                                                                           
 [ 3.3923576  8.651798 ]                                                                                                                                                                                           
 [ 5.8636594 10.131909 ]                                                                                                                                                                                           
 [ 3.6680114 11.535476 ]                                                                                                                                                                                           
 [ 1.924135   9.987121 ]                                                                                                                                                                                           
 [ 4.9095764  8.643579 ]                                                                                                                                                                                           
 ...
 [ 4.6614685  9.943193 ]                            
 [ 3.5867712 10.872507 ]                            
 [ 4.8476524 10.628259 ]]                           
Got reducer, about to start training
outputting to q   

after which I have to cntrl-C because nothing happens.

Any ideas what is going on?

@lmcinnes
Copy link
Owner

This may be an interaction between numba and multi-processing. That's certainly beyond my expertise unfortunately. I don't see why it should be a problem though. Definitely this is challenging to debug. Sorry I can't be more help.

@theahura
Copy link
Author

Opened a SO post here: https://stackoverflow.com/questions/68131348/training-a-python-umap-model-hangs-in-a-multiprocessing-process

Hoping there's more insight. Will post updates as I debug.

@theahura
Copy link
Author

theahura commented Jun 25, 2021

Ok so after many print statements, the hang looks like it starts here:
https://github.com/lmcinnes/umap/blob/master/umap/utils.py#L32

def fast_knn_indices(X, n_neighbors):
    knn_indices = np.empty((X.shape[0], n_neighbors), dtype=np.int32)
    # Never gets into the loop.
    for row in numba.prange(X.shape[0]):
        # v = np.argsort(X[row])  # Need to call argsort this way for numba
        v = X[row].argsort(kind="quicksort")
        v = v[:n_neighbors]
        knn_indices[row] = v
    return knn_indices

The shape of X at this point is (100, 100). It works fine outside of the Process; no such luck inside the Process.

Interestingly, I discovered that in the minimal example above if I don't run the 'normal' version, the 'multiprocess' version works fine. That's....really unintuitive. Numba might be maintaining state somewhere? Unfortunately this doesn't help me much, because I still get hangs in the multiprocessing version. More digging required...

@theahura
Copy link
Author

Like many umap/numba problems, switching to a different backend fixed the problem. I was previously using workqueues, which would just hang. I switched to 'omp', which showed me an actual error:

Terminating: fork() called from a process already using GNU OpenMP, this is unsafe.

Switching to tbb seemed to work with the minimal example above, though I had a fair bit of trouble getting tbb to actually load (see: numba/numba#7148)

I'll close this out, but this was definitely a weird interaction between numba and some other multiprocessing stuff. Seems brittle, but not really sure what's to be done about it 🤔

@lmcinnes
Copy link
Owner

Glad you found a solution, but it definitely seems brittle. In general the tbb backend seems to fix most problems, but it sadly is not the default a lot of the time for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants