UMAP hangs when training in a multiprocessing.Process #707

theahura · 2021-06-24T20:19:02Z

Hey Leland,

Thanks for the great library.

I've got a strange error. Looks like umap training completely hangs if it is run inside a multiprocessing.Process. Minimum example on py3.8.5:

import umap
import multiprocessing
import numpy as np
import sys
import time


def train_model(q=None):
  embeddings = np.random.rand(100, 512)
  reducer = umap.UMAP()
  print("Got reducer, about to start training")
  sys.stdout.flush()
  if not q:
    return reducer.fit_transform(embeddings)
  print("outputting to q")
  q.put(reducer.fit_transform(embeddings))
  print("output to q")


start = time.time()
model_output = train_model()
print('normal took: ', time.time() - start)
print('got: ', model_output)

start = time.time()
q = multiprocessing.Queue()
p = multiprocessing.Process(target=train_model, args=(q,), daemon=True)
p.start()
model_output = q.get()
print('multi took: ', time.time() - start)
print('got: ', model_output)

This results in the following output:

(env) amol@amol-small:~/code/soot/api-server/src$ python umap_multiprocessing_test.py                                                                                                                              
2021-06-24 16:09:46.233186: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such
 file or directory                                                                                                                                                                                                 
2021-06-24 16:09:46.233212: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.                                                      
Got reducer, about to start training                                                                                                                                                                               
normal took:  7.140857934951782                                                                                                                                                                                    
got:  [[ 5.585276  10.613853 ]                                                                                                                                                                                     
 [ 3.6862304  8.075892 ]                                                                                                                                                                                           
 [ 4.7457848  8.287621 ]                                                                                                                                                                                           
 [ 3.1373663  9.443794 ]                                                                                                                                                                                           
 [ 3.3923576  8.651798 ]                                                                                                                                                                                           
 [ 5.8636594 10.131909 ]                                                                                                                                                                                           
 [ 3.6680114 11.535476 ]                                                                                                                                                                                           
 [ 1.924135   9.987121 ]                                                                                                                                                                                           
 [ 4.9095764  8.643579 ]                                                                                                                                                                                           
 ...
 [ 4.6614685  9.943193 ]                            
 [ 3.5867712 10.872507 ]                            
 [ 4.8476524 10.628259 ]]                           
Got reducer, about to start training
outputting to q

after which I have to cntrl-C because nothing happens.

Any ideas what is going on?

The text was updated successfully, but these errors were encountered:

lmcinnes · 2021-06-24T21:29:51Z

This may be an interaction between numba and multi-processing. That's certainly beyond my expertise unfortunately. I don't see why it should be a problem though. Definitely this is challenging to debug. Sorry I can't be more help.

theahura · 2021-06-25T13:17:28Z

Opened a SO post here: https://stackoverflow.com/questions/68131348/training-a-python-umap-model-hangs-in-a-multiprocessing-process

Hoping there's more insight. Will post updates as I debug.

theahura · 2021-06-25T13:46:26Z

Ok so after many print statements, the hang looks like it starts here:
https://github.com/lmcinnes/umap/blob/master/umap/utils.py#L32

def fast_knn_indices(X, n_neighbors):
    knn_indices = np.empty((X.shape[0], n_neighbors), dtype=np.int32)
    # Never gets into the loop.
    for row in numba.prange(X.shape[0]):
        # v = np.argsort(X[row])  # Need to call argsort this way for numba
        v = X[row].argsort(kind="quicksort")
        v = v[:n_neighbors]
        knn_indices[row] = v
    return knn_indices

The shape of X at this point is (100, 100). It works fine outside of the Process; no such luck inside the Process.

Interestingly, I discovered that in the minimal example above if I don't run the 'normal' version, the 'multiprocess' version works fine. That's....really unintuitive. Numba might be maintaining state somewhere? Unfortunately this doesn't help me much, because I still get hangs in the multiprocessing version. More digging required...

theahura · 2021-06-25T14:31:09Z

Like many umap/numba problems, switching to a different backend fixed the problem. I was previously using workqueues, which would just hang. I switched to 'omp', which showed me an actual error:

Terminating: fork() called from a process already using GNU OpenMP, this is unsafe.

Switching to tbb seemed to work with the minimal example above, though I had a fair bit of trouble getting tbb to actually load (see: numba/numba#7148)

I'll close this out, but this was definitely a weird interaction between numba and some other multiprocessing stuff. Seems brittle, but not really sure what's to be done about it 🤔

lmcinnes · 2021-06-25T21:39:49Z

Glad you found a solution, but it definitely seems brittle. In general the tbb backend seems to fix most problems, but it sadly is not the default a lot of the time for users.

theahura closed this as completed Jun 25, 2021

rhysnewell mentioned this issue Aug 7, 2021

UMAP Segmentation Faults #747

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMAP hangs when training in a multiprocessing.Process #707

UMAP hangs when training in a multiprocessing.Process #707

theahura commented Jun 24, 2021 •

edited

lmcinnes commented Jun 24, 2021

theahura commented Jun 25, 2021

theahura commented Jun 25, 2021 •

edited

theahura commented Jun 25, 2021

lmcinnes commented Jun 25, 2021

UMAP hangs when training in a multiprocessing.Process #707

UMAP hangs when training in a multiprocessing.Process #707

Comments

theahura commented Jun 24, 2021 • edited

lmcinnes commented Jun 24, 2021

theahura commented Jun 25, 2021

theahura commented Jun 25, 2021 • edited

theahura commented Jun 25, 2021

lmcinnes commented Jun 25, 2021

theahura commented Jun 24, 2021 •

edited

theahura commented Jun 25, 2021 •

edited