Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: cannot assign slice from input of different size #1008

Open
Jiawei-Xing opened this issue May 13, 2023 · 8 comments
Open

ValueError: cannot assign slice from input of different size #1008

Jiawei-Xing opened this issue May 13, 2023 · 8 comments

Comments

@Jiawei-Xing
Copy link

Jiawei-Xing commented May 13, 2023

Hi, I want to use UMAP on a large distance matrix (369911x369911). I followed the first example of "UMAP on sparse data" from the tutorial (I've tried either lil or csr sparse matrix). The code worked well with a smaller sample dataset but failed on my large matrix. My sparse matrix is ~9 GB, and I was running it on an HPC node with 10 cpus (~30 GB). The low_memory option was set to True.

Traceback (most recent call last):
  File "umap.py", line 51, in <module>
    mapper = reducer.fit(matrix)
  File "/home/xing.232/.local/lib/python3.7/site-packages/umap/umap_.py", line 2526, in fit
    verbose=self.verbose,
  File "/home/xing.232/.local/lib/python3.7/site-packages/umap/umap_.py", line 340, in nearest_neighbors
    compressed=False,
  File "/home/xing.232/.local/lib/python3.7/site-packages/pynndescent/pynndescent_.py", line 804, in __init__
    leaf_array = rptree_leaf_array(self._rp_forest)
  File "/home/xing.232/.local/lib/python3.7/site-packages/pynndescent/rp_trees.py", line 1097, in rptree_leaf_array
    return np.vstack(rptree_leaf_array_parallel(rp_forest))
  File "/home/xing.232/.local/lib/python3.7/site-packages/pynndescent/rp_trees.py", line 1090, in rptree_leaf_array_parallel
    joblib.delayed(get_leaves_from_tree)(rp_tree) for rp_tree in rp_forest
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/parallel.py", line 1098, in __call__
    self.retrieve()
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/parallel.py", line 975, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/local/anaconda3-2020.02/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
  File "/usr/local/anaconda3-2020.02/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 620, in __call__
    return self.func(*args, **kwargs)
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/parallel.py", line 289, in __call__
    for func, args, kwargs in self.items]
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/parallel.py", line 289, in <listcomp>
    for func, args, kwargs in self.items]
ValueError: cannot assign slice from input of different size
@lmcinnes
Copy link
Owner

I believe this is an issue relating to caching compilation in pynndescent. If you reinstall pynndescent, preferably directly from github, it should resolve the issue.

To be clear, however, if pynndescent is running then it is computing nearest neighbors of vectors, so it is treating your distance matrix as a large set of sparse vectors. That probably isn't what you want. I would check that this is actually what you want to do.

@Jiawei-Xing
Copy link
Author

Thank you for your quick response! You are right, I should use metric="precomputed" to fit in distance matrix.

A related question for the matrix input: my original similarity matrix is sparse, but when I convert it to distance (1-similarity) most elements become 1. This causes memory issues as the matrix is not sparse anymore. Is it possible to use similarity matrix for fitting the model, or is there other way to overcome this?

@ReaganGen
Copy link

Hello. I also have the same error. It is weird that Umap works well on some of my datasets, but return this error message on some datasets with the same format. I also try to reinstall pynndescent directly from github. The same error still exits. Could anyone help?
Screen Shot 2023-05-21 at 3 55 45 PM

@ntmaier
Copy link

ntmaier commented May 23, 2023

Hello -

I have the similar issue as the one above. I also reinstalled pynndescent directly from the github master. Note: The script works for smaller files, right now I am running a relatively simple workflow, which breaks on larger files (~44.000,4000):

fit = umap.UMAP(n_neighbors=20,n_components=2,min_dist=0.1)
umap_spectrogram = fit.fit_transform(spectrograms_)

ValueError Traceback (most recent call last)
Cell In[6], line 7
4 # these settings seem to work pretty good n_neighbors = 30, n_components=3, min_dist=0.5
5 fit = umap.UMAP(n_neighbors=20,n_components=2,min_dist=0.1)
----> 7 umap_spectrogram = fit.fit_transform(spectrograms_)

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\umap\umap_.py:2772, in UMAP.fit_transform(self, X, y)
2742 def fit_transform(self, X, y=None):
2743 """Fit X into an embedded space and return that transformed
2744 output.
2745
(...)
2770 Local radii of data points in the embedding (log-transformed).
2771 """
-> 2772 self.fit(X, y)
2773 if self.transform_mode == "embedding":
2774 if self.output_dens:

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\umap\umap_.py:2516, in UMAP.fit(self, X, y)
2510 nn_metric = self._input_distance_func
2511 if self.knn_dists is None:
2512 (
2513 self._knn_indices,
2514 self._knn_dists,
2515 self._knn_search_index,
-> 2516 ) = nearest_neighbors(
2517 X[index],
2518 self._n_neighbors,
2519 nn_metric,
2520 self._metric_kwds,
2521 self.angular_rp_forest,
2522 random_state,
2523 self.low_memory,
2524 use_pynndescent=True,
2525 n_jobs=self.n_jobs,
2526 verbose=self.verbose,
2527 )
2528 else:
2529 self._knn_indices = self.knn_indices

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\umap\umap_.py:328, in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose)
325 n_trees = min(64, 5 + int(round((X.shape[0]) ** 0.5 / 20.0)))
326 n_iters = max(5, int(round(np.log2(X.shape[0]))))
--> 328 knn_search_index = NNDescent(
329 X,
330 n_neighbors=n_neighbors,
331 metric=metric,
332 metric_kwds=metric_kwds,
333 random_state=random_state,
334 n_trees=n_trees,
335 n_iters=n_iters,
336 max_candidates=60,
337 low_memory=low_memory,
338 n_jobs=n_jobs,
339 verbose=verbose,
340 compressed=False,
341 )
342 knn_indices, knn_dists = knn_search_index.neighbor_graph
344 if verbose:

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\pynndescent\pynndescent_.py:804, in NNDescent.init(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, init_dist, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, parallel_batch_queries, verbose)
793 print(ts(), "Building RP forest with", str(n_trees), "trees")
794 self._rp_forest = make_forest(
795 data,
796 n_neighbors,
(...)
802 self._angular_trees,
803 )
--> 804 leaf_array = rptree_leaf_array(self._rp_forest)
805 else:
806 self._rp_forest = None

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\pynndescent\rp_trees.py:1097, in rptree_leaf_array(rp_forest)
1095 def rptree_leaf_array(rp_forest):
1096 if len(rp_forest) > 0:
-> 1097 return np.vstack(rptree_leaf_array_parallel(rp_forest))
1098 else:
1099 return np.array([[-1]])

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\pynndescent\rp_trees.py:1089, in rptree_leaf_array_parallel(rp_forest)
1088 def rptree_leaf_array_parallel(rp_forest):
-> 1089 result = joblib.Parallel(n_jobs=-1, require="sharedmem")(
1090 joblib.delayed(get_leaves_from_tree)(rp_tree) for rp_tree in rp_forest
1091 )
1092 return result

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib\parallel.py:1098, in Parallel.call(self, iterable)
1095 self._iterating = False
1097 with self._backend.retrieval_context():
-> 1098 self.retrieve()
1099 # Make sure that we get a last message telling us we are done
1100 elapsed_time = time.time() - self._start_time

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib\parallel.py:975, in Parallel.retrieve(self)
973 try:
974 if getattr(self._backend, 'supports_timeout', False):
--> 975 self._output.extend(job.get(timeout=self.timeout))
976 else:
977 self._output.extend(job.get())

File c:\ProgramData\Anaconda\envs\seis\Lib\multiprocessing\pool.py:774, in ApplyResult.get(self, timeout)
772 return self._value
773 else:
--> 774 raise self._value

File c:\ProgramData\Anaconda\envs\seis\Lib\multiprocessing\pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
123 job, i, func, args, kwds = task
124 try:
--> 125 result = (True, func(*args, **kwds))
126 except Exception as e:
127 if wrap_exception and func is not _helper_reraises_exception:

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib_parallel_backends.py:620, in SafeFunction.call(self, *args, **kwargs)
618 def call(self, *args, **kwargs):
619 try:
--> 620 return self.func(*args, **kwargs)
621 except KeyboardInterrupt as e:
622 # We capture the KeyboardInterrupt and reraise it as
623 # something different, as multiprocessing does not
624 # interrupt processing for a KeyboardInterrupt
625 raise WorkerInterrupt() from e

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib\parallel.py:288, in BatchedCalls.call(self)
284 def call(self):
285 # Set the default nested backend to self._backend but do not set the
286 # change the default number of processes to -1
287 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288 return [func(*args, **kwargs)
289 for func, args, kwargs in self.items]

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib\parallel.py:288, in (.0)
284 def call(self):
285 # Set the default nested backend to self._backend but do not set the
286 # change the default number of processes to -1
287 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288 return [func(*args, **kwargs)
289 for func, args, kwargs in self.items]

ValueError: cannot assign slice from input of different size

@liufeifan
Copy link

I also have the same ValueError problem as above mentioned.

@ogreyesp
Copy link

ogreyesp commented Jun 25, 2023

Hi

This problem is happening very frequently. I have a dataset wherein UMAP works well. However, when I tried to build umap by applying a 10 fold cross validation, the error appeared in some folds.

Please, advice

@carluqcor
Copy link

Hi @ogreyesp, as @lmcinnes said it seems to be pynndescent. Using version pynndescent-0.5.8 works perfectly for me.

@liufeifan
Copy link

liufeifan commented Jul 20, 2023

I also have the same ValueError problem as above mentioned.

I solved the problem by going back to a previous older version 0.5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants