Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMAP Roadmap #15

Open
9 of 21 tasks
lmcinnes opened this issue Nov 13, 2017 · 36 comments
Open
9 of 21 tasks

UMAP Roadmap #15

lmcinnes opened this issue Nov 13, 2017 · 36 comments

Comments

@lmcinnes
Copy link
Owner

lmcinnes commented Nov 13, 2017

A rough roadmap of things to be done for UMAP. Some of these tasks are easy, some are hard, and some require deeper knowledge of UMAP. Short and medium term tasks should be approachable for many people. Reply to this issue if you are interested in taking up any of them.

Short term items

  • Support for sparse matrix input
  • Add random seed as an user option
  • Support for cosine distance RP-trees
  • Allow non-RP-tree initialisation of NN-descent
  • Better document (via docstrings) all the support functions
  • "Custom" initialisation with a predefined positioning.

Medium term items

  • Generate notebook for basic usage demonstration
  • Generate notebook explaining parameter options and their effects
  • Set up CI and build a basic test suite
  • Start building basic documentation and integrate with readthedocs

Longer term items

  • Generate notebook for "How UMAP works"
  • Add code (and devise API(?)) for UMAP on general pandas dataframes
  • Add support for semi-supervised dimension reduction via UMAP
  • UMAP as a generative model (code + demo)
  • UMAP for text data (similar to word2vec)
  • A transform function for new previously unseen data (see issue UMAP as a dimensionality reduction (umap.transform()) #40)
  • Model persistence for UMAP models

No priority

  • GPU support for UMAP
  • Conda-forge UMAP package
  • Improve numba usage (better numba expertise required)
  • Concurrency via Dask for multicore and distributed support
@Fil
Copy link
Contributor

Fil commented Nov 13, 2017

Here's the smallest notebook I could think of for basic usage demonstration
https://nbviewer.jupyter.org/gist/Fil/cce232583907035b65686cdec7d4cc92
umap-rbg-demo

@lmcinnes
Copy link
Owner Author

Thanks! I was hoping to have some further description in Markdown in the notebook, but this is an excellent beginning.

@KeithTheEE
Copy link

Would you mind if we pulled direct quotes from your README for the notebook ('basic usage demonstration' and 'explaining parameter options and their effects')?

I'm also currently wrapping the two ideas as one notebook, with a basic usage section at the top, and more in-depth information after that. Thoughts?

@lmcinnes
Copy link
Owner Author

Go ahead and pull whatever you need. It's helpful if you can explore the parameter effects in a little detail. Kyle McDonald had some nice min_dist comparisons here https://twitter.com/kcimc/status/930180473262919685 . Exploring some of the other effects similarly as well (metric, n_components, spread, n_neighbors) would be beneficial. But certainly any contributions are welcome.

@Fil
Copy link
Contributor

Fil commented Nov 13, 2017

Here's another version, exploring some of parameters
https://nbviewer.jupyter.org/gist/Fil/5c48475e88a0e1a8f56eaadaebff0544

@lmcinnes
Copy link
Owner Author

I love the metric exploration! The custom metrics nicely show off what can be done, and the effects (e.g. the pure red metric has a clear linear embedding etc.)

@lmcinnes
Copy link
Owner Author

@Fil I have added in a version of your parameter exploration notebook (with some minor changes and added text commentary and explanation) in the notebooks directory. Have a look and let me know if it looks okay to you. I really appreciate your work on this, so let me know how you would like to be acknowledged within the notebook.

@Fil
Copy link
Contributor

Fil commented Nov 17, 2017

Wow! I wanted to do the hue and HSL metrics, but didn't think they would turn out that splendid. Thank you! For the credit you should remove "excellent", and can add "for visionscarto.net" after my name for affiliation.

I'm preparing another example, will follow up when it's ready :)

lmcinnes added a commit that referenced this issue Nov 19, 2017
lmcinnes added a commit that referenced this issue Nov 19, 2017
lmcinnes added a commit that referenced this issue Nov 19, 2017
lmcinnes added a commit that referenced this issue Nov 20, 2017
@KeithTheEE
Copy link

So I had started on a different notebook approach, and decided to see it through to an alpha version. It uses scikit-learn's digits data, so it at least offers a different perspective. Like I said, it's an alpha/early draft version. There's plenty of points that I just got bored of writing instead of coding, but I'm going to go back to them soon. I also blinked and the documentation/code changed so I'll have to update that too. Here it is,
https://nbviewer.jupyter.org/github/CrakeNotSnowman/umapNotebooks/blob/master/UMAP%20Usage.ipynb

That said, I really like the notebook @Fil came up with, and @lmcinnes improved on, I think it offers a better intro to UMAP.

@lmcinnes
Copy link
Owner Author

@CrakeNotSnowman That looks great! To be honest more intros are good, especially if they come from different perspectives, as this one does. There are some really interesting results in there.

Sorry about the code and documentation changes; I'm a tinkerer and I can't help it.

I definitely look forward to seeing this with any further expository writing.

@loretoparisi
Copy link

What about UMAP for text data (similar to word2vec)?

@lmcinnes
Copy link
Owner Author

I have a colleague who is working on that -- there's some underlying theory to be worked through, but I believe the core ideas are now all in place. The essence of the idea is this: word2vec can be viewed as (in the limit) a matrix factorization problem, which is to say similar to PCA. It should be possible to use manifold learning like UMAP to do the embedding rather than something linear like PCA. Ideally this should capture word similarity better, at the cost that word algebra will no longer work.

The details are in what data to embed (something based on a word-word-co-occurence matrix), and how to measure distance (negative log likelihoods under a suitable model), and how to interpet the theory around all of that. Progress is being made, but it may be a little while before anything releasable happens.

@gokceneraslan
Copy link
Contributor

gokceneraslan commented Feb 22, 2018

Are you also planning to explore other exact and approximate k-nn graph methods? nmslib is a super fast parallelized implementation with a plethora of knn methods.

@lmcinnes
Copy link
Owner Author

I'm forgoing exact knn-graph methods as most are too slow on high dimensional data. I agree that nmslib is impressive but for this project I was hoping to keep the dependencies relatively self-contained. Right now I'm using my own python based implementation of NN-descent (for which kgraph is the reference implementation). The advantages of NN-descent are that it is non-metric space based (just like nmslib), and can be used for direct approximate knn-graph construction rather than building an index and then querying.

If someone else wanted to build an optimized UMAP on top of nmslib I would certainly be interested to see it -- it would likely outperform this version due to the parallelism (presuming a suitably parallelised version of the SGD for layout was paired with it).

@ghannum
Copy link

ghannum commented Apr 10, 2018

It would be useful to have a way to save the UMAP model to a file for transforming future data into the same space. What would it take to get save/load functions?

@lmcinnes
Copy link
Owner Author

@ghannum I admit that I had been hoping that the standard methods for model persistence in sklearn (pickling etc.) would handle this -- is that not working with UMAP, or are you looking for something a little different than what it would provide? This isn't really my area of expertise, so you'll have to excuse my lack of knowledge here.

@ghannum
Copy link

ghannum commented Apr 10, 2018

@lmcinnes I tried to pickle the model file, but pickling only works for data objects - not classes. I believe the correct approach would be to write one function which puts all of the relevant model data into a list and pickles the list. Then write a load function which loads the pickled data and constructs the model object.

@lmcinnes
Copy link
Owner Author

@ghannum Okay, thanks, I'll try to look into this at some point. At the very least I'll add it to the roadmap.

@david4096
Copy link

@lmcinnes very interested in that feature!

@josephcourtney
Copy link
Contributor

Unless I am not understanding something, pickling seems to work fine, at least on the current main branch. Here is a simple example that shows pickling and unpickling of a trained model, even with a custom metric. Note: if you unpickle a model with a custom metric, that metric must already be defined in that same file; the pickle only contains a reference to the metric function.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import umap
import pickle


digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    stratify=digits.target,
    random_state=42
)


def mydist(x, y):
    return np.max(np.abs(x - y))


trans = umap.UMAP(
    n_neighbors=5,
    random_state=42,
    metric=mydist
).fit(X_train)
plt.scatter(trans.embedding_[:, 0], trans.embedding_[:, 1], s=5, c=y_train, cmap='Spectral')
plt.title('Embedding of the training set by UMAP', fontsize=24)
plt.show()
plt.close()


with open('trans.pkl', 'wb') as f:
    pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)

with open('trans.pkl', 'rb') as f:
    trans = pickle.load(f)


test_embedding = trans.transform(X_test)
plt.scatter(test_embedding[:, 0], test_embedding[:, 1], s=5, c=y_test, cmap='Spectral')
plt.title('Embedding of the test set by UMAP', fontsize=24)
plt.show()
plt.close()

@lmcinnes
Copy link
Owner Author

lmcinnes commented Jul 20, 2018 via email

@bccho
Copy link

bccho commented Aug 29, 2018

@josephcourtney 's example fails when the training data is larger.

No error:

X = np.random.randn(4000, 48)
trans = umap.UMAP(
    n_neighbors=5,
    random_state=42,
    metric="euclidean
).fit(X)
with open('trans.pkl', 'wb') as f:
    pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
    trans = pickle.load(f)

Error:

X = np.random.randn(5000, 48)
trans = umap.UMAP(
    n_neighbors=5,
    random_state=42,
    metric="euclidean
).fit(X)
with open('trans.pkl', 'wb') as f:
    pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
    trans = pickle.load(f)

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-140-1981a7cd4080> in <module>()
      2     pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
      3 with open('trans.pkl', 'rb') as f:
----> 4     trans = pickle.load(f)
      5

/usr/lib/python2.7/pickle.pyc in load(file)
   1382
   1383 def load(file):
-> 1384     return Unpickler(file).load()
   1385
   1386 def loads(str):

/usr/lib/python2.7/pickle.pyc in load(self)
    862             while 1:
    863                 key = read(1)
--> 864                 dispatch[key](self)
    865         except _Stop, stopinst:
    866             return stopinst.value

/usr/lib/python2.7/pickle.pyc in load_newobj(self)
   1087         args = self.stack.pop()
   1088         cls = self.stack[-1]
-> 1089         obj = cls.__new__(cls, *args)
   1090         self.stack[-1] = obj
   1091     dispatch[NEWOBJ] = load_newobj

[path omitted]/lib/python2.7/site-packages/funcsigs/__init__.py in __new__(self, *args, **kwargs)
    199     def __new__(self, *args, **kwargs):
    200         obj = int.__new__(self, *args)
--> 201         obj._name = kwargs['name']
    202         return obj
    203

KeyError: 'name'

@lmcinnes
Copy link
Owner Author

@bccho : That's a little disconcerting. It seems to be some sort of issue with pickle storing certain objects. At 4096 there is a switch in how knn computation is handled, so that may be responsible, but it is entirely unclear to me where in the whole process this is going astray. It must be in some subobjects of the basic UMAP class, so is likely an issue for those objects in general (scipy sparse matrices perhaps?). I'm away for a few days but I'll try to look into it when I get back. If there is any chance you can switch to python3 that will resolve the issue, but I understand that that is not always an option.

@bccho
Copy link

bccho commented Aug 29, 2018

Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them.
Can you indicate what subobjects and parameters are required for transform to work correctly?

EDIT: After iterating through individual attributes from dir(trans), it looks like _random_init, _search, and _tree_init are the culprits. They are all instances of @numba.njit called on nested functions, but using dill didn't resolve the problem, and it seems they are necessary for transform.

EDIT: Here is a functioning workaround for Python 2:

import pickle

def save_umap(umap):
    for attr in ["_tree_init", "_search", "_random_init"]:
        if hasattr(umap, attr):
            delattr(umap, attr)
    return pickle.dumps(umap, pickle.HIGHEST_PROTOCOL)

def load_umap(s):
    umap = pickle.loads(s)
    from umap.nndescent import make_initialisations, make_initialized_nnd_search
    umap._random_init, umap._tree_init = make_initialisations(
        umap._distance_func, umap._dist_args
    )
    umap._search = make_initialized_nnd_search(
        umap._distance_func, umap._dist_args
    )
    return umap

import numpy as np
X = np.random.randn(5000, 16)
X_new = np.random.randn(100, 16)

from umap import UMAP
um = UMAP()
um.fit(X)
emb = um.transform(X_new)

pkl = save_umap(um)
um_new = load_umap(pkl) # no error!

emb_new = um_new.transform(X_new)

@lmcinnes
Copy link
Owner Author

lmcinnes commented Sep 1, 2018 via email

@lmcinnes
Copy link
Owner Author

lmcinnes commented Sep 4, 2018

Thanks for finding a workaround! It looks like it was the numba-jitted functions that were not pickling properly, at least under 2.7. I'll have to see if I can figure out a more permanent solution.

@bccho
Copy link

bccho commented Sep 8, 2018

I think that was the problem too. You could probably put the save_umap and get_umap code in as part of __getstate__ and __setstate__

@lmcinnes
Copy link
Owner Author

lmcinnes commented Sep 8, 2018 via email

@stefan-jansen
Copy link

I was able to persist umap objects using the pickle extension dill under Python 3.6.

@profwacko
Copy link

@bccho thanks for this fix, been running into the problem with larger training sets with joblib and pickle for the past week. Needed to use python 2.7 specifically. Hopefully this functionality gets added into UMAP soon.

Same error as above:

[path omitted]/lib/python2.7/site-packages/funcsigs/__init__.pyc in __new__(self, *args, **kwargs)
    199     def __new__(self, *args, **kwargs):
    200         obj = int.__new__(self, *args)
--> 201         obj._name = kwargs['name']
    202         return obj
    203 

KeyError: 'name'

Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them.
Can you indicate what subobjects and parameters are required for transform to work correctly?

EDIT: After iterating through individual attributes from dir(trans), it looks like _random_init, _search, and _tree_init are the culprits. They are all instances of @numba.njit called on nested functions, but using dill didn't resolve the problem, and it seems they are necessary for transform.

EDIT: Here is a functioning workaround for Python 2:

import pickle

def save_umap(umap):
    for attr in ["_tree_init", "_search", "_random_init"]:
        if hasattr(umap, attr):
            delattr(umap, attr)
    return pickle.dumps(umap, pickle.HIGHEST_PROTOCOL)

def load_umap(s):
    umap = pickle.loads(s)
    from umap.nndescent import make_initialisations, make_initialized_nnd_search
    umap._random_init, umap._tree_init = make_initialisations(
        umap._distance_func, umap._dist_args
    )
    umap._search = make_initialized_nnd_search(
        umap._distance_func, umap._dist_args
    )
    return umap

import numpy as np
X = np.random.randn(5000, 16)
X_new = np.random.randn(100, 16)

from umap import UMAP
um = UMAP()
um.fit(X)
emb = um.transform(X_new)

pkl = save_umap(um)
um_new = load_umap(pkl) # no error!

emb_new = um_new.transform(X_new)

@sleighsoft sleighsoft pinned this issue Oct 6, 2019
@nawafmo
Copy link

nawafmo commented Dec 24, 2019

am thinking about using UMAP for IDS project as feature extraction methods

is it a good Idea? have any body did this before ??

@lmcinnes
Copy link
Owner Author

It is worth trying, but a lot will depend on the nature of your data. I have seen UMAP used for IDS projects, though usually more as part of an exploratory tool rather than a production pipeline.

@nawafmo
Copy link

nawafmo commented Dec 25, 2019

It is worth trying, but a lot will depend on the nature of your data. I have seen UMAP used for IDS projects, though usually more as part of an exploratory tool rather than a production pipeline.

can you share with me some of these project ?

@lmcinnes
Copy link
Owner Author

Unfortunatlely I can't share details. Sorry.

@loretoparisi
Copy link

@lmcinnes my two cents is that the issue with umap is the use case. I see a lot of people do not how which is the advantage to use umap instead of t-sne / pca...

@lefnire
Copy link

lefnire commented Aug 2, 2020

Couldn't get the pickle.dumps/loads workaround to work (python3.8).

    man = self._unserialize_umap(man)                                                                                                                                                                                                                                                                               [20/1811]  File "/app/jwtauthtest/autoencoder.py", line 223, in _unserialize_umap
    umap = pickle.loads(s)
  File "/usr/local/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1028, in __setstate__
    self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
  File "/usr/local/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1028, in <listcomp>
    self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
  File "/usr/local/lib/python3.8/site-packages/pynndescent/rp_trees.py", line 1178, in renumbaify_tree
    hyperplanes.extend(tree.hyperplanes)
  File "/usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py", line 366, in extend
    return _extend(self, iterable)
  File "/usr/local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 415, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/usr/local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 358, in error_rewrite
    reraise(type(e), e, None)
  File "/usr/local/lib/python3.8/site-packages/numba/core/utils.py", line 80, in reraise
    raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_extend at 0x7feb5058daf0>) found for signature:

 >>> impl_extend(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C)))

There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'impl_extend': File: numba/typed/listobject.py: Line 1027.
    With argument(s): '(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C)))':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   - Resolution failure for literal arguments:
   No implementation of function Function(<function impl_append at 0x7feb5058d280>) found for signature:

    >>> impl_append(ListType[array(float64, 2d, C)], array(float32, 1d, C))

   There are 2 candidate implementations:
     - Of which 2 did not match due to:
     Overload in function 'impl_append': File: numba/typed/listobject.py: Line 589.
       With argument(s): '(ListType[array(float64, 2d, C)], array(float32, 1d, C))':
      Rejected as the implementation raised a specific error:
        LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)


      File "../usr/local/lib/python3.8/site-packages/numba/typed/listobject.py", line 597:
          def impl(l, item):
              casteditem = _cast(item, itemty)
              ^

      During: lowering "$8call_function.3 = call $2load_global.0(item, $6load_deref.2, func=$2load_global.0, args=[Var(item, listobject.py:597), Var($6load_deref.2, listobject.py:597)], kws=(), vararg=None)" at /usr/local/lib/python3.8/site-packages/numba/typed/listobject.py (597)
     raised from /usr/local/lib/python3.8/site-packages/numba/core/utils.py:81

   - Resolution failure for non-literal arguments:
   None

   During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[array(float64, 2d, C)])
   During: typing of call at /usr/local/lib/python3.8/site-packages/numba/typed/listobject.py (1051)


   File "../usr/local/lib/python3.8/site-packages/numba/typed/listobject.py", line 1051:
               def impl(l, iterable):
                   <source elided>
                   for i in iterable:
                       l.append(i)
                       ^

  raised from /usr/local/lib/python3.8/site-packages/numba/core/typeinfer.py:994

- Resolution failure for non-literal arguments:
None

During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'extend') for ListType[array(float64, 2d, C)])
During: typing of call at /usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py (101)


File "../usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py", line 101:
def _extend(l, iterable):
    return l.extend(iterable)

Also tried dill.dump/load, same error (maybe I need to dump/load_session? not sure how that might interfere with the rest of the environment, as this is shared with server code). I'll shelf umap for my project & subscribe here in case roadmap sees some love.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests