Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not work well on trained doc2vec model #5

Open
gclen opened this issue Sep 10, 2017 · 7 comments
Open

Does not work well on trained doc2vec model #5

gclen opened this issue Sep 10, 2017 · 7 comments
Labels
Good Reads Issues that discuss important topics regarding UMAP, that provide useful code or nice visualizations

Comments

@gclen
Copy link
Contributor

gclen commented Sep 10, 2017

I trained a doc2vec model on the large movie review dataset and then tried to use UMAP to reduce the dimensions of the resulting document vectors. I had hoped that it would be possible to separate the documents by sentiment (positive and negative), but unfortunately the embedding is one big blob. A notebook can be seen here and the rest of the files for training the doc2vec model are in that repository as well.

@lmcinnes
Copy link
Owner

That definitely looks underwhelming. How does t-SNE compare, or PCA? There may be less structure in the data than one might like. It looks more likely, however, that those two outliers are somehow messing everything up. I'll see if I can get some time and look into exactly what is going on internally. I am fairly busy at the moment with other projects, so I can't promise anything immediate. Sorry.

@lmcinnes
Copy link
Owner

If you have some time, the relevant thing to do is run the internals yourself step by step and look to see where things are getting swamped. In particular if you can build fuzzy simplicial set and look at the result (a sparse matrix) I suspect the distribution of non-zero entries will be suspicious (or, at least, the logs of them, since they are probably power law distributed). In particular you should look at the rows (and columns) associated to those two points that seem to end up at extremes.

Another alternative thing to look at is what happens if you don't use spectral initialisation.

@gclen
Copy link
Contributor Author

gclen commented Sep 24, 2017

I took a look at the things you suggested. Using a random initialisation still looks underwhelming but there are no huge outliers. There is slightly better separation using PCA but it is still not great (though I haven't messed around with parameters).

I constructed the fuzzy simplicial set and as suspected the distribution of logs of non-zero entries is suspicious. To compare the outlying rows to "normal" rows I calculated the log distributions (and sorted them) for the outlying rows and 10 rows selected at random. What I found was the largest values in the outlying distributions were much bigger than the largest values of the other rows. I'm not sure what this means but it's something. The updated notebook is located here. Let me know if you have any ideas for further tests.

@lmcinnes
Copy link
Owner

Hi Graham,

Sorry for the very long delay on ever getting back to you on this. I got rather invested in building the new version of UMAP (which I was hoping would fix some of these issues) and then this fell off my radar for while. The new UMAP, using numba, is now in place, and I think it does fix some of your issues, though not all. I believe some of the rest of the apparent issues can be corrected by more careful plotting. The end result is that I don't believe we get what you want, but it looks less bad in doing so. In particular the default UMAP on your data gives this:

image

This is, admittedly, somewhat underwhelming. If we turn down n_neighbors to 5 and set min_dist to 0.0 we get the following (which shows more structure, but certainly doesn't separate your classes):

image

On the other hand, if we plot the PCA result in the same way we get this:

image

I think in your original iteration the apparent separation was a little bit due to plotting artifacts combined with the fact that the light blue class looks to have slightly larger variance (but ultimately they look like two overlayed gaussian blobs).

Finally, the new version of UMAP does support cosine distance so we can, at least, compute with cosine distance which makes more sense for doc2vec vectors. That results in the following:

image

Still not much notable separation of classes, but then given the PCA result, and these results, I am not sure there is actually good separation in 2D. I know that's not an ideal answer, or even what you were looking for, but hopefully it helps somewhat.

@sleighsoft sleighsoft added the Good Reads Issues that discuss important topics regarding UMAP, that provide useful code or nice visualizations label Sep 15, 2019
lmcinnes pushed a commit that referenced this issue Sep 23, 2019
@vb690
Copy link

vb690 commented Sep 13, 2021

I also played around a bit with language models and UMAP obtaining however some more (marginally) satisfying results, here and here.

@lmcinnes
Copy link
Owner

Those are some nice results @vb690 ; would you mind if I referenced them in the example uses section of the documentation?

@vb690
Copy link

vb690 commented Sep 15, 2021

Hi @lmcinnes , sure thing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good Reads Issues that discuss important topics regarding UMAP, that provide useful code or nice visualizations
Projects
None yet
Development

No branches or pull requests

4 participants