Does not work well on trained doc2vec model #5

gclen · 2017-09-10T13:12:12Z

I trained a doc2vec model on the large movie review dataset and then tried to use UMAP to reduce the dimensions of the resulting document vectors. I had hoped that it would be possible to separate the documents by sentiment (positive and negative), but unfortunately the embedding is one big blob. A notebook can be seen here and the rest of the files for training the doc2vec model are in that repository as well.

lmcinnes · 2017-09-10T14:20:40Z

That definitely looks underwhelming. How does t-SNE compare, or PCA? There may be less structure in the data than one might like. It looks more likely, however, that those two outliers are somehow messing everything up. I'll see if I can get some time and look into exactly what is going on internally. I am fairly busy at the moment with other projects, so I can't promise anything immediate. Sorry.

lmcinnes · 2017-09-10T17:24:54Z

If you have some time, the relevant thing to do is run the internals yourself step by step and look to see where things are getting swamped. In particular if you can build fuzzy simplicial set and look at the result (a sparse matrix) I suspect the distribution of non-zero entries will be suspicious (or, at least, the logs of them, since they are probably power law distributed). In particular you should look at the rows (and columns) associated to those two points that seem to end up at extremes.

Another alternative thing to look at is what happens if you don't use spectral initialisation.

gclen · 2017-09-24T20:12:08Z

I took a look at the things you suggested. Using a random initialisation still looks underwhelming but there are no huge outliers. There is slightly better separation using PCA but it is still not great (though I haven't messed around with parameters).

I constructed the fuzzy simplicial set and as suspected the distribution of logs of non-zero entries is suspicious. To compare the outlying rows to "normal" rows I calculated the log distributions (and sorted them) for the outlying rows and 10 rows selected at random. What I found was the largest values in the outlying distributions were much bigger than the largest values of the other rows. I'm not sure what this means but it's something. The updated notebook is located here. Let me know if you have any ideas for further tests.

lmcinnes · 2017-11-27T00:23:11Z

Hi Graham,

Sorry for the very long delay on ever getting back to you on this. I got rather invested in building the new version of UMAP (which I was hoping would fix some of these issues) and then this fell off my radar for while. The new UMAP, using numba, is now in place, and I think it does fix some of your issues, though not all. I believe some of the rest of the apparent issues can be corrected by more careful plotting. The end result is that I don't believe we get what you want, but it looks less bad in doing so. In particular the default UMAP on your data gives this:

This is, admittedly, somewhat underwhelming. If we turn down n_neighbors to 5 and set min_dist to 0.0 we get the following (which shows more structure, but certainly doesn't separate your classes):

On the other hand, if we plot the PCA result in the same way we get this:

I think in your original iteration the apparent separation was a little bit due to plotting artifacts combined with the fact that the light blue class looks to have slightly larger variance (but ultimately they look like two overlayed gaussian blobs).

Finally, the new version of UMAP does support cosine distance so we can, at least, compute with cosine distance which makes more sense for doc2vec vectors. That results in the following:

Still not much notable separation of classes, but then given the PCA result, and these results, I am not sure there is actually good separation in 2D. I know that's not an ideal answer, or even what you were looking for, but hopefully it helps somewhat.

0.4dev

vb690 · 2021-09-13T12:21:19Z

I also played around a bit with language models and UMAP obtaining however some more (marginally) satisfying results, here and here.

lmcinnes · 2021-09-14T19:44:05Z

Those are some nice results @vb690 ; would you mind if I referenced them in the example uses section of the documentation?

vb690 · 2021-09-15T06:50:33Z

Hi @lmcinnes , sure thing!

sleighsoft added the Good Reads Issues that discuss important topics regarding UMAP, that provide useful code or nice visualizations label Sep 15, 2019

lmcinnes pushed a commit that referenced this issue Sep 23, 2019

Merge pull request #5 from lmcinnes/0.4dev

634808c

0.4dev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not work well on trained doc2vec model #5

Does not work well on trained doc2vec model #5

gclen commented Sep 10, 2017

lmcinnes commented Sep 10, 2017

lmcinnes commented Sep 10, 2017

gclen commented Sep 24, 2017

lmcinnes commented Nov 27, 2017

vb690 commented Sep 13, 2021

lmcinnes commented Sep 14, 2021

vb690 commented Sep 15, 2021

Does not work well on trained doc2vec model #5

Does not work well on trained doc2vec model #5

Comments

gclen commented Sep 10, 2017

lmcinnes commented Sep 10, 2017

lmcinnes commented Sep 10, 2017

gclen commented Sep 24, 2017

lmcinnes commented Nov 27, 2017

vb690 commented Sep 13, 2021

lmcinnes commented Sep 14, 2021

vb690 commented Sep 15, 2021