Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word Embedding Visualizer #1419

Closed
aneesh-joshi opened this issue Jun 16, 2017 · 21 comments
Closed

Word Embedding Visualizer #1419

aneesh-joshi opened this issue Jun 16, 2017 · 21 comments
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation feature Issue described a new feature

Comments

@aneesh-joshi
Copy link
Contributor

Currently, there is no direct way to visualise word embeddings made by the gensim word2vec model.
I have made a visualiser using matplotlib.
It uses Incremental PCA to reduce to manageable dimensions and then uses tSNE to bring it down to 2 dimensions.

Note: I have taken some of my code from https://github.com/jeffThompson/Word2VecAndTsne

This will be my first contribution (if at all), so please guide me.

(Also, I am not sure if this is the right place to submit a feature request, so please don't mind)

@gojomo
Copy link
Collaborator

gojomo commented Jun 16, 2017

Word2vec visualizations are very useful, this seems like a very good contribution idea!

It might work best as a demo notebook, or as extensions to the existing word2vec notebooks – though of course if a few general-usefulness methods support the visualizations, they could become improvements to existing gensim classes/modules. (For example, perhaps new utility functions on the KeyedVectors class, the model for sets-of-vectors keyed-by-strings... typically keyed-by-words.)

Also, is reduction by PCA before applying tSNE really necessary? (I could be wrong but thought tSNE could, on its own, do all necessary dim-reduction on word2vec-like hundreds-of-dimensions spaces.)

@aneesh-joshi
Copy link
Contributor Author

is reduction by PCA before applying tSNE really necessary?

tSNE can do the work but it tends to be pretty slow. I have set a flag so that the user can decide whether they want to use PCA first. If it is not set, tSNE will be used throughout.

Plotting the visualisation in Jupyter Notebooks takes away the ability to scroll, pan and zoom.
Unless there is a work around for that, it might have to be implemented as a script.

My script only requires the path to the Word2Vec model. It does the rest. Although, to prevent saving too much in active memory, I do save the reduced 2D vectors to a CSV file.

Could you suggest where I should implement my code and submit a PR?

@anotherbugmaster
Copy link
Contributor

@aneesh-joshi

You can probably use plotly for zooming and panning.

@aneesh-joshi
Copy link
Contributor Author

@anotherbugmaster
Thanks!
will try it.
Will it allow zooming, panning,etc from within a notebook?

@anotherbugmaster
Copy link
Contributor

Yes, the interface is the same as in web-version.

@aneesh-joshi
Copy link
Contributor Author

@gojomo
So where exactly do you want me to implement it?
in a demo notebook or append in an existing notebook or in the keyed vectors utility?

@parulsethi
Copy link
Contributor

parulsethi commented Jun 18, 2017

Sorry for commenting late, but you can directly visualize the word embeddings using Tensorboard projector.

Save the gensim model embeddings using model.wv.save_word2vec_format("filename") and use this script to convert the saved embedding file to tsv format which is required by tensorboard. See the instructions for usage in the docstring of script.

@aneesh-joshi
Copy link
Contributor Author

Ah
There goes my first contribution.
I had just added it to the word2vec ipynb

screenshot from 2017-06-18 18 31 24

screenshot from 2017-06-18 18 32 45

@aneesh-joshi
Copy link
Contributor Author

@parulsethi So I'll close this issue?

@aneesh-joshi
Copy link
Contributor Author

Also,
please review my PR
#1426

@parulsethi
Copy link
Contributor

I think it might be good to make a small mention of tensorboard for visualizing gensim word embeddings, in word2vec.ipynb. @menshikh-iv wdyt?

@aneesh-joshi
Copy link
Contributor Author

In my opinion, the tensorboard visualisations aren't very intuitive.
They are just dots on the screen which need hovering to get the words.
Recognising clusters,etc becomes difficult.

@aneesh-joshi
Copy link
Contributor Author

capture

@menshikh-iv
Copy link
Contributor

@parulsethi I think several viz in word2vec notebook will not be superfluous.
@aneesh-joshi fell free to add some visualization in notebook.

@aneesh-joshi
Copy link
Contributor Author

@menshikh-iv made the PR #1440

@menshikh-iv menshikh-iv added documentation Current issue related to documentation feature Issue described a new feature difficulty easy Easy issue: required small fix labels Oct 2, 2017
@halflings
Copy link

@aneesh-joshi I think a lot of work have been to the tensorflow embedding visualizer:

If you click on "A", it will show a label of your choosing (you can pass strings, or any other value) and show it instead of the dot. This is useful when you have a small number of words, and want to visualize them directly instead of hovering.

It has also search, conditional coloring of datapoints (e.g: you can pass a label for sentiment score and visualize that as a color, or a label for each class of words, etc.), allows you to interactively visualize the nearest neighbors of each dot, allows you to run PCA and t-SNE clustering, and allows you to "isolate" certain words you want to focus on, and only show those (which re-runs PCA or t-SNE on those points alone, showing more contrast between them).

I suspect all these features will be very hard to replicate, and as such I would really like to see this integrate with gensim rather than trying to reinvent the wheel.

@aneesh-joshi
Copy link
Contributor Author

@halflings
I agree with you.
My thought process was, especially considering it's in a notebook, that somebody new should be able to visualise the vectors that they've made. And the tensorboard visualizations didn't feel so intuitive.

However, I wasn't aware of the full extent of the tensorboard features. These would indeed be hard to replicate. Not to mention the additional dependency of the plotly package that comes with my version.

Do you suggest we make something like:
model.plot()

which brings up the tensorboard visualisations?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Dec 21, 2017

@halflings
Copy link

@menshikh-iv : Thanks, did not know about the word2vec2tensor script.
I guess it makes sense to also have a more lightweight way to visualize things directly from a notebook.

I can try reaching out to the people working on the embeddings projector; there might be a possibility of running it directly from a notebook, but that should not stop you from doing a smaller version using plotly!

@menshikh-iv
Copy link
Contributor

@halflings probably "embedded" tensorboard will looks not good for this case (too little space in notebook).

IMO, "emedding" isn't needed for this case, but I could be wrong.

@menshikh-iv
Copy link
Contributor

Resolved in #1800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

6 participants