word2vec-toolbox

We provide here a low level toolbox to manipulate word2vec vector space.

The idea is to propose an environement to study how word2vec captures the semantic relations between words, the project just aim to explore possibilities.

The project is organised in python package following this architecture:

├── data: contains the data (models, dataset, word2vec voc file)
├── notebook: experiences and studies
│ ├── classification: detail results for the classification proof of concept
│ └── dataExploration: several studies about word2vec vector space
├── thirdparty: external tools
└── toolbox: the actual toolbox

Feel free to have a look at the guided tour, it presents all the possibilities offered by the toolbox.

Side note: sorry for the not pep8 style, I was not yet a real pythonic guy at this time :)

Overview

We can separate the project is 4 axes, each of them have a folder in the notebooks:

Data

The repo comes with a minimal environement to run unit test and notebooks.

The complete data are available here:

You'll need them if you want to reproduce experiences.

If you want to use trained models, be aware all have been trained with the wikipedia corpus, therefore make sure you use the corresponding word2vec voc file.

Corpus

We used here 2 corpus so 2 vector spaces.

Both are skip-gram and have the following respective parameters:

Text8

size: 200
window: 5
sample: 1e-4
negative: 5
hs: 0

Wikipedia english

size: 300
window: 10
sample: 1e-4
negative: 5
hs: 0

Dataset and models

For classification tasks we used wordnet (antonyms, taxonomy), and manual annotation(domain) as ground truth.
All have been extracted and stored in files.

A lot of models have been trained, only the best perfoming ones and the global summary log has been bundled in the repo.

General thought about word2vec

Desipte of the simple algorithm to produce it, having clear comprehension about word2vec is a challenge itself.
We will try here to share a vision of it - not to take for granted.

The generated vector space is a discretisation of the semantic, just as a .JPG file attempt to describe a continuous plan space with a fixed resolution, word2vec tries to capture the infinites nuances of meaning of words with a finite number of dimensions.

Therefore, the main difference is a low resolution picture still understandable from a human point of view while we are biased to understand and see the 'big picture' of a low resolution semantic.

We suspect 'understandable human level' semantic to be approximated by a linear combinaisons of lower resolution tied together (maybe a notebook to come to elaborate this).

Beside this, as exposed in notebook, a cartesian approach of the vector space may not be the best fit to understand how the dimensions are tied together.

Important results

Here is some interesting results we found and word2vec model limitation we highlighted:

Polar coordinates

By exploring the data, we shifted from cartesian to polar vector space, it turns out that we can separate the concept in 2 parts:

The semantic direction - angle
How far to go in this direction - norm

This approach seems to provide good results to consider the taxonomic relations between concepts. While the angle provide the nature of the concept, the norm specify how specialized or precise to fully describe it

Human expert

Our classification tasks highlighted a problem: due to the nature of data and the way it is learned, the human expert is not always able to decide the quality of the prediction.

Indeed, different human being would have different understanding of concept and therefore have a different ground truth.

Can a single Human decide by a yes/no answer if unambitious as the opposite of intelligent is a false positive ? Even a group of Human with different background would be unlikely to agree.

We also observed the predicted results to challenge the edges of the ground truth in some cases.

Several meaning

One limitation of word2vec is also one word used in different contexts, for exemple skate is both a fish and a vehicle.

Therefore, the semantic position of a such concept would have a compromising position, this may be resolved with enough dimensions.

It raises the question of words being just a single dot to describe a continuous semantic space, also at a human level and the need to create new words (or abstracted ones) or study etymologies.

Conclusion

This amazing trip to word2vec space was a great experience:

We trained several really accurate classifier with good f1-score despite of simple models:
- Antonyms: 92.1%
- Taxonomy: 84.3%
- Domain: > 95% for 2 domains
We get a better understanding of word2vec space organisation and its limitations.
But more important we scratched the surface of numerous applications challenging the humans annotations.

Credit

This project has been partialy achieved during my research semester in NTU NLP research lab,
therefore, I would like to thanks:

Pr Kim Jung Jae - my advisor
Luu Anh Tuan - my collegue
Maciej Baranski - my flatmate =)

for all the helpfull discussions we shared on this topic.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
notebook		notebook
thirdparty		thirdparty
toolbox		toolbox
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

word2vec-toolbox

Overview

Data

Corpus

Text8

Wikipedia english

Dataset and models

General thought about word2vec

Important results

Polar coordinates

Human expert

Several meaning

Conclusion

Credit

About

Releases

Packages

Languages

License

pelodelfuego/word2vec-toolbox

Folders and files

Latest commit

History

Repository files navigation

word2vec-toolbox

Overview

Data

Corpus

Text8

Wikipedia english

Dataset and models

General thought about word2vec

Important results

Polar coordinates

Human expert

Several meaning

Conclusion

Credit

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages