We provide here a low level toolbox to manipulate word2vec vector space.
The idea is to propose an environement to study how word2vec captures the semantic relations between words, the project just aim to explore possibilities.
The project is organised in python package following this architecture:
├── data: contains the data (models, dataset, word2vec voc file)
├── notebook: experiences and studies
│ ├── classification: detail results for the classification proof of concept
│ └── dataExploration: several studies about word2vec vector space
├── thirdparty: external tools
└── toolbox: the actual toolbox
Feel free to have a look at the guided tour, it presents all the possibilities offered by the toolbox.
Side note: sorry for the not pep8 style, I was not yet a real pythonic guy at this time :)
We can separate the project is 4 axes, each of them have a folder in the notebooks:
The repo comes with a minimal environement to run unit test and notebooks.
The complete data are available here:
You'll need them if you want to reproduce experiences.
If you want to use trained models, be aware all have been trained with the wikipedia corpus, therefore make sure you use the corresponding word2vec voc file.
We used here 2 corpus so 2 vector spaces.
Both are skip-gram and have the following respective parameters:
- size: 200
- window: 5
- sample: 1e-4
- negative: 5
- hs: 0
- size: 300
- window: 10
- sample: 1e-4
- negative: 5
- hs: 0
For classification tasks we used wordnet (antonyms, taxonomy), and manual annotation(domain) as ground truth.
All have been extracted and stored in files.
A lot of models have been trained, only the best perfoming ones and the global summary log has been bundled in the repo.
Desipte of the simple algorithm to produce it, having clear comprehension about word2vec is a challenge itself.
We will try here to share a vision of it - not to take for granted.
The generated vector space is a discretisation of the semantic, just as a .JPG file attempt to describe a continuous plan space with a fixed resolution, word2vec tries to capture the infinites nuances of meaning of words with a finite number of dimensions.
Therefore, the main difference is a low resolution picture still understandable from a human point of view while we are biased to understand and see the 'big picture' of a low resolution semantic.
We suspect 'understandable human level' semantic to be approximated by a linear combinaisons of lower resolution tied together (maybe a notebook to come to elaborate this).
Beside this, as exposed in notebook, a cartesian approach of the vector space may not be the best fit to understand how the dimensions are tied together.
Here is some interesting results we found and word2vec model limitation we highlighted:
By exploring the data, we shifted from cartesian to polar vector space, it turns out that we can separate the concept in 2 parts:
- The semantic direction - angle
- How far to go in this direction - norm
This approach seems to provide good results to consider the taxonomic relations between concepts. While the angle provide the nature of the concept, the norm specify how specialized or precise to fully describe it
Our classification tasks highlighted a problem: due to the nature of data and the way it is learned, the human expert is not always able to decide the quality of the prediction.
Indeed, different human being would have different understanding of concept and therefore have a different ground truth.
Can a single Human decide by a yes/no answer if unambitious as the opposite of intelligent is a false positive ? Even a group of Human with different background would be unlikely to agree.
We also observed the predicted results to challenge the edges of the ground truth in some cases.
One limitation of word2vec is also one word used in different contexts, for exemple skate is both a fish and a vehicle.
Therefore, the semantic position of a such concept would have a compromising position, this may be resolved with enough dimensions.
It raises the question of words being just a single dot to describe a continuous semantic space, also at a human level and the need to create new words (or abstracted ones) or study etymologies.
This amazing trip to word2vec space was a great experience:
- We trained several really accurate classifier with good f1-score despite of simple models:
- Antonyms: 92.1%
- Taxonomy: 84.3%
- Domain: > 95% for 2 domains
- We get a better understanding of word2vec space organisation and its limitations.
- But more important we scratched the surface of numerous applications challenging the humans annotations.
This project has been partialy achieved during my research semester in NTU NLP research lab,
therefore, I would like to thanks:
- Pr Kim Jung Jae - my advisor
- Luu Anh Tuan - my collegue
- Maciej Baranski - my flatmate =)
for all the helpfull discussions we shared on this topic.