I think word embeddings are pretty cool
They have uses for downstream neural networks, or a deep reading of a particular corpus. While their results should always be taken with a grain of salt, they can reveal noteworthy insights.
So, I decided to take my favorite collection of short stories and novels, the Sherlock Holmes collection, and generate word embeddings using the Google word2vec model with the help of gensim.
There are two pre-trained word embeddings files in output/.bin which have vectors trained on this corpus using the Continuous Bag of Words (cbow) method, and the Skip-gram (skip) method. My personal insights from the corpus are within the output/.txt files. The source text for these vectors is from the USA version.
As mentioned before, this generator uses
gensim to create the word embeddings. It also uses
nltk in order to parse the sentences into tokens.
pip install gensim
pip install nltk
If you want to make your own vectors, you will need to pass in 3 arguments. These are:
- A source file (i.e
- An output name for your vector file (
skip_vectors.bin). This needs to have a
- An embedding type, either
cbowor Continuous Bag of Words or
Then, run the command such that:
python embeddings.py <source_file> <output_name.bin> <cbow or skip>
python embeddings.py "data/curated.txt" "skip_vectors.bin" "skip"
More food for thought
Skip gram methodology : http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
More on Gensim: https://radimrehurek.com/gensim/models/word2vec.html , https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
Cosine Similarity: https://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python
This project was made using Python 3.5.4