# Applying Word-Embeddings


There are different options to work with Word-Embeddings:
1. Trained Word-Embeddings can be downloaded from the web. These Word-Embeddings differ in
    * the method, e.g. Skipgram, CBOW, GloVe, fastText 
    * in the hyperparameter applied for the selected method, e.g. context-length
    * in the corpus, which has been applied for training
2. By applying packages such as [gensim](https://radimrehurek.com/gensim/) word-embeddings can easily be trained from an arbitrary collection of texts 
3. Training of a word embedding can be integrated into an end-to-end neural network for a specific application. For example, if a Deep-Nerual-Network shall be learned for document-classification, the first layer in this network can be defined, such that it learns a task-specific word-embedding from the given document-classification-training-data.

In this notebook option 1 and 2 are demonstrated. Option 3 is applied in a later lecture

## Apply Pre-Trained Word-Embeddings
### FastText


The [FastText project](https://fasttext.cc) provides word-embeddings for 157 different languages, trained on [Common Crawl](https://commoncrawl.org/) and [Wikipedia](https://www.wikipedia.org/). These word embeddings can easily be downloaded and imported to Python. The `KeyedVectors`-class of [gensim](https://radimrehurek.com/gensim/) can be applied for the import. This class also provides many useful tools, e.g. an index to fastly find the vector of an arbitrary word or function to calculate similarities between word-vectors. Some of these tools will be demonstrated below: 

After downloading word embeddings from [FastText](https://fasttext.cc/docs/en/english-vectors.html) they can be imported into a `KeyedVectors`-object from gensim as follows:

In [1]:
from gensim.models import KeyedVectors
import numpy as np
import warnings

In [2]:
warnings.filterwarnings("ignore")

In [3]:
# Creating the model
#en_model = KeyedVectors.load_word2vec_format('/Users/maucher/DataSets/Gensim/FastText/Gensim/FastText/wiki-news-300d-1M.vec')
#en_model = KeyedVectors.load_word2vec_format(r'C:\Users\maucher\DataSets\Gensim\Data\Fasttext\wiki-news-300d-1M.vec\wiki-news-300d-1M.vec') #path on surface
#en_model = KeyedVectors.load_word2vec_format('/Users/maucher/DataSets/Gensim/FastText/fasttextEnglish300.vec')
en_model = KeyedVectors.load_word2vec_format('/Users/johannes/DataSets/Gensim/FastText/fasttextEnglish300.vec') # path on iMAC

The number of vectors and their length can be accessed as follows:

In [13]:
# Printing out number of tokens available
print("Number of Tokens: {}".format(en_model.vectors.shape[0]))

# Printing out the dimension of a word vector 
print("Dimension of a word vector: {}".format(en_model.vectors.shape[1]))

Number of Tokens: 999994
Dimension of a word vector: 300


The first 20 words in the index:

In [71]:
en_model.wv.index2word[:20]

[',',
 'the',
 '.',
 'and',
 'of',
 'to',
 'in',
 'a',
 '"',
 ':',
 ')',
 'that',
 '(',
 'is',
 'for',
 'on',
 '*',
 'with',
 'as',
 'it']

The first 10 components of the word-vector for *evening*:

In [66]:
en_model["evening"][:10]

array([-0.0219,  0.0138, -0.0924, -0.0028, -0.0823, -0.1428,  0.0269,
       -0.0193,  0.0447,  0.0336], dtype=float32)

The first 10 components of the word-vector for *morning*:

In [67]:
en_model["morning"][:10]

array([-0.0025,  0.0429, -0.1727,  0.0185, -0.0414, -0.1486,  0.0326,
       -0.0501,  0.1374, -0.1151], dtype=float32)

The similarity between *evening* and *morning*:

In [83]:
similarity = en_model.similarity('morning', 'evening')
similarity

0.8645973

The 20 words, which are most similar to word *wood*:

In [73]:
en_model.most_similar("wood",topn=20)

[('timber', 0.7636732459068298),
 ('lumber', 0.7316348552703857),
 ('kiln-dried', 0.7024550437927246),
 ('wooden', 0.6998946666717529),
 ('oak', 0.674289345741272),
 ('plywood', 0.6731638312339783),
 ('hardwood', 0.6648495197296143),
 ('woods', 0.6632275581359863),
 ('pine', 0.654842734336853),
 ('straight-grained', 0.6503476500511169),
 ('wood-based', 0.6416549682617188),
 ('firewood', 0.6402209997177124),
 ('iroko', 0.6389516592025757),
 ('metal', 0.6362859606742859),
 ('timbers', 0.6347957849502563),
 ('quartersawn', 0.6330605149269104),
 ('Wood', 0.6307631731033325),
 ('forest', 0.6296596527099609),
 ('end-grain', 0.6279916763305664),
 ('furniture', 0.6257956624031067)]

### GloVe
As described [before](05representations.md) GloVe constitutes another method for calculating Word-Embbedings. Pre-trained GloVe vectors can be downloaded from
[Glove](https://nlp.stanford.edu/projects/glove/) and imported into Python. However, gensim already provides a downloader for several word-embeddings, including GloVe embeddings of different length and different training-data. 

The corpora and embeddings, which are available via the gensim downloader, can be queried as follows:

In [84]:
import gensim.downloader as api

In [93]:
api.info(name_only=True)

{'corpora': ['semeval-2016-2017-task3-subtaskBC',
  'semeval-2016-2017-task3-subtaskA-unannotated',
  'patent-2017',
  'quora-duplicate-questions',
  'wiki-english-20171001',
  'text8',
  'fake-news',
  '20-newsgroups',
  '__testing_matrix-synopsis',
  '__testing_multipart-matrix-synopsis'],
 'models': ['fasttext-wiki-news-subwords-300',
  'conceptnet-numberbatch-17-06-300',
  'word2vec-ruscorpora-300',
  'word2vec-google-news-300',
  'glove-wiki-gigaword-50',
  'glove-wiki-gigaword-100',
  'glove-wiki-gigaword-200',
  'glove-wiki-gigaword-300',
  'glove-twitter-25',
  'glove-twitter-50',
  'glove-twitter-100',
  'glove-twitter-200',
  '__testing_word2vec-matrix-synopsis']}

We select the GloVe word-embeddings `glove-wiki-gigaword-100` for download: 

In [92]:
word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data

In [94]:
type(word_vectors)

gensim.models.keyedvectors.Word2VecKeyedVectors

As can be seen in the previous output, the downloaded data is available as a `KeyedVectors`-object. Hence the same methods can now be applied as in the case of the FastText - Word Embedding in the previous section. In the sequel we will apply not only the methods used above, but also new ones.

Word analogy questions like *man is to king as woman is to ?* can be solved as in the code cell below:

In [95]:
result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

queen: 0.7699


Outliers within sets of words can be determined as follows:

In [96]:
print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


Similiarity between a pair of words:

In [98]:
similarity = word_vectors.similarity('woman', 'man')
print(similarity)

0.8323494


Most similar words to *cat*:

In [106]:
word_vectors.most_similar("cat",topn=20)

[('dog', 0.8798074722290039),
 ('rabbit', 0.7424427270889282),
 ('cats', 0.7323004007339478),
 ('monkey', 0.7288710474967957),
 ('pet', 0.7190139293670654),
 ('dogs', 0.7163873314857483),
 ('mouse', 0.6915251016616821),
 ('puppy', 0.6800068616867065),
 ('rat', 0.6641027331352234),
 ('spider', 0.6501134634017944),
 ('elephant', 0.6372530460357666),
 ('boy', 0.6266894340515137),
 ('bird', 0.6266419887542725),
 ('baby', 0.6257247924804688),
 ('pig', 0.6254673004150391),
 ('horse', 0.6251551508903503),
 ('snake', 0.6227242350578308),
 ('animal', 0.6200780272483826),
 ('dragon', 0.6187658309936523),
 ('duck', 0.6158087253570557)]

Similarity between sets of words:

In [108]:
sim = word_vectors.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
print("{:.4f}".format(sim))

0.7067


In [109]:
vector = word_vectors['computer']  # numpy vector of a word
print(vector.shape)
print(vector[:10])

(100,)
[-0.16298   0.30141   0.57978   0.066548  0.45835  -0.15329   0.43258
 -0.89215   0.57747   0.36375 ]


In [110]:
np.sqrt(np.sum(np.square(vector)))

6.529161

In [111]:
vector = word_vectors.word_vec('office', use_norm=True)
print(vector.shape)
print(vector[:10])

(100,)
[-0.01455544 -0.13056442  0.06381373 -0.00747831  0.10621653  0.02454428
 -0.08777763  0.1584893   0.0725054   0.08593655]


In [112]:
np.sqrt(np.sum(np.square(vector)))

1.0