# Word Embeddings       (∩^o^)⊃━☆ﾟ.*･｡ﾟ

1. [What are word embeddings?](#what)
2. [How are they trained?](#how)
3. [How did we use them for Beat classification?](#beat)
4. [Resources](#more)

***

### 1. What are word embeddings? <a id='what'></a>

- representation of a word in vector form
- carry meaning and context of a word
- similar words are closer together in the vector space

In [59]:
import numpy as np

In [1]:
# using pretrained spacy word embeddings 
import spacy
nlp = spacy.load('en_core_web_lg')

Examples:

In [88]:
# word without meaning
string_1 = 'drigakulope'
embedding_1 = nlp('drigakulope')

print(embedding_1[0].is_oov) # oov: out of vocabulary
print('size of vector: ',embedding_1.vector.size)
print(embedding_1.vector)

True
size of vector:  300
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [87]:
# word with meaning 
string_2 = 'health'
embedding_2 = nlp('health')
print(embedding_2[0].is_oov)
print(embedding_2.vector)

False
[-3.2881e-01  2.1108e-01  4.3552e-02  1.3979e-01 -5.2884e-01 -5.1644e-02
 -3.3082e-01 -1.2381e-01 -2.7482e-02  3.2725e+00 -9.5697e-01 -1.8551e-01
 -1.3150e-01  1.9451e-01 -1.7047e-01  2.6562e-01  2.2098e-01  1.7486e+00
 -6.4204e-01 -7.7755e-02 -2.4530e-02  6.3740e-01 -3.0614e-02 -6.2041e-01
  2.3933e-01 -3.6727e-01  6.0124e-02  2.3843e-01  2.6267e-01 -3.7098e-01
 -6.0564e-01 -6.9927e-02  3.3869e-01 -4.2650e-01  3.4224e-01  1.3294e-02
  1.0972e-01  6.0027e-02 -4.1023e-01  4.4138e-01  2.9172e-01  3.3538e-01
 -5.1540e-01 -2.9832e-01 -2.3043e-01  1.5725e-01 -4.3485e-01  2.1103e-01
  2.1040e-01 -8.0232e-02 -7.9495e-01 -2.6638e-01  6.9827e-01 -3.8734e-01
 -8.7617e-02  9.2663e-02  2.2422e-01  6.1503e-01  1.1925e-01 -5.3161e-01
 -4.1816e-02 -3.0765e-01 -1.6384e-01 -1.2057e-01  5.3617e-01  2.0648e-01
  3.2788e-01 -1.9545e-02 -2.7603e-01  3.0034e-01 -6.0207e-01 -2.6588e-01
  2.0489e-01 -1.0422e-01  7.3177e-01  2.5777e-01  1.0356e-01 -2.9657e-01
  1.5593e-01 -1.1592e-01  3.4370e-02  1.9574e

Mindmap for the word **Health**:

![title](images/mindmap.jpg)

In [90]:
# similarity score between a selection of words
token_1 = nlp('diet')
words = 'exercise food vitamins backyard'
tokens = nlp(words)
print('------------similarity scores----------')
for token in tokens:
    print(token_1.text, 'to',token.text, ':', token_1.similarity(token))

------------similarity scores----------
diet to exercise : 0.582544108078496
diet to food : 0.6076305053237465
diet to vitamins : 0.6295147982810888
diet to backyard : 0.1942055804386137


In [None]:
# what is the similarity score:
cosine over vectors

In [64]:
# most similar words (closest vectors) 
def most_similar(word, topn=5):
  word = nlp.vocab[str(word)]
  queries = [
      w for w in word.vocab 
      if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
  ]

  by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
  return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

most_similar("dog", topn=3)

[('dogs', 0.8835931), ('puppy', 0.85852146), ('pet', 0.8057451)]

In [65]:
most_similar('health')

[('healthcare', 0.78880006),
 ('care', 0.74709594),
 ('wellness', 0.73053104),
 ('medical', 0.7086058),
 ('nutrition', 0.6945323)]

In [66]:
most_similar('diet')

[('diets', 0.89323026),
 ('dieting', 0.8071348),
 ('dietary', 0.7490052),
 ('calorie', 0.739019),
 ('foods', 0.73535097)]

In [68]:
most_similar('coronavirus') 
# NB: embeddings are trained with news and media corpus from 2013 
# the word coronavirus is already in the corpus, but not a common word 

[('influenza', 0.5106981),
 ('rabies', 0.45984036),
 ('virus', 0.44592503),
 ('hepatitis', 0.44005278),
 ('pathogen', 0.43760213),
 ('ebola', 0.43437657)]

In [70]:
most_similar('flu')

[('influenza', 0.8240887),
 ('swine', 0.7921757),
 ('pandemic', 0.76574874),
 ('vaccine', 0.7106953),
 ('outbreak', 0.6994005)]

In [82]:
doc = nlp("Apple and banana are similar. Pasta and hippo aren't.")

apple = doc[0]
banana = doc[2]
pasta = doc[6]
hippo = doc[8]

print("apple <-> banana", apple.similarity(banana))
print("pasta <-> hippo", pasta.similarity(hippo))

apple <-> banana 0.5831845
pasta <-> hippo 0.079349115


Word embeddings visualized

<img src="images/embeddings.png" width="700">

- similar words are close to each other in the vector space
- using PCA to get 300 dimensions to 3 here for visualisation purpose

We can use word vectors to find similar words and words used in the same context, <br> we can compare the meaning of single words, sentences or full documents, <br> and we can even perform simple math operations, working with the relationship between words.

In [None]:
# king - man + woman = queen  This describes a gender relationship.
# Another example is: paris – france + germany = berlin. In this case, the vector difference between paris and france captures the concept of capital city.

***

### 2. How are they trained? <a id='how'></a>

**Word2Vec**: two-layer neural networks that are trained to reconstruct linguistic contexts of words

Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (**CBOW**) or continuous **skip-gram**.

In [None]:
# explanation

In [None]:
# example

In [None]:
# train own model with gensim word2vec: easy to do and awesome build in features 


***

### 3. How did we use word embeddings for beat classification? <a id='beat'></a>

***

### More information and recources <a id='more'></a>

Spacy: https://spacy.io/usage/spacy-101 <br> Spacy is ...

Gensim, word to vec ...

StarSpace: https://github.com/facebookresearch/StarSpace <br> StarSpace is a general-purpose neural model for efficient learning of entity embeddings for solving a wide variety of problems.

***

米＾－＾米 