<a href="https://colab.research.google.com/github/kai-lim/NLP_course/blob/main/D3_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word embeddings
*(Credit: Leon Derczynski, IT University of Copenhagen)*

Let's load some embeddings, and then use these to see which words are close to each other.
We'll use the gensim package's word2vec implementation, and an nltk corpus. We also need to download punkt - an nltk tokeniser used by the movie_reviews corpus.

In [None]:
from gensim.models import Word2Vec
from nltk.corpus import brown, movie_reviews

import nltk
nltk.download('brown')
nltk.download('movie_reviews')
nltk.download('punkt')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Let's generate word vectors over the Brown corpus text. We will have 20 dimensions, using a window of three for the context words in the skip-grams (e.g. c1, c2, w, c3, c4). This might be a little slow (maybe 1-2 minutes).

In [None]:
# for the Brown corpus
b = Word2Vec(brown.sents(), size=20, window=3, min_count=3)

Now we have the vectors, we can see how good they are by measuring which words are similar to each other.

In [None]:
b.most_similar('company', topn=5)

  """Entry point for launching an IPython kernel.


[('pool', 0.9683407545089722),
 ('planter', 0.9549104571342468),
 ('church', 0.9533883333206177),
 ('paper', 0.9525842666625977),
 ('valley', 0.9507160782814026)]

In [None]:
# for the Brown corpus
b2 = Word2Vec(brown.sents(), size=30, window=5, min_count=3)

In [None]:
b2.most_similar('company', topn=5)

  """Entry point for launching an IPython kernel.


[('soul', 0.9609832763671875),
 ('paper', 0.9599371552467346),
 ('pool', 0.959900438785553),
 ('driver', 0.9578371644020081),
 ('tragedy', 0.954524576663971)]

Not great, eh? Try altering the window and the dimension size, to see if you get better results.

Try also with the movie reviews results!

In [None]:
# for the movie review corpus
mr = Word2Vec(movie_reviews.sents(), size=20, window=5, min_count=3)

In [None]:
mr.most_similar('love', topn=5)

  """Entry point for launching an IPython kernel.


[('urinate', 0.7589472532272339),
 ('empathize', 0.7365001440048218),
 ('lei', 0.7324034571647644),
 ('learn', 0.7205872535705566),
 (';', 0.6854209303855896)]

We can also do some arithmetic with the words. Let's try that classical result, king - man + woman.

In [None]:
b.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

  """Entry point for launching an IPython kernel.


[('military', 0.969031572341919),
 ("other's", 0.9493482112884521),
 ('levels', 0.9446753263473511),
 ('tax', 0.9438850283622742),
 ('lower', 0.9429642558097839)]

Not a perfect result with the default model! Why don't we try loading a bigger dataset, based on a bigger vocabulary. This should give better results. You'll need the GloVe embeddings for this. 

We will download this from a github repository. If you are running this on your own local computer (rather then Colaboratory) you can download from www.derczynski.com/glove.twitter.27B.25d.txt.bz2 to your machine. In this case, there is no need to run the next cell - just replace the file name in the cell after next with the path to your downloaded file.

In [None]:
!git clone --quiet https://github.com/KCL-Health-NLP/nlp_examples.git  
from gensim.models.keyedvectors import KeyedVectors
print("Done copying files")

Done copying files


Now let's load the model file. This might take a few minutes. If you are using a copy on your own local machine, change the file path below to that of your file.

In [None]:
glove = KeyedVectors.load_word2vec_format("nlp_examples/representation/glove.twitter.27B.25d.txt.bz2", binary=False)
print("Done loading")

Done loading


Now, try the above again. Can you find any cool word combinations? What differences are there in the datasets?

Here are some ideas to try, substitute your own words in to these.

In [None]:
glove.most_similar('meat', topn=5)

[('bread', 0.9616427421569824),
 ('corn', 0.9524653553962708),
 ('egg', 0.9472206234931946),
 ('fish', 0.9398375153541565),
 ('soup', 0.9275275468826294)]

In [None]:
glove.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

[('average', 0.8820492625236511),
 ('human', 0.8792450428009033),
 ('persons', 0.877970814704895),
 ('smallest', 0.8638321757316589),
 ('potential', 0.8624012470245361)]

In [None]:
glove.most_similar(positive=['woman', 'king'], negative=['man'])

[('meets', 0.8841923475265503),
 ('prince', 0.832163393497467),
 ('queen', 0.8257461190223694),
 ('â€™s', 0.8174097537994385),
 ('crow', 0.8134994506835938),
 ('hunter', 0.8131038546562195),
 ('father', 0.811583399772644),
 ('soldier', 0.8111359477043152),
 ('mercy', 0.8082392811775208),
 ('hero', 0.8082262873649597)]

In [None]:
glove.similarity('car', 'bike')

0.77646494

In [None]:
glove.similarity('car', 'purple')

0.6448954

In [None]:
glove.similarity('red', 'purple')

0.86647636

In [None]:
glove.doesnt_match("breakfast cereal dinner lunch".split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'cereal'

In [None]:
glove.doesnt_match("red green horse blue".split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'horse'

What about ambiguous words? Can you think of any and try them? Past suggestions have been cancer, bank and play. Can you find any others, and explain what is going on? How does the embedding deal with ambiguity? What factors influence this?

In [None]:
glove.most_similar('word')

[('kind', 0.9247308373451233),
 ('even', 0.8885588645935059),
 ('is', 0.87043297290802),
 ('over', 0.8653795123100281),
 ('blind', 0.8579867482185364),
 ('exactly', 0.8571450710296631),
 ('its', 0.8565132021903992),
 ('had', 0.8532353639602661),
 ('true', 0.852778434753418),
 ('exact', 0.8523545265197754)]

What do these embeddings look like? We will display embeddings for four words: two colour adjectives, and two action verbs. Each column is the enbedding for one word. We have printed to two decimal places, using Python string formatting. Can you spot any similarities and differences?

In [None]:
print("   red      green             walk    run\n")
for i in range(len(glove['red'])):
  print("%8.2f%8.2f          %8.2f%8.2f" % (glove['red'][i], glove['green'][i], glove['walk'][i], glove['run'][i]))

   red      green             walk    run

   -0.27   -0.68             -1.41   -0.03
   -0.73   -0.91              0.28    0.49
    0.55    0.21              0.52    0.17
   -0.30   -0.12              0.23   -0.21
    0.29   -0.22             -0.85   -0.32
    0.80    0.70              0.55    0.46
    0.63    0.75              1.26    1.15
    0.64   -0.25             -0.66   -0.32
   -0.11    0.66              0.53    0.32
   -0.32   -0.19              0.44    0.13
    1.02    0.95             -0.60    0.01
   -0.62   -0.33              0.21    0.17
   -4.04   -4.08             -4.41   -4.05
   -0.31   -0.73              0.28    0.27
   -0.36   -0.11             -0.20    0.35
   -0.30   -0.97              0.63   -0.03
    0.48    0.49              0.18    0.00
   -0.32   -0.28             -0.67   -0.82
    0.39    0.39             -0.10   -0.24
    0.82    0.99             -1.07   -0.69
   -0.84   -0.83              0.02   -0.30
    0.04   -0.05              0.17    0.30
    0.73   

How do we use these embeddings in NLP? The usual way is to replace each occurence of a word with an embedding - it represents our word. The example below displays what we would pass to our algorithm for a sentence. We show one line for each word, with each value formatted to two decimal places again. The word is displayed at the start of the line for convenience only - this would not be passed to our algorithm.

In [None]:
sentence=["the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"]
embeddings = []
for i in sentence:
  embeddings.append(glove[i])
  
for i, val in enumerate(embeddings):
  print(sentence[i].ljust(10), ''.join("{:6.2f}".format(x) for x in val))  
  

the         -0.01  0.02  0.21  0.17 -0.44 -0.15  1.84 -0.16  0.18 -0.32  0.07  0.52 -6.34  0.48  0.14 -0.49  0.39 -0.00 -0.10  0.21 -0.86  0.17  0.19 -0.84 -0.31
quick       -0.06  1.14 -0.49  0.09  0.57  0.64  0.96 -0.88  0.29 -0.31 -0.26  0.72 -3.54 -0.52  0.40  0.13  0.45 -1.03 -0.58 -0.79 -0.30 -0.47  0.98  0.29 -0.36
brown       -0.55 -0.93  0.71  0.31 -0.15  0.39  0.46  0.08  0.33 -1.07  0.72  0.14 -3.99 -1.25 -0.13 -0.57 -0.32 -0.36 -0.41  0.78  0.21  0.93  0.37 -0.44 -0.09
fox          0.32 -0.05  0.90 -0.50  0.14 -0.48  0.40  0.51  0.32 -0.81  0.42 -0.49 -3.01 -0.40  0.57 -0.45 -0.60  0.28 -0.07  0.20  0.22  0.10 -0.15 -1.32  0.11
jumped      -1.01  0.55  1.84 -0.24 -0.56 -0.08  0.32 -1.00 -0.37  0.67  0.45  1.24 -2.95  0.23  0.16  0.76  0.95 -0.04  0.09 -0.87  0.85  0.56  2.15 -0.91 -0.27
over         0.55  0.34 -0.05  0.03 -0.59  0.23  0.10 -0.60 -1.03  0.67  0.19  1.27 -5.16  0.18  0.38  0.74  0.38  0.52 -0.79 -0.45 -0.78  0.23  0.19 -0.47  0.39
the         -0.01  0.02  0.2