# Part 2: Word Embeddings

1.   우리는 따로 학습 시킬 필요는 없음.
2.   다른 사람들이 학습해놓은 모델 가져오면 됨.


In [1]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

try: 
    import torchbearer
except:
    !pip install torchbearer
    
try:
    import torchtext
except:
    !pip install torchtext
    
try:
    import spacy
except:
    !pip install spacy

try:
    spacy.load('en')
except:
    !python -m spacy download en

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchbearer
  Downloading torchbearer-0.5.3-py3-none-any.whl (138 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.1/138.1 KB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torchbearer
Successfully installed torchbearer-0.5.3




2023-03-24 07:33:50.938727: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-24 07:33:50.938872: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-24 07:33:53.035258: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.d

Word embeddings transform a one-hot encoded vector (a vector that is 0 in elements except one, which is 1) into a much smaller dimension vector of real numbers. The one-hot encoded vector is a *sparse vector*, whilst the real valued vector is a *dense vector*. 

The key concept in these word embeddings is that words that appear in similar _contexts_ appear nearby in the vector space, i.e. the Euclidean distance between these two word vectors is small. By context here, we mean the surrounding words. For example in the sentences "I purchased some items at the shop" and "I purchased some items at the store" the words 'shop' and 'store' appear in the same context and thus should be close together in vector space.

We'll talk about some of the well-known algorithms for learning embeddings in the lectures, but you might have already heard of a popular model called *word2vec*, which was first published in a rejected ICLR submission (it has some pretty damning reviews, but also has thousands of citations!). In this lab we'll use pre-trained *GloVe* vectors. *GloVe* is a different algorithm for computing word vectors, although the outcome is similar to *word2vec*. These pre-trained embeddings have been trained on a gigantic corpus. We can use these pre-trained vectors within any of our models, with the idea that as they have already learned the context of each word they will give us a better starting point for our word vectors. This usually leads to faster training time and/or improved accuracy.

In PyTorch, we use word vectors with the `nn.Embedding` layer, which takes a _**[sentence length, batch size]**_ tensor and transforms it into a _**[sentence length, batch size, embedding dimensions]**_ tensor. `nn.Embedding` layers can be trained from scratch, or they can be initialised (and optionally fixed) with pre-trained embedding data. The key thing to remember about an `nn.Embedding` is that it does not need to explicitly use a one-hot vector representation at any point; it just maps an index to a vector. This is important because it implies massive computational savings; more concretly an Emdedding is essentially a linear map in which the weight matrix of the linear layer is multiplied by a one-hot sparse-vector to produce a lower-dimensional (dense) output. This is exactly equivalent to just selecting the column of the weight matrix corresponding to the index represented by the sparse vector.

In this part of the lab we won't be training any models; instead we'll be looking at the word embeddings and investigating a few interesting things we can do with them.

## Loading the GloVe vectors

First, we'll load the pre-trained GloVe vectors. The `name` field specifies what the vectors have been trained on, here the `6B` means a corpus of 6 billion words. The `dim` argument specifies the dimensionality of the word vectors. GloVe vectors are available in 50, 100, 200 and 300 dimensions. There is also a `42B` and `840B` glove vectors, however they are only available at 300 dimensions. The first time you run this it will take time as the vectors need to be downloaded:

In [2]:
import torchtext.vocab

glove = torchtext.vocab.GloVe(name='6B', dim=100) # 이미 학습된 것을 불러오기만 하는 것임. torchtext의 vocab에 존재.
# hidden size를를 몇으로로 정할래임. V -> hidden (100 / 300 / 512 / 256)
print(f'There are {len(glove.itos)} words in the vocabulary')
# itos -> dictionary의 사이즈가 무엇이냐는 것.
# 우리는 지금 아래와 같이 400000개의 words를 사용 가능함.

.vector_cache/glove.6B.zip: 862MB [02:39, 5.40MB/s]                           
100%|█████████▉| 399999/400000 [00:19<00:00, 20515.34it/s]


There are 400000 words in the vocabulary


As shown above, there are 400,000 unique words in the GloVe vocabulary. These are the most common words found in the corpus the vectors were trained on. **In these set of GloVe vectors, every single word is lower-case only.**

`glove.vectors` is the actual tensor containing the values of the embeddings.

In [3]:
glove.vectors.shape # 400000 x 100차원 짜리 가져옴.

torch.Size([400000, 100])

We can see what word is associated with each row by checking the `itos` (int to string) list. 

Below implies that row 0 is the vector associated with the word 'the', row 1 for ',' (comma), row 2 for '.' (period), etc.

In [6]:
glove.itos[:100] # 일부 찍어봄. # glove는 사전과도 같다. 어떤 단어에 대한 embedding 값을 쭉 가지고 있다.

['the',
 ',',
 '.',
 'of',
 'to',
 'and',
 'in',
 'a',
 '"',
 "'s",
 'for',
 '-',
 'that',
 'on',
 'is',
 'was',
 'said',
 'with',
 'he',
 'as',
 'it',
 'by',
 'at',
 '(',
 ')',
 'from',
 'his',
 "''",
 '``',
 'an',
 'be',
 'has',
 'are',
 'have',
 'but',
 'were',
 'not',
 'this',
 'who',
 'they',
 'had',
 'i',
 'which',
 'will',
 'their',
 ':',
 'or',
 'its',
 'one',
 'after',
 'new',
 'been',
 'also',
 'we',
 'would',
 'two',
 'more',
 "'",
 'first',
 'about',
 'up',
 'when',
 'year',
 'there',
 'all',
 '--',
 'out',
 'she',
 'other',
 'people',
 "n't",
 'her',
 'percent',
 'than',
 'over',
 'into',
 'last',
 'some',
 'government',
 'time',
 '$',
 'you',
 'years',
 'if',
 'no',
 'world',
 'can',
 'three',
 'do',
 ';',
 'president',
 'only',
 'state',
 'million',
 'could',
 'us',
 'most',
 '_',
 'against',
 'u.s.']

We can also use the `stoi` (string to int) dictionary, in which we input a word and receive the associated integer/index. If you try get the index of a word that is not in the vocabulary, you receive an error.

In [10]:
glove.stoi['the'] # 단어가 몇 번째에 존재하냐. keyerror 는 없는 것.
# 가장 많이 등장하는 단어를 앞에서부터 넣는다.

0

We can get the vector of a word by first getting the integer associated with it and then indexing into the word embedding tensor with that index.

In [14]:
print(glove.stoi['the'])
glove.vectors[glove.stoi['the']] # 우리의 목적은, 단어의 인텍스에서 벡터를 가져오라는 것이다. 이이 벡터는 400000 x 100 차원 짜리인데, the 라는 단어를 가져오려면,
# the 라는 단어의 인덱스를 0으로 부여함.
# the라는 embedding은 다음과 같이 생겼다.
# 0번째째 embedding 값이 아래와 같은 것이다.

# 컴퓨터는 단어를 알지는 못 한다. 숫자로 표현했다.
# 그래서 man - woman 와 같은 연산이 가능한 것이다.

0


tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  0.8278,  0.27

We'll be doing this a lot. __Use the following block to create a function that takes in word embeddings and a word and returns the associated vector.__ You should throw an error if the word doesn't exist in the vocabulary:

In [18]:
def get_vector(embeddings, word):

    # Glove라는는 embedding에에 vectors 값을을 함함.
    # YOUR CODE HERE
    # 특정정 embedding에에 대한 단어를를 뱉어라라.
  return embeddings.vectors[embeddings.stoi[word]]

As before, we use a word to get the associated vector.

In [19]:
get_vector(glove, 'the')

tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  0.8278,  0.27

## Similar Contexts

Now to start looking at the context of different words. 

If we want to find the words similar to a certain input word, we first find the vector of this input word, then we scan through our vocabulary finding any vectors similar to this input word vector.

The function below returns the closest 10 words to an input word vector:

In [21]:
import torch

# 모든든 단어와와 주어진진 단어와의의 모든든d istance를를 구해서서 가장장 거리가가 짧은은 애애 n개를를 리턴해줘줘.

# 가장장 가까운운 것을을 찾음음.
# embedding과과 vector가가 들어오면면 찾음음.
# 유클리디안안 distance로로 distance를를 구해서서 
def closest_words(embeddings, vector, n=10):
    distances = [(w, torch.dist(vector, get_vector(embeddings, w)).item()) for w in embeddings.itos]
    return sorted(distances, key = lambda w: w[1])[:n]

Let's try it out with 'korea'. The closest word is the word 'korea' itself (not very interesting), however all of the words are related in some way. Pyongyang is the capital of North Korea, DPRK is the official name of North Korea, etc.

Interestingly, we also get 'Japan' and 'China',  implies that Korea, Japan and China are frequently talked about together in similar contexts. This makes sense as they are geographically situated near each other. 

In [22]:
closest_words(glove, get_vector(glove, 'korea')) # glove와와 korea라는는 단어와와 가장장 가까운운 단어 10개는는 아래와와 같다다/

[('korea', 0.0),
 ('pyongyang', 3.9039547443389893),
 ('korean', 4.068886756896973),
 ('dprk', 4.2631049156188965),
 ('seoul', 4.340494632720947),
 ('japan', 4.551243305206299),
 ('koreans', 4.615607738494873),
 ('south', 4.65822696685791),
 ('china', 4.8395185470581055),
 ('north', 4.986356735229492)]

Looking at another country, India, we also get nearby countries: Thailand, Malaysia and Sri Lanka (as two separate words). Australia is relatively close to India (geographically), but Thailand and Malaysia are closer. So why is Australia closer to India in vector space? A plausible explaination is that India and Australia appear together in the context of [cricket](https://en.wikipedia.org/wiki/Cricket) matches.

In [24]:
closest_words(glove, get_vector(glove, 'india')) # 뒤에에 나오는는 것은은 distance 값이고고 가까울울 수록록 값이이 작은은 것이다.

[('india', 0.0),
 ('pakistan', 3.695482015609741),
 ('indian', 4.114313125610352),
 ('delhi', 4.155976295471191),
 ('bangladesh', 4.261017799377441),
 ('lanka', 4.43584680557251),
 ('sri', 4.515717506408691),
 ('australia', 4.806082248687744),
 ('thailand', 4.994781494140625),
 ('malaysia', 5.009334087371826)]

We'll also create another function that will nicely print out the tuples returned by our closest_words function.

In [26]:
def print_tuples(tuples):
    for w, d in tuples:
        print(f'({d:02.04f}) {w}') # 위 tuple을을 이쁘게게 출력하는는 것것.

Using the `print_tuples` function use the code block below to print out the 10 neighbours of 'jaguar':

In [32]:
# YOUR CODE HERE
print_tuples(closest_words(glove, get_vector(glove, 'ai'))) # 우리는 ai라는 단어를 원하지만만 ,아래와와 같이이 다른 단어가 나올올 수수 있다다. glove라는는 embedding을
# 학습한한 데이터가가 general하고고, 대화화 형식의의 데이터를를 학습했기기 때문에에 이렇게게 나옴옴.

(0.0000) ai
(4.5332) hey
(4.5842) ok
(4.6785) fukuhara
(4.8145) fortunately
(4.8299) 'cause
(4.8935) yeah
(4.9061) hi
(4.9083) luckily
(4.9333) …


__Use the following block to explain the results.__ (hint: use Google if you don't know what any of the terms are!)

YOUR ANSWER HERE

## Analogies

Another property of word embeddings is that we can apply standard arithmetic operations. This can give interesting results.

We'll show an example of this first, and then explain it:

In [33]:
def analogy(embeddings, word1, word2, word3, n=5): # vector로로 표현현 했으니니 빼기기 더하기기 이런런 것이이 되지지 않을까까 ?
    
    # king + woman - man =====> queen?이이 나오지지 않을까까? 어떤떤 embedding 값이이 나올올 텐데데, 그것이 400000개개 중중 같은은 게게 나올올 가능성은은 거의의 없다다.
    # 그 결과로로 나온온 vector와와 400000 단어어 중에서서 그나마마 distance가가 작은은 것을을 고른다다.

    candidate_words = closest_words(embeddings, get_vector(embeddings, word2) - get_vector(embeddings, word1) + get_vector(embeddings, word3), n+3)
    
    candidate_words = [x for x in candidate_words if x[0] not in [word1, word2, word3]][:n]
    
    print(f'{word1} is to {word2} as {word3} is to...')
    
    return candidate_words

In [39]:
print_tuples(analogy(glove, 'man', 'king', 'woman', n = 10))
print_tuples(analogy(glove, 'seoul', 'korea', 'india', n = 10))
# 잘잘 하긴긴 하지만만 우리가가 완벽히히 원하는 것을을 얻는는 데에는는 한계가가 있다다.

man is to king as woman is to...
(4.0811) queen
(4.6429) monarch
(4.9055) throne
(4.9216) elizabeth
(4.9811) prince
(4.9857) daughter
(5.0641) mother
(5.0775) cousin
(5.0787) princess
(5.1283) widow
seoul is to korea as india is to...
(5.8343) pakistan
(6.2924) lanka
(6.5571) australia
(6.5643) bangladesh
(6.5883) africa
(6.6894) sri
(6.7463) indonesia
(6.7763) indian
(6.9396) japan
(6.9865) zealand


This is the canonical example which shows off this property of word embeddings. So why does it work? Why does the vector of 'woman' added to the vector of 'king' minus the vector of 'man' give us 'queen'?

If we think about it, the vector calculated from 'king' minus 'man' gives us a "royalty vector". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this "royality vector" to 'woman', this should travel to her royal equivalent, which is a queen!

We can do this with other analogies too. For example, this gets an "acting career vector":

In [40]:
print_tuples(analogy(glove, 'man', 'actor', 'woman'))

man is to actor as woman is to...
(2.8133) actress
(5.0039) comedian
(5.1399) actresses
(5.2773) starred
(5.3085) screenwriter


__Use the following block to compute a 'capital city vector' that predicts the capital of England based on the capital and name of another country__:

In [45]:
# YOUR CODE HERE
print_tuples(analogy(glove, 'seoul', 'korea', 'london')) # 2016 - 17년도에 나옴.
# 따로 학습하지 않고, LSTM이나 다른 것에 밑에서 학습을 함.

seoul is to korea as london is to...
(4.0185) britain
(4.7469) england
(4.7797) europe
(4.8555) united
(4.9640) australia


__Use the following block to compute an 'musical genre vector' that predicts the genre of music played by Eminem based on another musician/band and their genre__:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()