# Word embeddings modelling

In [1]:
import nltk

## Language models

Language models assign **probability values** to sequences of words. In essence, they are trying to 
“fill in the blank” based on context. 

Given a sentence “The hand-held gaming device is powered by small solar /---/ ", a language model may complete this sentence by saying that the word **"panels"** would fill the gap 80% of the time and the word **"batteries"** 20% of the time.



Two types:

- **statistical** (N-grams, Hidden Markov Models, linguistc rules)
- **neural** (neural networks)

## Word vectors/embeddings

When we want to process natural language and mine it for useful information using machine learning techniques we have to map textual data to some numerical representation (this process is called **vectorisation** as we create vectors of numeric values). Word vectors are often referred to as **"word embeddings"**. 

Do you remember what a **vector** is?

- Geometry: **an object with magnitude and direction**

- Computer science: **one-dimensional array** (e.g. $$[.44, .26, .07, -.89, -.15].$$ )

![Picture title](image-20201006-124756.png)
Image source and refresher on vectors: https://www.mathsisfun.com/algebra/vectors-cross-product.html

## Data representation

How do we vectorise words? There are quite a few ways to represent text data as numbers, depending on what information we want them to contain.

Types of data representation
- one-hot (presence/absence)
- frequency-based (occurence frequency)
- distributed (read on!)

### One-hot encoding

**One-hot encoding**  is the simplest method  (“1-of-N” encoding). The resulting embeddings (vectors) are composed of a single "one" and a number of "0 (zeros)".
This encoding method marks a particular vector **index** with a value of true (1) if the token occurs in a document and false (0) if it does not. 
In other words, each element of a one-hot encoded vector reflects either the presence or absence of the token in the analysed text.




In [2]:
import numpy as np
sentence1 = "We value talking to a human being at the other end of a conversation".lower().split()
sentence2 = "Trump is being given a steroid that is usually used for severe cases of covid-19".lower().split()

vocab = set(sentence1+sentence2)
vocab = sorted(vocab)
print ("vocabulary (two sentences combined): ", vocab)

#encoding words in the sentece based on their index (position) in the vocabulary
integer_encoded = []
for i in sentence1:
    print (np.array(vocab)==i)
    v = np.where( np.array(vocab) == i)[0][0]
    print ('v: ', v)
    integer_encoded.append(v)
print ("sentence 1 encoded: ",integer_encoded)

integer_encoded = []
for i in sentence2:
    v = np.where( np.array(vocab) == i)[0][0]
    integer_encoded.append(v)
print ("sentence 2 encoded: ",integer_encoded)

def get_vec(len_vocab,word):
    empty_vector = [0] * len_vocab
    vect = 0
    find = np.where(np.array(vocab) == word)[0][0]
    empty_vector[find] = 1
    return empty_vector

def get_matrix(vocab, sentence):
    mat = []
    len_vocab = len(vocab)
    for i in sentence:
        vec = get_vec(len_vocab,i)
        mat.append(vec)
        
    return np.asarray(mat)

print ("MATRIX Sentence 1 :")
print (get_matrix(vocab, sentence1))   
print ("MATRIX Sentence 2 :")
print (get_matrix(vocab, sentence2))   

vocabulary (two sentences combined):  ['a', 'at', 'being', 'cases', 'conversation', 'covid-19', 'end', 'for', 'given', 'human', 'is', 'of', 'other', 'severe', 'steroid', 'talking', 'that', 'the', 'to', 'trump', 'used', 'usually', 'value', 'we']
[False False False False False False False False False False False False
 False False False False False False False False False False False  True]
v:  23
[False False False False False False False False False False False False
 False False False False False False False False False False  True False]
v:  22
[False False False False False False False False False False False False
 False False False  True False False False False False False False False]
v:  15
[False False False False False False False False False False False False
 False False False False False False  True False False False False False]
v:  18
[ True False False False False False False False False False False False
 False False False False False False False False False False False

**CODEIT** write a code snippet to extract the one-hot matrix representation of the following three sentences:

1. NLP is now the most popular subfield of machine learning.
2. My washing machine is not working properly now.
3. Analysis of language using artificial intelligence methods have risen dramatically.



In [3]:
##Insert your code here
sentence_1 = "NLP is now the most popular subfield of machine learning .".lower().split()
sentence_2 = "My washing machine is not working properly now .".lower().split()
sentence_3 = "Analysis of language using artificial intelligence methods have risen dramatically .".lower().split()

import numpy as np

vocab = set(sentence_1+sentence_2+sentence_3)
vocab = sorted(vocab)
print ("vocabulary (three sentences combined): ", vocab)
#Insert your code here

vocabulary (three sentences combined):  ['.', 'analysis', 'artificial', 'dramatically', 'have', 'intelligence', 'is', 'language', 'learning', 'machine', 'methods', 'most', 'my', 'nlp', 'not', 'now', 'of', 'popular', 'properly', 'risen', 'subfield', 'the', 'using', 'washing', 'working']


**OBSERVE AND REFLECT:**  Using the examples above explain why one-hot vector representation is **not** the best method for analysing semantic similarity? 



### Write your answer here ###

---

What are the **main problems with this one-hot representation**?


- **sparsity and size**: the representation size grows with the corpus (imagine a corpus with the 300,000 word vocabulary where each word vector will will have 300,000 dimensions (float values) with all but one being a zero) (computationally expensive!).
- **each vector is equally distant from every other vector** (does not reflect their position in relation to each other)
- **no contextual/semantic information** is embedded  - therefore they are not suitable for NLP tasks like POS tagging, named-entity recognition etc.




### Distributed representation

An alternative is called **distributed representation**. 

Please read here UNTIL (and including) Figure 3 (up until "While this shape example is oversimplified, it serves as a great high-level, abstract introduction to distributed representations"
to get familiar with this concept. https://www.oreilly.com/content/how-neural-networks-learn-distributed-representations/



## Training a simple neural language model

1. represent words with **one-hot vectors**
2. encode input words (create **word embeddings**):
- take the  one-hot vector representing the input word
- multiply it by a matrix of size (N,200) (200 is the vector size - number of dimensions - which is chosen **arbitrarily**).
This multiplication results in a vector of size 200 (word embedding). 

<img  src="http://mccormickml.com/assets/word2vec/matrix_mult_w_one_hot.png"/>

3. Now we have a representation of the input word. 
We multiply it by a matrix of size (200,N) (**output embedding**).  
As a result, we get a vector of size N and then pass it through **softmax function**.
Softmax normalises values of the vector into a probability distribution (each one of the values is between 0 and 1, and their sum is 1). 
This decoding step takes a word representation and returns a distribution which represents the model’s predictions of the next word. 



<center><img src = "http://mccormickml.com/assets/word2vec/output_weights_function.png"></center>





**QUICK Softmax refresher**: https://victorzhou.com/blog/softmax/

**Data needed for training**: pairs of input and target output words

**Data generation**: take every pair of neighboring words from the text and use the first one as the input word and the second one as the target output word. 
Example: “The cat is on the mat”.

**Word pairs for training**: (The, cat), (cat, is), (is, on), (on, the), (the, mat).



**Training process**: using gradient descent to update the model during training and loss measures to calculate
the distance between the output distribution predicted by the model and the target distribution for each pair of training words. 
The target distribution for each pair is a one-hot vector representing the target word.



Please check this page for more info on the algorithm architecture: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

**Model performance evaluation**: Let's talk about PERPLEXITY again :) 

<img  src="https://miro.medium.com/max/616/1*vV0XMYe69LPMlH3fFouDtw.png"/>

## How to improve the performance of a simple model?

Can you think what the biggest problem of this simple model is?

To predict the next word in the sentence, it only uses ONE preceding word. In real life, we consider much more context when reading and understanding a text. 
A model that could be taught to "remember" more than one preceding word would be more efficient!

**Example:** what words follow the word "eat"? 

We can answer “cookies”, “nuts” or "eucalyptus", and the model could also reply that these words may have high probability of being the target ones. However, if we knew that the actual word sequence was “Koalas eat" would it change our opinion about the most probable answer?


![ChessUrl](https://media.giphy.com/media/eDUHhtooZxyhi/giphy.gif)

## WORD EMBEDDINGS

Words get their embeddings by us looking at which other words they tend to appear next to. The mechanics of that is that

1. We get a lot of text data (say, all Wikipedia articles, for example). then


2. We have a window (say, of three words) that we slide against all of that text.


3. The sliding window generates training samples for our model


### Word2vec

A method of creating word embeddings

http://jalammar.github.io/illustrated-word2vec/


### GloVe (Global Vectors)

Disadvantage of skipgram models: they do not operate directly on the co-occurrence statistics.
They scan context windows across the entire corpus and fail to take advantage of the vast amount of repetition in the data.


**GloVe (Global Vectors)** is a **count-based model**. It learns word embeddings by dimensionality reduction of a **co-occurrence counts matrix**.

1. Build a co-occurence matrix (each row = how often does a word occur with every other word in some defined context-size in a large corpus).

2. Factorise this matrix (=> a lower-dimensional matrix: rows = word vectors).



### Problems with word2vec and GloVe 

They create one vector for different meanings of a polysemous word (and about 40% of English words are polysemous!).

Example: any occurence of the word "bank" (river bank or financial institution) - will be mapped to the same vector.

Words exist in context and their meanings are defined by the contextual use. Would not it be beneficial to learn representations that reflect this?

### BERT (Bidirectional Encoder Representations from Transformers)

Release of BERT model was described as marking the beginning of a new era in NLP. **Bidirectional Encoder Representations from Transformers (BERT)** is a language model that looks both to the left and the right of a word to pre-train representations.

![BertUrl](https://media.giphy.com/media/umMYB9u0rpJyE/giphy.gif)


Key technical innovation:
- applying bidirectional training of Transformer, a popular attention model, to language modelling
- deeper understanding on a word's context
- reads the entire sequence of words at once =>  learns context of a word based on all of its surroundings 



### Fastext (by Facebook Research)

- represents each word as an n-gram of characters.
Example: "artificial" with n=3 <ar, art, rti, tif, ifi, fic, ici, ial, al> (the angular brackets mean the beginning and end of the word).   
- capture the meaning of shorter words and suffixes & prefixes
- works well with rare words

## Wait, there is more!

If you want to learn about the most recent models please check out the following links:

- Word2vec: http://jalammar.github.io/illustrated-word2vec/
- Transformers: http://jalammar.github.io/illustrated-transformer/
- GPT2: http://jalammar.github.io/illustrated-gpt2/  & https://openai.com/blog/gpt-2-1-5b-release/ (the model was initially not release to public out of fear it would be used to spread fake news, spam, and disinformation. )
- GPT3 (2020): can generate computer code, prose and poetry; has been called "amazing", "spooky", "humbling", and "more than a little terrifying". 
- GPT3 use examples: https://gpt3examples.com/#examples


# Let's see how word2vec models work

Choose a word embedding in a language of your preference and download it:http://vectors.nlpl.eu/repository/#


**Gensim** is a python library which implements various natural language processing methods and algorithms.

In [4]:
!pip install gensim


You should consider upgrading via the '/opt/venv/bin/python -m pip install --upgrade pip' command.[0m


In the following code we read the brown corpus from nltk library and train(build) a Word2Vec language model using Gensim library.

**Note:** Running the following code takes a few minutes

In [15]:
import gensim
import logging
from nltk.corpus import brown 


nltk.download('brown')
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = brown.sents()
model = gensim.models.Word2Vec(sentences, min_count=1)

model.save('brown_model.bin')

[nltk_data] Downloading package brown to /home/jovyan/nltk_data...
[nltk_data]   Package brown is already up-to-date!
2021-01-15 13:12:52,896 : INFO : collecting all words and their counts
2021-01-15 13:12:52,897 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-01-15 13:12:53,640 : INFO : PROGRESS: at sentence #10000, processed 219770 words, keeping 23488 word types
2021-01-15 13:12:54,306 : INFO : PROGRESS: at sentence #20000, processed 430477 words, keeping 34367 word types
2021-01-15 13:12:55,003 : INFO : PROGRESS: at sentence #30000, processed 669056 words, keeping 42365 word types
2021-01-15 13:12:55,669 : INFO : PROGRESS: at sentence #40000, processed 888291 words, keeping 49136 word types
2021-01-15 13:12:56,197 : INFO : PROGRESS: at sentence #50000, processed 1039920 words, keeping 53024 word types
2021-01-15 13:12:56,601 : INFO : collected 56057 word types from a corpus of 1161192 raw words and 57340 sentences
2021-01-15 13:12:56,602 : INFO : Loa

Using the following code you can access vectors of words in your gensim model.

In [6]:
import numpy as np
import nltk

# Access vectors for specific words with a keyed lookup:
vector = model['year']
print(vector)
# see the shape of the vector (300,)
print(vector.shape)
# Processing sentences is not as simple as with Spacy:
vectors = [model[x] for x in "This is some text I am processing with text analysis library".split(' ')]


[-0.13499422 -0.04724487  0.8355935  -0.51010275  1.0083615  -0.3086034
 -0.26302773  0.39535046 -0.79079735 -0.43304572 -0.07567624  0.3719021
 -0.20299797  0.48016202 -0.14941478  0.22955637 -0.23768008  1.6194006
 -0.32277167 -0.4755897   0.05924155 -2.0401626  -0.07549319 -0.2911573
 -0.4468673   0.05049513  0.64557964 -0.6613545   0.02351663 -0.01314109
 -0.9225175   0.5825291  -0.01304154  0.37491497 -0.37053376 -0.20888218
 -1.0596713   1.5670098   0.57272583  1.3895929   0.55373794 -0.2274162
 -1.5938821  -1.081989    0.6925696  -1.307452   -0.745071   -0.31454763
 -0.18321386  1.8907186   0.4405438   1.5854336  -1.405698   -0.40413663
 -0.45831028 -1.3334285  -1.1032896  -0.65287316 -0.23668368 -1.529644
  1.1334416   0.9414207   0.9909252   0.85138464 -0.32880306  1.0097126
 -1.6619712  -0.68290323  0.902218    0.84549284  0.84821767 -0.5903684
  0.369186   -0.43195915 -0.23427604  0.66739225 -1.231144    0.1372639
 -0.02913455 -0.6316493   0.94689274  1.3507402  -0.50835323 

## Using a pretrained Word Embedding model

In the above we learnt who to use Gensim Library to train a language model from text.

In this section we focus on using the language models which are already built and trained with huge amount of data such as the whole corpus of Wikipedia.

Download a word2vec model in english on Wikipedia from [this link](http://vectors.nlpl.eu/repository/20/3.zip) (596 MB file)

 <p style="color:red"> IMPORTANT NOTE: do not run the following cell if you haven't downloaded a word embedding model</p>


 If you didn't download a model you can continue with the current small model.


The following cell code loads an already trained word2vec model using gensim library. (a pretrained model)

In [7]:
# Load vectors directly from the file

#Put the address of your downloded language model here 
from gensim.models import KeyedVectors
# Load vectors directly from the file
address_of_your_model ="model.bin"
model = KeyedVectors.load_word2vec_format(address_of_your_model, binary=True)

2021-01-15 13:11:32,760 : INFO : loading projection weights from model.bin
2021-01-15 13:11:35,430 : INFO : loaded (261794, 300) matrix from model.bin


The word2vec class in gensim library has a function for identifuing the most similar or dissimlar words to a word in it's vocabulary.

Try it by running th e following code cell:

 <p style="color:red">If you didn't download and load the pretrained model , you can still run the following codes. However it's very probable that your model doesn't work well or doesn't know some words, since it has been trained on a very small corpus</p>

In [None]:
# Load vectors directly from the file
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
!gunzip GoogleNews-vectors-negative300.bin
#Put the address of your downloded language model here
from gensim.models import KeyedVectors
# Load vectors directly from the file
address_of_your_model ="GoogleNews-vectors-negative300.bin"
model = KeyedVectors.load_word2vec_format(address_of_your_model, binary=True)

--2021-01-15 13:16:20--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.103.46
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.103.46|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz.4’

ews-vectors-negativ  19%[==>                 ] 298.44M  74.8MB/s    eta 19s    

In [8]:
model.similar_by_word('music')


2021-01-15 13:11:35,453 : INFO : precomputing L2-norms of word weight vectors


[('musician', 0.7037041187286377),
 ('hip-hop', 0.7010228633880615),
 ('soundtrack', 0.7001615762710571),
 ('classical', 0.680194079875946),
 ('orchestral', 0.6711907386779785),
 ('melody', 0.6686183214187622),
 ('song', 0.6630804538726807),
 ('lyric', 0.6580933928489685),
 ('choral', 0.657744824886322),
 ('Music', 0.6570932865142822)]

**CODE IT** Using the function `similarity` from gensim library. print the similarity measures of two sets of words according to your model. 

Cat and Dog 

Cat and King

In [9]:
x = 'Cat'
y = 'Dog'
z = 'King'
print(model.similarity(x,y))


0.52323097


### Analogies:

Gensim library provides functionalities for getting analogies from word2vec models.


The `king-man+woman = queen` is a very typical example of how vord embeddings capture semantic dimentions.

Imagine a dimention in a 300 dimentional embedding is storing the concept of Royalty in the word `King`. And one other dimention is storing `gender`.

What would happen if we substract the vector of `Man` from `King` (getting a vector which keeps the `Royalty` but subtracts `masculinity` from gender) and then add `Women` to the result. We excpect to get Queen (Royality+ feminine) which actualy happens in word embeddings trained are huge amount of text.


In the following we see how we can use the analogy funtionality in gensim library.


<img src="https://cdn-images-1.medium.com/max/600/1*LdviucnshWgIIcQvhTTF-g.png" >

The following code performs the above vector calculations. King-Man +Woman = Queen

In [10]:
model.wv.most_similar(positive=["king", "woman"], negative=["man"], topn=3)

  """Entry point for launching an IPython kernel.


[('monarch', 0.6530377864837646),
 ('queen', 0.5418059825897217),
 ('princess', 0.5147514343261719)]

**CODEIT** Using the above example, write a code line which can give the Capital of Belgium as output by knowing the Capital of France. 

or in other words:   **France** to **Paris** is **Belgium** to ...

**NOTE:** if you could not load the language model file **write the code as you think it's correct** and get the answer using this demo :https://rare-technologies.com/word2vec-tutorial/#app



In [11]:
#insert you code here

model.wv.most_similar(positive=["Paris", "Belgium"], negative=["France"], topn=1)

  This is separate from the ipykernel package so we can avoid doing imports until


[('Brussels', 0.6305320858955383)]

**CODEIT**    Using the same code try: **Man** is to **Actor** as **Woman** is to ...




In [12]:
#insert your code here
model.wv.most_similar(positive=["actor", "woman"], negative=["man"], topn=10)

  


[('actress', 0.7306582927703857),
 ('oscar-winner', 0.5388064980506897),
 ('co-star', 0.531090497970581),
 ('Sorvino', 0.5285298824310303),
 ('comedienne', 0.5283129215240479),
 ('Sarandon', 0.5210732221603394),
 ('Blanchett', 0.5175133943557739),
 ('screenwriter', 0.5155196189880371),
 ('Streep', 0.5138664841651917),
 ('Actress', 0.5117000341415405)]

**CODEIT**    Using the same code try: **go** is to **going** as **come** is to ...




In [13]:
model.wv.most_similar(positive=["going", "come"], negative=["go"], topn=10)

  """Entry point for launching an IPython kernel.


[('COMING', 0.35924386978149414),
 ('ENJOYING', 0.35013359785079956),
 ('SPOTS', 0.34187790751457214),
 ('BROUGHT', 0.33890360593795776),
 ('SURPRISES', 0.3362162709236145),
 ('bc-onbusiness-column-bo', 0.3350003957748413),
 ('UPON', 0.3324851095676422),
 ('ANTICIPATION', 0.3311701714992523),
 ('flurry', 0.3310207724571228),
 ('bring', 0.32899945974349976)]

**OBSERVE AND REFLECT : ** if you succeeded in running the above codes you can see that word2vec model have embedded in it some knowledge about the langauge(tenses), knowledge about the world (capital of countries) by observing the conexts of the words in huge amounts of text.  

Why do you think a languge model might act like the following?

` model.wv.most_similar(positive=["doctor", "woman"], negative=["man"], topn=1) = 'nurse' `

Read about [Bias In Language Models](https://towardsdatascience.com/bias-in-natural-language-processing-nlp-a-dangerous-but-fixable-problem-7d01a12cf0f7)






We are planning a separate session on ethics of NLP - stay tuned!

The following function in Gensim library finds a word in the list of words which is the most dissimilar to the others:

In [14]:
print(model.wv.doesnt_match(["France","Germany","Britain","cheese"]))

print(model.wv.doesnt_match(["year","book","month","day"]))

cheese
book
  """Entry point for launching an IPython kernel.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  This is separate from the ipykernel package so we can avoid doing imports until


## Visualization


In order to be able to visualize word embeddings in vector space, we need to use a dimentionality reduction method.

Embedding projector visualizes the word2vec and any other uploaded word embedding model.
https://projector.tensorflow.org

**Exercise** Load a word2vec model look for a word you find interesting and find the 10 words most close to it isolate them and upload an screen-shot in the next cell. The following cell contains an example of the word `watergate` and the top 10 closest words.
(You should upload your image) 

<img src = "embedding_watergate.png">

**IF YOU FANCY** Download one of the gensim models from this repository in your preferred language and run the functions from gensim Word2Vec model class on samples.
http://vectors.nlpl.eu/repository/