# Introduction and Word Vectors

- Two thoughts of Prof Christopher Manning  about language:
    - Language is such an evolved system of communication, but very **uncertain**.
        - However humans have some agreed meaning which helps us communicate so well.
        - We are internally and subconsciously doing some kind of probabilistic inference to determine meaning not just for information but also for social functions etc.
    - For artificial intelligence to reach at a very sophisticated level, it needs to be able to capture all of the human knowledge, which is predominantly conveyed through human language. 
        - Human language is our networking language through which we collectively form a huge network of individuals.
        - Human language made human being invincible. Language made humans to be able to work collectively as a group or team.  That is how they evolved not to just survive in a world of more powerful animals but they thrived.
        - Invention of writing made this knowledge to shared spatially (i.e. through space) or temporally (i.e. through time) not just verbally. 
        - Writing is very recent (~5000 years) phenomenon in scale of evolution, but made humans super powerful.
        - We compress knowledge efficiently and provide a view of the world in very few bits of information (e.g. I went to Zoo and saw an elephant. When you read this it constructs the whole visual scenery in your mind with images which can take few megabytes to store in a computer, but was communicated in very little words).

### How do we represent the meaning of the words?
- Linguists use something called Denotational Semantics to think about meaning.
    - Linguists think of meaning as what things represent.
    - $$\text{signifier(symbol)}  \leftrightarrow \text{signified{(idea or thing)}}$$
    - Word "chair" representing all the thing that are chair.
    - Word "running" represents a set of actions people do, which represents the activity those actions perform i.e. 🏃‍♂️ "

### How do we have reasonable meaning in a computer?
- Common Solution: Something like `WordNet` which is a thesaurus containing words and their relationships using  synonym set and hypernyms ("is a" relationship)

In [7]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ramand\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

- Synonyms of "good"

In [10]:
from nltk.corpus import wordnet as wn
poses = {'n' : 'noun', 'v' : 'verb', 'a' : 'adjective', 's' : 'adjective(s)', 'r' : 'adverb' }
for synset in wn.synsets("good"):
    print(f'{poses[synset.pos()]} : {", ".join([l.name() for l in synset.lemmas()])}')

noun : good
noun : good, goodness
noun : good, goodness
noun : commodity, trade_good, good
adjective : good
adjective(s) : full, good
adjective : good
adjective(s) : estimable, good, honorable, respectable
adjective(s) : beneficial, good
adjective(s) : good
adjective(s) : good, just, upright
adjective(s) : adept, expert, good, practiced, proficient, skillful, skilful
adjective(s) : good
adjective(s) : dear, good, near
adjective(s) : dependable, good, safe, secure
adjective(s) : good, right, ripe
adjective(s) : good, well
adjective(s) : effective, good, in_effect, in_force
adjective(s) : good
adjective(s) : good, serious
adjective(s) : good, sound
adjective(s) : good, salutary
adjective(s) : good, honest
adjective(s) : good, undecomposed, unspoiled, unspoilt
adjective(s) : good
adverb : well, good
adverb : thoroughly, soundly, good


- Hypernyms of "Tiger"

In [14]:
tiger = wn.synset("Tiger.n.01")
hyper = lambda t: t.hypernyms()
list(tiger.closure(hyper))

[Synset('person.n.01'),
 Synset('causal_agent.n.01'),
 Synset('organism.n.01'),
 Synset('physical_entity.n.01'),
 Synset('living_thing.n.01'),
 Synset('entity.n.01'),
 Synset('whole.n.02'),
 Synset('object.n.01')]

- WordNet is outlining various use of "good" in English. They are very fine grained difference which humans can barely understand.
- It clearly misses nuance. e.g. expert is not really "good"
- It also misses new meanings of word e.g. 'wicked good'. It is based on human labor and impossible to keep up-to-date.
- It also can't give us accurate word similarity and a score of how similar a pair of words are?
- It is very subjective.


### Localist Distribution
This is tangential to current discussion.
- A representation of space or collection where each entity is represented independently in a space.
- It can only describe a number of distinct object that are linear in number of dimension.
- This representation do not represent any relationship between the entities.
- One Hot Encoding is a localist representaiton.
- If we represent each word as a symbol, English language is estimated to have 13 million words. If we represent each word as a vector of 13 million dimension with one 1 and rest 0.
- In Neurology, localist representation theorizes that each neuron is a single concept on a stand alone basis. Each neuron or localist unit which has "meaning and representation"
- This is inverse of Distributed Representation.


### Representing words as discrete symbols.
- **Pre-2012**
    - Words as discrete symbols in lexicon (vocabulary)
    - "hotel, conference, motel" a localist representation
    - One Hot Encoding is used to represent the words as vector
        - motel = [00000100]
        - hotel =  [00000010]
    - These vectors become huge because languages have lot of words.
    - In language like English, we can have almost infinite words by using Derivation Morphology 
        - "New words are created in language by adding more words to the ending of existing words."
        - "e.g. Paternal --> Paternalistic, Paternalistically "
        - "This can explode the vocabulary of a language by many folds."
    - This takes huge computational power as word vector can be of dimension 500,000 or more.
    - Another bigger problem is often times we are interested in relationship and meaning of words.
        - If I search for "Seattle Motels", I might also actually like "Seattle Hotels" in the search too.
        - However,  words as a symbol representation keeps these words orthogonal in the space. See One Hot Encoding above. 
        - There is no notion of similarity between one-hot encoded vectors.
        - Word Similarity tables can solve this problem but that means but that explodes the computational problem. For each pair of words you keep a score of how similar they are but leads to really large table e.g.  using 500,000 words in vocabulary we might end up with a table of 2.5 trillion cells.
- Instead of that how about **we encode "similarity" in vector themselves?**

### Distributional Semantics
- Linguistic meaning of the word: A word's meaning is given by the words which appear close to this word frequently. 
- When a word *w* appears in the text, it's **context** is the set of words that appear near *w*  at a fixed length window.
- "You shall know a word by the company it keeps" - J. R. Firth 1957
- ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fdailygrind%2FlUwgP_m7kQ.png?alt=media&token=933154b6-2990-4d0a-aa17-323b7ddabf13)
- The meaning of "banking" is the collections of all the words around it.


### Distributed Representation

- This idea is inverse of localist representation e.g. One Hot Encoding.
- In One Hot Encoding each vector is independent of the other words and represented as really large vectors (English has 13 million different words) where each vector has all 0s but one 1 which represent that word.
    - motel = \[00000100...00\]
    - hotel  = \[00000010...00\]
- In distributed representation, each word is represented as a dense vector which is similar to vector of words that appear in similar contexts.
- In other words, words which appear together live closer in the vector space representing all the words. e.g. motel and hotel will be related words and will live close in the vector space.
- The dimensions of this vector is very small compared to the vocabulary e.g. 50, 100, 200.... 4000 as compared to 13 million English words. 
- We use this smaller vector space to encode the relationship between words.
$$ \text{banking} = \begin{pmatrix} 0.102 \\ 0.432 \\0.445\\ 0.001\\0.034 \end{pmatrix}$$
- These word vectors are sometimes called Word Embeddings or word representations.

- When this vector space of some large dimension (100) is squished or projected to a 2-D plane, although you lose lot of information but still the related words seemingly appear together. 
    - A cluster of countries in one such projected plane
    - ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fdailygrind%2F6feMskbybU.png?alt=media&token=7147604c-ac91-4612-bdc3-c1917e6c9db1)
- ### Word2Vec 
    - By Tomas Mikolov
    - **Idea:**
        - We have a sufficiently large corpus of text.
        - Initially each word in this fixed set of vocabulary is represented by a vector with arbitrary values.
        - We then iterate through corpus, for each position *t* in text
            - We have the center word *c*
            - We also have context words *o* ("outside" words surrounding *c*)
            - Use the similarity of the word vectors for *c * and *o* to calculate the probability of *o* given *c* (or vice versa)
                - ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fdailygrind%2FFvXzkI7qYs.png?alt=media&token=40ab1db5-7de2-4de2-add1-88d8c9e8fe71)
            - Keep adjusting the word vectors to **maximize this probability.** (covered below)
            - This is repeated for each position.
                - ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fdailygrind%2FoQ83JB0O6A.png?alt=media&token=aa676c96-1731-4295-bd5c-034b8816a42b)
- **Types of word2vec algorithms**
    - Skip-Grams (SG)
        - Predict context words given target irrespective of position of context words.
    - Continuous Bag of words (CBOW)
        - Predict target word from bag-of-words context
- **Skip-gram Prediction**
    - For every word t = 1...*T*, predict surrounding context words in a window of radius "m" of every words.
    - **Objective function**: Maximize the probability of any context word given the current center word. 
    - It is also called loss or cost function.
    - $\text{Likelihood} = L(\theta) = \displaystyle\prod_{t=1}^{T}\prod_{-m \leq j \leq m \atop j \neq 0} P(w_{t+j} | w_t; \theta)$
    - What $L(\theta)$ is saying is that we have a large corpus of text, we are going to go through each position in this text, and for each such position, we are going to a have window of *2m* around it (*m *  words before, *m * words after), and then we are going to have a probability distribution that will give us probability to a word appearing in context of the center word.
    - Context words are represented as "O" (outside words?) and center word is represented as "C".
    - We would like to set the parameters of the model such that, these probabilities of the words that do appear in context of the center word is as high as possible.  $\theta$ is the parameter of our model, which is the vector representation of the word.
    - Likelihood is then representing how good a job our model will do in predicting words around every word!
    - Representing the objective function in more math friendly way (negative log likelihood)
    - $ J(\theta) = -\frac{1}{T} log L(\theta) = -\frac{1}{T} \displaystyle\sum_{t=1}^{T}\sum_{-m \leq j \leq m \atop j \neq 0} logP(w_{t+j} | w_t; \theta)$
    - Minimizing the objective function $\leftrightarrow$ Maximizing predictive accuracy

- **How do we calculate the probability?**
    - Each word will be represented by two vectors. $V$ and $U$
    - $V$ represents the word as center word and $U$ represents the word as a context word.
    - $U^{T}V$ represents how similar the center word is to another context word. It is simple dot product. It will be bigger if U and V are similar.
    - So if we iterate over all words $\displaystyle\sum_1^W U_w^TV$
        We are basically working out how similar each word is to V.
    - **Softmax form**
        - The dot product of vectors we obtained can be any number say -304.
        - Softmax is the standard way to convert these numbers to a probability distribution.
        - First we expontentiate them. That makes the values positive. It amplifies the max, but still assign values to lower values.
        - Then we normalize to convert them to probability by summing all the numbers together and divide it by the value.
    - Finally, the ratio is representing what is the probability of word at index o in vocabulary to appear in context of the current center word (word at index c in vocabulary)
        - $P(o|c) = \displaystyle\frac{exp(U_{o}^T V_{c})}{\sum_1^W exp(U_w^TV_{c})}$
            - Where *o* is the outside (or output) word index, and *c* is the center word index, $V_{c}$ and $U_{o}$ are "center" and "outside" vectors of indices *c* and *o*
            - Understand the difference between $P(w_{t+j} | w_t; \theta)$ and $P(o|c)$
                *c* and *o* are indices in the space of vocabulary (e.g. word 218 and word 3023 in the vocabulary), while *t* and *t+j* are positions in the text (e.g. word 342 and 346 in the text corpus)
- **How it all fits together?**
    - As discussed above, we will have two vectors $V_c$ and $U_o$ representing the word. $V_c$ represents the word as center word, while $U_o$ represents the word as "outside" or context word.
    - Initially these vectors are assigned random values. Our goal is to find the right values of these vectors i.e. to train the model and compute the best possible values for these vectors. These are the parameters of the model.
    - We often define the set of all parameters in a model in a one long vector $\theta$
        - $\theta = \begin{bmatrix} V_{\text{aardvark}} \\ V_{\text{a}} \\ . \\ . \\ . \\ V_{\text{Zyzzyva}} \\ U_{\text{aardvark}} \\ U_{\text{a}} \\ . \\ . \\ . 
\\ U_{\text{Zyzzyva}} \end{bmatrix} \in {\rm I\!R}^{2dV}$
        - Each vector has dimension of $d$ and there are $2V$ vectors with $V$ is the vocabulary size.
    - Essentially, we want to minimize the overall loss i.e. the negative log likelihood, we saw above. At minimum loss we will have the best possible value of $\theta$.

- **How do we minimize the negative log likelihood and how does it give us the best** $\theta$ **values?**
    - The objective function can be rewritten as:
        - $ J(\theta) = \displaystyle-\frac{1}{T} log L(\theta) = -\frac{1}{T} \sum_{t=1}^{T}\sum_{-m \leq j \leq m \atop j \neq 0} log\frac{exp(U_o^TV_c)}{\sum_{w=1}^vexp(U_w^TV_c)}$
    - We will minimize it by calculating it's gradient value $\frac{dJ(\theta)}{d\theta}$ 
    - Since $\theta$ is built of $V_c$ and $U_o$, we will essentially calculate $\frac{\partial J(\theta)}{\partial V_c}$ and $\frac{\partial J(\theta)}{\partial U_o}$
    - Let's first focus on $\frac{\partial J(\theta)}{\partial V_c}$ 
    - Ignoring the summation and constant terms we basically need to calculate
        -  $\frac{\partial J(\theta)}{\partial V_c } = \frac{\partial}{\partial V_c} log\frac{exp(U_o^TV_c)}{\sum_{w=1}^vexp(U_w^TV_c)} $
        - $= \frac{\partial}{\partial V_c} log (exp(U_o^TV_c)) - \frac{\partial}{\partial V_c} log \sum_{w=1}^v exp(U_w^TV_c)$
        - $ = t_1 - t_2$
    - Let's tackle the first term first. 
        - $t_1 = \frac{\partial}{\partial V_c} log(exp(U_o^TV_c)) = \frac{\partial U_o^TV_c}{\partial V_c}  = U_o$
    - Let's tackle the second term.
        - $t_2 = \frac{\partial}{\partial V_c} log \sum_{w=1}^v exp(U_w^TV_c)$
        - $t_2 = \frac{1}{\sum_{w=1}^v exp(U_w^TV_c)} \frac{\partial}{\partial V_c}\sum_{x=1}^v exp(U_x^TV_c)$
        - Notice the change from $w$ to $x$ to avoid confusing this summation with denominator
        - Move the derivative inside the summation.
        - $t_2 = \frac{1}{\sum_{w=1}^v exp(U_w^TV_c)} \sum_{x=1}^v \frac{\partial}{\partial V_c}exp(U_x^TV_c)$
        - $t_2 = \frac{1}{\sum_{w=1}^v exp(U_w^TV_c)} \sum_{x=1}^v exp(U_x^TV_c)U_x$
        - Moving the denominator inside the summation too.
        - $t_2 = \sum_{x=1}^v \frac{exp(U_x^TV_c)U_x}{\sum_{w=1}^v exp(U_w^TV_c)}$
        - The partial inner term has pattern $P(x|c) = \displaystyle\frac{exp(U_{x}^T V_{c})}{\sum_1^W exp(U_w^TV_{c})}$
        - Therefore $t_2 = \sum_{x=1}^v P(x|c)U_x$
    - Going back to original equation
        - $\frac{\partial J(\theta)}{\partial V_c }= t_1 - t_2$
        - $\frac{\partial J(\theta)}{\partial V_c }= U_o - \sum_{x=1}^v P(x|c)U_x$
    - Notice how amazing this result is. $t_1$ essentially represents the observed value and second term $t_2$ has form of an expectation value, we are calculating probability of each word in vocabulary appearing in context of the center word, and then taking that probability and multiplying with it's current context vector. In other words, it is average over all possible context vectors weighted by likelihood of their occurrence.
    - We are essentially substracting expected value from observed value and trying to bring them close.
    - Similarly by symmetry:
        - $\frac{\partial J(\theta)}{\partial U_o }= t_1 - t_2$
        - $\frac{\partial J(\theta)}{\partial V_c }= V_c - \sum_{x=1}^v P(x|c)V_x$
    - This way we have calculated the gradient. Substracting a fraction of gradient moves the value towards minimum.
    - Update of the parameters happen in the following manner.
        - $\theta_{j}^{new} = \theta_{j}^{old} - \alpha\frac{\partial}{\partial\theta_{j}^{old}} J(\theta)$
        - In matrix form: $\theta^{new} = \theta^{old} - \alpha\frac{\partial}{\partial\theta^{old}} J(\theta)$
        - or $\theta^{new} = \theta^{old} - \alpha\triangledown_\theta J(\theta)$
        - $\alpha$ is our step size which is used to decide how much we move in each step.
    - This is essentially [[Gradient Descent]] algorithm, which we will come back to separately.
    - We actually use [[Stochastic Gradient Descent]] to avoid exceedingly big computation and in fact a better result.
    - Instead of updating $\theta$ using entire vocabulary, we update parameter after each window $t$ (basically update the parameter for each individual center word.)
        - $\theta^{new} = \theta^{old} - \alpha\triangledown_\theta J_t(\theta)$

## Implementing Word2Vec from scratch using numpy

Let's implement word2Vec from scratch. I have implemented it in another notebook [here](Word2Vec_from_scratch.ipynb)