# Tags  

vector space models, representations, embeddings



# Citation  

GloVe: Global Vectors for Word Representation
Pennington, Socher, Manning 2014

# Significance

A new technique to get vector representations of words/embeddings which produces a word-vector space with meaningful sub-structure 
i.e. does well on word analogy tasks such as king-queen = man-woman , + leverage statistical information which models like skipgram w2vec do not do
State of the art results on word analogy data sets, + downstream tasks like NER

# Context and summary  

Previously Existing Techiques :  
Two main classes of techniques - 


## Global matrix factorization techniques like LSA and other techniques  
1) Adv and Disadv : Leverage distributional information in corpus, as they primarily work off normalized word counts. These word well for distance based metrics, but not so well for word analogy tasks (king - queen = main - woman)  
2) Start with Term document matrix or word word matrix of event counts, do transformations , dimensionality reduction to avoid disproportionate impact of irrelevant information like cooccurence counts of the with other words . Use PMI, other techniques to do so  

    

## Local context window methods, such as Mikolov's skip gram techniques which do well on analogy tasks, but do not leverage distributional information in corpus  
1) Bengio (2003) uses a simple neural architecture,  Mikolov (2013a) uses skip gram and CBOW (single layer neural architectures), Levy (2014) - explicit word embeddingds based on PPMI metric  , vector log bilinear techniques (Mnih and Kavukcuoglo 2013) like vLBL and ivLBL)  
2) CBOW and vLBL predict word given context, and skip gram and ivLBL predict context given word  
3) These methods do not use cooccurence information , thus failing to take advantage of repetion in data (?)  
4) Also, in these methods, computational cost scales with corpus size C, which makes computation much more expensive for large corpora

# Glove  - Deriving the Cost Function

1) Let X be word word matrix of co-occurence counts, X<sub>ij</sub> be count of word j in context i, X<sub>i</sub> be no of times any word appears in context of word i = $\sum_k{X_{ik}}$ (rowsum of counts).
Let$p_{j|i} = X_{ij}/X_{i}$ be the conditional probability of word j appearing in context i   

2) The goal of course is to get vector representations for every word from the co-occurence matrix X. Instead, of using traditional normalization and dimensionality reduction techniques such as length normalization of counts followed by LSA; In glove we define an objective function, based on some criteria below, and minimize it, this yields vectors for every word in the X matrix

3) Specifically, while defining the objective function, we use the observation  that instead of conditional probabilities themselves, the *ratio* of conditional probabilities in the presence of a third word k is better able to distinguish relevant and irrelevant words, and also discriminate betwee two relevant words.   

4) Example - In figure pasted below (figure 1 of paper) Suppose we want to get representations of word i = ice, and j = steam.  
a) For word k related to ice, but not to steam, $p_{k|ice}/p_{k|steam}$ will be very high  
b) for word k related to steam but not ice, $p_{k|ice}/p_{k|steam}$ will be very small  
c) and for word k which is equally related or unrelated to ice and steam, $p_{k|ice}/p_{k|steam}$ ~ 1

![how_ratio_of_prob_is_better_than_raw_prob](glove_pic1.png "Image Credit Figure 1 in paper")       


(Table 1 from paper)


5) In the table above, we see that $p_{k|ice}/p_{k|steam}$ is very high is k is the word 'solid', this is because solid in the context ice is much more frequent than solid in the context steam. Similarily $p_{k|ice}/p_{k|steam}$ is very low  if k is the word "gas" as counts of gas in the context of the word steam is much more frequent than count of gas in the context of ice

6) Using this observation, we this want to get an objective function F of word vectors to minimize such that
$F(w_{i},w_{j},w_{\tilde{k}}) = \frac{p_{k|i}}{p_{k|j}}$  

In other words, given word vectors $w_{i},w_{j},w_{\tilde{k}}$ for words i, j and context vector $\tilde{k}$ ; the objective function applied on word vectors should respect the (obtained from data) ratios of conditional probabilities; and the word vectors should be obtained in such a way that the function applied on them should respect the obtained from data) ratios of conditional probabilities

7) Since word vectors are expected to be linear, and RHS is a ratio of conditional probabilities, the linear equilavalent of which is a difference between word vectors, we change the equation above to $F(w_{i}-w_{j},w_{\tilde{k}}) = \frac{p_{k|i}}{p_{k|j}}$  

8) In addition, note that LHS arguments are vectors, RHS is a scalar. F could be some complex function which converts vectors to scalars, but this would obfuscate the linear structure we want. To make things simple, make the arguments of LHS a scalar by converting to dot product  $F( (w_{i}-w_{j})^{T}w_{\tilde{k}}) = \frac{p_{k|i}}{p_{k|j}}$    
Note that the argument is now just the unnormalized cosine distance between w<sub>i</sub>-w<sub>j</sub> and w<sub>k</sub>


 

9) For word-word co-occurence matrix, distinction between word and context is arbitrary. This means that w can be replaced by $\tilde{w}$; and X can be replaced by X<sup>T</sup>. This requires F to be a from the additive group of real numbers to the multiplicative group of positive real numbers. See [here](https://towardsdatascience.com/emnlp-what-is-glove-part-iii-c6090bed114) for more on homomorphism. This implies that  $F( (w_{i}-w_{j})^{T}w_{\tilde{k}})  =.F( (w_{i})^{T}w_{\tilde{k}})/F( (w_{j})^{T}w_{\tilde{k}}) $  . This implies that F is the exponential functon. which satisfies this inequality



10) From 7 and 9, This implies (numerators) that $F( (w_{i})^{T}w_{\tilde{k}}) = p_{k|i} = X_{i\tilde{k}}/X_{i}$

11) Taking ln on both sides, and knowing that F is the exponential function, $ (w_{i})^{T}w_{\tilde{k}} = ln(p_{k|i}) = ln(X_{i\tilde{k}}) - ln(X_{i})$

12) X<sub>i</sub> is independent of k, so call it a word specific bias term b<sub>i</sub> which we can try to learn along with word vectors rather than obtaining from corpus . Why (?)

$ (w_{i})^{T}w_{\tilde{k}} = ln(X_{i\tilde{k}}) - b_{i}$

13) Add a term $b_{\tilde{k}}$ to restore symmetry to get 

$ (w_{i})^{T}w_{\tilde{k}} + b_{i} + b_{k}  = ln(X_{i\tilde{k}}) $


14) In equation above,  we want to learn all w<sub>i</sub>, b<sub>i</sub> and b<sub>k</sub>; $X_{ik}$ is known , obtained as counts from corpus.

15) Here's a different way to get the equation above .
Let's desire that learnt word and context vectors $w_{i}$ and $w_{\tilde{k}}$   should satisfy the relation

$(w_{i})^{T}w_{\tilde{k}} = ln(P_{\tilde{k}|i}) =  ln(X_{i\tilde{k}}/X_{i}) = ln(X_{i\tilde{k}}) - ln(X_{i}) $ 


Similarly, flipping i and $\tilde{k}$,


$(w_{\tilde{k}})^{T}w_{i} = ln(P_{\tilde{i}|k}) =  ln(X_{i\tilde{k}}/X_{i}) = ln(X_{i\tilde{k}}) - ln(X_{\tilde{k}}) $ 


Adding both equations above, 

$(w_i)^{T}w_{\tilde{k}}  = ln(X_{i\tilde{k}}) - 0.5 ln(X_{i} - 0.5 ln X_{\tilde{k}}$  =>

$(w_{i})^{T}w_{\tilde{k}} + b_{i} + b_{\tilde{k}} = ln(X_{i\tilde{k}}) $


15) Now we go ahead , set cost function  
$J =  \sum_{i,k}((w_{i})^{T}w_{\tilde{k}} + b_{i}  + b_{\tilde{k}} - ln(X_{ik}))^{2}   $   

we want to minimize this to find word vectors and biases

16) However, the equation above has problems - not defined when $X_{ik}$ = 0, also weights all cooccurences equally including rare cooccurences.  So add a weighting term f(X<sub>ij</sub>) to equation above - 


$J =  \sum_{i,k}f(X_{ik})((w_{i})^{T}w_{\tilde{k}} + b_{i}  + b_{\tilde{k}} - ln(X_{ik}))^{2}   $   



f(x) should satisfy the following properties. f(0) = 0, in fact f(x) should approach 0 as x approaches 0 fast enough that $lim_{x->0} {f(x)ln(x^{2})}$ -> 0 as x -> 0

f(x) should be non-decreasing to awoid high weightage to rare cooccurences. 

f(x) should be relatively small for large values of x so that frequent cooccurences are not overweighted , should tend to plateau. 



The choice of f by the authors which satisfied the above 3 conditions were - $f(x) = (x/x_{max})^{\alpha}$ if x < $x_{max}$, 1 otherwise, $\alpha$ was chosen to be 3/4 heuristically




![figure from paper on how f(x) varies with x](glove_pic2.png "Image Credit Figure 1 in paper")  


(Figure 1 from paper)



# Relation to other models like W2VEC

In this section, the authors try to cast w2vec objective functions in a form similar to GloVe to compare and contrast

1) A skip gram or ivLBL mode tries to predict context given word i. Assuming a softmax probability, The probability of word j appearing in context of word i is

$Q_{j|i} = \frac{exp(w_i^{T}w_{j})}{\sum_{k}{exp(w_i^{T}w_{k})}}$

2) We would like to minimize  $J = -\sum_{i,j(i)} ln(Q_{j|i})$  
where i spans across all words, j across all words in context for each i

3) 2 is equivalent to $J = -\sum_{i,j}X_{ij}ln(Q_{j|i})$ grouping together terms with similar i and j

4) Note that $p_{ij} = X_{ij}/X_{i}$   =>  $J = -\sum_{i}X_{i}\sum_{j}{P_{j|i}ln(Q_{j|i})}$;


5) 4 can be written as $J = \sum_{i}X_{i}H(P_{i},Q_{i})$ where H is the cross entropy between distrobitions P and Q

6) 5 has some problems -  Q (model output) needs to be properly normalized, cross entropy H is just one of many distance measures between two distributions, which has the disadvantage that it gives too much weight rare counts for distributions with long tails. Also, computing softmax over the  entire vocabulary is computationally complex

7) Therefore use a different distance metric $J = \sum_{i,j}X_{i}(\hat{P}_{j|i} - \hat{Q}_{j|i})^{2}$    where $\hat{P}_{j|i}$ and $\hat{Q_{j|i}}$ are unnormalized, $\hat{P}_{j|i} = X_{ij}$ and  $\hat{Q}_{j|i} = exp(w_{i}^{T}w_{j})$

8) 7 has a problem where there is NO normalization, if counts are large, this can blow up. To avoid, minimize squared differece between logs

Instead of $J = \sum_{i,j}X_{i}(\hat{P}_{j|i} - \hat{Q}_{j|i})^{2}$ ,

use 

$J = \sum_{i,j}X_{i}(ln(\hat{P}_{j|i}) - ln(\hat{Q}_{j|i}))^{2}$ 

which means

$J = \sum_{i,j}X_{i}(w_{i}^{T}w_{j}-ln(X_{ij}))^{2}$   which is kind of analogous to the glove equation, absent bias terms which we can include

# Computational cost of Glove

1) Repeating equation above, cost $J = \sum_{i,j}f(X_{ij})((w_{i}^{T}w_{j}) + b_{i} + b_{j} - ln(X_{ij}))$

2) Computational cost depends on no of non-sparse elements in word-word matrix X which is of dimension V*V, V is vocab size. Therefore, in worst case of completely dense matrix, cost is O(V<sup>2</sup>)

3) Assume that $X_{ij}$ in presence of sparsity can be modeled as a power law function of frequency rank of word pair $r_{ij}$, ie $X_{ij} = \frac{k}{r_{ij}^{\alpha}} $ 

4) Number of words in corpus C |C| ~ $\sum_{ij}X_{ij}$  =  $\sum_{r=1}^{|X|}\frac{k}{r_{ij}^{\alpha}}$ = $kH_{|X|,\alpha}$ where $H_{n,m} = \sum_{1}^{n}\frac{1}{r^{m}}$, a generalized harmonic number 

5) Looking at the power law equation ie $X_{ij} = \frac{k}{r_{ij}^{\alpha}} $ , in the limiting case of $X_{ij}$ = 1 (atleast 1 count), $r_{ij} = k^{1/\alpha}$. Therefore, if we rank $X_{ij}$, we get all ranks from 1 to $k^{1/\alpha}$

which means the no of non-zero terms in X |X| = $k^{1/\alpha}$

6) Therefore, substituting k = $|X|^{\alpha}$ in equation |C| ~ kH(|X|,\alpha), we get |C| ~ $|X|^{\alpha}H(|X|,\alpha)$

7) Using properties of generalized harmonic number,

![generalized_harmonic_number](glove_pic3.png "Image Credit Equation 20 in paper")     where $\zeta$ is the reimann zeta function


(Equation 20 from the paper)

8) This gives  ![order_glove](glove_pic4.png "Image Credit Equation 21 in paper")     

(Equation 21 from paper)

9) This simplies for large X to |X| = O(|C|) if $\alpha$ > 1 , and $O(|C|^{1/\alpha})$ if $\alpha$>1

10. In practice, it is observed by the authors that $\alpha$ = 1.25; which means that $|X| = O(|C|^{0.8})$; which is lesser than w2vec methods where $|X|$ = O(|C|)

# Experiments

## Evaluation

1) Evaluated on word analogy tasks of Mikolov, word similarity tasks (Luong 2013), and on ConLL 2003 NER benchmark

2) Word analogy - given 3 words a , b , c such that a:b::c:?, we find d as the word in corpus whose vector representation is closest to w_{b} + w_{c} - w{a}

3) NER task - Used the ConLL benchmark , classifying each word into 4 entity types - organization, location, person, Misc . Model used was - ~437K discrete features from Stanford NER model + 50 dimensional vectors for each word in a 5 word context => trained a CRF model

## Corpora used to train

1) 5 data sets - 2010 wiki dump with  1 billion tokens, 2014 wiki dump with 1.6 billion tokens, gigaword 5 with 4.3 billion tokens, giga word 5 + wiki 2014 with ~ 6 billion tokens, and 42 billion tokens from crawling the web


2) Each word tokenized and lower-cased using stanford tokenizer, vocab of 400K most frequent words built, matrix X of word-word cooccurence counts constructed

3) Different window sizes for context, and whether to distinguish between left and right are tried. 1/d scaling function used within context

4) NN Models trained using adaGrad, initial learning rate of 0.05

# Results

![table 2 results](glove_pic5.png "Table. 2 results")  

                Results on word analogy task

SVD is obtained from truncated matrix of top 10k frequent words, SVD-S - SVD of sqrt(X truncated), SVD-R - SVD of log(1+X truncated)

![table 4 results](glove_pic6.png "Table. 4 results")  

            Results on NER task (Table 4 from the paper)

# Vector length and context size 

![table 2 results](glove_pic7.png "Table. 2 results")  

            Performance on word analogy task (Figure 2 from the paper)