# Word Embeddings

Word Embeddings are the vector representation of the words where words with similar meaning or context are closer in the vector space and they have similar representations.

Usage: 
 sentiment analysis, document clustering, question answering, paraphrase detection 
 
 1. Word2vec
 2. Glove
 3. BERT
 4. USE
 
 


 ## Word2Vec

1. Word2vec is a two-layer feedforward neural network (shallow neural network).
2. There are two architectures in Word2Vec:    ? independtly or in combination
    1. Continous Bag of Words (CBOW)
    2. Skip-Gram
    
Word2Vec have input layer of vocabulary size, followed by hidden layer of vector size(300) and then output layer of vocabulary size. we either train the word on its context (skip-gram) or train the context on the word (continuous bag of words) using a 1-hidden layer neural network.

Con:

Word2vec model only considers the local (nearby) context words of the target word for it's vector representation.


### Skipgram 

1. The network is trained using word pairs generated from the corpus.

2. For each word in a sentence, we generate word pairs by looking at a 
fixed number of words before and after the current word (or the input word). 
This fixed number of words is also known as the window size. it is 2 or 3 or 4 etc

3. Window size of 3 means that we look at 3 words before and 3 words after the current word , 
these are context words wt-3,wt-2,wt-1,wt+1,wt+2,wt+3.

4. word pairs generated are (wt,wt-3), (wt,wt-2), (wt,wt-1), (wt,wt+1), (wt,wt+2), (wt+3). 
These word pairs generated are used to train the Neural Network.That is wt is given as input to the model and output is one of the 
context words.

5. Assume that the vocabulary size of the corpus is 10,000. This means that there are 10,000 unique words in the corpus.
    The input to the network is a one-hot encoded vector representing the input word. 
    The network has one hidden layer with 300 neurons. The output layer has 10,000 neurons, 
    one for each word, with a softmax activation. No activation is used in the hidden layer.
    The presence of softmax means that the model will actually output probabilities for 10,000 words. 
    This probability is the probability of the word at that index being the nearbuy or a context word for the input/current word.
    Intuitively, words that occur near the input word multiple times in the corpus will have a larger probability than others.

6. The hidden layer is associated with a matrix of size (10,000 X 300), one 300 dimensional vector for each word in the corpus. 
    After backpropagation, the values of this matrix represent our word vectors.So the proposed task was fake because we never used the output layer.
    We just wanted to learn representations of the words.

## Subsampling 
Certain  words which are present hugely but have less meaning to the sentence. Words like "the", "and", "a" etc. might occur in many context windows and hence be a part of many word-pairs. subsampling is introduced that deletes words with high frequency from the corpus. each word is assigned a probability of whether to keep it or drop it.

## Negative Sampling


## CBOW

continuous bag-of-words (CBOW) model predicts the input word from the context words

# GLoVe  (Globel Vectors)  (statistical approach)

1. Glove is a count-based model which works upon generated word-word co-occurence matrix also called count matrix from the corpus. It does not use neural network


2. Glove consider global context words to generate vector representation of a given word by creating word-word co-ooccurrence matrix.
3. co- occurrence matrix contains words(vocabulary) in rows and columns. It has size V*V , where V is number of words in the vocabulary. 

4. co-occurrence matrix is a symmetric matrix
5. context meanings neighbouring words.


6. GLOVE learns by constructing a co-occurrence matrix (words X context) that basically count how frequently a word appears in a context. Since it's going to be a gigantic matrix, we factorize this matrix to achieve a lower-dimension representation.


## Loading GloVe

In [None]:
# loading the Glove pretrained model
def load_glove_vectors(tokenizer,max_features):
    embeddings_index=dict()
    embed_size=50

    with open("glove.6B.50d.txt", encoding='utf-8') as f:
        for line in f:
            values=line.split()
            word=values[0]
            coefs=np.asarray(values[1:],dtype='float32')
            embeddings_index[word]=coefs
            
    word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.zeros((nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features:
            continue
        try:
            embedding_vector = model[word]
            embedding_matrix[i] = embedding_vector
        except KeyError:
            continue

    print("Loaded %s word vectors"%len(embeddings_index))
    
    return embedding_matrix

## Loading Word2Vec

In [None]:
from gensim.models import KeyedVectors
filepath="./GoogleNews-vectors-negative300.bin"

def load_word2vec_embeddings(filepath, tokenizer, max_features, embedding_size):
    model = KeyedVectors.load_word2vec_format(filepath, binary=True)

    emb_mean, emb_std = model.wv.syn0.mean(), model.wv.syn0.std()

    word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embedding_size))
    for word, i in word_index.items():
        if i >= max_features:
            continue
        try:
            embedding_vector = model[word]
            embedding_matrix[i] = embedding_vector
        except KeyError:
            continue
    return embedding_matrix

Reference:
https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/   ****

https://www.mygreatlearning.com/blog/word-embedding/
https://machinelearninginterview.com/topics/natural-language-processing/what-is-the-difference-between-word2vec-and-glove/

https://deeplearning.lipingyang.org/wp-content/uploads/2017/12/How-is-GloVe-different-from-word2vec_-Quora.pdf

