# Skip-Gram Model

- From [v1] Lecture 44

## Model Architecture

![Skip_Gram_Model_2](images/Skip_Gram_Model_2.jpg)

## Neural Network Architecture - A Sample

![Skip_Gram_Neural_Network_Architecture__A_Sample](images/Skip_Gram_Neural_Network_Architecture__A_Sample.jpg)

## Python Implementation for Word Embedding

- Lecturer gave only pieces of code, we need to integrate and make it compelte and run to see how it works
- The corpuse used contains about 100 documents
  - It is enough to identify all the word vectors in the right context

### Initialization

```python
def setup_corpus(self, corpus_dir='/home/ramaseshan/Dropbox/NLPClass/2019/SmallCorpus/'):
    self.corpus= PlaintextCorpusReader(corpus_dir, '.*')
    
def init_model_parameters(self, context_window_size=5, word_embedding_size=70 epochs=400, eta=0.01):
    self.context_window_size = context_window_size
    self.word_embedding_size = word_embedding_size
    self.epochs = epochs
    self.eta = eta
    
def initialize_weights(self):
    self.embedding_weights = np.random.uniform(-0.9, 0.9, (self.vocabulary_size, self.word_embedding_size)) # input weights
    self.context_weights = np.random.uniform(-0.9, 0.9, (self.word_embedding_size, self.vocabulary_size)) # input weights
```

### Forward Pass

- Forward Pass
  - $\displaystyle \large H = W^TX$
    - $H$ is the hidden layer
  - $\displaystyle \large U = W^TH = W^{'T} \cdot W^TX$
    - $U$ is the output layer

```python
def forward_pass(self, X):
    H = np.dot(self.embedding_weights.T, X)
    U = np.dot(self.context_weights.T, H)
    y_hat = self.softmax(U)
    return y_hat, H, U
```

### Back Propagation

- Back Propagation
  - $\displaystyle \large \mathcal{w}_{ij}^{' new} = \mathcal{W}_{ij}^{' old} - \eta e_j \cdot \mathcal{h}_i$ or
  - $\displaystyle \large \bf{v}_{w_j}^{(new)} = \bf{v}_{w_j}^{'(old)} - \eta e_j \cdot \mathcal{h} \text{for h = 1,2,3,...V}$
  - $\displaystyle \large \frac{\partial{E}}{\partial{w_{ki}}} = \frac{\partial{E}}{\partial{h_{i}}} \cdot \frac{\partial{h_i}}{\partial{w_{ki}}} = EH_i \cdot \bf{\chi}_k$

```python
def back_propagation(self,X,H,E):
  delta_context_weights = np.outer(H, E)
  delta_embedding_weights = np.outer(X, np.dot(self.context_weights, E.T))
  
  # Change the weights using the back propagation values
  self.context_weights = self.context_weights - (self.eta * delta_context_weights)
  self.embedding_weights = self.embedding_weights - (self.eta * delta_embedding_weights)
  pass
```

### Training

- Training
  - $\displaystyle \large E = -\mathcal{v}_{wO}^{'} \cdot \mathcal{h} + log \sum_{{j^{'}}=1}^{V} \exp({v_{w_j}^{'}}^T \cdot h)$
- Training Guidelines
  - The Error to Epoch curve should be smooth, otherwise there is something wrong in
    - Learning Parameters
  - Start with less number of EPOCH first to check the curve
  - Don't start with huge size of vocabulary like 1 million, instead start with say 10 wordds, 5 epochs, kind of
  - Make sure program is right
  - Make sure that error is slowly coming down, then you can take a bigger corpus, bigger vocabulary and then increase the number of epochs, so on
    - While doing this, keep changing the learning parameter, and find the right learning parameter as well

```python
def train(self):
    for i in range(0, self.epochs):
        for target_word, context_words in np.array(self.training_samples):
            # for all the words
            y_hat, H, U = self.forward_pass(target_word)
            
            # Compute error for all the context words
            EI = np.sum([np.subtract(y_hat, word) for word in context_words], axis=0)
            
            # do back propagation to adjust weights
            self.back_propagation(target_word, H, EI)
            
            # Compute the error
            self.error[i] = -np.sum([U[word.index(1)]
                                   for word in context_words]) + \
                                        len(context_words) * \
                                        np.log(np.sum(np.exp(U)))
```

### Word Vector for _deep_ and similar words

![Python_Impl_Word_Embedding_5](images/Python_Impl_Word_Embedding_5.jpg)

## Source Preparation for Training

![Python_Impl_Source_Preparation_For_Training](images/Python_Impl_Source_Preparation_For_Training.jpg)

### Complexity in Number of Neurons vs Number of Words

- Since input to Skip-Gram is a One-Hot vector and the output layer will have same number of neurons
  - This huge size will impact in every layer
    - Input to Hidden layer - say 1 million to 300
    - Hidden to Output 300 to 1 millon
    - 1 million softmax calculation
- We need to reduce this complexity

## Reducing Complexity (Optimization)

- We need to start reducing the complexity from the training samples
  - Some bi-grams won't add any value in learning context
    - E.g., $\texttt{(of, the)}$, $\texttt{(returns, the)}$ won't contribute much in finding the relationship between the words $\texttt{happy}$ and $\texttt{returns}$
  - Similarly in tri-grams it will exist
  - Normally bigrams and tri-grams won't be used in the training, instead we will go with __*5-words window*__

### Sub-Sampling

- Removing following kind of words from the training samples, the resultant sub-samples are used for training
  - Words pairs that does not give much informatino
  - Words pairs that appear in switched order
  - Less frequent words

- The words $\texttt{(of, the)}$ in the pairs $\texttt{(of, happy)}$, $\texttt{(returns, the)}$ do __*not give much information*__ about the words happy and returns, respectively. Similarly some pairs __*reappear*__ with the __*order of the words switched*__
  - The pair $\texttt{(of, the)}$ does not add any value information
  - The pairs $\texttt{(wish, you)}$, $\texttt{(you, wish)}$ can be reduced to one pair
- Some word could also be __*randomly removed from the based on the frequencies*__
  - Support a word appeared only once in the corpus, that won't give any information to the model
    - This word can be removed from the training samples
- Words with _less frequency or infrequent_ __*words appearing as context words*__ _could be discarded as they_ __*may not provide contextual information to the central word*__

#### Sub-Sampling in Google's Word2Vec.c

- Here is the code for sub-sampling used by [word2vec.c](https://github.com/tmikolov/word2vec/blob/master/word2vec.c) that randomly removes a word from the sample

```c
        if (word == 0) break;
        // The subsampling randomly discards frequent words while keeping the ranking same
        if (sample > 0) {
          real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn;
          next_random = next_random * (unsigned long long)25214903917 + 11;
          if (ran < (next_random & 0xFFFF) / (real)65536) continue;
        }
```

- Let $\displaystyle \large f(X) = \frac{vocab[word].cn}{train \_ words}$ and $\displaystyle \large ran = (\sqrt{f(x)} + 1) \times \frac{1}{f(x)}$
  - where
    - ${vocab[word].cn}$ is the count of the word
    - ${train \_ words}$ represents all the training words
  - Then, the probability of keeping the word is decided based on the generated random value random
- If $ran \lt random$ keep it, else discard the word

### Negative Sampling

- __*Negative-Sampling:*__
  - Another mechanism to minimize the computation
  - The size of the network is proportional to the size of the vocabulary $\mathbf{V}$. For every training cycle of input, the every weight in the network needs to be updated
  - For every training cycle, Softmax function computes the sum of the output neuron values
  - Cost of updating all the weights in the fully connected network is very high
  - Is it possible to change only a small percentage of the weights?
- __*Intuition behind negative sampling*__
  - In case of Skip-Gram model, the learning happens between the context words and the input word
    - The error that occurs doesn't have impact on other words that are in the training.
    - Other than input and context words, rest of the words doesn't contribute anything to the learning in that iteration
  - Can we really curtail the updation of the weights by only choosing the certain number of weights to be updated
    - Why should I update all my weights in every iteration, when it does n't make any impact on the learning
      - Computation of weights in each iteration is costly
    - Which are the weights that we should not bother about
- __*How Negative Sampling is performed*__
  - Select a small number of _negative_ words
    - Negative words are words that are not the context words
    - __*These words are choosen by soem mechasnim*__
  - While updating the weights, these samples output zero while the positive sample(s) will retainits value
  - During the backpropagation, the weights related to the negative and positive words are changed and the rest will remain untouched for the current update
  - This reduces drastically the computation

#### Selecting a Negative Sample

- Several mechanism available in selecting negative words
- Below is one kind of mechanism
  1. Pick the most frequently occuring words as negative words
    - Given the probability of a given word in the corpus $\displaystyle \large P(w_i) = \frac{f(w_i)}{\sum_{{j=0}}^{n} f(w_j)}$
    - Apply some mechanism ($\frac{3}{4}$), which boosts the probability of less frequent words and reduces the probability of highly appearing words in the corpus
  2. Once we have the probability of words, find the number of occurence of that word in the entire corpus (say using Frequency Table)
    - Using Frequency Table, find out how many times you want to repeat certain words in a table
      - E.g., create a unigram table of size of the vocabulary
      - Pickup the most frequently appearing word and create another unigram table
      - Use the probability (above formula) to pick words which you want to use as negative sample words

![Word2Vec_Selectin_A_Negative_Sample](images/Word2Vec_Selectin_A_Negative_Sample.jpg)

### Trouble with the size of the network

- All weights ($output \rightarrow hidden$) and ($hidden \rightarrow input$) are adjusted by taking a training sample so that the prediction cycle minimizes the loss function
- This amounts to updating all the weights in the neural network
  - amounts to several million weights for a network which has input neurons, $|V| = 1M$, and hidden unit size as 300
- In addition, we should consider the seveeral million training samples pairs

### Softmax

- As part of reducing the complexity of Word2Vec learning, optimization can be applied to Softmax as well

![Word2Vec_Softmax](images/Word2Vec_Softmax.jpg)