# Skip-Gram Model

- From [v1] Lecture 44

## Model Architecture

![Skip_Gram_Model_2](images/Skip_Gram_Model_2.jpg)

## Neural Network Architecture - A Sample

![Skip_Gram_Neural_Network_Architecture__A_Sample](images/Skip_Gram_Neural_Network_Architecture__A_Sample.jpg)

## Python Implementation for Word Embedding

- Lecturer gave only pieces of code, we need to integrate and make it compelte and run to see how it works
- The corpuse used contains about 100 documents
  - It is enough to identify all the word vectors in the right context

### Initialization

```python
def setup_corpus(self, corpus_dir='/home/ramaseshan/Dropbox/NLPClass/2019/SmallCorpus/'):
    self.corpus= PlaintextCorpusReader(corpus_dir, '.*')
    
def init_model_parameters(self, context_window_size=5, word_embedding_size=70 epochs=400, eta=0.01):
    self.context_window_size = context_window_size
    self.word_embedding_size = word_embedding_size
    self.epochs = epochs
    self.eta = eta
    
def initialize_weights(self):
    self.embedding_weights = np.random.uniform(-0.9, 0.9, (self.vocabulary_size, self.word_embedding_size)) # input weights
    self.context_weights = np.random.uniform(-0.9, 0.9, (self.word_embedding_size, self.vocabulary_size)) # input weights
```

### Forward Pass

- Forward Pass
  - $\displaystyle \large H = W^TX$
    - $H$ is the hidden layer
  - $\displaystyle \large U = W^TH = W^{'T} \cdot W^TX$
    - $U$ is the output layer

```python
def forward_pass(self, X):
    H = np.dot(self.embedding_weights.T, X)
    U = np.dot(self.context_weights.T, H)
    y_hat = self.softmax(U)
    return y_hat, H, U
```

### Back Propagation

- Back Propagation
  - $\displaystyle \large \mathcal{w}_{ij}^{' new} = \mathcal{W}_{ij}^{' old} - \eta e_j \cdot \mathcal{h}_i$ or
  - $\displaystyle \large \bf{v}_{w_j}^{(new)} = \bf{v}_{w_j}^{'(old)} - \eta e_j \cdot \mathcal{h} \text{for h = 1,2,3,...V}$
  - $\displaystyle \large \frac{\partial{E}}{\partial{w_{ki}}} = \frac{\partial{E}}{\partial{h_{i}}} \cdot \frac{\partial{h_i}}{\partial{w_{ki}}} = EH_i \cdot \bf{\chi}_k$

```python
def back_propagation(self,X,H,E):
  delta_context_weights = np.outer(H, E)
  delta_embedding_weights = np.outer(X, np.dot(self.context_weights, E.T))
  
  # Change the weights using the back propagation values
  self.context_weights = self.context_weights - (self.eta * delta_context_weights)
  self.embedding_weights = self.embedding_weights - (self.eta * delta_embedding_weights)
  pass
```

### Training

- Training
  - $\displaystyle \large E = -\mathcal{v}_{wO}^{'} \cdot \mathcal{h} + log \sum_{{j^{'}}=1}^{V} \exp({v_{w_j}^{'}}^T \cdot h)$
- Training Guidelines
  - The Error to Epoch curve should be smooth, otherwise there is something wrong in
    - Learning Parameters
  - Start with less number of EPOCH first to check the curve
  - Don't start with huge size of vocabulary like 1 million, instead start with say 10 wordds, 5 epochs, kind of
  - Make sure program is right
  - Make sure that error is slowly coming down, then you can take a bigger corpus, bigger vocabulary and then increase the number of epochs, so on
    - While doing this, keep changing the learning parameter, and find the right learning parameter as well

```python
def train(self):
    for i in range(0, self.epochs):
        for target_word, context_words in np.array(self.training_samples):
            # for all the words
            y_hat, H, U = self.forward_pass(target_word)
            
            # Compute error for all the context words
            EI = np.sum([np.subtract(y_hat, word) for word in context_words], axis=0)
            
            # do back propagation to adjust weights
            self.back_propagation(target_word, H, EI)
            
            # Compute the error
            self.error[i] = -np.sum([U[word.index(1)]
                                   for word in context_words]) + \
                                        len(context_words) * \
                                        np.log(np.sum(np.exp(U)))
```

### Word Vector for _deep_ and similar words

![Python_Impl_Word_Embedding_5](images/Python_Impl_Word_Embedding_5.jpg)

## Source Preparation for Training

![Python_Impl_Source_Preparation_For_Training](images/Python_Impl_Source_Preparation_For_Training.jpg)

### Complexity in Number of Neurons vs Number of Words

- Since input to Skip-Gram is a One-Hot vector and the output layer will have same number of neurons
  - This huge size will impact in every layer
    - Input to Hidden layer - say 1 million to 300
    - Hidden to Output 300 to 1 millon
    - 1 million softmax calculation
- We need to reduce this complexity