# One Word Learning Architecture

- From [v1] Lecture 44
  - Finding the bi-gram
    - Given a word, we want network to learn the next word
  - Simple architecture to understand the Word Embedding concept

## Source Preparation for Training

- Lets take window size $n=5$ and assume the given corpus is preprocessed already (removing letters, symbols, numbers, ...)
- Assume we have start $<S>$ and end $</S>$ symbols present in the corpus
- Start from first word (assuming two start symbols are added), start building $bi-grams$
- Repeatitions in the bigram model doesn't matter

![Source_Preparation_For_Training](images/Source_Preparation_For_Training.jpg)

## One Word Learning

- Let $V$ be the size of the input (input vector size)
  - $x_i = (x_1, x_2, \ldots , x_V$)
  - Example: If we have 1 million words in the vocabulary, then $||V|| = \text{1 million}$, i.e., the number of elements in the input layer is 1 million
  - Input will be __*BoW - Bag of Words*__
- The hidden layer size will less than the input layer size
  - Usually it may go upto $300$, or $500$
- The size of output layer will be same as Input Layer
  - So if we have 1 million elements in Input layer, the size of output layer will also be 1 million

![One_Word_Learning](images/One_Word_Learning.jpg)

## Input Layer

- Input will be __*BoW - Bag of Words*__ and the representation will be __*OHV - One Hot Vector*__

![One_Word_Learning_Input_Layer](images/One_Word_Learning_Input_Layer.jpg)

## Hidden Layer

- It is a fully connected layer

![One_Word_Learning_Hidden_Layer](images/One_Word_Learning_Hidden_Layer.jpg)

## Output Layer

![One_Word_Learning_Output_Layer](images/One_Word_Learning_Output_Layer.jpg)

## Update Weights - Hidden Output Layers

- 

![Update_Weights__Hidden_Output_Layers](images/Update_Weights__Hidden_Output_Layers.jpg)

### Cross Entropy Loss Function

- Why Cross Entropy?
  - $log$ $p(x)$ is well scaled
  - Selection of step size is easier
  - With $p(x)$ multiplication may yield to near zero causing _underflow_
  - For better optimization, $log$ $p(x)$ is considered (multiplication $\rightarrow$ addition)

- What is Cross Entopy? From <https://datascience.stackexchange.com/questions/20296/cross-entropy-loss-explanation>
  - The Cross Entrophy formula takes in two distributions, $p(x)$, the true distribution, and $q(x)$, the estimated distribution, defined over the discreate variable $x$ and is given by
    - $\Large H(p,q) = - \Large \Sigma p(x) log(q(x))$

- From <https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html>

```python
def CrossEntropy(yHat, y):
    if y == 1:
      return -log(yHat)
    else:
      return -log(1 - yHat)
```

![Cross_Entropy_Log_Loss_Function_Graph](images/Cross_Entropy_Log_Loss_Function_Graph.jpg)

## Update Weights (HO) - Minimization of E

![Update_Weights_(HO)_Minimization_Of_E](images/Update_Weights_(HO)_Minimization_Of_E.jpg)

## Update Input to Hidden Weights

![Update_Input_to_Hidden_Weights](images/Update_Input_to_Hidden_Weights.jpg)

## Some Insights on Output-Hidden-Input Layer Weight Updates

![Some_Insights_on_Output_Hidden_Input_Layer_Weight_Updates](images/Some_Insights_on_Output_Hidden_Input_Layer_Weight_Updates.jpg)

## Matrix Operations

### How Softmax Calculated

- See [v1] Lecture 42

![Softmax_Calculation](images/Softmax_Calculation.jpg)

### How differences calcualted

- See [v1] Lecture 43

![How_Differences_Calculated](images/How_Differences_Calculated.jpg)