# Word Embedding using Word2Vec

- It is developed by Google (around 2013, 2014)

## Goal

- Process each word in a Vocabulary of words to obtain a respective numeric representation of each word in Vocabulary
  - Instead of have a _One Hot Vector_, represent words in terms fixed-sized vector having, 100 or 200 or 300 elements
- Reflect _Sematic Similarities_, _Syntactic Similarities_, or both, between words they represent
- Map each of the plurality of words to a respective vector and output a single merged vector that is a combination of the respective vectors
  - Merge multiple words that are similar and put them in one vector

## Context Words and Central Word

- In Probabilitis Language Model
  - Conditional Probability is used in identifying/predicting the next word in the Language Model
  - In Language Model, the word that is going to be predicted is the last word in the context of words
  - So, when the context of words is given, the next word of context is predicted
    - Example: $\text{"How are you"}$
      - $\text{"How are"}$ are the given context words in the case of Language Model
      - $\text{"you"}$ is the context word that we want to predict
- In CBOW Model
  - Central Word is surrounded by context words
  - Given the context words, we want to identify/predict what is my _Central Word_?
    - Example: $\text{"more happy returns of the day"}$, lets consider window size as $5$
      - $\text{"more happy ___ of the day"}$ are the given context words in the case of CBOW Model
      - $\text{"returns"}$ is the central word that we want to identify/predict
- In Skip-Gram Model
  - Given the central word, identify the surronding words
  - Example: $\text{"more happy returns of the day"}$, lets consider window size as $5$
    - $\text{"returns"}$ is the given context word
    - $\text{"more happy of the"}$ is what we need to predict the surronding context words for the given central word

- ![Context_Words_and_Central_Word](images/Context_Words_and_Central_Word.jpg)

## CBOW Model

- CBOW Neural Network Architecture
  - Input layer having $n-1$ words, where $n$ is the window size
    - For window size $n=5"$, $w_{t-2},w_{t-1},w_{t+1},w_{t+2}$
  - Neuron sum's (linear sum) all the incoming weights, input
  - Finally we have output, which predicts the central word
    - $w_{t}$
    - $Softmax$ is used to find the most probable central word
- Input is a __*One Hot Vector*__
  - So we cannot feed all the words togehter as one shot in the input layer
  - We will be feeding one word at a time
     - Example: "Wish", "you", "a", "happy", "year" as context words
       - Each words are inputed to the input layer one at a time
- Perform a Linear Summation
  - Over the input and its weights
- Maximize the probability of word based on the word co-occurences within a distance $n$
- Input size and Output size should match
  - If the input vector size is 100, output vector size should also be 100
  - $Softmax$ probabiltiy will be estiamted over those 100 words, indicating which is more probable as central word

![CBOW_NN_Architecture](images/CBOW_NN_Architecture.jpg)

## Skip-Gram Model

- Skip Gram Neural Network Architecture

- It is similar to CBOW
  - Input is a __*One Hot Vector*__
  - Output predicts one word at a time based on the given central word
- Example
  - SG uses the central word "new" and predicts the context words "WIsh" "you", "a", "happy", "year"

![Skip_Gram_Model_NN_Architecture.jpg](images/Skip_Gram_Model_NN_Architecture.jpg)