# Word Embedding using Word2Vec

- It is developed by Google (around 2013, 2014)

## Study Links

- From [27]
  - Good overview of all models, more details on
    - Context Words
    - CBOW
    - Skip-Gram
    - Negative Sampling
- From [v4]
  - More intuitive explanation on
    - Limitations of One-Hot-Vector
    - Idea behind the Word Embedding

## Distributional Similarity based representations

- From [v4] Lecture 2
  - You can get a lot of value by representing a word by means of its neighbors
    - "You shall know a word by the company it keeps" J.R.Firth 1957:11
  - This is one of the most successful ideas of modern statistical NLP
  - ![Distributional_Similarity_Based_Representations](images/Distributional_Similarity_Based_Representations.jpg)

## Word Meanings is defined in terms of Vectors

- From [v4] Lecture 2
  - We build a dense vector for each word type, chosen so that it is good at predicting other words appearing in the context
    - ... those words are also being represented by vectors ... it all gets a bit recursive
    - Similarity of words are found using metrix like dot product of those vectors
  - ![Word_Meaning_As_Vector](images/Word_Meaning_As_Vector.jpg)

## Main Idea of Word2Vec

- From [v4] Lecture 2
  - __*Predict between every word and its context words!*__
  - Two Alogirhtms
    - Continuous Bag of Words (CBOW)
    - Skip-Grams (SG)
  - Two (moderately efficient) training methods
    - Hierarchical softmax
    - Negative Sampling

## Goal

- Process each word in a Vocabulary of words to obtain a respective numeric representation of each word in Vocabulary
  - Instead of have a _One Hot Vector_, represent words in terms fixed-sized vector having, 100 or 200 or 300 elements
- Reflect _Sematic Similarities_, _Syntactic Similarities_, or both, between words they represent
- Map each of the plurality of words to a respective vector and output a single merged vector that is a combination of the respective vectors
  - Merge multiple words that are similar and put them in one vector

## [Context Words](https://cs224d.stanford.edu/lecture_notes/notes1.pdf) and Central Word

- In Probabilitis Language Model
  - Conditional Probability is used in identifying/predicting the next word in the Language Model
  - In Language Model, the word that is going to be predicted is the last word in the context of words
  - So, when the context of words is given, the next word of context is predicted
    - Example: $\text{"How are you"}$
      - $\text{"How are"}$ are the given context words in the case of Language Model
      - $\text{"you"}$ is the context word that we want to predict
- In CBOW Model
  - Central Word is surrounded by context words
  - Given the context words, we want to identify/predict what is my _Central Word_?
    - Example: $\text{"more happy returns of the day"}$, lets consider window size as $5$
      - $\text{"more happy ___ of the day"}$ are the given context words in the case of CBOW Model
      - $\text{"returns"}$ is the central word that we want to identify/predict
- In Skip-Gram Model
  - Given the central word, identify the surronding words
  - Example: $\text{"more happy returns of the day"}$, lets consider window size as $5$
    - $\text{"returns"}$ is the given context word
    - $\text{"more happy of the"}$ is what we need to predict the surronding context words for the given central word

- ![Context_Words_and_Central_Word](images/Context_Words_and_Central_Word.jpg)

- From [27] Notes 1
  - __Context of a Word__
    - The context of a word is the set of $C$ surronding words.
    - For instance, for $C = 2$, the context of the word $\text{"fox"}$ in the sentence $\text{"The quick brown fox jumped over the lazy dog"}$ is $\{\text{"quick"}, \text{"brown"}, \text{"jumped", \text{"over"}}\}$

## CBOW Model

- _Refer [25] to get to know more about architecture of both models_

- CBOW Neural Network Architecture
  - Input layer having $n-1$ words, where $n$ is the window size
    - For window size $n=5"$, $w_{t-2},w_{t-1},w_{t+1},w_{t+2}$
  - Neuron sum's (linear sum) all the incoming weights, input
  - From [v4] Lecture 2, __*Window size $n$ is one of the Hyper Parameter for this model*__
  - Finally we have output, which predicts the central word
    - $w_{t}$
    - $Softmax$ is used to find the most probable central word
- Input is a __*One Hot Vector*__
  - So we cannot feed all the words togehter as one shot in the input layer
  - We will be feeding one word at a time
     - Example: $\text{"Wish"}$, $\text{"you"}$, $\text{"a"}$, $\text{"happy"}$, $\text{"year"}$ as context words
       - Each words are inputed to the input layer one at a time
- Perform a Linear Summation
  - Over the input and its weights
- Maximize the probability of word based on the word co-occurences within a distance $n$
- Input size and Output size should match
  - If the input vector size is 100, output vector size should also be 100
  - $Softmax$ probabiltiy will be estiamted over those 100 words, indicating which is more probable as central word

![CBOW_NN_Architecture](images/CBOW_NN_Architecture.jpg)

### Hyper Parameters

- From [v4] Lecture 2
  - Words as Vectors
  - Window Size $n$

## Skip-Gram Model

- Skip Gram Neural Network Architecture

- It is similar to CBOW
  - Input is a __*One Hot Vector*__
  - Output predicts one word at a time based on the given central word
- Example
  - SG uses the central word $\text{"new"}$" and predicts the context words $\text{"Wish"}$ $\text{"you"}$, $\text{"a"}$, $\text{"happy"}$, $\text{"year"}$
- From [v4] Lecture 2, __*Window size $n$ is one of the Hyper Parameter for this model*__

![Skip_Gram_Model_NN_Architecture.jpg](images/Skip_Gram_Model_NN_Architecture.jpg)

### Hyper Parameters

- From [v4] Lecture 2
  - Words as Vectors
  - Window Size $n$