##### What is word embeddings and why do we need them?

__Word embeddings__ - Words that appear in similar contexts tend to have similar meanings. This idea comes from linguistics and is called _The Distributional Hypothesis_.

__Core Limitation of Classical Methods (N-grams, BOW, TF IDF)__:
- treat words as independent symbols

- rely on counts

- live in high-dimensional sparse space

- have no notion of meaning

So for classical NLP: fraud ‚â† scam ‚â† cheating

They are:

- different columns

- unrelated vectors

- orthogonal dimensions

But humans know they are related.


##### Intution:

Examples:
- He deposited money _in_ the bank.
- She withdrew cash _from_ the bank.

vs
- He sat _by_ the river bank.
- The bank was flooded

The word bank appears in different contexts ‚Üí different meanings.
Embeddings try to encode context information into vectors.

##### So a word embededding, conceptually, is:
- a desnse vector
- low dimensional (upto 300 dimensions)
- learned from data
- where distance and direction carry meaning.

So, instead of this (BoW):

        [0, 0, 1, 0, 0, 0, 0, ...]  # sparse


We get this:

        [0.12, -0.44, 0.87, ...]   # dense


Each number has no standalone meaning ‚Äî meaning comes from geometry.

__Important note:__ Embeddings do not store meaning. They store usage patterns, meanings emerges from usage.

So embedding:
- captures bias
- reflect data distribution
- can drift over time

---

##### Pre - requisites before Word2Vec:

__Artificial Neural Network (ANN)__

An ANN is a function approximator inspired by biological neurons.


Basic structure

- Input layer: receives features

- Hidden layers: apply weighted sums + nonlinear activations

- Output layer: produces predictions

The neuron computes $y = f(\mathbf{w} \cdot \mathbf{x} + b)$. where ùëì is an activation function (e.g., ReLU, sigmoid).

_Purpose:_ learn a mapping from inputs to outputs by adjusting weights.

Think of an ANN as a machine that learns from examples.

- You give it input (data)

- It makes a guess

- It slowly learns how to make better guesses

Inside, it has many small units (‚Äúneurons‚Äù) that:

- Look at the input

- Decide what is important

- Pass the result forward

Over time, the network learns which inputs matter more than others.

__Loss Function__

The loss function measures how wrong the model‚Äôs predictions are.

- Scalar value

- Lower loss = better performance

- Guides learning

Examples

- _Mean Squared Error (MSE) ‚Äî regression_
- _Cross-Entropy Loss ‚Äî classification : Penalizes confident wrong predictions_

The loss function is simply a score of how wrong the guess was.

- If the guess is good ‚Üí low loss

- If the guess is bad ‚Üí high loss

Example:

- True answer: 10

- Model guesses: 50 ‚Üí big loss

- Model guesses: 11 ‚Üí small loss

The model uses this score to know how much it needs to improve.


__Optimizer__

The optimizer defines how weights are updated to minimize the loss.

Core idea:

ùë§
‚Üê
ùë§
‚àí
ùúÇ
‚àá
ùêø


where 
ùúÇ is the learning rate.

Common optimizers

- SGD: simple, stable, slower convergence

- Momentum: accelerates SGD using past gradients

- Adam: adaptive learning rates; fast and widely used

Optimizer choice affects:

- Training speed

- Stability

- Final model quality

The optimizer is the method the model uses to improve itself.

After seeing how wrong it was:

- The optimizer changes the model slightly

- The goal is to reduce future mistakes

Think of it as:

- Loss function: ‚ÄúHow bad was I?‚Äù

- Optimizer: ‚ÄúHow should I adjust next time?‚Äù

Different optimizers adjust more cautiously or more aggressively.

---

##### Word2Vec

It is a NLP technique that uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can:
- detect synonymous words
- suggest additonal words for a partial sentence.

As the name implies, it represents each distinct word with a particular list of numbers called a vector.

__It is a method to learn word embeddings from text.__

- Word2Vec = a way to train word vectors such that words with similar usage patterns have similar vectors.

The output is:

 - a dense vector per word

 - typically 50‚Äì300 dimensions

 - learned automatically from text


##### The Problem Word2Vec Solves:

 - Classical NLP sees this:

        - fraud ‚â† scam ‚â† cheating


- Word2Vec tries to make this true in vector space:

        - cosine_similarity(fraud, scam) ‚Üí high
        - cosine_similarity(fraud, banana) ‚Üí low


It does this by learning from context, not frequency.


__Important :__ A word is defined by the words that surround it. This is the distributional hypothesis in action.

Example:

"free prize win now"


The word prize often appears near:

- free
- win
- claim
- money

So its vector should be close to those words.

##### Cosine Similarity

n NLP, words and documents become vectors.

We need a way to answer:
- ‚ÄúHow similar are these two vectors?‚Äù

But similarity should mean:

- meaning similarity, not size similarity
- independent of how long a document is

This is why cosine similarity exists. It calculates how aligned two vectors are in direction.
- same direction ‚Üí similar meaning
- opposite direction ‚Üí opposite meaning
- perpendicular ‚Üí unrelated

__Cosine Similarity Formula:__ It is the angle between the two vectors.


$$
\cos(\theta) = \frac{\vec{A} \cdot \vec{B}}{||\vec{A}|| \, ||\vec{B}||}
$$




        ¬∑ = dot product

        ||A|| = magnitude (length) of vector A

A = [1, 0, 1]
B = [1, 0, 1]

$$
\vec{A} \cdot \vec{B} = (a_1 \times b_1) + (a_2 \times b_2) + (a_3 \times b_3)
$$

$$
\vec{A} \cdot \vec{B} = (1 \times 1) + (0 \times 0) + (1 \times 1)
$$

$$
= 1 + 0 + 1
$$

$$
= 1 + 0 + 1
$$

$$
= 2
$$

$$
||\vec{A}|| = \sqrt{a_1^2 + a_2^2 + a_3^2}
$$

$$
||\vec{A}|| = \sqrt{1^2 + 0^2 + 1^2}
$$

$$
= \sqrt{1 + 0 + 1}
$$

$$
= \sqrt{2}
$$

$$
||\vec{B}|| = \sqrt{1^2 + 0^2 + 1^2}
$$

$$
= \sqrt{2}
$$

$$
||\vec{A}|| \times ||\vec{B}|| = \sqrt{2} \times \sqrt{2}
$$

$$
= 2
$$

$$
= 2
$$

$$
\cos(\theta) = \frac{2}{2}
$$

$$
= 1
$$

$$
\boxed{\cos(\theta) = 1}
$$








__Distance between the vectors__:

$$
\boxed {1 - \cos(\theta)}
$$

_So the distance is 0. This means the vecotrs are similar._

##### CBOW (Continuous Bag of Words) : A subset of Word2Vec



CBOW predicts a target word given its surrounding context words.

Example

Sentence:

        the cat sat on the mat


With window size = 2 and target = sat:

- Input (context words): ["the", "cat", "on", "the"]

- Output (target word): "sat"

Order of context words does not matter ‚Üí ‚Äúbag of words‚Äù.

__Important__:
- __Window size__ plays a crucial role in identifying the input and output data. Its good practice to take the window size as an _odd_ number.
    - It helps us to identify number the words we need to pick in the beginiing. Out of these 5 words, we will take the center word.
    - This will tell us the context of the forward and the backward words. 
    - __NOTE__: _Window size only decides which words are used as input. It does not define the neural network size_
    
    For example in _Mayank is on his way to learn NLP_ if I select the window size to be 5 from the beginning :

        - Input1 = [Mayank, is, his, way]
        - Output1 = [on]
    - Moving the window to the next 5 words in the same example
        - Input2 = [is,on,way,to]
        - Output2 = [his]
    - Moving again
        - Input3 =[on,his,to,learn]
        - Output3 = [way]
    - Moving again
        - Input4 = [his, way, learn, NLP]
        - Output4 = [to]    
    
    Now we will train our model with this data. In order to achieve that, we need to conver the data into vectors. We can use methods like OneHot Encoding.

    - Let's perform OHE

    Vocabulary = [Mayank, is, on, his, way, to, learn, NLP]

    OHE:
    - Mayank [1,0,0,0,0,0,0,0]
    - is [0,1,0,0,0,0,0,0]
    - on [0,0,1,0,0,0,0,0]
    - his [0,0,0,1,0,0,0,0]
    - way [0,0,0,0,1,0,0,0]
    - to [0,0,0,0,0,1,0,0]
    - learn [0,0,0,0,0,0,1,0]
    - NLP [0,0,0,0,0,0,0,1]

    So, each word is represented as an 8-dimensional one-hot vector.

    Let's train the model now. 
    
    __Let's choose a target word.__
    - 'his' is the target word.
    - since window size is 5, left : _is,on_ , right : _way,to_

    __Input Layer__
    The input is multiple one-hot vectors, one for each context words. 
    - In our case, context words are: _[is,on,way,to]_
    - Their one hot vectors:
        - is [0, 1, 0, 0, 0, 0, 0, 0]
        - on [0, 0, 1, 0, 0, 0, 0, 0]
        - way [0, 0, 0, 0, 1, 0, 0, 0]
        - to [0, 0, 0, 0, 0, 1, 0, 0]

    __Hidden Layer (Embedding Layer)__

    This is a shared respresentation for all the words and exists once for the entire model. More dimensions = more capacity to store semantic information.

    Between input and hidden layer:
    - Weight matrix size = vocab size * embedding dimension
    - For our example, let us take embedding dimension as 4.
    - So __W = 8 x 4__

    __So what is an embedding dimension?__
    The number of numbers used to represent the meaning of a word.

            With OHE, is = [0, 1, 0, 0, 0, 0, 0, 0].
            With embedding dimension 4, is = [0.12,.0,15,0.66,0.88]

    This means each word gets mapped to a 4D dense vector.

    How context words are combined:

    - Each one-hot vector selects its row from W

    - This gives 4 embedding vectors

    - CBOW averages them

    Mathematically (conceptually):

        hidden_vector =    (embedding(is) + embedding(on) + embedding(way) + embedding(to)) / 4
    
    _This averaged vector is the hidden layer output._


    __Output Layer__
    The output layer tries to predict the target word (‚Äúhis‚Äù). 
    - It tries to answer - Given this context representation, which word best fits in the center?
    - It must assign a score or probability to every word in the vocabulary.







