##### What is word embeddings and why do we need them?

__Word embeddings__ - Words that appear in similar contexts tend to have similar meanings. This idea comes from linguistics and is called _The Distributional Hypothesis_.

__Core Limitation of Classical Methods (N-grams, BOW, TF IDF)__:
- treat words as independent symbols

- rely on counts

- live in high-dimensional sparse space

- have no notion of meaning

So for classical NLP: fraud ≠ scam ≠ cheating

They are:

- different columns

- unrelated vectors

- orthogonal dimensions

But humans know they are related.


##### Intution:

Examples:
- He deposited money _in_ the bank.
- She withdrew cash _from_ the bank.

vs
- He sat _by_ the river bank.
- The bank was flooded

The word bank appears in different contexts → different meanings.
Embeddings try to encode context information into vectors.

##### So a word embededding, conceptually, is:
- a desnse vector
- low dimensional (upto 300 dimensions)
- learned from data
- where distance and direction carry meaning.

So, instead of this (BoW):

        [0, 0, 1, 0, 0, 0, 0, ...]  # sparse


We get this:

        [0.12, -0.44, 0.87, ...]   # dense


Each number has no standalone meaning — meaning comes from geometry.

__Important note:__ Embeddings do not store meaning. They store usage patterns, meanings emerges from usage.

So embedding:
- captures bias
- reflect data distribution
- can drift over time

##### Word2Vec

It is a NLP technique that uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can:
- detect synonymous words
- suggest additonal words for a partial sentence.

As the name implies, it represents each distinct word with a particular list of numbers called a vector.

__It is a method to learn word embeddings from text.__

- Word2Vec = a way to train word vectors such that words with similar usage patterns have similar vectors.

The output is:

 - a dense vector per word

 - typically 50–300 dimensions

 - learned automatically from text


##### The Problem Word2Vec Solves:

 - Classical NLP sees this:

        - fraud ≠ scam ≠ cheating


- Word2Vec tries to make this true in vector space:

        - cosine_similarity(fraud, scam) → high
        - cosine_similarity(fraud, banana) → low


It does this by learning from context, not frequency.


__Important :__ A word is defined by the words that surround it. This is the distributional hypothesis in action.

Example:

"free prize win now"


The word prize often appears near:

- free
- win
- claim
- money

So its vector should be close to those words.

##### Cosine Similarity

n NLP, words and documents become vectors.

We need a way to answer:
- “How similar are these two vectors?”

But similarity should mean:

- meaning similarity, not size similarity
- independent of how long a document is

This is why cosine similarity exists. It calculates how aligned two vectors are in direction.
- same direction → similar meaning
- opposite direction → opposite meaning
- perpendicular → unrelated

__Cosine Similarity Formula:__ It is the angle between the two vectors.


$$
\cos(\theta) = \frac{\vec{A} \cdot \vec{B}}{||\vec{A}|| \, ||\vec{B}||}
$$




        · = dot product

        ||A|| = magnitude (length) of vector A

A = [1, 0, 1]
B = [1, 0, 1]

$$
\vec{A} \cdot \vec{B} = (a_1 \times b_1) + (a_2 \times b_2) + (a_3 \times b_3)
$$

$$
\vec{A} \cdot \vec{B} = (1 \times 1) + (0 \times 0) + (1 \times 1)
$$

$$
= 1 + 0 + 1
$$

$$
= 1 + 0 + 1
$$

$$
= 2
$$

$$
||\vec{A}|| = \sqrt{a_1^2 + a_2^2 + a_3^2}
$$

$$
||\vec{A}|| = \sqrt{1^2 + 0^2 + 1^2}
$$

$$
= \sqrt{1 + 0 + 1}
$$

$$
= \sqrt{2}
$$

$$
||\vec{B}|| = \sqrt{1^2 + 0^2 + 1^2}
$$

$$
= \sqrt{2}
$$

$$
||\vec{A}|| \times ||\vec{B}|| = \sqrt{2} \times \sqrt{2}
$$

$$
= 2
$$

$$
= 2
$$

$$
\cos(\theta) = \frac{2}{2}
$$

$$
= 1
$$

$$
\boxed{\cos(\theta) = 1}
$$








__Distance between the vectors__:

$$
\boxed {1 - \cos(\theta)}
$$

_So the distance is 0. This means the vecotrs are similar._