# Word2Vec

## One-Hot Encoding
One-Hot Encoding is a method for converting categorical variables into a numerical format, making it easier for machine learning algorithms to process the data. In categorical variables, each value represents a distinct category, and there is no inherent order or numerical relationship between them.

## N-gram
Given a sentence containing m words (w₁, w₂, ..., wₘ), the probability of this sentence is calculated as:
$$P(sen = (w_1, w_2, \ldots, w_m)) = P(w_1)P(w_2 | w_1)P(w_3 | w_2, w_1) \ldots P(w_m | w_{m-1}, \ldots, w_1)$$
Typically, language models aim to maximize the conditional probability: P(wₜ | w₁, w₂, ..., wₜ₋₁). However, due to the recency effect, the current word is primarily correlated with its nearest n preceding words (usually n <= 5), rather than being dependent on all previous words.
Therefore, the above formula can be approximated as:
$$P(w_t | w_1, w_2, \dots, w_{t-1}) = P(w_t | w_{t-1}, w_{t-2}, \dots, w_{t-(n-1)})$$

However, the N-gram model still has its limitations:
* First, due to the exponential growth of parameter space, it cannot effectively handle longer contextual dependencies (when N > 3).
* Second, it fails to capture the intrinsic relationships between words.

## What is Word Embedding?
**Word embedding** is a technique in Natural Language Processing (NLP) that represents words (or phrases, sentences) as **low-dimensional, dense vectors** in a continuous vector space. These vectors capture semantic and syntactic relationships between words, allowing machines to process language more effectively. 

## Why Use Word Embeddings?
Traditional methods (e.g., one-hot encoding) suffer from:
* High dimensionality (e.g., 10,000D for a 10K-word vocabulary).
* No semantic meaning (all word vectors are orthogonal).

Word embeddings solve these by:
* Lower dimensions (e.g., 300D).
* Preserving semantic relationships.

## Popular Word Embedding Models
1. **Word2Vec**
* Uses **CBOW** (predicts a word from context) or **Skip-gram** (predicts context from a word).
* Trained on large text corpora.
2. **GloVe** (Global Vectors)
* combines **global co-occurrence statistics** with loca context.
3. ....

## CBOW (Continuous Bag of Words)
**CBOW** is one of the two main architectures in **Word2Vec** (the other being **Skip-gram**). It predicts a **target word** based on its **context words** (surrounding words in a fixed window).

## Training Steps of the CBOW Model
The basic training stops of the **CBOW (Continuous Bag of Words)** model include:
1. **input Representation**
* Represent the **context words** as **one-hot vectors**, where the vocabulary size is **V**, and the number of context words is **C**.
2. **Projection to Hidden Layer**
* Multiply each context word's **one-hot vector** by the **input-to-hidden weight matrix W** (of size **V * N**, where N is the embedding dimension). 
3. **Hidden Layer Vector Calculation**
* Sum all the resulting vectors from the previous step and take their **average** to produce the **hidden layer vector** (also N-dimensional).
4. **Output Layer Projection**
* Multiply the hidden layer vector by the **hidden-to-output weight matrix W'** (of size **N * V**) to generate a V-dimensional score vector.
5. **Prediction via Softmax**
* Apply the **softmax activation** to the score vector to obtain a **probability distribution** over the vocabulary.
* The word with the **highest probability** is selected as the predicted **target word**.

In [1]:
from gensim.models import Word2Vec

sentences = [["I", "love", "natural", "language", "processing"], 
             ["CBOW", "is", "a", "Word2Vec", "model"]]

# Train CBOW (sg=0 means CBOW; sg=1 means Skip-gram)
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0, workers=4)

# Get word vector for "language"
vector = model.wv["language"]
print(vector)

[ 8.13227147e-03 -4.45733406e-03 -1.06835726e-03  1.00636482e-03
 -1.91113955e-04  1.14817743e-03  6.11386076e-03 -2.02715401e-05
 -3.24596534e-03 -1.51072862e-03  5.89729892e-03  1.51410222e-03
 -7.24261976e-04  9.33324732e-03 -4.92128357e-03 -8.38409644e-04
  9.17541143e-03  6.74942741e-03  1.50285603e-03 -8.88256077e-03
  1.14874600e-03 -2.28825561e-03  9.36823711e-03  1.20992784e-03
  1.49006362e-03  2.40640994e-03 -1.83600665e-03 -4.99963388e-03
  2.32429506e-04 -2.01418041e-03  6.60093315e-03  8.94012302e-03
 -6.74754381e-04  2.97701475e-03 -6.10765442e-03  1.69932481e-03
 -6.92623248e-03 -8.69402662e-03 -5.90020278e-03 -8.95647518e-03
  7.27759488e-03 -5.77203138e-03  8.27635173e-03 -7.24354526e-03
  3.42167495e-03  9.67499893e-03 -7.78544787e-03 -9.94505733e-03
 -4.32914635e-03 -2.68313056e-03 -2.71289347e-04 -8.83155130e-03
 -8.61755759e-03  2.80021061e-03 -8.20640661e-03 -9.06933658e-03
 -2.34046578e-03 -8.63180775e-03 -7.05664977e-03 -8.40115082e-03
 -3.01328895e-04 -4.56429

In [2]:
# Find similar words
similar_words = model.wv.most_similar("natural", topn=2)
print(similar_words)

[('model', 0.09291722625494003), ('language', 0.004842500668019056)]


# Skip-gram
Skip-gram is one of the two main architectures in **Word2Vec** (the other being CBOW). Unlike **CBOW**, which is predicts a target word from its context, Skip-gram **does the opposite**:it takes a **center word** as input and predicts its surrounding **context words**. This approach excels at capturing **semantic relationships** and works particularly well for **rare words** and complex linguistic patterns.

In [3]:
from gensim.models import Word2Vec

sentences = [
    ["I", "love", "natural", "language", "processing"],
    ["Skip-gram", "is", "powerful", "for", "word", "embeddings"]
]

# Train Skip-gram (sg=1)
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1, workers=4)

# Get word vector
vector = model.wv["language"]
print(vector)

[ 8.1681199e-03 -4.4430327e-03  8.9854337e-03  8.2536647e-03
 -4.4352221e-03  3.0310510e-04  4.2744912e-03 -3.9263200e-03
 -5.5599655e-03 -6.5123225e-03 -6.7073823e-04 -2.9592158e-04
  4.4630850e-03 -2.4740540e-03 -1.7260908e-04  2.4618758e-03
  4.8675989e-03 -3.0808449e-05 -6.3394094e-03 -9.2608072e-03
  2.6657581e-05  6.6618943e-03  1.4660227e-03 -8.9665223e-03
 -7.9386048e-03  6.5519023e-03 -3.7856805e-03  6.2549924e-03
 -6.6810320e-03  8.4796622e-03 -6.5163244e-03  3.2880199e-03
 -1.0569858e-03 -6.7875278e-03 -3.2875966e-03 -1.1614120e-03
 -5.4709399e-03 -1.2113475e-03 -7.5633135e-03  2.6466595e-03
  9.0701487e-03 -2.3772502e-03 -9.7651005e-04  3.5135616e-03
  8.6650876e-03 -5.9218528e-03 -6.8875779e-03 -2.9329848e-03
  9.1476962e-03  8.6626766e-04 -8.6784009e-03 -1.4469790e-03
  9.4794659e-03 -7.5494875e-03 -5.3580985e-03  9.3165627e-03
 -8.9737261e-03  3.8259076e-03  6.6544057e-04  6.6607012e-03
  8.3127534e-03 -2.8507852e-03 -3.9923131e-03  8.8979173e-03
  2.0896459e-03  6.24894

In [4]:
# Find similar words
print(model.wv.most_similar("natural", topn=2))
# Output: [('language', 0.88), ('processing', 0.82)]

[('for', 0.19912061095237732), ('powerful', 0.07497558742761612)]


## Summary
### Key Innovations
1. Dense Vector Representations
* Transforms words from sparse one-hot encodings (e.g., 50,000D) to compact 100-300D continuous vectors.
2. Semantic Arithmetic
* Enables vector operations that reflect linguistic relationships

### Architecture Comparison
| Feature  | CBOW                               | Skip-gram                       |
|----------|------------------------------------|---------------------------------|
| Approach | Predicts center from context       | Predicts context from center    |
| Speed    | Faster (better for frequent words) | Slower (excels with rare words) |
| Strength | Syntactic patterns                 | Semantic relationships          |

In [5]:
from gensim.models import Word2Vec

sentences = [["自然", "语言", "处理"], ["深度学习", "改变", "世界"]]

# Train model (default Skip-gram)
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Get word vector
vector = model.wv["自然"]

# Similarity
print(model.wv.similarity("自然", "语言"))
print(model.wv.most_similar("深度学习", topn=2))

0.13887984
[('语言', 0.17018885910511017), ('自然', 0.06408978998661041)]
