
# Word2Vec: A Comprehensive Overview

This notebook provides an in-depth overview of Word2Vec, including its history, mathematical foundation, implementation, usage, advantages and disadvantages, and more. We'll also include visualizations and a discussion of the model's impact and applications.



## History of Word2Vec

Word2Vec was introduced by Tomas Mikolov and colleagues at Google in 2013 in the papers "Efficient Estimation of Word Representations in Vector Space" and "Distributed Representations of Words and Phrases and their Compositionality." The model revolutionized natural language processing (NLP) by providing a method to represent words as continuous vectors in a high-dimensional space, capturing semantic relationships between words based on their context in large text corpora. Word2Vec laid the foundation fo...



## Mathematical Foundation of Word2Vec

### Skip-Gram Model

One of the two architectures in Word2Vec is the Skip-Gram model, which predicts the context words given a target word. Given a sequence of training words \( w_1, w_2, \dots, w_T \), the Skip-Gram model aims to maximize the following average log probability:

\[
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t)
\]

Where \( c \) is the context window size, \( w_t \) is the target word, and \( w_{t+j} \) are the context words.

### CBOW Model

The Continuous Bag of Words (CBOW) model, the second architecture in Word2Vec, works in the opposite manner: it predicts the target word given the context words. The objective function for CBOW is to maximize the probability of the target word given the context:

\[
\frac{1}{T} \sum_{t=1}^{T} \log p(w_t | w_{t-c}, \dots, w_{t+c})
\]

### Softmax and Negative Sampling

The output of the Skip-Gram or CBOW model is typically passed through a softmax function to calculate the probability distribution over the vocabulary:

\[
p(w_O | w_I) = \frac{\exp(v_{w_O}^\top v_{w_I})}{\sum_{w=1}^{|V|} \exp(v_w^\top v_{w_I})}
\]

Where \( v_{w_O} \) and \( v_{w_I} \) are the vectors for the output and input words, respectively, and \( |V| \) is the size of the vocabulary.

To efficiently train the model, especially with large vocabularies, Word2Vec uses techniques like Negative Sampling and Hierarchical Softmax. Negative Sampling, for instance, approximates the softmax function by only updating a small sample of negative examples instead of the entire vocabulary.

\[
\log \sigma(v_{w_O}^\top v_{w_I}) + \sum_{i=1}^k \mathbb{E}_{w_i \sim P_n(w)}[\log \sigma(-v_{w_i}^\top v_{w_I})]
\]

Where \( \sigma \) is the sigmoid function, \( k \) is the number of negative samples, and \( P_n(w) \) is the noise distribution.

### Training

Word2Vec is trained using stochastic gradient descent (SGD) or its variants. The model learns to adjust the word vectors such that words appearing in similar contexts have similar vector representations, capturing semantic relationships like "king" - "man" + "woman" = "queen".



## Implementation in Python

We'll implement a basic version of Word2Vec using the Gensim library. This implementation will demonstrate the key concepts of Word2Vec, including training a Skip-Gram model on a sample corpus.


In [None]:

import gensim
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Sample corpus
sentences = [
    ["king", "queen", "man", "woman"],
    ["king", "man", "kingdom"],
    ["queen", "woman", "monarchy"],
    ["man", "woman", "child"],
    ["woman", "queen", "lady"],
    ["man", "king", "lord"]
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, sg=1)

# Print word vectors
word_vectors = model.wv
print("Vector for 'king':", word_vectors['king'])
print("Vector for 'queen':", word_vectors['queen'])

# Plot word vectors using PCA
words = list(word_vectors.index_to_key)
vectors = [word_vectors[word] for word in words]

pca = PCA(n_components=2)
result = pca.fit_transform(vectors)

plt.figure(figsize=(8, 6))
plt.scatter(result[:, 0], result[:, 1])

for i, word in enumerate(words):
    plt.annotate(word, xy=(result[i, 0], result[i, 1]))

plt.title('Word2Vec Word Embeddings')
plt.show()



## Pros and Cons of Word2Vec

### Advantages
- **Efficient Learning**: Word2Vec efficiently captures semantic relationships between words, making it a powerful tool for various NLP tasks.
- **Low Dimensionality**: The word vectors generated by Word2Vec are typically low-dimensional, which makes them computationally efficient for downstream tasks.
- **Versatility**: Word2Vec can be used for various tasks such as finding word similarities, clustering words, and improving the performance of machine learning models in NLP.

### Disadvantages
- **Context Independence**: Word2Vec does not capture the context of words in different sentences, which can lead to a lack of understanding of polysemy (words with multiple meanings).
- **Memory Usage**: Training Word2Vec on large corpora requires significant memory, especially when using large vocabularies.



## Conclusion

Word2Vec marked a significant milestone in the field of natural language processing by providing a method to efficiently represent words as vectors in a continuous space. These word vectors capture semantic relationships between words, enabling various downstream tasks in NLP. While Word2Vec has some limitations, such as context independence, its impact on the field is undeniable, and it continues to be widely used in many applications.
