# Word2Vec with Negative Sampling Implementation

In this Jupyter notebook, we will implement the Word2Vec model using the technique of **negative sampling**. Word2Vec is a popular algorithm for learning word representations (embeddings) from large text corpora. Negative sampling is an optimization technique used to efficiently train the model by approximating the softmax function with a binary classification task.

We will go through the process of:

1. Preparing the text data for training.
2. Implementing the negative sampling objective function.
3. Training the Word2Vec model.
4. Evaluating the learned word embeddings.

Let's get started with building the model!


To train Word2Vec with negative sampling, the formula for the objective function involves maximizing the likelihood of the context words while minimizing the likelihood of randomly sampled negative words. Here's the formula for training Word2Vec with negative sampling:

### Objective Function for Word2Vec with Negative Sampling

Given a target word $w_t $ and a context word $w_c $, the objective is to maximize the probability of the context word given the target word using a logistic regression model. The model outputs a probability for the pair of words to be a valid context-target pair.

1. **Positive Pairs (True Context)**
   The probability of a valid context word $w_c $ given the target word $w_t $ is:

  $$
   P(w_c | w_t) = \sigma(v_{w_c}^T v_{w_t})
   $$

   Where:
   - $\sigma(x) $ is the sigmoid function, $\sigma(x) = \frac{1}{1 + e^{-x}} $,
   - $v_{w_c} $ is the vector representation of the context word $w_c $,
   - $v_{w_t} $ is the vector representation of the target word $w_t $.



2. **Negative Sampling**
   Negative sampling introduces random negative samples $w_n $ to train the model to distinguish valid word pairs from random noise. For each positive word pair $(w_t, w_c) $, we sample $k $ negative samples $w_n $. The objective function also includes the probability of the context word $w_n $ being sampled:

  $$
   P(w_n | w_t) = \sigma(-v_{w_n}^T v_{w_t})
   $$

   The negative sign is used to push the model to reduce the similarity between negative samples and the target word.



3. **Final Objective Function**
   The final objective function to maximize is:

  $$
   J(w_t, w_c) = \log \sigma(v_{w_c}^T v_{w_t}) + \sum_{n=1}^k \mathbb{E}_{w_n \sim P(w)} \left[ \log \sigma(-v_{w_n}^T v_{w_t}) \right]
   $$

   Where:
   - The first term corresponds to the positive sample (context word),
   - The second term sums over the negative samples, where each negative sample $w_n $ is drawn from a distribution $P(w) $ (often a unigram distribution raised to a power, e.g., $P(w) = \frac{p(w)^\alpha}{\sum_{w'} p(w')^\alpha} $).

By maximizing this objective, the model learns to increase the similarity between the target word vector $v_{w_t} $ and the context word vectors $v_{w_c} $, while decreasing the similarity with negative samples.

# Importing Pretrained Word2Vec Using Gensim

The Gensim library is a popular Python package for natural language processing tasks, particularly for working with word embeddings such as Word2Vec. Gensim provides a straightforward way to load pretrained Word2Vec models, including Google's pretrained Word2Vec model or others in the `.bin` or `.txt` format.

Here’s a step-by-step guide to import a pretrained Word2Vec model:

## Steps to Import Pretrained Word2Vec

1. **Install Gensim**  
   If you haven't installed Gensim, you can install it using pip:
   ```bash
   pip install gensim
   ```

2. **Download a Pretrained Word2Vec Model**  
   Commonly used pretrained models include:
   - Google's pretrained Word2Vec model: [Google News vectors](https://code.google.com/archive/p/word2vec/)
   - Other links https://huggingface.co/fse/word2vec-google-news-300
   - Other embeddings such as FastText, Glove, or models trained on specific datasets.

3. **Load the Pretrained Model**  
   Use the `KeyedVectors` module from Gensim to load the pretrained model. If the model is in binary format, set `binary=True`. Otherwise, leave it as `binary=False`.

   ```python
   from gensim.models import KeyedVectors

   # Path to the pretrained model
   model_path = "path/to/pretrained/word2vec.bin"

   # Load the model
   word2vec_model = KeyedVectors.load_word2vec_format(model_path, binary=True)
   ```

4. **Using the Loaded Model**  
   Once loaded, you can use the model to:
   - Retrieve vector representation of words:
     ```python
     vector = word2vec_model["example"]
     print(vector)
     ```
   - Find most similar words:
     ```python
     similar_words = word2vec_model.most_similar("king", topn=5)
     print(similar_words)
     ```
   - Compute similarity between words:
     ```python
     similarity = word2vec_model.similarity("king", "queen")
     print(similarity)
     ```
