# Word2Vec with Negative Sampling Implementation

In this Jupyter notebook, we will implement the Word2Vec model using the technique of **negative sampling**. Word2Vec is a popular algorithm for learning word representations (embeddings) from large text corpora. Negative sampling is an optimization technique used to efficiently train the model by approximating the softmax function with a binary classification task.

We will go through the process of:

1. Preparing the text data for training.
2. Implementing the negative sampling objective function.
3. Training the Word2Vec model.
4. Evaluating the learned word embeddings.

Let's get started with building the model!


In [1]:
!pip install datasets


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [2]:
import datasets

train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [3]:
train_text=[sample['text'] for sample in train_data][:2]

In [4]:
train_tokens=[ token for text in train_text for token in text.lower().split()]

In [13]:
from collections import Counter
vocab=Counter(train_tokens)
w2i={k:i for i, (k,v) in enumerate(vocab.items())}
i2w={v:k for   (k,v) in  w2i.items()}

In [7]:
print(len(w2i))

309


To train Word2Vec with negative sampling, the formula for the objective function involves maximizing the likelihood of the context words while minimizing the likelihood of randomly sampled negative words. Here's the formula for training Word2Vec with negative sampling:

### Objective Function for Word2Vec with Negative Sampling

Given a target word $w_t $ and a context word $w_c $, the objective is to maximize the probability of the context word given the target word using a logistic regression model. The model outputs a probability for the pair of words to be a valid context-target pair.

1. **Positive Pairs (True Context)**
   The probability of a valid context word $w_c $ given the target word $w_t $ is:

  $$
   P(w_c | w_t) = \sigma(v_{w_c}^T v_{w_t})
   $$

   Where:
   - $\sigma(x) $ is the sigmoid function, $\sigma(x) = \frac{1}{1 + e^{-x}} $,
   - $v_{w_c} $ is the vector representation of the context word $w_c $,
   - $v_{w_t} $ is the vector representation of the target word $w_t $.



In [8]:
import torch
import torch.nn as nn

u_o=nn.Embedding(len(w2i),300)
u_c=nn.Embedding(len(w2i),300)

2. **Negative Sampling**
   Negative sampling introduces random negative samples $w_n $ to train the model to distinguish valid word pairs from random noise. For each positive word pair $(w_t, w_c) $, we sample $k $ negative samples $w_n $. The objective function also includes the probability of the context word $w_n $ being sampled:

  $$
   P(w_n | w_t) = \sigma(-v_{w_n}^T v_{w_t})
   $$

   The negative sign is used to push the model to reduce the similarity between negative samples and the target word.



In [18]:
import nltk
import numpy as np

window_size=3
num_sample=5
for text in train_text:
  tokens=text.lower().split()
  for window in nltk.ngrams(tokens, window_size):
    central_word=window[window_size//2]
    context_words=[]
    for i in range(window_size):
      if i!=window_size//2:
        context_word=window[i]
        context_words.append(context_word)
        #print()
    prob=np.ones(len(w2i))
    for word in context_words:
      prob[w2i[word]]=0
    for context_word in context_words:
      print('Positive Sample ', (context_word, central_word ))
      negative_idxs=np.random.choice(len(w2i),size=num_sample, replace=False, p=prob/np.sum(prob))
      print('Negative sample :', [i2w[idx] for idx in negative_idxs])
  break

Positive Sample  ('i', 'rented')
Negative sample : ['is', 'shock', 'less', 'porno.', "i've"]
Positive Sample  ('i', 'rented')
Negative sample : ['men.<br', '/>i', 'obvious', 'pile.', 'purposes']
Positive Sample  ('rented', 'i')
Negative sample : ['first', 'appears', 'then', 'come', 'u.s.']
Positive Sample  ('am', 'i')
Negative sample : ['life.', 'explicit', 'first', "it's", 'hardly']
Positive Sample  ('i', 'am')
Negative sample : ['in', 'even', 'anything', 'really,', 'attentions']
Positive Sample  ('curious-yellow', 'am')
Negative sample : ["don't", 'mentally', 'porn', 'am', 'reality']
Positive Sample  ('am', 'curious-yellow')
Negative sample : ['clitoris', 'culturally', 'asking', 'nude,', 'erotica.']
Positive Sample  ('from', 'curious-yellow')
Negative sample : ['treated', 'his', 'for', 'brown', 'making']
Positive Sample  ('curious-yellow', 'from')
Negative sample : ['from', 'wants', 'purposes', 'around', "we're"]
Positive Sample  ('my', 'from')
Negative sample : ['see', 'double', 'th

In [23]:
#For postive pair
torch.sigmoid(u_c(torch.tensor(w2i[central_word])).dot(u_o(torch.tensor(w2i[context_word]))))

tensor(1., grad_fn=<SigmoidBackward0>)

In [25]:
#For Negative pair
torch.sigmoid(-u_c(torch.tensor(w2i[central_word])).dot(u_o(torch.tensor(negative_idxs[0]))))


tensor(0.9540, grad_fn=<SigmoidBackward0>)

3. **Final Objective Function**
   The final objective function to maximize is:

  $$
   J(w_t, w_c) = \log \sigma(v_{w_c}^T v_{w_t}) + \sum_{n=1}^k \mathbb{E}_{w_n \sim P(w)} \left[ \log \sigma(-v_{w_n}^T v_{w_t}) \right]
   $$

   Where:
   - The first term corresponds to the positive sample (context word),
   - The second term sums over the negative samples, where each negative sample $w_n $ is drawn from a distribution $P(w) $ (often a unigram distribution raised to a power, e.g., $P(w) = \frac{p(w)^\alpha}{\sum_{w'} p(w')^\alpha} $).

By maximizing this objective, the model learns to increase the similarity between the target word vector $v_{w_t} $ and the context word vectors $v_{w_c} $, while decreasing the similarity with negative samples.

In [26]:
import torch.optim as optim
optimizer=optim.Adam([u_o.weight, u_c.weight])

In [28]:
from tqdm import tqdm
window_size=3
num_sample=5
for text in train_text:
  tokens=text.lower().split()
  for window in tqdm(nltk.ngrams(tokens, window_size)):
    central_word=window[window_size//2]
    context_words=[]
    for i in range(window_size):
      if i!=window_size//2:
        context_word=window[i]
        context_words.append(context_word)

    prob=np.ones(len(w2i))
    for word in context_words:
      prob[w2i[word]]=0
    for context_word in context_words:
      optimizer.zero_grad()
      loss=torch.log(torch.sigmoid(u_c(torch.tensor(w2i[central_word])).dot(u_o(torch.tensor(w2i[context_word])))))
      negative_idxs=np.random.choice(len(w2i),size=num_sample, replace=False, p=prob/np.sum(prob))
      for negative_idx in negative_idxs:
          loss+=torch.log(torch.sigmoid(-u_c(torch.tensor(w2i[central_word])).dot(u_o(torch.tensor(negative_idx)))))
      loss.backward()
      optimizer.step()


286it [00:01, 176.26it/s]
212it [00:01, 183.31it/s]


In [29]:
v=(u_o.weight.data.numpy()+u_c.weight.data.numpy())/2

In [30]:
v.shape

(309, 300)

# Importing Pretrained Word2Vec Using Gensim

The Gensim library is a popular Python package for natural language processing tasks, particularly for working with word embeddings such as Word2Vec. Gensim provides a straightforward way to load pretrained Word2Vec models, including Google's pretrained Word2Vec model or others in the `.bin` or `.txt` format.

Here’s a step-by-step guide to import a pretrained Word2Vec model:

## Steps to Import Pretrained Word2Vec

1. **Install Gensim**  
   If you haven't installed Gensim, you can install it using pip:
   ```bash
   pip install gensim
   ```

2. **Download a Pretrained Word2Vec Model**  
   Commonly used pretrained models include:
   - Google's pretrained Word2Vec model: [Google News vectors](https://code.google.com/archive/p/word2vec/)
   - Other links https://huggingface.co/fse/word2vec-google-news-300
   - Other embeddings such as FastText, Glove, or models trained on specific datasets.

In [32]:
!gdown --id 0B7XkCwpI5KDYNlNUTTlSS21pQmM

Downloading...
From (original): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
From (redirected): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&confirm=t&uuid=77b49503-6403-4287-90f7-7098200cb12c
To: /content/GoogleNews-vectors-negative300.bin.gz
100% 1.65G/1.65G [00:38<00:00, 43.1MB/s]


3. **Load the Pretrained Model**  
   Use the `KeyedVectors` module from Gensim to load the pretrained model. If the model is in binary format, set `binary=True`. Otherwise, leave it as `binary=False`.

   ```python
   from gensim.models import KeyedVectors

   # Path to the pretrained model
   model_path = "path/to/pretrained/word2vec.bin"

   # Load the model
   word2vec_model = KeyedVectors.load_word2vec_format(model_path, binary=True)
   ```

In [33]:
from gensim.models import KeyedVectors

# Path to the pretrained model
model_path = "GoogleNews-vectors-negative300.bin.gz"

# Load the model
word2vec_model = KeyedVectors.load_word2vec_format(model_path, binary=True)

4. **Using the Loaded Model**  
   Once loaded, you can use the model to:
   - Retrieve vector representation of words:
     ```python
     vector = word2vec_model["example"]
     print(vector)
     ```
   - Find most similar words:
     ```python
     similar_words = word2vec_model.most_similar("king", topn=5)
     print(similar_words)
     ```
   - Compute similarity between words:
     ```python
     similarity = word2vec_model.similarity("king", "queen")
     print(similarity)
     ```


In [35]:
vector = word2vec_model["example"]
print(vector.shape)

(300,)


In [36]:
similarity = word2vec_model.similarity("king", "queen")
print(similarity)

0.6510956


In [38]:
word2vec_model["France"]

array([ 4.85839844e-02,  7.86132812e-02,  3.24218750e-01,  3.49121094e-02,
        7.71484375e-02,  3.54003906e-02, -1.25976562e-01, -3.86718750e-01,
       -1.31835938e-01,  2.91748047e-02, -1.44531250e-01, -1.42578125e-01,
        1.79687500e-01, -2.75390625e-01, -1.65039062e-01,  9.32617188e-02,
        1.17187500e-01,  1.82617188e-01,  6.10351562e-02,  1.14257812e-01,
        1.82617188e-01, -1.16699219e-01, -3.24707031e-02, -7.56835938e-02,
        9.64355469e-03,  8.59375000e-02, -2.85156250e-01, -2.55859375e-01,
        3.01513672e-02,  2.16796875e-01, -1.00097656e-01,  2.85644531e-02,
       -2.81250000e-01, -8.39843750e-02, -2.02636719e-02, -1.96289062e-01,
       -4.78515625e-02,  7.12890625e-02, -1.42578125e-01, -1.13525391e-02,
        1.16210938e-01,  7.22656250e-02,  1.47460938e-01,  1.50390625e-01,
        1.40625000e-01,  2.47070312e-01, -1.69921875e-01,  7.76367188e-02,
       -5.44433594e-02,  1.66992188e-01, -1.45507812e-01,  2.12402344e-02,
       -7.51953125e-02,  

In [41]:
word2vec_model["frame"]

array([-2.57568359e-02,  3.57421875e-01, -3.08837891e-02,  8.25195312e-02,
        9.96093750e-02, -1.64062500e-01,  4.80468750e-01, -5.51757812e-02,
        2.27539062e-01, -1.03149414e-02,  4.12597656e-02,  6.25000000e-02,
       -1.75781250e-01,  1.90734863e-03, -2.05078125e-02, -1.70898438e-01,
        5.98144531e-03,  1.89453125e-01, -1.51977539e-02, -1.90429688e-01,
       -9.81445312e-02, -2.95410156e-02, -1.00097656e-01, -1.18408203e-02,
        4.08935547e-03, -5.00488281e-03, -3.02734375e-01,  1.38671875e-01,
        1.27929688e-01,  3.12500000e-02, -1.25000000e-01,  1.52343750e-01,
       -9.70458984e-03,  6.00585938e-02,  2.31933594e-02, -2.96875000e-01,
       -1.79687500e-01,  3.73535156e-02, -1.15234375e-01,  2.77099609e-02,
        2.96875000e-01,  8.39843750e-02,  1.65039062e-01,  6.93359375e-02,
        2.03125000e-01,  3.11279297e-02, -1.05957031e-01,  1.99218750e-01,
        1.96289062e-01,  1.64062500e-01,  3.68652344e-02, -2.22656250e-01,
       -8.34960938e-02, -

In [37]:
similar_words = word2vec_model.most_similar("king", topn=5)
print(similar_words)

[('kings', 0.7138045430183411), ('queen', 0.6510956883430481), ('monarch', 0.6413194537162781), ('crown_prince', 0.6204220056533813), ('prince', 0.6159993410110474)]
