

```
# This is formatted as code
```

Word Embedding is an important term in Natural Language Processing and a significant breakthrough in deep learning that solved many problems. In this article, we’ll be looking into what pre-trained word embeddings in NLP are.

Table of Content
- Word Embeddings
- Challenges in building word embedding from scratch
- Pre Trained Word Embeddings
- Word2Vec
- GloVe
- BERT Embeddings


## Word Embeddings

Word embedding is an approach in Natural language Processing where raw text gets converted to numbers/vectors. As deep learning models only take numerical input this technique becomes important to process the raw data. It helps in capturing the semantic meaning as well as the context of the words. A real-valued vector with various dimensions represents each word.

There are certain methods of generating word embeddings such as BOW (Bag of words), TF-IDF, Glove, BERT embeddings, etc. The earlier methods only converted the words without extracting the semantic relationship and context. But the recent ones such as BERT embeddings, which is a pre-trained word embedding model capture the full context of the word as well as the semantic relationships of the word within the sentence.

## Challenges in building word embedding from scratch

Training word embeddings from scratch is possible but it is quite challenging due to large trainable parameters and sparsity of training data. These models need to be trained on a large number of datasets with rich vocabulary and as there are large number of parameters, it makes the training slower. So, it’s quite challenging to train a word embedding model on an individual level.

## Pre Trained Word Embeddings

There’s a solution to the above problem, i.e., using pre-trained word embeddings. Pre-trained word embeddings are trained on large datasets and capture the syntactic as well as semantic meaning of the words. This technique is known as transfer learning in which you take a model which is trained on large datasets and use that model on your own similar tasks.

There are two broad classifications of pre trained word embeddings – word-level and character-level. We’ll be looking into two types of word-level embeddings i.e. Word2Vec and GloVe and how they can be used to generate embeddings.

## Word2Vec

Word2Vec is one of the most popular pre trained word embeddings developed by Google. It is trained on Good news dataset which is an extensive dataset. As the name suggests, it represents each word with a collection of integers known as a vector. The vectors are calculated such that they show the semantic relation between words.

A popular example of how semantic relation is made is the king queen example:

```
King - Man + Woman ~ Queen
```

# https://jalammar.github.io/illustrated-word2vec/  The best tutorial on word2vec  
# https://projector.tensorflow.org/

Word2vec is a feed-forward neural network which consists of two main models – Continuous Bag-of-Words (CBOW) and Skip-gram model. The continuous bag of words model learns the target word from the adjacent words whereas in the skip-gram model, the model learns the adjacent words from the target word. They are completely opposite of each other.

Firstly, the size of context window is defined. Context window is a sliding window which runs through the whole text one word at a time. It basically refers to the number of words appearing on the right and left side of the focus word. eg. if size of the context window is set to 2, then it will include 2 words on the right as well as left of the focus word.

Focus word is our target word for which we want to create the embedding / vector representation. Generally, focus word is the middle word but in the example below we’re taking last word as our target word. The neighbouring words are the words that appear in the context window. These words help in capturing the context of the whole sentence. Let’s understand this with the help of an example.

Suppose we have a sentence – “He poured himself a cup of coffee”. The target word here is “himself”.

### Continuous Bag-Of-Words

input = [“He”, “poured”, “a”, “cup”]

output = [“himself”]

### Skip-gram model

input = [“himself”]

output = [“He”, “poured”, “a”, “cup”]

To generate word embeddings using pre trained word word2vec embeddings, first download the model bin file from here. Then import all the necessary libraries needed such as gensim (will be used for initialising the pre trained model from the bin file.

In [1]:
!pip install gensim
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#import gensim library
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
#replace with the path where you have downloaded your model.
# https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
pretrained_model_path = '/content/drive/MyDrive/tutorial_data/GoogleNews-vectors-negative300.bin.gz'
#initialise the pre trained model using load_word2vec_format from gensim module.
word_vectors = KeyedVectors.load_word2vec_format(pretrained_model_path, binary=True)

# Calculate cosine similarity between word pairs
word1 = "early"
word2 = "seats"
#calculate the similarity
similarity1 = word_vectors.similarity(word1, word2)
#print final value
print(similarity1)

word3 = "king"
word4 = "man"
#calculate the similarity
similarity2 = word_vectors.similarity(word3, word4)
#print final value
print(similarity2)

0.03583806
0.22942673


The above code initialises word2vec model using gensim library. It calculates the cosine similarity between words. As you can see the second value is comparatively larger than the first one (these values ranges from -1 to 1), so this means that the words “king” and “man” have more similarity.

We can also find words which are most similar to the given word as parameter

In [4]:
# finding most similar word embeddings with King
king = word_vectors.most_similar('King')
print(f'Top 10 most similar words to "King" are : {king}')


Top 10 most similar words to "King" are : [('Jackson', 0.5326348543167114), ('Prince', 0.5306329727172852), ('Tupou_V.', 0.5292826294898987), ('KIng', 0.5227501392364502), ('e_mail_robert.king_@', 0.5173623561859131), ('king', 0.5158917903900146), ('Queen', 0.5157250165939331), ('Geoffrey_Rush_Exit', 0.49920955300331116), ('prosecutor_Dan_Satterberg', 0.49850785732269287), ('NECN_Alison', 0.49128594994544983)]


## GloVe

Given by Stanford, GloVe stands for Global Vectors for Word Representation. It is a popular word embedding model which works on the basic idea of deriving the relationship between words using statistics. It is a count based model that employs co-occurrence matrix. A co-occurrence matrix tells how often two words are occurring globally. Each value is a count of a pair of words occurring together.

Glove basically deals with the spaces where the distance between words is linked to to their semantic similarity. It has properties of the global matrix factorisation and the local context window technique. Training of the model is based on the global word-word co-occurrence data from a corpus, and the resultant representations results into linear substructure of the vector space

GloVe calculates the co-occurrence probabilities for each word pair. It divides the co-occurrence counts by the total number of co-occurrences for each word:

$$F(w_{i}, w_{j}, w_{k}) = \frac {P_{ik}}{P_{jk}}$$      

For example, the co-occurrence probability of “cat” and “mouse” is calculated as: Co-occurrence Probability(“cat”, “mouse”) = Count(“cat” and “mouse”) / Total Co-occurrences(“cat”)

In this case:

Count("cat" and "mouse") = 1

Total Co-occurrences("cat") = 2 (with "chases" and "mouse")

So, Co-occurrence Probability("cat", "mouse") = 1 / 2 = 0.5


GloVe Model Building

Firstly, download gloVe 6B embeddings from this site. Then unzip the file and add the file to the same folder as your code. There are many variations of the 6B model but we’ll using the glove.6B.50d.

In [None]:
!pip install tensorflow

In [58]:
import numpy as np
from scipy.spatial.distance import cosine

# Load GloVe embeddings
def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

def find_top_n_nearest_words(target_word, embeddings, n=5):
    """
    Find the top N nearest words to the target word based on cosine similarity.

    Args:
        target_word (str): The target word.
        embeddings (dict): A dictionary of word embeddings (word -> vector).
        n (int): Number of nearest words to return.

    Returns:
        list: A list of tuples (word, similarity) for the top N nearest words.
    """
    if target_word not in embeddings:
        print(f"'{target_word}' not found in the vocabulary.")
        return None

    target_vector = embeddings[target_word]
    similarities = []

    for word, vector in embeddings.items():
        if word == target_word:
            continue  # Skip the target word itself
        similarity = 1 - cosine(target_vector, vector)  # Cosine similarity
        similarities.append((word, similarity))

    # Sort by similarity in descending order
    similarities.sort(key=lambda x: x[1], reverse=True)

    # Return the top N nearest words
    return similarities[:n]

# Calculate cosine similarity
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

# Normalize vectors (optional but recommended)
def normalize_vectors(embeddings):
    for word in embeddings:
        embeddings[word] /= np.linalg.norm(embeddings[word])
    return embeddings


# Example usage
file_path = '/content/drive/MyDrive/tutorial_data/glove.6B.50d.txt'
embeddings = load_glove_embeddings(file_path)
embeddings = normalize_vectors(embeddings)

In [59]:
target_word = "king"
top_n_words = find_top_n_nearest_words(target_word, embeddings, n=10)

if top_n_words:
    print(f"Top {len(top_n_words)} nearest words to '{target_word}':")
    for word, similarity in top_n_words:
        print(f"{word}: {similarity:.4f}")

Top 10 nearest words to 'king':
prince: 0.8236
queen: 0.7839
ii: 0.7746
emperor: 0.7736
son: 0.7667
uncle: 0.7627
kingdom: 0.7542
throne: 0.7540
brother: 0.7492
ruler: 0.7434


# exercise   
calculate cosine similarity of previous examples

## BERT Embeddings

Another important pre trained transformer based model is by Google known as BERT or Bidirectional Encoder Representations from Transformers. It can be used to extract high quality language features from raw text or can be fine-tuned on own data to perform specific tasks.

BERT’s architecture consists of only encoders and input received is a sequence of tokens i.e. Token embeddings, Segment embeddings and Positional embeddings. The main idea is to mask a few words in a sentence and task the model to predict the masked words.

BERT

Firstly, install the transformers library as we’ll be using pytorch and transformers for implementing this.

In [None]:
!pip install transformers

In [55]:
import torch
from transformers import BertTokenizer, BertModel
from scipy.spatial.distance import cosine
import numpy as np

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

# Function to get BERT embeddings for a word in a sentence
def get_bert_embeddings(text, target_word):
    # Tokenize input text
    marked_text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segments_ids = [1] * len(tokenized_text)

    # Convert to PyTorch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensor = torch.tensor([segments_ids])

    # Get embeddings from BERT
    with torch.no_grad():
        outputs = model(tokens_tensor, segments_tensor)
        hidden_states = outputs.hidden_states  # Tuple of all hidden states

    # Extract embeddings for the target word
    # Use the last 4 layers (common practice for BERT embeddings)
    token_embeddings = torch.stack(hidden_states[-4:], dim=0)
    token_embeddings = torch.mean(token_embeddings, dim=0)  # Average over layers
    token_embeddings = torch.squeeze(token_embeddings, dim=0)  # Remove batch dimension

    # Find the index of the target word
    try:
        target_index = tokenized_text.index(target_word)
    except ValueError:
        print(f"'{target_word}' not found in the tokenized text.")
        return None

    # Get the embedding for the target word
    target_embedding = token_embeddings[target_index].numpy()
    return target_embedding

# Function to compute cosine similarity
def cosine_similarity(vec1, vec2):
    return 1 - cosine(vec1, vec2)

# Example usage
text = "The king and queen ruled the kingdom."
word1 = "king"
word2 = "queen"

# Get embeddings for "king" and "queen"
embedding_king = get_bert_embeddings(text, word1)
embedding_queen = get_bert_embeddings(text, word2)

# Compute cosine similarity
similarity = cosine_similarity(embedding_king, embedding_queen)
print(f"Cosine similarity between '{word1}' and '{word2}': {similarity:.4f}")

Cosine similarity between 'king' and 'queen': 0.7545


In [56]:
# Get embeddings for two words in the same sentence
embedding_king = get_bert_embeddings("The king ruled the kingdom.", "king")
embedding_queen = get_bert_embeddings("The queen ruled the kingdom.", "queen")

# Compute cosine similarity
similarity = 1 - cosine(embedding_king, embedding_queen)
print(f"Cosine similarity between 'king' and 'queen': {similarity:.4f}")

Cosine similarity between 'king' and 'queen': 0.8322


In [57]:
# Example usage
text1 = "The king ruled the kingdom."
text2 = "I ate an apple for breakfast."

# Get embeddings for "king" and "apple"
embedding_king = get_bert_embeddings(text1, "king")
embedding_apple = get_bert_embeddings(text2, "apple")

# Compute cosine similarity
similarity = cosine_similarity(embedding_king, embedding_apple)
print(f"Cosine similarity between 'king' and 'apple': {similarity:.4f}")

Cosine similarity between 'king' and 'apple': 0.3491
