# **Understanding Word Embeddings**

**PyTorch:**


* Flexible and can be customized for different tasks.
Can be trained on domain-specific data, making the embeddings more relevant to the task at hand.

* Requires large training data and computation time.
More complex to implement than pre-trained models like GloVe.

**FastText:**

* Handles out-of-vocabulary words by breaking them into known subwords.
Understands word morphology (e.g., "runner" and "running" are related due to the common subwords).

* Training is slower compared to Word2Vec.
Requires more memory to store embeddings due to subword modeling.

**Word2Vec:**

* Captures both syntactic and semantic word relationships.

* Cannot handle out-of-vocabulary words.
Does not understand word morphology.
Must be trained, which requires time and data.

**GloVe:**

* Captures global information about words in a corpus.
Pre-trained, so it’s fast and does not require training from scratch. Provides high-quality embeddings.

* Cannot handle out-of-vocabulary words (new or rare words not seen during training).
Does not understand word morphology (e.g., "running" and "runner" are treated as separate words).

Input: https://f1miamigp.com/news/press-release/beginners-guide-formula-1/

# **PyTorch**

Reference: https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#an-example-n-gram-language-modeling

### **N-Gram Language Modeling**

An N-gram language model predicts a word given the previous N - 1 words. For example, if you're building a 3-gram model (trigram), it looks at the last two words in the sentence to predict the next word.

For example:

Text: "Amy played tennis at the country club today."
If the model sees "Amy played", it might predict "tennis" as the next word because it has learned from training data that this is a common sequence of words.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Define hyperparameters - trigram model
CONTEXT_SIZE = 2  # 2 words to predict third word
EMBEDDING_DIM = 10

# Input text
text = """Formula 1 is a complex sport. The technical and sporting
regulations extend to over 200 pages and there is a supplementary financial set
of rules for the teams and a drivers sporting code.
Here is a brief guide to explain the third-highest watched global sport,
behind the Olympics and FIFA World Cup. The FIA Formula 1 World Championship
started in 1950. With a series of races held in different locations around
the globe, the aim back then is the same as today. The winner is the first to
cross the finish line and points are awarded based on top ten positions — the
one with the most at the end of the year is crowned World Champion. In 1950
there were seven championship rounds, but as the sport has grown, next year
will feature 24 races. Starting in March and ending in November, the
championship competes across the world with races in Europe, Asia,
the Americas, the Middle East and Australia on a mixture of permanent,
street or hybrid tracks. Four of those venues were on the original schedule
73 years ago: Silverstone, Monaco, Spa and Monza.""".split()

# Build a vocabulary
vocab = set(text)
word_to_ix = {word: i for i, word in enumerate(vocab)}

# Create training data: ([context words], target word)
# For example: (['drivers', 'sporting'], 'code')
ngrams = [
    ([text[i - j - 1] for j in range(CONTEXT_SIZE)], text[i])
    for i in range(CONTEXT_SIZE, len(text))
]

# Print first 5 ngrams
print(ngrams[:5])

# Define the model
class NGramLanguageModeler(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

# Initialize the model
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

# Train the model
losses = []
for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 1: Zero the gradients
        model.zero_grad()

        # Step 2: Forward pass
        log_probs = model(context_idxs)

        # Step 3: Compute loss
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 4: Backward pass and optimization
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    losses.append(total_loss)
print(losses)

# Get the embedding for a specific word (e.g., 'FIA')
print(model.embeddings.weight[word_to_ix['FIA']])

[(['1', 'Formula'], 'is'), (['is', '1'], 'a'), (['a', 'is'], 'complex'), (['complex', 'a'], 'sport.'), (['sport.', 'complex'], 'The')]
[898.6751246452332, 892.7027866840363, 886.8457572460175, 881.1056125164032, 875.4853818416595, 869.9877555370331, 864.6145172119141, 859.3705060482025, 854.2571942806244, 849.2739715576172]
tensor([ 0.1269,  0.5635,  1.9361, -2.8904, -0.4996,  0.2090, -0.1735,  0.3241,
        -1.0291,  0.0933], grad_fn=<SelectBackward0>)


**Approach:**

* Each word in the vocabulary is initialized with a randomly generated dense vector (embedding).
* To predict the next word, the embedding vector of a context word is passed through two linear layers, with a ReLU activation function applied in between. This process transforms the embedding into a score vector, where each element represents a score for every word in the vocabulary.
* The score vector is then fed into a softmax function, which converts the raw scores (logits) into probabilities that sum to 1. The word with the highest probability is predicted as the next word in the sequence.

During training, the model adjusts the embeddings and weights of the linear layers to improve its ability to predict the next word across different contexts.






### **Computing Word Embeddings: Continuous Bag-of-Words**

A Continuous Bag-of-Words (CBOW) model predicts a word based on the surrounding context words. Instead of focusing only on the previous words, CBOW considers a few words both before and after the target word to make a prediction.

For example:
Text: "Amy played tennis at the country club today." If the model sees the context words "played" and "at", it might predict "tennis" as the target word because it has learned from the data that these words commonly appear around it.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Define hyperparameters - CBOW model
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
EMBEDDING_DIM = 10

# Input text (same as N-grams example)
text = """Formula 1 is a complex sport. The technical and sporting
regulations extend to over 200 pages and there is a supplementary financial set
of rules for the teams and a drivers sporting code.
Here is a brief guide to explain the third-highest watched global sport,
behind the Olympics and FIFA World Cup. The FIA Formula 1 World Championship
started in 1950. With a series of races held in different locations around
the globe, the aim back then is the same as today. The winner is the first to
cross the finish line and points are awarded based on top ten positions — the
one with the most at the end of the year is crowned World Champion. In 1950
there were seven championship rounds, but as the sport has grown, next year
will feature 24 races. Starting in March and ending in November, the
championship competes across the world with races in Europe, Asia,
the Americas, the Middle East and Australia on a mixture of permanent,
street or hybrid tracks. Four of those venues were on the original schedule
73 years ago: Silverstone, Monaco, Spa and Monza.""".split()

# Build a vocabulary
vocab = set(text)
word_to_ix = {word: i for i, word in enumerate(vocab)}

# Create training data: ([context words], target word)
# For example: (['is', 'Formula', 'a', 'complex'], '1')
data = [
    ([text[i - j - 1] for j in range(CONTEXT_SIZE)] +
     [text[i + j + 1] for j in range(CONTEXT_SIZE)], text[i])
    for i in range(CONTEXT_SIZE, len(text) - CONTEXT_SIZE)
]

# Print first 5 context-target pairs
print(data[:5])

# Define the model
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * 2 * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))  # Flatten the embedding
        out = F.relu(self.linear1(embeds))  # Pass through first linear layer with ReLU
        out = self.linear2(out)  # Pass through second linear layer
        log_probs = F.log_softmax(out, dim=1)  # Apply log softmax to get probabilities
        return log_probs

# Initialize the model
model = CBOW(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

# Train the model
losses = []
for epoch in range(10):
    total_loss = 0
    for context, target in data:
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 1: Zero the gradients
        model.zero_grad()

        # Step 2: Forward pass
        log_probs = model(context_idxs)

        # Step 3: Compute loss
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 4: Backward pass and optimization
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    losses.append(total_loss)
print(losses)

# Get the embedding for a specific word (e.g., 'FIA')
print(model.embeddings.weight[word_to_ix['FIA']])

[(['1', 'Formula', 'a', 'complex'], 'is'), (['is', '1', 'complex', 'sport.'], 'a'), (['a', 'is', 'sport.', 'The'], 'complex'), (['complex', 'a', 'The', 'technical'], 'sport.'), (['sport.', 'complex', 'technical', 'and'], 'The')]
[890.5913977622986, 884.0580186843872, 877.6146855354309, 871.2510859966278, 864.9642112255096, 858.7463171482086, 852.5973641872406, 846.5178642272949, 840.5101511478424, 834.5711290836334]
tensor([-0.9175, -0.8865, -0.8957,  0.7067,  0.5507, -0.9028, -0.5304, -1.0181,
        -0.3795,  0.7387], grad_fn=<SelectBackward0>)


**Approach:**

* Each word in the vocabulary is initialized with a randomly generated dense vector (embedding).
* To predict the target word, the embedding vectors of the context words (a fixed number of words before and after the target) are combined by averaging them. This combined embedding represents the context. The combined context embedding is passed through two linear layers, with a ReLU activation function applied in between. This process transforms the combined context embedding into a score vector, where each element represents a score for every word in the vocabulary.
* The score vector is then passed through a softmax function, which converts the raw scores (logits) into probabilities. The word with the highest probability is predicted as the target word.

During training, the model adjusts both the embeddings of the words and the weights of the linear layers to improve its ability to predict the target word given the context across different sentences.

# **FastText**

Reference: https://radimrehurek.com/gensim/models/fasttext.html

* FastText breaks down each word into smaller pieces called character n-grams (short sequences of characters). For example, the word "playing" might be broken into parts like "pla", "lay", "ayi", "ing". FastText then learns vector representations for these subwords. So, the word "playing" is represented not only by its own vector but also by the combined vectors of these smaller subword pieces.

* When the model encounters a new word it hasn’t seen before, it can still generate an embedding by looking at the subwords. Example: The model has never seen the word "unplayable". Even though "unplayable" is new, you’ve seen its parts ("un", "play", "able") in other words. FastText uses the vectors of these subwords to form an approximate vector for "unplayable", which gives the model an idea of its meaning based on similar words.

* FastText’s use of subwords allows it to understand the morphological structure of words. Example: if the model has been trained on the word "run", it can still recognize and generate vectors for "running" or "runner" because it’s familiar with the root "run" and the suffixes "-ing" or "-ner".

In [None]:
from gensim.models import FastText
from gensim.utils import tokenize

# Input text
text = """Formula 1 is a complex sport. The technical and sporting
regulations extend to over 200 pages and there is a supplementary financial set
of rules for the teams and a drivers sporting code.
Here is a brief guide to explain the third-highest watched global sport,
behind the Olympics and FIFA World Cup. The FIA Formula 1 World Championship
started in 1950. With a series of races held in different locations around
the globe, the aim back then is the same as today. The winner is the first to
cross the finish line and points are awarded based on top ten positions — the
one with the most at the end of the year is crowned World Champion. In 1950
there were seven championship rounds, but as the sport has grown, next year
will feature 24 races. Starting in March and ending in November, the
championship competes across the world with races in Europe, Asia,
the Americas, the Middle East and Australia on a mixture of permanent,
street or hybrid tracks. Four of those venues were on the original schedule
73 years ago: Silverstone, Monaco, Spa and Monza."""

# Tokenize the text
sentences = [list(tokenize(sentence)) for sentence in text.split(".")]

#sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW

# Skip-gram (sg=1)
skipgram_model = FastText(sentences=sentences, vector_size=10, window=2, min_count=1, sg=1, epochs=10)

# Get vector embedding for a word (e.g., 'Formula')
word_embedding_skipgram = skipgram_model.wv['Formula']
print("Word embedding for 'Formula' using Skip-gram:", word_embedding_skipgram)

# Most similar words to 'Formula'
similar_words_skipgram = skipgram_model.wv.most_similar('Formula')
print("Most similar words to 'Formula' using Skip-gram:", similar_words_skipgram)

# CBOW (sg=0)
cbow_model = FastText(sentences=sentences, vector_size=10, window=2, min_count=1, sg=0, epochs=10)

# Get vector embedding for a word (e.g., 'Formula')
word_embedding_cbow = cbow_model.wv['Formula']
print("Word embedding for 'Formula' using CBOW:", word_embedding_cbow)

# Most similar words to 'Formula'
similar_words_cbow = cbow_model.wv.most_similar('Formula')
print("Most similar words to 'Formula' using CBOW:", similar_words_cbow)


Word embedding for 'Formula' using Skip-gram: [-0.00464474 -0.0172566  -0.0186835  -0.01091931  0.01194055 -0.00579774
 -0.00181203  0.01378844 -0.00608867  0.01172597]
Most similar words to 'Formula' using Skip-gram: [('supplementary', 0.779981791973114), ('venues', 0.7072525024414062), ('complex', 0.6128764748573303), ('started', 0.6115627884864807), ('there', 0.5901592373847961), ('grown', 0.5643189549446106), ('Monaco', 0.533416211605072), ('aim', 0.5325751304626465), ('rounds', 0.5311640501022339), ('FIA', 0.5202236175537109)]
Word embedding for 'Formula' using CBOW: [-0.00472969 -0.01732939 -0.0185458  -0.01086534  0.01204792 -0.00560015
 -0.0018083   0.01361873 -0.00599586  0.01183709]
Most similar words to 'Formula' using CBOW: [('supplementary', 0.7853191494941711), ('venues', 0.706602931022644), ('complex', 0.6094363927841187), ('started', 0.6031332015991211), ('there', 0.5749540328979492), ('grown', 0.5597285628318787), ('aim', 0.5265063643455505), ('Monaco', 0.5247567296028

# **Word2Vec**

Reference: https://radimrehurek.com/gensim/models/word2vec.html

Word2Vec is a model that learns word embeddings by predicting a word based on its surrounding context (CBOW) or predicting surrounding words given a target word (Skip-gram). Unlike FastText, Word2Vec does not break words into subwords; instead, it treats each word as a whole unit and learns a vector representation for each individual word based on its occurrence and co-occurrence with other words in the training data.
* cannot handle words it has never seen during training.
* cannot understand morphological structure.

In [None]:
from gensim.models import Word2Vec
from gensim.utils import tokenize

# Input text
text = """Formula 1 is a complex sport. The technical and sporting
regulations extend to over 200 pages and there is a supplementary financial set
of rules for the teams and a drivers sporting code.
Here is a brief guide to explain the third-highest watched global sport,
behind the Olympics and FIFA World Cup. The FIA Formula 1 World Championship
started in 1950. With a series of races held in different locations around
the globe, the aim back then is the same as today. The winner is the first to
cross the finish line and points are awarded based on top ten positions — the
one with the most at the end of the year is crowned World Champion. In 1950
there were seven championship rounds, but as the sport has grown, next year
will feature 24 races. Starting in March and ending in November, the
championship competes across the world with races in Europe, Asia,
the Americas, the Middle East and Australia on a mixture of permanent,
street or hybrid tracks. Four of those venues were on the original schedule
73 years ago: Silverstone, Monaco, Spa and Monza."""

# Tokenize the text
sentences = [list(tokenize(sentence)) for sentence in text.split(".")]

# Skip-gram (sg=1)
skipgram_model_w2v = Word2Vec(sentences=sentences, vector_size=10, window=2, min_count=1, sg=1, epochs=10)

# Get vector embedding for a word (e.g., 'Formula')
word_embedding_skipgram_w2v = skipgram_model_w2v.wv['Formula']
print("Word embedding for 'Formula' using Skip-gram (Word2Vec):", word_embedding_skipgram_w2v)

# Most similar words to 'Formula'
similar_words_skipgram_w2v = skipgram_model_w2v.wv.most_similar('Formula')
print("Most similar words to 'Formula' using Skip-gram (Word2Vec):", similar_words_skipgram_w2v)

# CBOW (sg=0)
cbow_model_w2v = Word2Vec(sentences=sentences, vector_size=10, window=2, min_count=1, sg=0, epochs=10)

# Get vector embedding for a word (e.g., 'Formula')
word_embedding_cbow_w2v = cbow_model_w2v.wv['Formula']
print("Word embedding for 'Formula' using CBOW (Word2Vec):", word_embedding_cbow_w2v)

# Most similar words to 'Formula'
similar_words_cbow_w2v = cbow_model_w2v.wv.most_similar('Formula')
print("Most similar words to 'Formula' using CBOW (Word2Vec):", similar_words_cbow_w2v)

Word embedding for 'Formula' using Skip-gram (Word2Vec): [-0.08530151  0.03223372 -0.04496393 -0.05081124  0.03526921  0.05353492
  0.07798788 -0.05680488  0.0733971   0.06545072]
Most similar words to 'Formula' using Skip-gram (Word2Vec): [('Asia', 0.6907055974006653), ('in', 0.577402651309967), ('over', 0.5724160671234131), ('feature', 0.445841521024704), ('around', 0.444007933139801), ('FIA', 0.43559950590133667), ('there', 0.40092626214027405), ('are', 0.39971092343330383), ('held', 0.3876173198223114), ('Four', 0.3679717481136322)]
Word embedding for 'Formula' using CBOW (Word2Vec): [-0.08518743  0.03203779 -0.04565658 -0.05091595  0.03566262  0.05355391
  0.07801172 -0.05726781  0.07358135  0.06572264]
Most similar words to 'Formula' using CBOW (Word2Vec): [('Asia', 0.6972048878669739), ('in', 0.584405243396759), ('over', 0.5777981281280518), ('around', 0.4524732530117035), ('feature', 0.45227470993995667), ('FIA', 0.4404560625553131), ('there', 0.4188275933265686), ('are', 0.416

# **GloVe**

Reference: https://www.geeksforgeeks.org/pre-trained-word-embedding-using-glove-in-nlp-models/

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2024-09-30 17:25:05--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-09-30 17:25:06--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-09-30 17:25:06--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load GloVe embeddings
def load_glove_embeddings(filepath, embedding_dim=50):
    embeddings_index = {}
    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            embeddings_index[word] = np.array(vector, dtype=np.float32)
    return embeddings_index

# Input text
text = """Formula 1 is a complex sport. The technical and sporting
regulations extend to over 200 pages and there is a supplementary financial set
of rules for the teams and a drivers sporting code.
Here is a brief guide to explain the third-highest watched global sport,
behind the Olympics and FIFA World Cup. The FIA Formula 1 World Championship
started in 1950. With a series of races held in different locations around
the globe, the aim back then is the same as today. The winner is the first to
cross the finish line and points are awarded based on top ten positions — the
one with the most at the end of the year is crowned World Champion. In 1950
there were seven championship rounds, but as the sport has grown, next year
will feature 24 races. Starting in March and ending in November, the
championship competes across the world with races in Europe, Asia,
the Americas, the Middle East and Australia on a mixture of permanent,
street or hybrid tracks. Four of those venues were on the original schedule
73 years ago: Silverstone, Monaco, Spa and Monza."""

# Tokenizer setup
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])

# Convert text to sequences (list of word indices)
sequences = tokenizer.texts_to_sequences([text])
word_index = tokenizer.word_index

# Pad sequences to ensure uniform length
padded_sequences = pad_sequences(sequences, padding='post')

print("Word Index: ", word_index)
print("Padded Sequences: ", padded_sequences)

# Load GloVe embeddings (50-dimensional vectors)
glove_embeddings = load_glove_embeddings('glove.6B.50d.txt', embedding_dim=50)

# Get the embedding for a word (e.g., 'Formula')
def get_glove_embedding(word, embeddings):
    return embeddings.get(word, np.zeros(50))

# Example word embedding lookup for 'Formula'
word_embedding_glove = get_glove_embedding('Formula', glove_embeddings)
print("Word embedding for 'Formula' using GloVe:", word_embedding_glove)

# Create an embedding matrix for the tokenizer's vocabulary
def create_embedding_matrix(word_index, embeddings, embedding_dim=50):
    vocab_size = len(word_index) + 1
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    for word, i in word_index.items():
        embedding_vector = embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

# Create the embedding matrix using GloVe vectors
embedding_matrix = create_embedding_matrix(word_index, glove_embeddings, embedding_dim=50)

print("Embedding Matrix Shape: ", embedding_matrix.shape)

# Example: Get the dense vector for the first word in the vocabulary
first_word = list(word_index.keys())[0]
first_word_embedding = embedding_matrix[word_index[first_word]]
print(f"Dense vector for the word '{first_word}':", first_word_embedding)

* Word Index: This dictionary maps words to their unique indices
* Paddded Sequences: These are sequences of word indices from the Word Index. Each number corresponds to a word in the original sentence. Padding ensures that all sequences have the same length, so shorter sentences are padded with zeros or additional tokens to match the length of the longest sentence.
* The output shows a zero vector, which indicates that the word 'formula' was not found in the GloVe pre-trained embeddings.
* The shape of the embedding matrix (120, 50) indicates that there are 120 words (or unique indices) in your vocabulary, and each word is represented by a 50-dimensional vector.