## ***NLP Practical: Word Embeddings (Word2Vec, GloVe, FastText, BERT)***

***Name: Prexit Joshi***

 ***Roll No.: 118***



This notebook demonstrates how to implement and use various word embedding techniques to represent text as numerical vectors. We will explore:

Word2Vec

GloVe

FastText

BERT

Word embeddings are a cornerstone of modern NLP, as they capture the semantic meaning and relationships between words. We will use Python's gensim and transformers libraries to implement these techniques.

In [7]:
# Install required libraries
!pip install -q gensim transformers torch

# Import libraries
import gensim
import gensim.downloader as api
from transformers import BertTokenizer, BertModel
import torch

# Example corpus, tokenized and lowercased
corpus = [
    "Natural Language Processing is a part of Artificial Intelligence",
    "TF and TF-IDF are important techniques for feature extraction",
    "Feature extraction helps in text mining and information retrieval",
    "TF-IDF reduces the weight of common words in the corpus"
]
tokenized_corpus = [doc.lower().split() for doc in corpus]

## ***1. Word2Vec***

Word2Vec is a predictive model that learns word vectors by predicting a word from its context. It captures semantic relationships well but cannot create vectors for out-of-vocabulary (OOV) words.

In [8]:
# Train a Word2Vec (Skip-gram) model
word2vec_model = gensim.models.Word2Vec(sentences=tokenized_corpus, vector_size=50, window=5, min_count=1, sg=1)

# Get vector for a word
print("Word2Vec vector for 'feature':\n", word2vec_model.wv['feature'][:5])

# Find similar words
print("\nWords similar to 'extraction':\n", word2vec_model.wv.most_similar('extraction', topn=2))

Word2Vec vector for 'feature':
 [-0.01724207  0.00733798  0.0103818   0.0115101   0.01493315]

Words similar to 'extraction':
 [('mining', 0.23062904179096222), ('language', 0.22057761251926422)]


## ***2. GloVe***

GloVe (Global Vectors) is a count-based model that learns vectors from a global word-word co-occurrence matrix. We typically use pre-trained models.

In [9]:
# Load a pre-trained GloVe model
glove_model = api.load("glove-wiki-gigaword-50")

# Get vector for a word
print("GloVe vector for 'computer':\n", glove_model['computer'][:5])

# Find similar words
print("\nWords similar to 'nlp':\n", glove_model.most_similar('nlp', topn=2))

GloVe vector for 'computer':
 [ 0.079084 -0.81504   1.7901    0.91653   0.10797 ]

Words similar to 'nlp':
 [('hagelin', 0.7022942304611206), ('.760', 0.6916053891181946)]


## ***3. FastText***

FastText extends Word2Vec by treating words as a bag of character n-grams. This allows it to generate vectors for out-of-vocabulary (OOV) words.

In [10]:
# Train a FastText model
fasttext_model = gensim.models.FastText(sentences=tokenized_corpus, vector_size=50, window=5, min_count=1, sg=1)

# Get vector for an OOV word ('extractions' is not in the corpus)
print("FastText vector for OOV word 'extractions':\n", fasttext_model.wv['extractions'][:5])

FastText vector for OOV word 'extractions':
 [-3.3362366e-03 -1.6197474e-03  9.3988667e-05  2.1759269e-04
 -1.3983970e-03]


## ***4. BERT***

BERT is a transformer-based model that generates contextual embeddings. The vector for a word changes based on its surrounding sentence, allowing it to capture context and nuance.

In [11]:
# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sentences with the same word in different contexts
sentence1 = "Feature extraction is a technique."
sentence2 = "This car has a new safety feature."

# Get contextual embedding for 'feature' in sentence 1
inputs1 = tokenizer(sentence1, return_tensors='pt')
outputs1 = model(**inputs1)
feature_vector1 = outputs1.last_hidden_state[0, 1, :] # 'feature' is the 2nd token

# Get contextual embedding for 'feature' in sentence 2
inputs2 = tokenizer(sentence2, return_tensors='pt')
outputs2 = model(**inputs2)
feature_vector2 = outputs2.last_hidden_state[0, 6, :] # 'feature' is the 7th token

# Check if vectors are different
similarity = torch.nn.functional.cosine_similarity(feature_vector1, feature_vector2, dim=0)
print(f"Similarity between the two 'feature' vectors: {similarity.item():.4f}")
print("(A value less than 1.0 proves they are different)")

Similarity between the two 'feature' vectors: 0.2005
(A value less than 1.0 proves they are different)


# ***Conclusion***

Word2Vec, GloVe, and FastText produce static embeddings (one vector per word), with FastText's key advantage being its ability to handle OOV words. In contrast, BERT provides dynamic, contextual embeddings that understand a word's meaning in a specific sentence, offering superior performance for complex NLP tasks.