<a href="https://colab.research.google.com/github/ovieimara/ITNPAI1/blob/master/AI1_practical_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook is about exploring various embedding models. There are many models, and choosing the right model for the task may be difficult. In this practical you will be exploring static (Word2Vec, Fasttext) as well as dynamic embedding models (BERT)

In [1]:
%%capture
!pip install gensim transformers torch numpy scikit-learn

##Downloading Word2Vec may take 10 minutes. Take this time to review today's lecture on embeddings so that everything is clear for this session, in particular 1) the concept of semantic analogy and 2) semantic similarity with cosine distance

#Part I: Word Analogy with Embeddings

In [2]:
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
import gensim.downloader as api

# Download and load a smaller Word2Vec model
word2vec_model = api.load('word2vec-google-news-300')



Let's define word analogy based on existing Word2Vec functions and test it first with the "textbook" analogy you have also seen in the lecture.

In [3]:
def word_analogy(model, positive, negative, topn=5):
    """
    Finds words that fit the analogy: positive1 - negative1 + positive2 = ?
    :param model: Word2Vec model
    :param positive: List of positive words (e.g., ['king', 'woman'])
    :param negative: List of negative words (e.g., ['man'])
    :param topn: Number of closest results to return
    """
    return model.most_similar(positive=positive, negative=negative, topn=topn)

In [6]:
# Example analogy: king - man + woman = queen
result = word_analogy(word2vec_model, positive=['king', 'woman'], negative=['man'])
print("Word Analogy Results:", result)

Word Analogy Results: [('queen', 0.7118193507194519), ('monarch', 0.6189674139022827), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321839332581)]


Let's define more analogies. The format is:

*   positive = two words for instance 'king' and 'woman'
*   negative = one word, which is associated to the first word, to be "subtracted" and replace by the second positive word, for instance 'man', which is associated to 'king' and will be replaced by 'woman' with an expected result of 'queen'



In [4]:
analogies = [
    (['king', 'woman'], ['man']),  # king - man + woman = queen
    (['paris', 'italy'], ['france']),  # paris - france + italy = rome
    (['walking', 'swam'], ['walked']),  # walking - walked + swam = swum
    (['big', 'bigger'], ['small']),  # big - small + bigger = smaller
    (['doctor', 'woman'], ['man']),  # doctor - man + woman = nurse (semantic bias example)
]

In [5]:
for positive, negative in analogies:
    result = word_analogy(word2vec_model, positive, negative)
    print(f"{positive} - {negative} = {result[0][0]} (Score: {result[0][1]:.4f})")

['king', 'woman'] - ['man'] = queen (Score: 0.7118)
['paris', 'italy'] - ['france'] = lohan (Score: 0.5070)
['walking', 'swam'] - ['walked'] = swimming (Score: 0.7449)
['big', 'bigger'] - ['small'] = biggest (Score: 0.5590)
['doctor', 'woman'] - ['man'] = gynecologist (Score: 0.7094)


Let's try with another set. Record those cases where the analogy works and those where it doesn't. When it doesn't work it may not be because the analogy is wrong but because limitations of the model (Word2Vec is rather old now, and is a static model).
You can also try with your own examples!

In [7]:
analogies = [
    (['king', 'woman'], ['man']),       # king - man + woman = queen
    (['paris', 'france'], ['london']),  # paris - france + london = england
    (['japan', 'sushi'], ['italy']),    # japan - sushi + italy = pizza
    (['fast', 'faster'], ['slow']),     # fast - slow + faster = slower
]

In [8]:
for positive, negative in analogies:
    result = word_analogy(word2vec_model, positive, negative)
    print(f"{positive} - {negative} = {result[0][0]} (Score: {result[0][1]:.4f})")

['king', 'woman'] - ['man'] = queen (Score: 0.7118)
['paris', 'france'] - ['london'] = France (Score: 0.5404)
['japan', 'sushi'] - ['italy'] = Sushi (Score: 0.5640)
['fast', 'faster'] - ['slow'] = quicker (Score: 0.5878)


##Let's try with a more recent embedding model, Fasttext. It is still a static embedding model, so it doesn't account for the different meanings of a word based on context. But it handles out-of-vocabulary words better than Word2Vec and captures morphological nuances (e.g., prefixes and suffixes).

In [9]:
%%capture
!pip install fasttext

We'lle be downloading the .vec version of Fasttext (you'll use the .bin version later).
*   Text-based: The .vec file is a plain text file where each line represents a word and its corresponding vector.
*   Human-readable: You can open .vec files in a text editor and read the word embeddings directly. Each line contains the word followed by the vector values (usually space-separated).




In [10]:
# Takes 2-3 minutes
# Download the English FastText model
# There is one embedding per language
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
!gunzip cc.en.300.vec.gz

--2025-01-27 16:41:55--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.96, 3.163.189.51, 3.163.189.14, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1325960915 (1.2G) [binary/octet-stream]
Saving to: ‘cc.en.300.vec.gz’


2025-01-27 16:42:08 (101 MB/s) - ‘cc.en.300.vec.gz’ saved [1325960915/1325960915]



##Below we are loading under the word2vec format to support the same functions for computing analogy between words.
##This takes another 10 minutes... so maybe take this time to collect i) more examples of analogies, and ii) also examples of word pairs where the words have different degrees of similarity.

In [11]:
from gensim.models import KeyedVectors

# Load the pre-trained FastText embeddings
fasttext_model_path = "/content/cc.en.300.vec"  # Update with your file
fasttext_model = KeyedVectors.load_word2vec_format(fasttext_model_path, binary=False)   #for analogy tasks

print(f"Loaded {len(fasttext_model)} word vectors.")

Loaded 2000000 word vectors.


##Now we are defining an analogy function just as before but with Fasttext

In [12]:
# Solve analogies using FastText
def solve_analogy(model, positive, negative):
    try:
        result = model.most_similar(positive=positive, negative=negative, topn=1)
        return result
    except KeyError as e:
        return f"Word not in vocabulary: {e}"

##... and testing again

In [13]:
# Example analogies
analogies = [
    (['king', 'woman'], ['man']),
    (['paris', 'italy'], ['france']),
    (['walking', 'swam'], ['walked']),
    (['japan', 'sushi'], ['italy']),
    (['doctor', 'woman'], ['man']),
]

In [14]:
for positive, negative in analogies:
    result = solve_analogy(fasttext_model, positive, negative)
    print(f"{positive} - {negative} = {result}")

['king', 'woman'] - ['man'] = [('queen', 0.7554903030395508)]
['paris', 'italy'] - ['france'] = [('rome', 0.6593306064605713)]
['walking', 'swam'] - ['walked'] = [('swimming', 0.7572133541107178)]
['japan', 'sushi'] - ['italy'] = [('sashimi', 0.6613127589225769)]
['doctor', 'woman'] - ['man'] = [('gynecologist', 0.7013276815414429)]


###look at which analogies have improved and which have not. Also try with your own examples.

#Now let's use a contextual embedding, BERT.
###We no longer have the Word2Vec format that makes it possible to define an analogy function, so we will measure the similarity between the expected results and the analogy equation of the type (king + woman - man). The higher the similarity the closer we are to the results. You can compare to previous wrong results by measuring the distance to the wrong results obtained e.g. with Word2Vec.

In [15]:
# IGNORE the warning about the HF_TOKEN as we're using public models
from transformers import AutoModel, AutoTokenizer
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load BERT model and tokenizer
model_name = "bert-base-uncased"
bert_model = AutoModel.from_pretrained(model_name)
bert_tokenizer = AutoTokenizer.from_pretrained(model_name)

# Embed words
def embed_word(word):
    inputs = bert_tokenizer(word, return_tensors="pt")
    outputs = bert_model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [16]:
# Perform analogy: doctor - man + woman
doctor = embed_word("doctor")
man = embed_word("man")
woman = embed_word("woman")
nurse = embed_word("nurse")

result_vector = doctor - man + woman

# Compare to queen
similarity = cosine_similarity(result_vector, nurse)
print(f"Similarity to 'nurse': {similarity[0][0]:.4f}")

Similarity to 'nurse': 0.8630


In [18]:
# Perform analogy: doctor - man + woman
japan = embed_word("Japan")
sushi = embed_word("sushi")
pizza = embed_word("pizza")
italy = embed_word("Italy")

result_vector = italy - sushi + japan

# Compare to queen
similarity = cosine_similarity(result_vector, pizza)
print(f"Similarity to 'pizza': {similarity[0][0]:.4f}")

Similarity to 'pizza': 0.5570


#Part II: Word similarity with embeddings

In [19]:
word_pairs = [
    ("king", "queen"),
    ("cat", "dog"),
    ("car", "bicycle"),
    ("apple", "banana"),
    ("teacher", "school"),
    ("war", "peace"),  # antonyms
    ("happy", "joyful"),
    ("table", "blue"), # unrelated
]

##Word2Vec similarity uses the already loaded version of Word2Vec

In [20]:
def word2vec_similarity(word1, word2):
    try:
        vec1 = word2vec_model[word1]
        vec2 = word2vec_model[word2]
        return cosine_similarity([vec1], [vec2])[0][0]
    except KeyError:
        return None  # Handle OOV words

# Compute similarities
for word1, word2 in word_pairs:
    similarity = word2vec_similarity(word1, word2)
    print(f"Word2Vec Similarity ({word1}, {word2}): {similarity}")

Word2Vec Similarity (king, queen): 0.6510956883430481
Word2Vec Similarity (cat, dog): 0.760945737361908
Word2Vec Similarity (car, bicycle): 0.5364484190940857
Word2Vec Similarity (apple, banana): 0.5318406224250793
Word2Vec Similarity (teacher, school): 0.63824063539505
Word2Vec Similarity (war, peace): 0.3407759368419647
Word2Vec Similarity (happy, joyful): 0.4238196909427643
Word2Vec Similarity (table, blue): 0.030079539865255356


In [21]:
# IGNORE the warning about the HF_TOKEN as we're using public models
# Reloading with a slight variant
from transformers import AutoModel, AutoTokenizer
import torch

# Load BERT model and tokenizer
bert_model = AutoModel.from_pretrained('bert-base-uncased')
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def bert_embedding(word):
    tokens = bert_tokenizer(word, return_tensors="pt", add_special_tokens=False)
    with torch.no_grad():
        output = bert_model(**tokens)
    # Average the token embeddings
    return output.last_hidden_state.mean(dim=1).squeeze().numpy()

###Let's compute BERT similarity. Compare to the Word2Vec scores.
###You can try more word pairs of your own

In [22]:
def bert_similarity(word1, word2):
    vec1 = bert_embedding(word1)
    vec2 = bert_embedding(word2)
    return cosine_similarity([vec1], [vec2])[0][0]

# Compute similarities
for word1, word2 in word_pairs:
    similarity = bert_similarity(word1, word2)
    print(f"BERT Similarity ({word1}, {word2}): {similarity}")

BERT Similarity (king, queen): 0.6094793081283569
BERT Similarity (cat, dog): 0.6345011591911316
BERT Similarity (car, bicycle): 0.2727319300174713
BERT Similarity (apple, banana): 0.20649036765098572
BERT Similarity (teacher, school): 0.39560166001319885
BERT Similarity (war, peace): 0.6956070065498352
BERT Similarity (happy, joyful): 0.32036682963371277
BERT Similarity (table, blue): 0.3751840591430664


##Let's move from word to sentences and compute sentence similarity

In [23]:
sentence_pairs = [
    ("The king rules the kingdom.", "The queen governs the empire."),
    ("The cat is on the mat.", "The dog is in the yard."),
    ("I love playing football.", "Soccer is my favorite sport."),
    ("War brings destruction.", "Peace leads to harmony."),
    ("The teacher is in the classroom.", "The student is in the library."),
    ("I love programming.", "I enjoy coding."),
    ("The weather is nice today.", "It's sunny outside."),
    ("Apples are delicious.", "I like bananas.")
]

###Word2Vec first

In [25]:
def sentence_embedding_word2vec(sentence):
    words = sentence.lower().split()
    vectors = [word2vec_model[word] for word in words if word in word2vec_model]
    return sum(vectors) / len(vectors) if vectors else None

def word2vec_sentence_similarity(sent1, sent2):
    vec1 = sentence_embedding_word2vec(sent1)
    vec2 = sentence_embedding_word2vec(sent2)
    if vec1 is not None and vec2 is not None:
        return cosine_similarity([vec1], [vec2])[0][0]
    return None

# Compute similarities
for sent1, sent2 in sentence_pairs:
    similarity = word2vec_sentence_similarity(sent1, sent2)
    print(f"Word2Vec Sentence Similarity:\n{sent1}\n{sent2}\nScore: {similarity}\n")

Word2Vec Sentence Similarity:
The king rules the kingdom.
The queen governs the empire.
Score: 0.6699811220169067

Word2Vec Sentence Similarity:
The cat is on the mat.
The dog is in the yard.
Score: 0.8818545341491699

Word2Vec Sentence Similarity:
I love playing football.
Soccer is my favorite sport.
Score: 0.5485471487045288

Word2Vec Sentence Similarity:
War brings destruction.
Peace leads to harmony.
Score: 0.3857032060623169

Word2Vec Sentence Similarity:
The teacher is in the classroom.
The student is in the library.
Score: 0.8816166520118713

Word2Vec Sentence Similarity:
I love programming.
I enjoy coding.
Score: 0.7459654211997986

Word2Vec Sentence Similarity:
The weather is nice today.
It's sunny outside.
Score: 0.5262449979782104

Word2Vec Sentence Similarity:
Apples are delicious.
I like bananas.
Score: 0.1177532821893692



###Then BERT

In [26]:
def sentence_embedding_bert(sentence):
    tokens = bert_tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        output = bert_model(**tokens)
    # Average over all token embeddings
    return output.last_hidden_state.mean(dim=1).squeeze().numpy()

def bert_sentence_similarity(sent1, sent2):
    vec1 = sentence_embedding_bert(sent1)
    vec2 = sentence_embedding_bert(sent2)
    return cosine_similarity([vec1], [vec2])[0][0]

# Compute similarities
for sent1, sent2 in sentence_pairs:
    similarity = bert_sentence_similarity(sent1, sent2)
    print(f"BERT Sentence Similarity:\n{sent1}\n{sent2}\nScore: {similarity}\n")

BERT Sentence Similarity:
The king rules the kingdom.
The queen governs the empire.
Score: 0.9137302041053772

BERT Sentence Similarity:
The cat is on the mat.
The dog is in the yard.
Score: 0.8842370510101318

BERT Sentence Similarity:
I love playing football.
Soccer is my favorite sport.
Score: 0.8345466256141663

BERT Sentence Similarity:
War brings destruction.
Peace leads to harmony.
Score: 0.8601904511451721

BERT Sentence Similarity:
The teacher is in the classroom.
The student is in the library.
Score: 0.6967145800590515

BERT Sentence Similarity:
I love programming.
I enjoy coding.
Score: 0.8875049352645874

BERT Sentence Similarity:
The weather is nice today.
It's sunny outside.
Score: 0.8350158333778381

BERT Sentence Similarity:
Apples are delicious.
I like bananas.
Score: 0.8236267566680908



##Now analyse the results. Try with different sentence pairs, possibly longer ones.

### So BERT is contextual, while Word2Vec and Fasttext focus on isolated words
### Now, let's try SBERT which is optimised for sentence comparison.

In [27]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the SBERT model
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')  # Or another SBERT model of your choice

# Function to compute SBERT embeddings for a sentence
def sentence_embedding_sbert(sentence):
    return sbert_model.encode(sentence, convert_to_numpy=True)

# Function to compute cosine similarity between two sentences using SBERT
def sbert_sentence_similarity(sent1, sent2):
    vec1 = sentence_embedding_sbert(sent1)
    vec2 = sentence_embedding_sbert(sent2)
    return cosine_similarity([vec1], [vec2])[0][0]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [28]:
for sent1, sent2 in sentence_pairs:
    similarity = sbert_sentence_similarity(sent1, sent2)
    print(f"SBERT Sentence Similarity:\n{sent1}\n{sent2}\nScore: {similarity}\n")

SBERT Sentence Similarity:
The king rules the kingdom.
The queen governs the empire.
Score: 0.5167741775512695

SBERT Sentence Similarity:
The cat is on the mat.
The dog is in the yard.
Score: 0.241413414478302

SBERT Sentence Similarity:
I love playing football.
Soccer is my favorite sport.
Score: 0.6574643850326538

SBERT Sentence Similarity:
War brings destruction.
Peace leads to harmony.
Score: 0.4820183515548706

SBERT Sentence Similarity:
The teacher is in the classroom.
The student is in the library.
Score: 0.517518162727356

SBERT Sentence Similarity:
I love programming.
I enjoy coding.
Score: 0.8287293314933777

SBERT Sentence Similarity:
The weather is nice today.
It's sunny outside.
Score: 0.602494478225708

SBERT Sentence Similarity:
Apples are delicious.
I like bananas.
Score: 0.45837509632110596



## Analyse the results. Now try with longer sentences to see if SBERT does better and in which case.

In [None]:
sentence_pairs = [
    ("Hadoop efficiently processes large datasets by distributing tasks across multiple nodes, ensuring scalability and speed for big data analysis.", "The queen governs the empire."),
    # ("The cat is on the mat.", "The dog is in the yard."),
    # ("I love playing football.", "Soccer is my favorite sport."),
    # ("War brings destruction.", "Peace leads to harmony."),
    # ("The teacher is in the classroom.", "The student is in the library."),
    # ("I love programming.", "I enjoy coding."),
    # ("The weather is nice today.", "It's sunny outside."),
    # ("Apples are delicious.", "I like bananas.")
]