# **Fun with Word Embeddings**
#### **Author: Partha Seetala**

**Video Tutorial: https://www.youtube.com/watch?v=8jqqE8XG5T0**


# **Download pre-built Word Embeddings**



In [None]:
# Install dependencies (gensim and numpy)
!pip install gensim -U
!pip install numpy==1.25
!pip install keras_preprocessing

Collecting keras_preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl.metadata (1.9 kB)
Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: keras_preprocessing
Successfully installed keras_preprocessing-1.1.2


In [None]:
w2vec = None
glove = None
embeddings = None

In [None]:
import numpy as np
import gensim.downloader as api

if w2vec is None:
    # Download the 300 dimension Word2Vec Embeddings built from Google News
    print("Downloading Word2Vec embeddings that were built from Google News")
    w2vec = api.load("word2vec-google-news-300")

if glove is None:
    # Download the 300 dimension GloVe Embeddings
    print("Downloading GloVe embeddings that were built from Wikipedia")
    glove = api.load("glove-wiki-gigaword-300")

Downloading GloVe embeddings that were built from Wikipedia


In [None]:
# select which embedding we'll use
embeddings = glove

# **FUN 1**: Finding how similar are two words

In [None]:
word_pairs = [
    ('apple', 'orange'),
    ('apple', 'microsoft'),
    ('google', 'microsoft'),
    ('man', 'father'),
    ('woman', 'father'),
    ('woman', 'mother'),
    ('dog', 'wolf'),
    ('cat', 'kitten'),
    ('cat', 'lion'),
    ('cat', 'dog'),
    ('cat', 'mouse'),
    ('cat', 'pizza'),
    ('apple', 'mango'),
    ('mango', 'orange'),
    ('lemon', 'orange'),
    ('lemon', 'lime'),
    ('einstein', 'grape'),
]

for pair in word_pairs:
    similarity = embeddings.similarity(pair[0], pair[1])
    print("Similarity between {:>10s} and {:10s} = {:3.2f}%".format(pair[0], pair[1], similarity*100))



Similarity between      apple and orange     = 32.06%
Similarity between      apple and microsoft  = 56.64%
Similarity between     google and microsoft  = 63.93%
Similarity between        man and father     = 54.52%
Similarity between      woman and father     = 45.41%
Similarity between      woman and mother     = 68.99%
Similarity between        dog and wolf       = 44.64%
Similarity between        cat and kitten     = 43.05%
Similarity between        cat and lion       = 38.01%
Similarity between        cat and dog        = 68.17%
Similarity between        cat and mouse      = 45.38%
Similarity between        cat and pizza      = 14.63%
Similarity between      apple and mango      = 40.26%
Similarity between      mango and orange     = 35.46%
Similarity between      lemon and orange     = 47.24%
Similarity between      lemon and lime       = 71.73%
Similarity between   einstein and grape      = 4.35%


# **FUN 2**: Finding a word that doesn't belong

In [None]:
words = [
    ['artificial', 'intelligence', 'machine', 'banana'],
    ['computer', 'laptop', 'book', 'server'],
    ['cat', 'dog', 'lion', 'mouse'],
    ['cat', 'dog', 'lion', 'kitten'],
    ['rat', 'elephant', 'dinosaurs', 'giraffe'],
    ['elephant', 'lion', 'giraffe', 'fish'],
    ['elephant', 'lion', 'giraffe', 'fish', 'computer', 'laptop', 'book', 'water'],
    ['gandhi', 'hitler', 'mussolini']
]

# Given a set of words find a word that doesn't belong
for word in words:
    odd_word_out = embeddings.doesnt_match(word)
    print("{}  ->  {}".format(word, odd_word_out))


['artificial', 'intelligence', 'machine', 'banana']  ->  banana
['computer', 'laptop', 'book', 'server']  ->  book
['cat', 'dog', 'lion', 'mouse']  ->  mouse
['cat', 'dog', 'lion', 'kitten']  ->  lion
['rat', 'elephant', 'dinosaurs', 'giraffe']  ->  rat
['elephant', 'lion', 'giraffe', 'fish']  ->  fish
['elephant', 'lion', 'giraffe', 'fish', 'computer', 'laptop', 'book', 'water']  ->  laptop
['gandhi', 'hitler', 'mussolini']  ->  gandhi


# **FUN 3**: Finding other words similar to a given word

In [None]:
words = ["human", "computer", "intelligence", "pizza"]

for word in words:
    similar_words = embeddings.most_similar(word, topn=10)
    print("\n", word)
    for sim in similar_words:
        print(" - {} {:3.2f}%".format(sim[0], sim[1]*100))


 human
 - rights 66.78%
 - beings 62.11%
 - animal 58.78%
 - humans 56.96%
 - humanity 50.16%
 - animals 49.98%
 - environmental 47.70%
 - abuses 47.38%
 - democracy 46.17%
 - scientists 45.76%

 computer
 - computers 82.48%
 - software 73.34%
 - pc 62.40%
 - technology 61.99%
 - computing 61.79%
 - laptop 59.56%
 - internet 58.58%
 - ibm 58.25%
 - systems 57.45%
 - hardware 57.29%

 intelligence
 - cia 65.24%
 - information 55.40%
 - security 54.01%
 - counterterrorism 53.87%
 - operatives 53.31%
 - fbi 53.30%
 - military 53.12%
 - secret 52.89%
 - spy 52.25%
 - agency 50.99%

 pizza
 - pizzas 65.33%
 - taco 61.55%
 - pepperoni 60.20%
 - sandwiches 58.44%
 - restaurant 56.44%
 - sandwich 55.62%
 - burgers 53.97%
 - pasta 53.56%
 - hamburgers 53.32%
 - burger 53.00%


# **FUN 4**: Adding and subtracting Word Embeddings

In [None]:
# We'll do A - B + C
combinations = [
    ('king', 'man', 'woman'),    # king - man + women => queen
    ('paris', 'france', 'germany'),
    ('brother', 'man', 'woman'),
    ('walking', 'walk', 'swim'),
    ('doctor', 'medicine', 'car'),
    ('driver', 'car', 'plane'),
    ('husband', 'man', 'woman'),
    ('apple', 'fruit', 'vegetable'),
    ('germany', 'hitler', 'mussolini')
]

for a, b, c in combinations:
    result = embeddings.most_similar(positive=[a, c], negative=[b], topn=1)
    print("{:>10s} - {:10s} + {:10s} = {}".format(a, b, c, result[0][0]))


      king - man        + woman      = queen
     paris - france     + germany    = berlin
   brother - man        + woman      = daughter
   walking - walk       + swim       = swimming
    doctor - medicine   + car        = driver
    driver - car        + plane      = flight
   husband - man        + woman      = wife
     apple - fruit      + vegetable  = macintosh
   germany - hitler     + mussolini  = italy


In [None]:
# List of complex analogies to solve (with multiple additions and subtractions)
complex_analogies = [
    (['france', 'germany', 'italy'], ['paris', 'berlin']),
    (['brazil', 'yen', 'rupee'], ['japan', 'india']),
    (['walking', 'run', 'swimming'], ['walk', 'swim']),
    (['brother', 'queen'], ['king', 'sister']),
    (['husband', 'woman'], ['man']),
    (['woman', 'husband'], ['man']),
    (['cars', 'wings', 'sky'], ['wheels', 'road'])
]

# Solve each analogy
for positive, negative in complex_analogies:
    result = embeddings.most_similar(positive=positive, negative=negative, topn=2)
    print("({}) - ({}) => ".format(", ".join(positive), ", ".join(negative)), end="")
    for sim in result:
        print("{} {:3.2f}%".format(sim[0], sim[1]*100), end=", ")
    print("\n")

(france, germany, italy) - (paris, berlin) => spain 53.19%, slovakia 49.63%, 

(brazil, yen, rupee) - (japan, india) => peso 61.34%, franc 53.54%, 

(walking, run, swimming) - (walk, swim) => running 55.06%, ran 43.81%, 

(brother, queen) - (king, sister) => honours 29.17%, godoy 28.37%, 

(husband, woman) - (man) => wife 77.33%, mother 71.96%, 

(woman, husband) - (man) => wife 77.33%, mother 71.96%, 

(cars, wings, sky) - (wheels, road) => skies 44.92%, blue 42.24%, 



# **FUN 5**: Finding similar sentences using just Word Embeddings

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import re

# Function to compute sentence embedding by averaging word embeddings
def get_sentence_embedding(embeddings, sentence):
    # Simple preprocessing: lowercase and split into words
    words = re.sub(r'[^\w\s]', '', sentence.lower()).split()

    # Get embeddings for words in the sentence that exist in the model
    word_embeddings = [embeddings[word] for word in words if word in embeddings]

    # Average the word embeddings to get the sentence embeddings
    return np.mean(word_embeddings, axis=0)

# Calculate a matrix that does cosine similarities across all sentences
def calculate_similarities_among_sentences(embeddings, sentences):
    sentence_embeddings = np.array([get_sentence_embedding(embeddings, sentence) for sentence in sentences])
    similarity_matrix = cosine_similarity(sentence_embeddings)
    return similarity_matrix

# Print similar sentences
def show_similar_sentences(similarity_matrix, sentences, threshold=0.75):
    for i in range(len(sentences)):
        similarities = []
        for j in range(len(sentences)):
            if i == j:
                continue
            sim = similarity_matrix[i][j]
            if sim < threshold:
                continue
            similarities.append((sentences[j], sim))

        similarities.sort(key=lambda x: x[1], reverse=True)

        print("\n", sentences[i])
        for sim in similarities:
            print(" - {} {:3.2f}%".format(sim[0], sim[1]*100))

In [None]:
sentences = [
    "The quarterback threw a perfect pass to win the game in overtime.",  # Sports
    "Artificial neural networks are accelerating breakthroughs in pattern recognition.",  # Tech
    "The senator proposed a bill to reform healthcare funding nationwide.",  # Politics
    "Philosophers debate whether free will exists or if fate governs our actions.",  # Philosophy
    "A new surgical technique reduced recovery time for knee replacements.",  # Medicine
    "Quantum encryption could soon make data breaches a thing of the past.",  # Tech
    "The marathon runner trained for months to qualify for the Olympics.",  # Sports
    "Existentialism suggests that meaning in life is created, not discovered.",  # Philosophy
    "The prime minister faced criticism for delaying the climate change vote.",  # Politics
    "Gene editing tools like CRISPR are revolutionizing treatment for genetic disorders.",  # Medicine
    "The basketball team’s defense strategy turned the match in their favor.",  # Sports
    "Tech startups are racing to develop sustainable battery alternatives.",  # Tech
    "The ethics of artificial intelligence challenge our notions of responsibility.",  # Philosophy
    "Voters turned out in record numbers to support the new tax initiative.",  # Politics
    "A groundbreaking study linked diet to improved mental health outcomes.",  # Medicine
    "The tennis champion’s backhand was unbeatable during the final set.",  # Sports
    "Blockchain technology is reshaping how we verify digital transactions.",  # Tech
    "Stoicism teaches us to focus only on what we can control.",  # Philosophy
    "The opposition party rallied against cuts to public education spending.",  # Politics
    "Antiviral drugs are being tested to shorten the duration of flu symptoms."  # Medicine
]


similarities = calculate_similarities_among_sentences(embeddings, sentences)
show_similar_sentences(similarities, sentences, threshold=0.75)


 The quarterback threw a perfect pass to win the game in overtime.
 - The basketball team’s defense strategy turned the match in their favor. 80.72%
 - The tennis champion’s backhand was unbeatable during the final set. 76.10%

 Artificial neural networks are accelerating breakthroughs in pattern recognition.

 The senator proposed a bill to reform healthcare funding nationwide.
 - The opposition party rallied against cuts to public education spending. 83.87%
 - Voters turned out in record numbers to support the new tax initiative. 81.71%
 - The prime minister faced criticism for delaying the climate change vote. 75.26%

 Philosophers debate whether free will exists or if fate governs our actions.
 - Stoicism teaches us to focus only on what we can control. 84.05%
 - Existentialism suggests that meaning in life is created, not discovered. 78.33%
 - The ethics of artificial intelligence challenge our notions of responsibility. 76.70%
 - Quantum encryption could soon make data breaches 