<h4>Word Embeddings</h4>

Word embeddings are a way to represent words as numbers that computers can understand and work with. 

The key idea is simple: words with similar meanings should be represented by similar numbers.

Think of it like a map where words are placed as points - words that mean similar things (like "cat" and "dog") end up close to each other, while unrelated words (like "cat" and "algebra") are far apart.

These systems learn by looking at huge amounts of text and noticing which words appear together frequently. For example, words like "king" and "queen" often appear in similar sentences, so the system learns they're related.

Below are examples of the GloVe model which takes an approach of creating a big table of how often words appear together, then using mathematical techniques to turn that into word representations.


In [1]:
# Import the gensim library to work with word vectors
# The "as api" import allows us to easily download pre-trained models
# KeyedVectors is used to handle the word vectors efficiently 

from gensim.models import KeyedVectors
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import gensim.downloader as api

In [2]:
# List all available models from the gensim downloader

print("Available models:\n")
for model_name in api.info()['models'].keys():
    print(model_name)

Available models:

fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis


<h4>GloVe embeddings</h4>

In [3]:
# Here we load a GloVe vectors pre-trained model (trained on Twitter data)

model = api.load("glove-twitter-50")

In [4]:
# Display the vector for the word "abuse"
print("\nVector for 'abuse':\n", model['abuse'])


Vector for 'abuse':
 [ 0.60233   0.76993  -0.89615  -0.15807  -0.21837  -0.009611  0.31435
 -0.7212    0.18842  -0.20418   0.5714   -0.68694  -3.2285   -0.1543
  0.57321   0.66187  -0.67019  -1.0092    0.033162 -0.23652  -0.18133
  0.24384   0.40323  -0.10941  -0.17483   0.21203   0.53417   0.69763
 -0.61553   0.53514   0.10736  -0.56608  -0.12903   0.035331  0.19674
 -0.36697  -1.2568    0.085889 -0.59497  -2.2233    0.57036  -1.3173
  0.25104  -0.24124   0.47565   0.17862   0.11298   0.36407  -0.027958
 -0.74695 ]


In [5]:
# Looking up a words associated to Domestic Violence (physical) in the model and identifying similar words 

similar_words = model.most_similar(positive = ["abuse", "domestic", "assault"], topn = 20)

for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

arrest: 0.8591
rape: 0.8546
murder: 0.8462
alleged: 0.8449
harassment: 0.8404
charges: 0.8384
appeal: 0.8315
violence: 0.8252
terrorism: 0.8232
cruelty: 0.8197
investigation: 0.8146
rights: 0.8108
torture: 0.7982
terrorist: 0.7972
victim: 0.7938
threatening: 0.7902
execution: 0.7893
weapons: 0.7868
crime: 0.7867
military: 0.7846


In [6]:
# Looking up a words associated to Domestic Violence (non-physical) in the model and identifying similar words 

similar_words = model.most_similar(positive=["emotional", "abuse", "psychological"], topn=20)

for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

physical: 0.8272
critical: 0.8230
illness: 0.8194
harm: 0.8027
depression: 0.7956
causes: 0.7914
issues: 0.7846
violent: 0.7822
torture: 0.7819
lack: 0.7803
anxiety: 0.7740
substance: 0.7738
anger: 0.7725
significant: 0.7699
human: 0.7689
causing: 0.7679
suffering: 0.7676
danger: 0.7647
common: 0.7609
situation: 0.7600


In [7]:
# Looking for analogous words.

analogous = model.most_similar(positive=["rape", "assault"], negative=["theft"], topn = 20)
for word, analogy in analogous:
    print(f"{word}: {analogy:.4f}")  

arrest: 0.7410
abortion: 0.7244
harassment: 0.7242
murder: 0.7111
appeal: 0.6977
blasphemy: 0.6950
abuse: 0.6909
rapist: 0.6902
punishment: 0.6898
victim: 0.6792
judge: 0.6678
terrorist: 0.6666
violence: 0.6662
innocent: 0.6600
marriage: 0.6595
laws: 0.6594
sandusky: 0.6593
alleged: 0.6560
anti-gay: 0.6558
defend: 0.6548


ELMo was more advanced - instead of giving each word one fixed representation, it creates different representations depending on how the word is used in each sentence.
The main breakthrough was showing that computers could learn meaningful relationships between words just by reading lots of text, without anyone having to explicitly teach them what words mean.

In [None]:
'''
# ELMo requires TensorFlow and TensorFlow Hub
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained ELMo model from TensorFlow Hub
print("Loading ELMo model...")
elmo = hub.load("https://tfhub.dev/google/elmo/3")

def get_elmo_embeddings(sentences):
    """
    Get ELMo embeddings for a list of sentences
    Returns the default representation (mean of all layers)
    """
    # ELMo expects a list of sentences
    embeddings = elmo(sentences)
    
    # ELMo returns multiple representations, we'll use the default one
    return embeddings["default"].numpy()

def find_similar_context_words(target_sentences, candidate_sentences, top_n=5):
    """
    Find sentences with similar context to target sentences
    """
    # Get embeddings for all sentences
    all_sentences = target_sentences + candidate_sentences
    embeddings = get_elmo_embeddings(all_sentences)
    
    # Split embeddings
    target_embeddings = embeddings[:len(target_sentences)]
    candidate_embeddings = embeddings[len(target_sentences):]
    
    # Calculate similarities
    similarities = cosine_similarity(target_embeddings, candidate_embeddings)
    
    # Find most similar for each target
    results = []
    for i, target_sentence in enumerate(target_sentences):
        # Get similarity scores for this target
        scores = similarities[i]
        
        # Get top similar candidates
        top_indices = np.argsort(scores)[::-1][:top_n]
        
        similar_contexts = []
        for idx in top_indices:
            similar_contexts.append({
                'sentence': candidate_sentences[idx],
                'similarity': scores[idx]
            })
        
        results.append({
            'target': target_sentence,
            'similar': similar_contexts
        })
    
    return results

# Example usage similar to your GloVe code
print("\n" + "="*60)
print("ELMo Context-Dependent Word Embeddings Demo")
print("="*60)

# Define sentences containing words related to domestic violence (physical)
physical_violence_sentences = [
    "The abuse was physical and left visible marks",
    "Domestic violence includes physical assault and battery",
    "Physical abuse can cause serious injury"
]

# Define candidate sentences for comparison
candidate_sentences = [
    "The abuse of power was evident in the political scandal",
    "Drug abuse is a serious health problem",
    "Emotional abuse can be just as damaging as physical violence",
    "The assault was reported to the police immediately",
    "Domestic disputes often escalate to violence",
    "Physical therapy helped with the injury recovery",
    "The battery in my phone needs charging",
    "Mental health support is crucial for abuse survivors",
    "Violence in movies has increased over the years",
    "The physical examination revealed no injuries"
]

print("\nFinding sentences with similar contexts to physical violence...")
results = find_similar_context_words(physical_violence_sentences, candidate_sentences, top_n=3)

for result in results:
    print(f"\nTarget: '{result['target']}'")
    print("Most similar contexts:")
    for i, similar in enumerate(result['similar'], 1):
        print(f"  {i}. '{similar['sentence']}' (similarity: {similar['similarity']:.4f})")

print("\n" + "="*60)
print("Demonstrating Context Sensitivity")
print("="*60)

# Show how the same word gets different embeddings in different contexts
context_examples = [
    "The financial abuse involved stealing money",
    "The child abuse case was heartbreaking",
    "The substance abuse program helped many people",
    "The verbal abuse continued for years"
]

print("\nGetting embeddings for different contexts of 'abuse':")
embeddings = get_elmo_embeddings(context_examples)

# Calculate similarities between different uses of "abuse"
context_similarities = cosine_similarity(embeddings)

print("\nSimilarity matrix between different contexts:")
print("Sentences:")
for i, sentence in enumerate(context_examples):
    print(f"{i+1}. {sentence}")

print(f"\nSimilarity Matrix:")
print("     ", end="")
for i in range(len(context_examples)):
    print(f"  {i+1:>6}", end="")
print()

for i in range(len(context_examples)):
    print(f"{i+1:>2}. ", end="")
    for j in range(len(context_examples)):
        print(f"  {context_similarities[i][j]:>6.3f}", end="")
    print()

print("\n" + "="*60)
print("Word Analogy with Context")
print("="*60)
'''