# GloVe: Global Vectors for Word Representation

In this notebook, we will explore **GloVe**, a technique for generating word embeddings. GloVe uses a co-occurrence matrix to capture word relationships and trains embeddings by minimizing a reconstruction loss.

---

## Objectives
By the end of this notebook, you will:
1. Understand how a co-occurrence matrix is created and used to capture contextual information.
2. Learn how to train word embeddings using GloVe.
3. Implement and use GloVe embeddings to find similar words.

---

## What is GloVe?
GloVe (Global Vectors for Word Representation) is a method for learning word embeddings based on the co-occurrence of words in a corpus. Unlike Word2Vec, which uses local context windows, GloVe incorporates global statistics of word co-occurrence.

### Key Features:
- Uses a co-occurrence matrix to capture the number of times words appear together in a context window.
- Learns word embeddings by factorizing this matrix and minimizing the difference between observed and predicted log co-occurrence counts.

---

## Why Use GloVe?
Word embeddings generated by GloVe have several advantages:
1. Capture semantic and syntactic relationships between words.
2. Perform well on word similarity and analogy tasks.
3. Provide interpretable embeddings useful in downstream NLP applications.


# Preliminaries: Libraries and Data Preparation

We will:
1. Use `numpy` and `scipy.sparse` for matrix operations.
2. Define a small corpus of sentences for training.
3. Create a co-occurrence matrix to capture word relationships in the corpus.

In [12]:
import numpy as np
from scipy import sparse

# Our sample corpus
corpus = [
    ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'],
    ['the', 'five', 'boxing', 'wizards', 'jump', 'quickly'],
    ['pack', 'my', 'box', 'with', 'five', 'dozen', 'liquor', 'jugs'],
    ['she', 'sells', 'seashells', 'by', 'the', 'seashore'],
    ['how', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if', 'a', 'woodchuck', 'could', 'chuck', 'wood'],
    ['peter', 'piper', 'picked', 'a', 'peck', 'of', 'pickled', 'peppers'],
    ['a', 'peck', 'of', 'pickled', 'peppers', 'peter', 'piper', 'picked'],
    ['to', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question'],
    ['all', 'the', 'world', 'is', 'a', 'stage', 'and', 'all', 'the', 'men', 'and', 'women', 'merely', 'players'],
    ['friends', 'romans', 'countrymen', 'lend', 'me', 'your', 'ears'],
    ['romeo', 'romeo', 'wherefore', 'art', 'thou', 'romeo'],
    ['the', 'cat', 'sat', 'on', 'the', 'mat'],
    ['the', 'dog', 'chased', 'the', 'cat'],
    ['the', 'mat', 'was', 'on', 'the', 'floor'],
    ['the', 'sun', 'shines', 'bright', 'in', 'the', 'blue', 'sky'],
    ['rain', 'drops', 'keep', 'falling', 'on', 'my', 'head'],
    ['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit'],
    ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times'],
    ['call', 'me', 'ishmael'],
    ['it', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife'],
    ['happy', 'families', 'are', 'all', 'alike', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own', 'way'],
]


# Step 1: Create the Co-occurrence Matrix

The co-occurrence matrix represents the number of times words appear together within a context window. Each row and column corresponds to a word in the vocabulary.

### Parameters:
- **Window Size**: Determines how many words to the left and right are considered as context.
- **Vocab**: The set of unique words in the corpus.


In [13]:
def create_co_occurrence_matrix(corpus, window_size=2):
    # Create vocabulary and mappings
    vocab = sorted(set(word for sentence in corpus for word in sentence))
    word_to_id = {word: i for i, word in enumerate(vocab)}

    # Initialize the co-occurrence matrix
    co_occurrence = sparse.lil_matrix((len(vocab), len(vocab)))

    # Populate the matrix
    for sentence in corpus:
        for i, word in enumerate(sentence):
            left_context = max(0, i - window_size)
            right_context = min(len(sentence), i + window_size + 1)
            for j in range(left_context, right_context):
                if i != j:
                    co_occurrence[word_to_id[word], word_to_id[sentence[j]]] += 1

    return co_occurrence, word_to_id

# Create co-occurrence matrix
co_occurrence, word_to_id = create_co_occurrence_matrix(corpus)

print("Vocabulary:", list(word_to_id.keys()))
print("Co-occurrence matrix shape:", co_occurrence.shape)


Vocabulary: ['a', 'acknowledged', 'alike', 'all', 'and', 'are', 'art', 'be', 'best', 'blue', 'box', 'boxing', 'bright', 'brown', 'by', 'call', 'cat', 'chased', 'chuck', 'could', 'countrymen', 'dog', 'dozen', 'drops', 'ears', 'every', 'falling', 'families', 'family', 'five', 'floor', 'fortune', 'fox', 'friends', 'good', 'ground', 'happy', 'head', 'hobbit', 'hole', 'how', 'if', 'in', 'is', 'ishmael', 'it', 'its', 'jugs', 'jump', 'jumps', 'keep', 'lazy', 'lend', 'liquor', 'lived', 'man', 'mat', 'me', 'men', 'merely', 'much', 'must', 'my', 'not', 'of', 'on', 'or', 'over', 'own', 'pack', 'peck', 'peppers', 'peter', 'picked', 'pickled', 'piper', 'players', 'possession', 'question', 'quick', 'quickly', 'rain', 'romans', 'romeo', 'sat', 'seashells', 'seashore', 'sells', 'she', 'shines', 'single', 'sky', 'stage', 'sun', 'that', 'the', 'there', 'thou', 'times', 'to', 'truth', 'unhappy', 'universally', 'want', 'was', 'way', 'wherefore', 'wife', 'with', 'wizards', 'women', 'wood', 'woodchuck', 'wo

# Step 2: Train the GloVe Model

We will train word embeddings using GloVe by minimizing the difference between the log of observed co-occurrence counts and predicted values.

### Parameters:
- **Vector Size**: Dimensionality of the word embeddings.
- **Iterations**: Number of training iterations.
- **Learning Rate**: Step size for updates during training.


In [14]:
def train_glove(co_occurrence, word_to_id, vector_size=50, iterations=50, learning_rate=0.05):
    vocab_size = len(word_to_id)
    X = co_occurrence.tocsr()

    # Apply logarithm to non-zero elements only
    X_log = X.copy()
    X_log.data = np.log(X_log.data)

    # Initialize word vectors
    W = np.random.randn(vocab_size, vector_size) / np.sqrt(vector_size)
    W_context = np.random.randn(vocab_size, vector_size) / np.sqrt(vector_size)

    for _ in range(iterations):
        for i in range(vocab_size):
            J = X[i].nonzero()[1]  # Get indices of non-zero elements

            error = W[i].dot(W_context[J].T) - X_log[i, J].A.ravel()
            grad = error.dot(W_context[J])

            W[i] -= learning_rate * grad
            W_context[J] -= learning_rate * np.outer(error, W[i])

    # Final word vectors are the sum of W and W_context
    return W + W_context


# Train the model
word_vectors = train_glove(co_occurrence, word_to_id, vector_size=20, iterations=100)
print("Word vectors trained successfully!")



Word vectors trained successfully!


# Step 3: Find Similar Words

Using the trained embeddings, we can find words that are most similar to a given word by computing cosine similarity.

### Example:
Query for "quick" might return "fast", "speedy", and other similar terms.


In [27]:
def find_similar_words(word, word_to_id, word_vectors, top_n=3):
    if word not in word_to_id:
        return "Word not in vocabulary"

    word_id = word_to_id[word]
    word_vector = word_vectors[word_id]

    similarities = word_vectors.dot(word_vector) / (np.linalg.norm(word_vectors, axis=1) * np.linalg.norm(word_vector))
    most_similar = np.argsort(similarities)[::-1][1:top_n+1]

    return [(list(word_to_id.keys())[list(word_to_id.values()).index(i)], similarities[i]) for i in most_similar]

# Example: Find similar words

print("\nWords similar to 'quick':")
print(find_similar_words('quick', word_to_id, word_vectors))



Words similar to 'quick':
[('women', 0.44342990320616027), ('dozen', 0.4300459737639469), ('quickly', 0.39493677590245657)]


Keep in mind that these results are based solely on the very small corpus we defined earlier. Results would be much better if the model were trained on, say, a publicly available dataset.

# Exercises: Hands-on Practice

1. **Experiment with Window Size**:
   Modify the `window_size` parameter when creating the co-occurrence matrix. Observe how it affects the embeddings.

2. **Custom Corpus**:
   Train GloVe on a different corpus, such as your own set of sentences or a publicly available dataset.

3. **Explore Similar Words**:
   Extend the `find_similar_words` function to list the top 10 most similar words for multiple queries.

4. **Hyperparameter Tuning**:
   Experiment with different values for `vector_size`, `iterations`, and `learning_rate`. Observe the impact on the quality of word embeddings.
