# Global Vectors for Word Representation

In Natural Language Processing (NLP), a glove (Global Vectors for Word Representation) is a type of word embedding that represents words as vectors in a high-dimensional space. The goal of word embeddings is to capture the semantic meaning and context of words, allowing machines to understand the nuances of human language.

GloVe is a specific type of word embedding that was introduced in a 2014 paper by Pennington, Socher, and Manning. It's based on the idea that words that appear in similar contexts tend to have similar meanings. The GloVe algorithm uses a combination of co-occurrence statistics and matrix factorization to learn dense vector representations of words.

Here's a high-level overview of how GloVe works:

1. **Co-occurrence matrix**: Create a matrix where each row represents a word, and each column represents a context (e.g., a sentence or a document). The cell at row i and column j contains the frequency of word i appearing in the context of word j.
2. **Matrix factorization**: Factorize the co-occurrence matrix into two lower-dimensional matrices, U and V, such that the product UV approximates the original matrix. This is done using a technique called stochastic gradient descent.
3. **Vector representation**: Each row of the U matrix represents a word, and each column of the V matrix represents a context. The dot product of the row vector for word i and the column vector for context j gives the predicted probability of word i appearing in context j.

The resulting vector representations of words are dense, real-valued vectors that capture the semantic meaning and context of each word. These vectors can be used for a variety of NLP tasks, such as:

* Text classification
* Sentiment analysis
* Information retrieval
* Language modeling
* Machine translation

GloVe has several advantages over other word embedding techniques, including:

* **Scalability**: GloVe can handle large vocabularies and large datasets.
* **Flexibility**: GloVe can be used for a wide range of NLP tasks.
* **Interpretability**: The vector representations of words can be visualized and interpreted.

However, GloVe also has some limitations, such as:

* **Computational complexity**: The matrix factorization step can be computationally expensive.
* **Hyperparameter tuning**: The performance of GloVe depends on the choice of hyperparameters, such as the dimensionality of the vector space and the learning rate.

Overall, GloVe is a powerful tool for representing words in a high-dimensional space, and it has been widely used in many NLP applications.

In [4]:
import numpy as np

#Resources:
# https://www.kaggle.com/datasets/sawarn69/glove6b100dtxt
#


# Load the pre-trained GloVe model
glove_model = {}

# with open('/content/glove.6B.100d.txt', 'r') as f:
with open('../data/glove.6B.100d.txt', 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        glove_model[word] = vector

# Define two words to compare
word1 = 'hello'
word2 = 'world'

# Get the vector representations of the words
vector1 = glove_model.get(word1)
vector2 = glove_model.get(word2)


print(f'word embedding from {word1}:{vector1}')


FileNotFoundError: [Errno 2] No such file or directory: '../data/glove.6B.100d.txt'

In [None]:

# Calculate the cosine similarity between the two words
similarity = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(f"Similarity between '{word1}' and '{word2}': {similarity:.4f}")

## Train a Classifier with Feature Vectors

In [26]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from datasets import load_dataset
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

# Step 1: Load GloVe embeddings
def load_glove_embeddings(glove_file, word2vec_file):
    # Convert GloVe to Word2Vec format if not already done
    glove2word2vec(glove_file, word2vec_file)
    return KeyedVectors.load_word2vec_format(word2vec_file, binary=False)


## Load pre-trained embeddings


In [27]:

# GloVe files (use smaller dimensions for faster processing)
glove_file = "/content/glove.6B.100d.txt"
word2vec_file = "/content/glove.6B.100d.word2vec.txt"

glove_model = load_glove_embeddings(glove_file, word2vec_file)


  glove2word2vec(glove_file, word2vec_file)



## Step 2: Prepare dataset (using a Hugging Face dataset for demonstration)
 We'll use the AG News dataset as an example

In [28]:
dataset = load_dataset("ag_news")
train_data = dataset["train"]
test_data = dataset["test"]


## Step 3: Text preprocessing and feature extraction


In [29]:

def text_to_glove_vector(text, glove_model, embedding_dim=100):
    """
    Converts a text document to a feature vector by averaging its word embeddings.
    """
    words = text.split()
    word_vectors = [glove_model[word] for word in words if word in glove_model]
    if len(word_vectors) == 0:
        return np.zeros(embedding_dim)
    return np.mean(word_vectors, axis=0)


## Create feature matrices for train and test sets


In [30]:
X_train = np.array([text_to_glove_vector(text, glove_model) for text in train_data["text"]])
X_test = np.array([text_to_glove_vector(text, glove_model) for text in test_data["text"]])

# Use integer labels for Naive Bayes
y_train = np.array(train_data["label"])
y_test = np.array(test_data["label"])



## Step 4: Train a Naive Bayes classifier
Note: Naive Bayes works best with discrete features; embeddings are continuous.
We'll normalize and discretize them.

In [31]:

X_train_discrete = np.round(X_train * 10).astype(int)
X_test_discrete = np.round(X_test * 10).astype(int)


In [32]:

# Train Naive Bayes model
nb_classifier = MultinomialNB()
shift = np.abs(X_train_discrete.min())  # Find the minimum value
X_train_discrete += shift  # Shift all values to be non-negative
X_test_discrete += shift

nb_classifier.fit(X_train_discrete, y_train)

# Step 5: Evaluate the classifier
y_pred = nb_classifier.predict(X_test_discrete)

print("Classification Report:")
print(classification_report(y_test, y_pred))

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.79      0.76      1900
           1       0.82      0.85      0.83      1900
           2       0.77      0.67      0.72      1900
           3       0.71      0.72      0.72      1900

    accuracy                           0.76      7600
   macro avg       0.76      0.76      0.76      7600
weighted avg       0.76      0.76      0.76      7600

Accuracy: 0.76
