<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Transformer/2_word embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embedding and Representation

In [None]:
# %pip install numpy==1.24.3 # colab``
# %pip install gensim # colab
from gensim.models import Word2Vec
import gensim.downloader as api
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from transformers import AutoTokenizer, AutoModel
import torch

### 1. One-Hot Encoding


**Objective**: Represent each word as a binary vector where only one position (corresponding to the word's index) is set to 1, while the rest are set to 0.

**Explanation**: One-hot encoding is a straightforward method for representing words, but it has limitations. It creates sparse vectors (mostly zeros) and does not capture any semantic relationship between words. For example, 'cat' and 'dog' would be just as unrelated as 'cat' and 'car' in this representation.
    

In [20]:
# Sample vocabulary and sentences
vocabulary = ["hello", "world", "nlp", "tutorial"]
word_to_index = {word: idx for idx, word in enumerate(vocabulary)}

# One-hot encode a sample word
def one_hot_encode(word, vocab_size=len(vocabulary)):
    vector = np.zeros(vocab_size)
    vector[word_to_index[word]] = 1
    return vector

# Example usage
for word in vocabulary:
    print(f"One-hot encoding for '{word}':", one_hot_encode(word))
    

One-hot encoding for 'hello': [1. 0. 0. 0.]
One-hot encoding for 'world': [0. 1. 0. 0.]
One-hot encoding for 'nlp': [0. 0. 1. 0.]
One-hot encoding for 'tutorial': [0. 0. 0. 1.]


### 2. Frequency-Based Embeddings: Bag of Words (BoW) and TF-IDF


**Objective**: Represent documents as vectors based on the frequency of each word in the document.

**Explanation**: 
- **Bag of Words (BoW)**: This method creates vectors that represent word counts for each term in a document, disregarding word order and capturing only the frequency of terms. While simple, BoW can result in large vectors and may lose contextual meaning.
- **TF-IDF (Term Frequency - Inverse Document Frequency)**: This method enhances BoW by weighting terms based on their importance in a document, calculated by the frequency within a document and rarity across documents. It reduces the influence of commonly used words and highlights unique terms, which is beneficial in document classification tasks.
    

In [21]:

# Sample sentences
sentences = ["hello world", "hello nlp", "nlp tutorial"]

# Bag of Words
vectorizer = CountVectorizer()
bow_vectors = vectorizer.fit_transform(sentences).toarray()
print("Bag of Words Vectors:", bow_vectors)
print("Vocabulary:", vectorizer.get_feature_names_out())

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(sentences).toarray()
print("TF-IDF Vectors:", tfidf_vectors)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
    

Bag of Words Vectors: [[1 0 0 1]
 [1 1 0 0]
 [0 1 1 0]]
Vocabulary: ['hello' 'nlp' 'tutorial' 'world']
TF-IDF Vectors: [[0.60534851 0.         0.         0.79596054]
 [0.70710678 0.70710678 0.         0.        ]
 [0.         0.60534851 0.79596054 0.        ]]
Vocabulary: ['hello' 'nlp' 'tutorial' 'world']


### 3. Distributed Embeddings: Word2Vec


**Objective**: Capture word meanings based on surrounding context using large text corpora, making similar words closer in vector space.

**Explanation**: Word2Vec creates word embeddings by training on contexts within sentences. It produces dense, low-dimensional vectors where words with similar meanings have similar embeddings. This is achieved through two training methods:
- **Skip-gram**: Predicts surrounding words given a target word, capturing word associations.
- **CBOW (Continuous Bag of Words)**: Predicts a target word from its context, which can be more efficient but less accurate with rare words.
    

In [22]:
# Sample corpus
corpus = [["hello", "world"], ["hello", "nlp"], ["nlp", "tutorial"]]

# Train Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=5, window=2, min_count=1, sg=1)

# Get vector for a word
# print("Word2Vec embedding for 'hello':", model.wv["hello"])
for word in vocabulary:
    print(f"Word2Vec embedding for '{word}':", model.wv[word])
    

    

Word2Vec embedding for 'hello': [-0.14233617  0.12917745  0.17945977 -0.10030856 -0.07526743]
Word2Vec embedding for 'world': [-0.03632035  0.0575316   0.01983747 -0.1657043  -0.18897636]
Word2Vec embedding for 'nlp': [-0.01072454  0.00472863  0.10206699  0.18018547 -0.186059  ]
Word2Vec embedding for 'tutorial': [ 0.1476101  -0.03066943 -0.09073226  0.13108103 -0.09720321]


### 4. GloVe (Global Vectors for Word Representation)


**Objective**: Capture word semantics by modeling global word co-occurrence statistics.

**Explanation**: Unlike Word2Vec, which learns from local contexts, GloVe (Global Vectors) uses global word co-occurrence statistics across a corpus to capture meaning. It results in embeddings where semantic relationships are preserved, allowing vector arithmetic to reveal analogies (e.g., "king" - "man" + "woman" ≈ "queen"). GloVe embeddings are typically pre-trained on large corpora and can be used as-is without further training on smaller datasets.
    

In [23]:
# Load GloVe embeddings (this may take a while)
glove_vectors = api.load("glove-wiki-gigaword-50")

# Get vector for a word
# print("GloVe embedding for 'hello':", glove_vectors["hello"])
for word in vocabulary:
    print(f"GloVe embedding for '{word}':", glove_vectors[word])
    

GloVe embedding for 'hello': [-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
  0.67204 ]
GloVe embedding for 'world': [-0.41486   0.71848  -0.3045    0.87445   0.22441  -0.56488  -0.37566
 -0.44801   0.61347  -0.11359   0.74556  -0.10598  -1.1882    0.50974
  1.3511    0.069851  0.73314   0.26773  -1.1787   -0.148     0.039853
  0.033107 -0.27406   0.25125   0.41507  -1.6188   -0.81778  -0.73892
 -0.28997   0.57277   3.4719    0.73817  -0.044495 -0.15119  -0.93503
 -0.13152  -0.28562   0.76327  -0.83332  -0.6793   -0.39099  -0.64466
  1.0044   -0.2051  

### 5. Contextual Embeddings (Overview)


**Objective**: Generate word embeddings that change based on context, which is important for capturing the nuances in sentences.

**Explanation**: Traditional embeddings like Word2Vec and GloVe assign each word a single vector, but contextual embeddings, used in models like BERT, generate different vectors for the same word based on surrounding words. This is particularly useful for disambiguating words with multiple meanings (e.g., "bank" as a financial institution vs. "bank" of a river). Contextual embeddings allow for richer representations, enhancing performance in tasks like sentiment analysis, translation, and question answering.
    

In [24]:
# Initialize model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize and get embeddings
input_text = "NLP is fascinating!"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)

# Extract contextual embeddings for each token
embeddings = outputs.last_hidden_state
print("Contextual embeddings shape:", embeddings.shape)
    

Contextual embeddings shape: torch.Size([1, 7, 768])
