<a href="https://colab.research.google.com/github/juacardonahe/Curso_NLP/blob/main/1_FundamentosNLP/1.3_WordEmbeddings/1_3_1_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://raw.githubusercontent.com/juacardonahe/Curso_NLP/refs/heads/main/data/UnFieldB.png" width="40%">

# **Natural Language Procesing (NLP)**
### Departamento de Ingeniería Eléctrica, Electrónica y Computación
#### Universidad Nacional de Colombia - Sede Manizales

#### Created by: Juan José Cardona H.
#### Reviewed by: Diego A. Perez

#**1.3 - Feauture Extraction (Embeddings)**
Traditional machine learning methods typically rely on hand-engineered features to represent documents within a corpus. Common approaches include Bag-of-Words, TF-IDF, and manually crafted features such as document length, word sentiment, or metadata (e.g., tags or associated scores). Modern techniques, however, leverage learned representations through methods like Word2Vec, GLoVE, or end-to-end feature learning via neural networks.

##**1.3.1 - Conventional Models**

###**Bag of Words (BoW)**
A representation of text data where each document is represented as a bag (multiset) of its words, disregarding grammar and word order but keeping multiplicity.

The Bag of Words model is a commonly used technique in NLP where each word in the text is represented as a feature, and its frequency is used as the feature value.

**Importance:** Simplifies text processing and is used in various NLP applications such as text classification and information retrieval.

**Limitation:** Ignores word order and context (so “dog bites man” and “man bites dog” look identical), creates huge, sparse vectors as your features grow with every unique word, can’t capture semantic similarity (treating “happy” and “joyful” as unrelated), is sensitive to typos and rare terms that bloat the feature space, and weights every word occurrence equally, even though some words carry more meaning than others.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "Apple develops innovative smartphones and computers",
    "Google focuses on search engines and cloud services",
    "Microsoft builds software and cloud computing platforms"
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the documents into a Bag of Words
bow_matrix = vectorizer.fit_transform(documents)

# Get the feature names (unique words in the corpus)
feature_names = vectorizer.get_feature_names_out()

# Convert the Bag of Words matrix into an array
bow_array = bow_matrix.toarray()

# Display the Bag of Words
print("Feature Names (Words):", feature_names)
print("\nBag of Words Representation:")
print(bow_array)

Feature Names (Words): ['and' 'apple' 'builds' 'cloud' 'computers' 'computing' 'develops'
 'engines' 'focuses' 'google' 'innovative' 'microsoft' 'on' 'platforms'
 'search' 'services' 'smartphones' 'software']

Bag of Words Representation:
[[1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0]
 [1 0 0 1 0 0 0 1 1 1 0 0 1 0 1 1 0 0]
 [1 0 1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 1]]


###**Term Frequency-Inverse Document Frequency (TF-IDF)**
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used to evaluate the importance of a word in a document relative to a collection of documents (or corpus). The basic idea is that if a word appears frequently in a document but not in many other documents, it should be given more importance.

The TF-IDF score for a term *t* in a document *d* within a corpus *D* is calculated as the product of two metrics: **Term Frequency (TF)** and **Inverse Document Frequency (IDF)**.

**Importance:** Helps in identifying important words in documents and is commonly used in information retrieval and text mining.

**Limitations:** Ignores word order and context, so it can’t distinguish nuances in phrasing; it assumes each document is independent and static, making it ill‑suited for streaming or evolving corpora; very common but meaningful phrases can still get low scores if they appear in many documents; it’s sensitive to noise (e.g., typos or OCR errors) that skew term counts; and it produces sparse, high‑dimensional vectors that can be inefficient and may require dimensionality reduction or feature selection to work well in downstream models.

####**Term Frequency (TF)**
Measures how frequently a term appears in a document. It is often normalized by the total number of terms in the document to prevent bias toward longer documents.

$$
TF(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

####**Inverse Document Frequency (IDF)**
Inverse Document Frequency (IDF) measures how important a term is in the entire corpus. It decreases the weight of terms that appear in many documents and increases the weight of terms that appear in fewer documents.

$$
IDF(t, D) = \log\left( \frac{\text{Total number of documents in corpus } D}{\text{Number of documents where term } t \text{ appears} + 1} \right)
$$

**Note:** The “+1” is added to the denominator to prevent division by zero in case the term doesn’t appear in any document.

####**Combining TF and IDF: TF-IDF**
The TF-IDF score is calculated by multiplying the TF value with the IDF value for a term *t* in a document *d*:

$$
TF\text{-}IDF(t,d,D) = TF(t,d) \times IDF(t,D)
$$

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data (documents)
documents = [
    "A quick brown fox jumps over the lazy dog.",
    "Bright sunshine streamed through the window.",
    "The rain poured all night long."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the model and transform the documents into TF-IDF representation
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the feature names (unique words in the corpus)
feature_names = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix into an array
tfidf_array = tfidf_matrix.toarray()

# Display the TF-IDF matrix
print("Feature Names (Words):", feature_names)
print("\nTF-IDF Matrix:")
print(tfidf_array)

Feature Names (Words): ['all' 'bright' 'brown' 'dog' 'fox' 'jumps' 'lazy' 'long' 'night' 'over'
 'poured' 'quick' 'rain' 'streamed' 'sunshine' 'the' 'through' 'window']

TF-IDF Matrix:
[[0.         0.         0.36888498 0.36888498 0.36888498 0.36888498
  0.36888498 0.         0.         0.36888498 0.         0.36888498
  0.         0.         0.         0.21786941 0.         0.        ]
 [0.         0.43238509 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.43238509 0.43238509 0.2553736  0.43238509 0.43238509]
 [0.43238509 0.         0.         0.         0.         0.
  0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.         0.         0.2553736  0.         0.        ]]


##**1.3.2 - Distributional Models**

###**Word Embeddings**
Word embeddings are dense vector representations of words that capture their meanings, syntactic properties, and relationships with other words. Unlike traditional methods like Bag of Words or TF-IDF, which treat words as discrete entities, word embeddings map words into a continuous vector space where semantically similar words are located near each other.

**Importance:** Captures semantic relationships between words and is used in various NLP models and applications.

**Limitation:** Word embeddings like Word2Vec or GloVe capture semantics but still have drawbacks: they assign a single static vector per word, so they can’t handle polysemy (“bank” as river‑edge vs. financial institution); they ignore word order and sentence context; they require lots of training data to learn good representations; pre‑trained embeddings may not cover domain‑specific or rare vocabulary; and the resulting vectors can encode unwanted biases present in the training text.

####**Word2Vec**
Word2Vec Is a neural network–based technique that learns dense vector representations (embeddings) for words by predicting a word from its context (or vice versa). It comes in two main flavors—CBOW (Continuous Bag‑of‑Words), which predicts a target word given surrounding words, and Skip‑Gram, which predicts surrounding words from a target word. Through training on large text corpora, Word2Vec produces vectors where semantically similar words end up close together in the embedding space.

In [21]:
!pip install --upgrade gensim

Collecting numpy<2.0,>=1.18.5 (from gensim)
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.6
    Uninstalling numpy-2.2.6:
      Successfully uninstalled numpy-2.2.6
  Attempting uninstall: scipy
    Found existing installation: scipy 1.15.3
    Uninstalling scipy-1.15.3:
      Successfully uninstalled scipy-1.15.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency 

In [4]:
from gensim import models
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Download NLTK data for tokenization
nltk.download('punkt')
nltk.download('punkt_tab')

# Sample text data
sentences = [
    "Apple develops innovative smartphones and computers",
    "Google focuses on search engines and cloud services",
    "Microsoft builds software and cloud computing platforms",
    "Apple and Google compete in the smartphone market",
    "Microsoft and Google invest in cloud technology"
]

# Tokenize sentences into words
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train Word2Vec model
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,  # Size of word embeddings
    window=5,         # Context window size
    min_count=1,      # Ignore words with frequency < 1
    workers=2,        # Number of CPU cores to use
    sg=0              # Use CBOW (0) instead of Skip-gram (1)
)

# Save the model (optional)
model.save("word2vec_tech.model")

# Display vocabulary
print("Vocabulary:", list(model.wv.index_to_key))

# Example: Get word embedding for 'cloud'
print("\nWord embedding for 'cloud':")
print(model.wv['cloud'][:10], "... (first 10 dimensions)")

# Example: Find similar words to 'cloud'
print("\nWords similar to 'cloud':")
similar_words = model.wv.most_similar('cloud', topn=3)
for word, score in similar_words:
    print(f"{word}: {score:.4f}")

# Example: Compute similarity between two words
similarity = model.wv.similarity('google', 'microsoft')
print("\nSimilarity between 'google' and 'microsoft':", round(similarity, 4))

Vocabulary: ['and', 'google', 'cloud', 'apple', 'in', 'microsoft', 'engines', 'develops', 'innovative', 'smartphones', 'computers', 'focuses', 'on', 'search', 'technology', 'invest', 'builds', 'software', 'computing', 'platforms', 'compete', 'the', 'smartphone', 'market', 'services']

Word embedding for 'cloud':
[ 9.7127428e-05  3.1004562e-03 -6.8210070e-03 -1.3901597e-03
  7.6748789e-03  7.3423479e-03 -3.6663914e-03  2.6678147e-03
 -8.3423387e-03  6.1972556e-03] ... (first 10 dimensions)

Words similar to 'cloud':
innovative: 0.1990
focuses: 0.1742
in: 0.1709

Similarity between 'google' and 'microsoft': 0.0094


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


####**GloVe**
GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm that builds word embeddings by factorizing a global word–word co‑occurrence matrix. It counts how often words appear together across a corpus to capture global statistics, then finds word vectors whose dot products predict those co‑occurrence counts. The result is dense vectors where words with similar meanings or usage patterns lie close together.

In [5]:
import gensim.downloader as api
import numpy as np

# Load pre-trained GloVe model (50-dimensional vectors from Wikipedia 2014 + Gigaword 5)
glove_model = api.load("glove-wiki-gigaword-50")

# Sample words to explore
words = ["apple", "google", "microsoft", "cloud"]

# Display word vectors for sample words
print("Word Vectors (first 10 dimensions):")
for word in words:
    if word in glove_model:
        print(f"{word}: {glove_model[word][:10]} ...")
    else:
        print(f"{word}: Not in vocabulary")

# Find words similar to 'cloud'
print("\nWords similar to 'cloud':")
similar_words = glove_model.most_similar('cloud', topn=3)
for word, score in similar_words:
    print(f"{word}: {score:.4f}")

# Compute similarity between 'google' and 'microsoft'
similarity = glove_model.similarity('google', 'microsoft')
print("\nSimilarity between 'google' and 'microsoft':", round(similarity, 4))

# Example: Word analogy (google - apple + microsoft)
print("\nAnalogy: google - apple + microsoft")
result = glove_model.most_similar(positive=['google', 'microsoft'], negative=['apple'], topn=1)
print(f"Result: {result[0][0]} (score: {result[0][1]:.4f})")

Word Vectors (first 10 dimensions):
apple: [ 0.52042  -0.8314    0.49961   1.2893    0.1151    0.057521 -1.3753
 -0.97313   0.18346   0.47672 ] ...
google: [ 0.969    -0.61799   1.6561    1.4079   -0.063774 -0.45934  -1.0961
 -1.2889    0.51762   0.48351 ] ...
microsoft: [ 0.58446 -1.4571   1.1942   1.7112  -0.29475 -0.4456  -0.71201 -1.2774
  0.13483  0.94727] ...
cloud: [ 0.54407    0.9233     0.50644    0.46454   -0.62015   -0.35166
 -0.066969  -0.32835    0.67205    0.0049138] ...

Words similar to 'cloud':
clouds: 0.8214
dust: 0.7634
horizon: 0.7332

Similarity between 'google' and 'microsoft': 0.8451

Analogy: google - apple + microsoft
Result: yahoo (score: 0.7533)


##**1.3.2 - Pre-Trained Models (BERT)**

Unlike traditional word embeddings (like Word2Vec), which provide a single vector for each word regardless of context, contextual embeddings produce different vectors for a word depending on the surrounding words. These embeddings are generated using models like BERT (Bidirectional Encoder Representations from Transformers), which take into account the entire sentence when computing word representations.

**Importance:** Provides more accurate word representations by considering the surrounding context.

**Limitations:**  it’s computationally heavy and slow to train or run inference (large model size and quadratic attention cost), can only handle inputs up to 512 tokens so it struggles with long documents, uses a fixed masking objective that doesn’t directly model generation or sequential dependencies, requires substantial fine‑tuning data for downstream tasks, and—even though it’s “contextual”—still can’t incorporate real‑time or world knowledge after pretraining and may perpetuate biases present in its training corpus.

In [6]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text
text = "Apple develops innovative smartphones and Google focuses on cloud services."

# Tokenize input text
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)

# Get BERT embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract [CLS] token embedding (first token, used for sentence representation)
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Extract word embeddings for all tokens
word_embeddings = outputs.last_hidden_state[0]

# Print results
print("Input text:", text)
print("\n[CLS] token embedding shape:", cls_embedding.shape)
print("[CLS] embedding (first 10 dimensions):", cls_embedding[0, :10].numpy())
print("\nNumber of tokens:", word_embeddings.shape[0])
print("Embedding for first token (first 10 dimensions):", word_embeddings[0, :10].numpy())

# Example: Cosine similarity between [CLS] embedding and first token embedding
cosine_sim = torch.cosine_similarity(cls_embedding, word_embeddings[0:1], dim=1)
print("\nCosine similarity between [CLS] and first token:", round(cosine_sim.item(), 4))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Input text: Apple develops innovative smartphones and Google focuses on cloud services.

[CLS] token embedding shape: torch.Size([1, 768])
[CLS] embedding (first 10 dimensions): [-0.73555124 -0.2897195  -0.27042347 -0.12382691 -0.2886039   0.2151361
  0.24391133  0.9900225  -0.3722815  -0.15357335]

Number of tokens: 14
Embedding for first token (first 10 dimensions): [-0.73555124 -0.2897195  -0.27042347 -0.12382691 -0.2886039   0.2151361
  0.24391133  0.9900225  -0.3722815  -0.15357335]

Cosine similarity between [CLS] and first token: 1.0
