# What is Embedding?

Embedding converts categories into continuous vectors

Embedding, mainly used in deep learning, converts categorical data into dense vectors of continuous numbers. These vectors capture semantic relationships between categories and reduce dimensionality efficiently. Pre-trained embeddings like Word2Vec, GloVe, and FastText are available for tasks like natural language processing (NLP) and recommendation systems. Trainable embeddings are learned during model training for specific tasks.

Advantages of embeddings include their compactness and ability to capture semantic relationships, improving model performance. However, embeddings are computationally expensive and less interpretable compared to encodings.

The choice between encoding and embedding depends on the dataset size, number of categories, and specific problem requirements. One-hot encoding is suitable for small datasets with few categories, while embeddings are preferred for large datasets, especially in NLP tasks and recommendation systems.

In [1]:
# The Gensim library provides tools and algorithms for topic modeling, 
# document similarity analysis, and other natural language processing (NLP) tasks.
!pip install gensim

# Full form of gensim - generate similar

Collecting gensim
  Obtaining dependency information for gensim from https://files.pythonhosted.org/packages/63/46/5feab9c524a380bfa9f9f1c0d065743280dca30b216ab4c7a231f22dbed7/gensim-4.3.2-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading gensim-4.3.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (8.3 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Obtaining dependency information for smart-open>=1.8.1 from https://files.pythonhosted.org/packages/ad/08/dcd19850b79f72e3717c98b2088f8a24b549b29ce66849cd6b7f44679683/smart_open-7.0.1-py3-none-any.whl.metadata
  Downloading smart_open-7.0.1-py3-none-any.whl.metadata (23 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Obtaining dependency information for wrapt from https://files.pythonhosted.org/packages/0f/16/ea627d7817394db04518f62934a5de59874b587b792300991b3c347ff5e0/wrapt-1.16.0-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading wrapt-1.16.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.6 kB)
Downloading gensim-4.3.2-cp

In [2]:
# Importing necessary libraries
import numpy as np
from gensim.models import Word2Vec

# Sample data
sentences = [['I', 'love', 'machine', 'learning'],
             ['machine', 'learning', 'is', 'awesome'],
             ['deep', 'learning', 'is', 'interesting']]

# Training Word2Vec model
model = Word2Vec(sentences, vector_size=5, window=3, min_count=1, sg=1)

# Getting embedding for a word
word_embedding = model.wv['machine']
print("Embedding for 'machine':", word_embedding)

# Getting embedding for a sentence
sentence_embedding = np.mean([model.wv[word] for word in sentences[0]], axis=0)
print("Embedding for sentence 'I love machine learning':", sentence_embedding)


Embedding for 'machine': [ 0.1476101  -0.03066943 -0.09073226  0.13108103 -0.09720321]
Embedding for sentence 'I love machine learning': [-0.01302523  0.02925177  0.0208698   0.0414466  -0.10625307]


In [3]:
# In the above example words are already tokenized. However it will not be the case always
# In such case you can use tokenizer libraries
! pip install nltk

# nl - natural language toolkit

Collecting nltk
  Obtaining dependency information for nltk from https://files.pythonhosted.org/packages/a6/0a/0d20d2c0f16be91b9fa32a77b76c60f9baf6eba419e5ef5deca17af9c582/nltk-3.8.1-py3-none-any.whl.metadata
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting click (from nltk)
  Obtaining dependency information for click from https://files.pythonhosted.org/packages/00/2e/d53fa4befbf2cfa713304affc7ca780ce4fc1fd8710527771b58311a3229/click-8.1.7-py3-none-any.whl.metadata
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting regex>=2021.8.3 (from nltk)
  Obtaining dependency information for regex>=2021.8.3 from https://files.pythonhosted.org/packages/60/9e/4b0223e05776aa3be806a902093b2ab1de3ba26b652d92065d5c7e1d4df3/regex-2023.12.25-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading regex-2023.12.25-cp311-cp311-macosx_11_0_arm64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m2.3 MB/s[0m

In [5]:
# The 'punkt' tokenizer is a pre-trained model used by NLTK for tokenizing sentences. 
# It's not included by default when you install NLTK, so you need to download it separately.

# You can download it using the NLTK downloader. 

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mohanasudhangandhi/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample text corpus
text_corpus = "Machine learning is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence."

# Tokenizing text
tokens = word_tokenize(text_corpus.lower())  # Convert to lowercase for consistency

# Training Word2Vec model
# vector size - vector dimension

# window - it determines the context size or the number of words considered before and after the target word 
# when training the Word2Vec model. In this example, window=5 means that the context window size is set to 5 words.

# min_count - Words with frequencies lower than min_count will be ignored and not included in the vocabulary. 
# In this example, min_count=1 means that all words appearing at least once in the corpus will be included in the vocabulary.

# sg=0 for the Continuous Bag of Words (CBOW) model and sg=1 for the Skip-gram model. 
# In this example, sg=1 indicates that the Skip-gram model will be used for training.
model = Word2Vec([tokens], vector_size=10, window=5, min_count=1, sg=1)

# Accessing word embeddings
word_embeddings = model.wv

# Displaying word embeddings for individual words
print("Word Embeddings:")
for word in tokens:
    print(f"{word}: {word_embeddings[word]}")

Word Embeddings:
machine: [ 0.00278317  0.04964359  0.07698309 -0.01144223  0.04323421 -0.05814379
 -0.00804191  0.08100051 -0.02360065 -0.09663455]
learning: [-0.037098   -0.08745642  0.05437467  0.06509756 -0.0078755  -0.06709856
 -0.07085925 -0.0249706   0.05143254 -0.03665238]
is: [-0.00547545  0.00242737  0.05115977  0.09006381 -0.09300154 -0.07114856
  0.06470766  0.0896343  -0.05010227 -0.03771948]
the: [-0.08531428  0.03201014 -0.04636526 -0.0510409   0.03596911  0.05383289
  0.07779722 -0.05787721  0.07406821  0.06628729]
scientific: [ 0.07817571 -0.09510187 -0.00205531  0.03469197 -0.00938972  0.08381772
  0.09010784  0.06536506 -0.00711621  0.07710405]
study: [ 0.07913878 -0.07002877 -0.09149235 -0.00370559 -0.0311619   0.07894032
  0.05949505 -0.01549919  0.01477904  0.01776185]
of: [ 0.07299155  0.05040807  0.06833922  0.00699987  0.06393988 -0.03357086
 -0.00889996  0.05728829 -0.0760235  -0.03941352]
algorithms: [ 0.02343239 -0.04517407  0.08408428 -0.09853405  0.0676281

# Skip-gram (sg) Vs CBOW

Imagine you're trying to learn about a topic by asking your friends questions. The Skip-gram model operates similarly to this scenario.

In the Skip-gram model:

You are like the target word you're trying to learn more about.
Your friends' responses are like the context words surrounding the target word.
By asking your friends different questions (providing different target words), you can learn about the relationships between your target word and the words that often appear around it.
So, in a sense, the Skip-gram model learns by predicting the context words given a target word. It tries to understand the meaning of a word by observing the words that tend to appear nearby in a sentence or text corpus.

In contrast, the Continuous Bag of Words (CBOW) model works in the opposite way. Instead of predicting context words given a target word, it predicts a target word given a set of context words. You can think of CBOW as trying to guess the missing word in a sentence given the words surrounding it.

Overall, the Skip-gram model is like learning from the context, while the CBOW model is like learning from the target itself. Both approaches have their advantages and are suitable for different types of text data and tasks.