### Word embedding
Word embedding, also known as "word to vector", is a technique in Natural Language Processing (NLP) that maps words into a continuous vector space where semantically similar words are located close to each other. There are various algorithm that helps to achieve word to vector such as:

1. Word2Vec: CBOW and Skip-Gram
2. GloVe: Global word co-occurrance 
3. fastText: extends Word2Vec by considering subword info

will look into most popular Word2Vec. Word2Vec has two main architecture:

1. <b>Continuous Bag-of-Words (CBOW):</b> Predicts the target word based on it's context word. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the dimensions we want to represent the current word present at the output layer. 

2. <b>Skip-Gram:</b> Predicts contect word within specific window given target word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.

Let's load the require libs to implement this. Creating new local environment (.venv_gensim) due to some version conflicts issue with numpy

In [1]:
from gensim.models import Word2Vec
import gensim

In [3]:
# sentence tokenizer
# list of tokenized sentences
sentences = [
    ['this', 'is', 'a', 'sample', 'sentence'],
    ['another', 'sentence', 'for', 'word', 'embeddings'],
    ['word', 'embeddings', 'are', 'powerful']
]


In [4]:

# Initialize the Word2Vec model
# 'sentences': The corpus to train on.
# 'vector_size': Dimension of the word vectors.
# 'window': Maximum distance between the current and predicted word within a sentence.
# 'min_count': Ignores all words with total frequency lower than this.
# 'sg': 0 for CBOW, 1 for Skip-gram.
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)

# We can also train the model explicitly if you load data incrementally
# model.train(sentences, total_examples=len(sentences), epochs=10)

now let's access the vector for a specific word


In [5]:
word_vector = model.wv["word"]
print(f"Vector for 'word': {word_vector}")

Vector for 'word': [-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419385e-03
  7.4669183e-03 -6.1676754e-03  1.1056137e-03  6.0472824e-03
 -2.8400505e-03 -6.1735227e-03 -4.1022300e-04 -8.3689485e-03
 -5.6000124e-03  7.1045388e-03  3.3525396e-03  7.2256695e-03
  6.8002474e-03  7.5307419e-03 -3.7891543e-03 -5.6180597e-04
  2.3483764e-03 -4.5190323e-03  8.3887316e-03 -9.8581640e-03
  6.7646410e-03  2.9144168e-03 -4.9328315e-03  4.3981876e-03
 -1.7395747e-03  6.7113843e-03  9.9648498e-03 -4.3624435e-03
 -5.9933780e-04 -5.6956373e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384968e-03  9.2734173e-03
  7.8980681e-03 -6.9895042e-03 -9.1558648e-03 -3.5575271e-04
 -3.0998408e-03  7.8943167e-03  5.9385742e-03 -1.5456629e-03
  1.5109634e-03  1.7900408e-03  7.8175711e-03 -9.5101865e-03
 -2.0553112e-04  3.4691966e-03 -9.3897223e-04  8.3817719e-03
  9.0107834e-03  6.5365066e-03 -7.1162102e-04  7.7104042e-03
 -8.5343346e-03  3.2071066e-03 -4.6379971e-03 -5.0889552e-03
  3.5

let's find the similar words

In [6]:
similar_words = model.wv.most_similar('embeddings')
print(f"Words similar to 'embeddings': {similar_words}")

Words similar to 'embeddings': [('is', 0.21617142856121063), ('sample', 0.09291722625494003), ('this', 0.06285078823566437), ('a', 0.027057476341724396), ('another', 0.016134677454829216), ('word', -0.01083916611969471), ('are', -0.027750369161367416), ('sentence', -0.05234673246741295), ('for', -0.059876296669244766), ('powerful', -0.111670583486557)]


Save and load the model.

In [None]:
# model.save("word2vec.model")
# loaded_model = Word2Vec.load("word2vec.model")

https://medium.com/@dilip.voleti/classification-using-word2vec-b1d79d375381