### Word embedding
Word embedding, also known as "word to vector", is a technique in Natural Language Processing (NLP) that maps words into a continuous vector space where semantically similar words are located close to each other. There are various algorithm that helps to achieve word to vector such as:

1. Word2Vec: CBOW and Skip-Gram
2. GloVe: Global word co-occurrance 
3. fastText: extends Word2Vec by considering subword info

will look into most popular Word2Vec. Word2Vec has two main architecture:

1. <b>Continuous Bag-of-Words (CBOW):</b> Predicts the target word based on it's context word. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the dimensions we want to represent the current word present at the output layer. 

2. <b>Skip-Gram:</b> Predicts contect word within specific window given target word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.

Let's load the require libs to implement this. Creating new local environment (.venv_gensim) due to some version conflicts issue with numpy

In [26]:
from gensim.models import Word2Vec
import gensim

In [27]:
s = "Natural language processing is the processing of natural language information by a computer. The study of NLP, a subfield of computer science, is generally associated with artificial intelligence. NLP is related to information retrieval, knowledge representation, computational linguistics, and more broadly with linguistics. NLP is used by many applications that use language, such as text translation, voice recognition, text summarization and chatbots. You may have used some of these applications yourself, such as voice-operated GPS systems, digital assistants, speech-to-text software and customer service bots. NLP also helps businesses improve their efficiency, productivity and performance by simplifying complex tasks that involve language."


In [28]:
# sentence tokenizer
# list of tokenized sentences
sentences = [['natural', 'language', 'processing', 'is', 'the', 'processing', 'of', 'natural', 'language', 'information', 'by', 'a', 'computer'],
             ['the', 'study', 'of', 'nlp', 'a', 'subfield', 'of', 'computer', 'science', 'is', 'generally', 'associated', 'with', 'artificial', 'intelligence'],
             ['nlp', 'is', 'related', 'to', 'information', 'retrieval', 'knowledge', 'representation', 'computational', 'linguistics', 'and', 'more', 'broadly', 'with', 'linguistics'],
             ['nlp', 'is', 'used', 'by', 'many', 'applications', 'that', 'use', 'language', 'such', 'as', 'text', 'translation', 'voice', 'recognition', 'text', 'summarization', 'and', 'chatbots'],
             ['you', 'may', 'have', 'used', 'some', 'of', 'these', 'applications', 'yourself', 'such', 'as', 'voice-operated', 'gps', 'systems', 'digital', 'assistants', 'speech-to-text', 'software', 'and', 'customer', 'service', 'bots'],
             ['nlp', 'also', 'helps', 'businesses', 'improve', 'their', 'efficiency', 'productivity', 'and', 'performance', 'by', 'simplifying', 'complex', 'tasks', 'that', 'involve', 'language']
             ]


In [29]:

# Initialize the Word2Vec model
# 'sentences': The corpus to train on.
# 'vector_size': Dimension of the word vectors.
# 'window': Maximum distance between the current and predicted word within a sentence.
# 'min_count': Ignores all words with total frequency lower than this.
# 'sg': 0 for CBOW, 1 for Skip-gram.
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)

# We can also train the model explicitly if you load data incrementally
# model.train(sentences, total_examples=len(sentences), epochs=10)

now let's access the vector for a specific word


In [30]:
word_vector = model.wv["language"]
print(f"Vector for 'word': {word_vector}")

Vector for 'word': [-5.4003386e-04  2.4686972e-04  5.1032561e-03  9.0068132e-03
 -9.2969034e-03 -7.1429289e-03  6.4567956e-03  9.0040723e-03
 -5.0308416e-03 -3.7975647e-03  7.3822839e-03 -1.5640019e-03
 -4.5311842e-03  6.5645222e-03 -4.8464844e-03 -1.8124769e-03
  2.8850122e-03  9.9007005e-04 -8.2795341e-03 -9.4753997e-03
  7.3291594e-03  5.0678090e-03  6.7572161e-03  7.6526409e-04
  6.3389027e-03 -3.4001863e-03 -9.7024976e-04  5.7662958e-03
 -7.5205415e-03 -3.9463118e-03 -7.5025591e-03 -9.3113509e-04
  9.5465071e-03 -7.3318179e-03 -2.3344674e-03 -1.9194310e-03
  8.0814678e-03 -5.9280121e-03  3.4620538e-05 -4.7587682e-03
 -9.5906006e-03  4.9988711e-03 -8.7502012e-03 -4.3935957e-03
 -3.0840820e-05 -2.9497084e-04 -7.6713203e-03  9.5948167e-03
  4.9820058e-03  9.2406673e-03 -8.1495466e-03  4.4870791e-03
 -4.1473308e-03  8.2275335e-04  8.5188570e-03 -4.4735475e-03
  4.5282380e-03 -6.7842570e-03 -3.5477793e-03  9.4023868e-03
 -1.5790074e-03  3.2821283e-04 -4.1276081e-03 -7.6685771e-03
 -1.5

let's find the similar words

In [37]:
word = "language"
similar_words = model.wv.most_similar(word)
print(f"Words similar to  {word}:\n {similar_words}")

Words similar to  language:
 [('the', 0.21926084160804749), ('applications', 0.21701940894126892), ('many', 0.20485423505306244), ('bots', 0.19553223252296448), ('you', 0.17234398424625397), ('complex', 0.17025862634181976), ('generally', 0.1519763022661209), ('study', 0.14263521134853363), ('voice', 0.14103567600250244), ('systems', 0.14074334502220154)]


let's save the model

In [32]:
import os

In [33]:
model.save(os.path.join(os.getcwd(), "models","word2vec.model"))

let's load the saved model and use to find similar words

In [35]:
loaded_model = Word2Vec.load(os.path.join(os.getcwd(), "models","word2vec.model"))

In [36]:
word = "language"
similar_words = loaded_model.wv.most_similar(word)
print(f"Words similar to  {word}:\n {similar_words}")

Words similar to  language:
 [('the', 0.21926084160804749), ('applications', 0.21701940894126892), ('many', 0.20485423505306244), ('bots', 0.19553223252296448), ('you', 0.17234398424625397), ('complex', 0.17025862634181976), ('generally', 0.1519763022661209), ('study', 0.14263521134853363), ('voice', 0.14103567600250244), ('systems', 0.14074334502220154)]
