# Word2Vec
Word2Vec is a popular technique in natural language processing (NLP) used to convert words into dense vector representations, also known as word embeddings. These word embeddings capture semantic relationships between words, making them valuable in various NLP tasks like sentiment analysis, text classification, machine translation, and more. There are two main architectures for Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram.

### Continuous Bag of Words (CBOW):
In the CBOW architecture, the model predicts the current word given a context. The "context" refers to the surrounding words within a fixed window size. The model takes input as a bag of context words and predicts the target word.
CBOW performs better on smaller datasets.

Here's how CBOW works:

Input Layer: It consists of the context words encoded as one-hot vectors.

Projection Layer: It maps one-hot encoded vectors to distributed representation vectors (word embeddings). This layer is often referred to as the embedding layer.

Hidden Layer: This layer processes the distributed representations of context words.

Output Layer: It produces a probability distribution over the vocabulary. The target word is predicted based on this distribution.

For example, consider the sentence "The cat sits on the mat." If we set a window size of 2, and we want to predict the target word "sits", the context words would be "The", "cat", "on", and "the". The CBOW model would try to predict "sits" given these context words.

### Skip-gram:
Skip-gram is the reverse of CBOW. It predicts the context words given a target word. Instead of predicting the target word from the context, Skip-gram predicts surrounding words given a target word. The model learns to maximize the probability of context words given the target word. Skip-gram is suitable for larger datasets and is more effective at capturing semantic relationships between words.

Input Layer: It consists of the target word encoded as a one-hot vector.

Projection Layer: Similar to CBOW, it maps one-hot encoded vectors to distributed representation vectors (word embeddings).

Hidden Layer: This layer processes the distributed representation of the target word.

Output Layer: Produces a probability distribution over the vocabulary. Each output neuron corresponds to a word in the vocabulary, and the model aims to predict the probability of each word being a context word given the target word.

Using the same example sentence, if the target word is "sits", Skip-gram would try to predict "The", "cat", "on", and "the" as context words.

### Advantages:

Semantic Similarity: Word embeddings capture semantic relationships, allowing for measuring similarity between words.

Dimensionality Reduction: Word embeddings reduce the dimensionality of the data while preserving important semantic information.

Efficiency: Once trained, Word2Vec models can quickly generate word embeddings for large vocabularies.

### Disadvantages:

Requires Large Corpus: Word2Vec requires a large corpus of text data for training to learn meaningful embeddings.

Out-of-vocabulary Words: Words not seen during training will not have embeddings, causing issues for rare or misspelled words.

Contextual Information: Word2Vec ignores contextual information, which can limit its effectiveness in certain tasks where context plays a crucial role.

Fixed Context Window: The choice of context window size can impact the quality of embeddings, and there's no universal best window size for all tasks.

### Usage:
Word2Vec is used in various NLP tasks such as sentiment analysis, document clustering, machine translation, and information retrieval. It enables tasks like synonym detection, analogy completion, and word similarity calculations.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import Word2Vec

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
text = "The quick brown fox jumps over the lazy dog."

In [None]:
# Tokenize the text using NLTK
tokens = word_tokenize(text.lower())

# Remove stopwords using NLTK
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]

In [None]:
# Train Word2Vec model using Gensim
model = Word2Vec([filtered_tokens], vector_size=100, window=5, min_count=1, sg=1) #  CBOW (sg=0) and Skip-gram (sg=1) # window = 5 represents the context or the words taken into consideration during training.
model1 = Word2Vec([filtered_tokens], vector_size=100, window=5, min_count=1, sg=0) # vector_size = 100 represents each word will be represented as a 100-dimensional vector

In [None]:
# Get word embeddings
word_embeddings = model.wv
word_embeddings1 = model1.wv

In [None]:
# Access word embeddings
print(word_embeddings['fox'])

[-0.00713902  0.00124103 -0.00717672 -0.00224462  0.0037193   0.00583312
  0.00119818  0.00210273 -0.00411039  0.00722533 -0.00630704  0.00464722
 -0.00821997  0.00203647 -0.00497705 -0.00424769 -0.00310898  0.00565521
  0.0057984  -0.00497465  0.00077333 -0.00849578  0.00780981  0.00925729
 -0.00274233  0.00080022  0.00074665  0.00547788 -0.00860608  0.00058446
  0.00686942  0.00223159  0.00112468 -0.00932216  0.00848237 -0.00626413
 -0.00299237  0.00349379 -0.00077263  0.00141129  0.00178199 -0.0068289
 -0.00972481  0.00904058  0.00619805 -0.00691293  0.00340348  0.00020606
  0.00475375 -0.00711994  0.00402695  0.00434743  0.00995737 -0.00447374
 -0.00138926 -0.00731732 -0.00969783 -0.00908026 -0.00102275 -0.00650329
  0.00484973 -0.00616403  0.00251919  0.00073944 -0.00339215 -0.00097922
  0.00997913  0.00914589 -0.00446183  0.00908303 -0.00564176  0.00593092
 -0.00309722  0.00343175  0.00301723  0.00690046 -0.00237388  0.00877504
  0.00758943 -0.00954765 -0.00800821 -0.0076379   0.

In [None]:
# Access word embeddings
print(word_embeddings1['fox'])

[-0.00713902  0.00124103 -0.00717672 -0.00224462  0.0037193   0.00583312
  0.00119818  0.00210273 -0.00411039  0.00722533 -0.00630704  0.00464722
 -0.00821997  0.00203647 -0.00497705 -0.00424769 -0.00310898  0.00565521
  0.0057984  -0.00497465  0.00077333 -0.00849578  0.00780981  0.00925729
 -0.00274233  0.00080022  0.00074665  0.00547788 -0.00860608  0.00058446
  0.00686942  0.00223159  0.00112468 -0.00932216  0.00848237 -0.00626413
 -0.00299237  0.00349379 -0.00077263  0.00141129  0.00178199 -0.0068289
 -0.00972481  0.00904058  0.00619805 -0.00691293  0.00340348  0.00020606
  0.00475375 -0.00711994  0.00402695  0.00434743  0.00995737 -0.00447374
 -0.00138926 -0.00731732 -0.00969783 -0.00908026 -0.00102275 -0.00650329
  0.00484973 -0.00616403  0.00251919  0.00073944 -0.00339215 -0.00097922
  0.00997913  0.00914589 -0.00446183  0.00908303 -0.00564176  0.00593092
 -0.00309722  0.00343175  0.00301723  0.00690046 -0.00237388  0.00877504
  0.00758943 -0.00954765 -0.00800821 -0.0076379   0.

In [None]:
# Predict similarity between words
similarity_score = word_embeddings.similarity('fox', 'dog')
print("Similarity between 'fox' and 'dog':", similarity_score)

Similarity between 'fox' and 'dog': 0.0045030187


### Pretrained Model

In [None]:
from gensim.models import KeyedVectors
import gensim.downloader as api

In [None]:
# Load the Word2Vec model
word2vec_model = api.load('word2vec-google-news-300')



In [None]:
# Get embedding for a word
word_embedding = word2vec_model['word']
print(word_embedding)
print(word_embedding.shape)

[ 3.59375000e-01  4.15039062e-02  9.03320312e-02  5.46875000e-02
 -1.47460938e-01  4.76074219e-02 -8.49609375e-02 -2.04101562e-01
  3.10546875e-01 -1.05590820e-02 -6.15234375e-02 -1.55273438e-01
 -1.52343750e-01  8.54492188e-02 -2.70996094e-02  3.84765625e-01
  4.78515625e-02  2.58789062e-02  4.49218750e-02 -2.79296875e-01
  9.09423828e-03  4.08203125e-01  2.40234375e-01 -3.06640625e-01
 -1.80664062e-01  4.73632812e-02 -2.63671875e-01  9.08203125e-02
  1.37695312e-01 -7.20977783e-04  2.67333984e-02  1.92382812e-01
 -2.29492188e-02  9.70458984e-03 -7.37304688e-02  4.29687500e-01
 -7.93457031e-03  1.06445312e-01  2.80761719e-02 -2.29492188e-01
 -1.91650391e-02 -2.36816406e-02  3.51562500e-02  1.71875000e-01
 -1.12304688e-01  6.25000000e-02 -1.69921875e-01  1.29882812e-01
 -1.54296875e-01  1.58203125e-01 -7.76367188e-02  1.78710938e-01
 -1.72851562e-01  9.96093750e-02  3.94531250e-01  6.44531250e-02
 -6.83593750e-02 -3.18359375e-01  5.95703125e-02 -1.02539062e-02
  9.37500000e-02  8.25195

In [None]:
# Example of getting most similar words
similar_words = word2vec_model.most_similar('king')
for words in similar_words:
  print(words)

('kings', 0.7138045430183411)
('queen', 0.6510956883430481)
('monarch', 0.6413194537162781)
('crown_prince', 0.6204220056533813)
('prince', 0.6159993410110474)
('sultan', 0.5864824056625366)
('ruler', 0.5797567367553711)
('princes', 0.5646552443504333)
('Prince_Paras', 0.5432944297790527)
('throne', 0.5422105193138123)
