<a href="https://colab.research.google.com/github/prad69/NLP/blob/main/NLP_Word_and_Sentence_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Word Embeddings
## Word2Vec

In [4]:
# First, you'll need to install gensim
!pip install gensim

# Import the necessary modules

from gensim.test.utils import common_texts

from gensim.models import Word2Vec

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
print(common_texts) #Sample Data

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]


 Word2vec accepts several parameters that affect both training speed and quality.

One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:
1

model = Word2Vec(sentences, min_count=10)  # default value is 5

A reasonable value for min_count is between 0-100, depending on the size of your dataset.

Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:
1

model = Word2Vec(sentences, vector_size=200)  # default value is 100

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

Other hyper-parameters:

*   size: window=window_size for capturing context for target word

*   sample: The threshold for configuring which higher-frequency words are randomly down sampled, useful range is (0, 1e-5)

*   workers: Use these many worker threads to train the model (faster training with multicore machines)

*   sg: Training algorithm: skip-gram if sg=1, otherwise CBOW.

*   iter: Number of iterations (epochs) over the corpus.


In [5]:
model = Word2Vec(sentences=common_texts, vector_size=10, window=5, min_count=1, workers=4)
#Here, vector_size = 10 denotes the length of embedding
model.save("word2vec.model")

If you save the model you can continue training it later:

In [6]:
# load the saved model
model = Word2Vec.load("word2vec.model")
# model.train([["hello", "world"]], total_examples=1, epochs=1)

The trained word vectors are stored in a KeyedVectors instance, as model.wv:

In [7]:
# Get the embeddings for the word 'human'
embedding = model.wv['human']

print(embedding)
print(len(embedding))

[-0.00410223 -0.08368949 -0.05600012  0.07104538  0.0335254   0.0722567
  0.06800248  0.07530741 -0.03789154 -0.00561806]
10


In [8]:
# Get the most similar words (having the most similar embeddings)
similar_words = model.wv.most_similar('human',topn = 3) #topn denotes the top 3 similar words
print(similar_words)

[('graph', 0.3586882948875427), ('system', 0.22743132710456848), ('time', 0.1153423935174942)]


In [9]:
# Store just the words + their trained embeddings.
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors")

In [10]:
# Load back with memory-mapping = read-only, shared across processes.
from gensim.models import KeyedVectors
wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')
wv['computer']  # Get numpy vector embedding for 'computer'

array([ 0.0163195 ,  0.00189972,  0.03474648,  0.00217841,  0.09621626,
        0.05062076, -0.08919986, -0.0704361 ,  0.00901718,  0.06394394],
      dtype=float32)

### Refer to the link below for more details:
https://radimrehurek.com/gensim/models/word2vec.html

# Gensim comes with several already pre-trained models, in the Gensim-data repository

In [11]:
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [12]:
# Download the "glove-twitter-25" embeddings
# Pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased.
glove_vectors = gensim.downloader.load('glove-twitter-25')
glove_vectors



<gensim.models.keyedvectors.KeyedVectors at 0x7c0a7039ca10>

In [13]:
# Use the downloaded vectors as usual:
glove_vectors.most_similar('twitter')

[('facebook', 0.948005199432373),
 ('tweet', 0.9403423070907593),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104824066162109),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885937333106995),
 ('tweets', 0.8878158330917358),
 ('tl', 0.8778461217880249),
 ('link', 0.8778210878372192),
 ('internet', 0.8753897547721863)]

# Document/Sentence Embeddings
Paragraph, Sentence, and Document embeddings

## Doc2vec

In [15]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Define your sentences (example)
sentences = ["this is the first sentence", "this is the second sentence", "yet another sentence", "one more sentence", "and the final sentence"]

# Tag the sentences for training
tagged_data = [TaggedDocument(words=sentence.split(), tags=[str(i)]) for i, sentence in enumerate(sentences)]

# Train the model
model = Doc2Vec(tagged_data, vector_size=10, window=2, min_count=1, workers=4)

# Get the embeddings for the sentences
sentence_vectors = [model.infer_vector(sentence.split()) for sentence in sentences]
# The infer_vectors expects the input as a list of words (nltk.word_tokenize())

print("Sentence Embeddings:")
print(sentence_vectors) #Embeddings of the sentences

import numpy as np
print("\nShape:")
print(np.array(sentence_vectors).shape)

Sentence Embeddings:
[array([ 0.04198869,  0.01253986, -0.04709351, -0.03361204,  0.03727632,
        0.0360917 , -0.04976037, -0.01924757,  0.00466516, -0.04645238],
      dtype=float32), array([ 0.03795011,  0.04658079,  0.00688064,  0.02874659, -0.01405975,
        0.03965318,  0.04835472, -0.03572603, -0.01147622,  0.00996721],
      dtype=float32), array([-0.03657199, -0.04226479, -0.00826058, -0.04235109,  0.04691093,
       -0.03137154,  0.01618799,  0.03856817,  0.04027783,  0.01325311],
      dtype=float32), array([ 0.00235717, -0.01192741,  0.04444199, -0.02232105, -0.01424887,
       -0.02587563, -0.02123747, -0.00106649, -0.00531701, -0.01841368],
      dtype=float32), array([ 0.02992814,  0.02777706, -0.00993294, -0.03367668, -0.03010589,
       -0.001131  , -0.04741756,  0.01215397,  0.03172598, -0.04326031],
      dtype=float32)]

Shape:
(5, 10)


In [16]:
sentence_vectors[0] #the first embedding

array([ 0.04198869,  0.01253986, -0.04709351, -0.03361204,  0.03727632,
        0.0360917 , -0.04976037, -0.01924757,  0.00466516, -0.04645238],
      dtype=float32)

In [17]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(sentence_vectors[1].reshape(1,-1),sentence_vectors[2].reshape(1,-1))[0][0]
#Cosine similarity between embeddings

np.float32(-0.6857255)

In [18]:
# Find the similarity between all the sentences
similarity = cosine_similarity(sentence_vectors)
similarity

array([[ 1.        , -0.03847909, -0.13069807, -0.12699436,  0.5823946 ],
       [-0.03847909,  0.99999994, -0.68572557, -0.4150059 , -0.18280153],
       [-0.13069807, -0.68572557,  1.        ,  0.04168017, -0.16783181],
       [-0.12699436, -0.4150059 ,  0.04168017,  0.99999994,  0.3402756 ],
       [ 0.5823946 , -0.18280153, -0.16783181,  0.3402756 ,  1.0000001 ]],
      dtype=float32)

In [19]:
#Find the most similar sentence to the first sentence (at index = 0)
ind = 0  # The index of the sentence for which you want to find the most similar sentence
max = -1 # This will store the cosine_similarity of the most similar document
print("Input Sentence -->", sentences[ind])
for i in range(np.array(sentence_vectors).shape[0]):
    if i != ind:
        if max < cosine_similarity(sentence_vectors[i].reshape(1,-1),sentence_vectors[ind].reshape(1,-1))[0][0]:
            max = cosine_similarity(sentence_vectors[i].reshape(1,-1),sentence_vectors[ind].reshape(1,-1))[0][0]
            s_ind = i

print("Most Similar Sentence -->", sentences[s_ind])
print("Cosine Simialrity:", max)

Input Sentence --> this is the first sentence
Most Similar Sentence --> and the final sentence
Cosine Simialrity: 0.58239454


#### More about Doc2vec here:
https://radimrehurek.com/gensim/models/doc2vec.html