 Program 2
Program on Word 2 vector and cosine similarity


Cosine similarity is a metric used to measure how similar two vectors are in a multi-dimensional space. It is particularly popular in natural language processing and information retrieval for comparing the similarity of documents or words based on their vector representations.


In [7]:
import pandas as pd
import spacy #natural language processing (NLP) library in Python.
from sklearn.metrics.pairwise import cosine_similarity

#loads an English language model called "en_core_web_sm" from spaCy using spacy.load(). 
nlp=spacy.load('en_core_web_sm')
#This model provides pre-trained word embeddings and other linguistic information for English text.

terms=['I','like','apples','oranges','pears']

#spaCy's nlp() function to obtain its word embedding vector using .vector. 
# .tolist() converts this vector to a Python list
vectors=[
    nlp(term).vector.tolist() for term in terms
]

# the word vector for 'apples' is extracted from the vectors list using terms.index('apples') 
# to find the index of 'apples' in the terms list. 
x=pd.Series(vectors[terms.index('apples')]).rename('apples')
print("word vector for apples:\n ",x)#prints the Pandas Series x, which represents the word vector for 'apples'.

# a Pandas DataFrame named abc is created
abc=pd.DataFrame(
    cosine_similarity(vectors),
    index=terms,
    columns=terms
).round(3)

#prints the cosine similarity matrix stored in the abc DataFrame. 
# The matrix shows how similar each term is to every other term in the terms list based on their word embeddings.
print("\ncosine similarity matrix :\n",abc)



word vector for apples:
  0    -1.244198
1     0.849777
2    -0.847986
3     1.536878
4     1.451894
        ...   
91    0.708207
92    1.898031
93   -0.110315
94   -0.203278
95    0.668344
Name: apples, Length: 96, dtype: float64

cosine similarity matrix :
              I   like  apples  oranges  pears
I        1.000  0.163   0.376    0.168 -0.088
like     0.163  1.000   0.073   -0.006  0.217
apples   0.376  0.073   1.000    0.750  0.289
oranges  0.168 -0.006   0.750    1.000  0.584
pears   -0.088  0.217   0.289    0.584  1.000


The value at index 0 is approximately -1.244198, and the value at index 1 is approximately 0.849777. These values indicate the position of the word "apples" in a high-dimensional space, where each dimension represents some aspect of the word's meaning.Word vectors are often used to capture semantic information about words, allowing algorithms to understand and work with word meanings in a numerical format.