<a href="https://colab.research.google.com/github/niksom406/Learning_NLP/blob/main/Word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec

Install gensim

This cell installs the `gensim` library, which is used for topic modeling and word embedding.

In [2]:
pip install gensim



**Import gensim**

This cell imports the `gensim` library.

In [3]:
import gensim

**Cell 3: Import specific modules from gensim**

This cell imports `Word2Vec` and `KeyedVectors` from the `gensim.models` module. `Word2Vec` is an algorithm for learning word embeddings, and `KeyedVectors` is a structure to store and query word vectors.

In [4]:
from gensim.models import Word2Vec, KeyedVectors

**Cell 4: Load a pre-trained Word2Vec model**

This cell uses `gensim.downloader` to download and load a pre-trained Word2Vec model trained on the Google News dataset. This model contains 300-dimensional word vectors for a large vocabulary. It then gets the vector for the word 'king'.

In [5]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

vec_king = wv['king']



**Cell 5: Get the vector for 'human'**

This cell retrieves and displays the word vector for the word 'human' from the loaded Word2Vec model.

In [6]:
wv['human']

array([ 5.59082031e-02,  9.22851562e-02,  1.07910156e-01,  2.83203125e-01,
       -2.43164062e-01,  1.90429688e-02,  4.08203125e-01, -3.17382812e-02,
       -4.78515625e-02,  6.34765625e-02, -9.32617188e-02, -4.46777344e-02,
       -2.41210938e-01, -1.58203125e-01, -5.83496094e-02,  2.51953125e-01,
       -3.24707031e-02,  1.00097656e-01, -4.56542969e-02,  1.35742188e-01,
       -2.07031250e-01, -3.73046875e-01,  4.39453125e-02,  4.24804688e-02,
        6.93359375e-02, -2.42187500e-01, -2.75390625e-01,  1.95312500e-01,
        2.26562500e-01, -1.90429688e-01, -2.35351562e-01, -5.56640625e-02,
       -1.25000000e-01, -8.78906250e-02, -2.33398438e-01,  9.61914062e-02,
       -4.83398438e-02,  4.54101562e-02,  9.81445312e-02,  5.76171875e-02,
       -4.17480469e-02,  2.02148438e-01, -9.03320312e-02,  2.75390625e-01,
       -6.34765625e-02,  4.93164062e-02,  2.92968750e-02,  2.57812500e-01,
        1.32812500e-01,  7.42187500e-02,  6.64062500e-02, -1.37695312e-01,
       -1.73828125e-01,  

**Cell 6: Check the shape of the 'king' vector**

This cell displays the shape of the vector for the word 'king', which is (300,) indicating a 300-dimensional vector.

In [7]:
vec_king.shape

(300,)

**Cell 7: Display the 'king' vector**

This cell displays the actual values of the 300-dimensional word vector for the word 'king'.

In [8]:
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

**Cell 8: Find words most similar to 'football'**

This cell uses the `most_similar` method to find words in the vocabulary that are most similar to 'football' based on their word vectors.

In [9]:
wv.most_similar('football')

[('soccer', 0.731354832649231),
 ('fooball', 0.7139959335327148),
 ('Football', 0.7124834060668945),
 ('basketball', 0.668246865272522),
 ('footbal', 0.6649289727210999),
 ('athletics', 0.6265192627906799),
 ('gridiron', 0.6191604733467102),
 ('baseball', 0.6162001490592957),
 ('footballl', 0.6069177985191345),
 ('sports', 0.5927178859710693)]

**Cell 9: Find words most similar to 'happy'**

This cell finds and displays words most similar to 'happy'.

In [10]:
wv.most_similar('happy')

[('glad', 0.7408890724182129),
 ('pleased', 0.6632170677185059),
 ('ecstatic', 0.6626912355422974),
 ('overjoyed', 0.6599286794662476),
 ('thrilled', 0.6514049172401428),
 ('satisfied', 0.6437949538230896),
 ('proud', 0.636042058467865),
 ('delighted', 0.627237856388092),
 ('disappointed', 0.6269949674606323),
 ('excited', 0.6247665286064148)]

**Cell 10: Calculate similarity between 'hockey' and 'sports'**

This cell calculates and displays the cosine similarity between the word vectors for 'hockey' and 'sports', which indicates how similar the two words are in the context of the model.

In [11]:

wv.similarity("hockey","sports")

0.53541523

**Cell 11: Perform vector arithmetic**

This cell performs vector arithmetic: it subtracts the vector for 'man' from the vector for 'king' and adds the vector for 'woman'. This operation is often used to demonstrate how word embeddings capture relationships between words (e.g., king - man + woman should be close to queen).

In [12]:
vec=wv['king']-wv['man']+wv['woman']

**Cell 12: Display the resulting vector**

This cell displays the values of the resulting vector from the arithmetic operation in the previous cell.

In [13]:

vec


array([ 4.29687500e-02, -1.78222656e-01, -1.29089355e-01,  1.15234375e-01,
        2.68554688e-03, -1.02294922e-01,  1.95800781e-01, -1.79504395e-01,
        1.95312500e-02,  4.09919739e-01, -3.68164062e-01, -3.96484375e-01,
       -1.56738281e-01,  1.46484375e-03, -9.30175781e-02, -1.16455078e-01,
       -5.51757812e-02, -1.07574463e-01,  7.91015625e-02,  1.98974609e-01,
        2.38525391e-01,  6.34002686e-02, -2.17285156e-02,  0.00000000e+00,
        4.72412109e-02, -2.17773438e-01, -3.44726562e-01,  6.37207031e-02,
        3.16406250e-01, -1.97631836e-01,  8.59375000e-02, -8.11767578e-02,
       -3.71093750e-02,  3.15551758e-01, -3.41796875e-01, -4.68750000e-02,
        9.76562500e-02,  8.39843750e-02, -9.71679688e-02,  5.17578125e-02,
       -5.00488281e-02, -2.20947266e-01,  2.29492188e-01,  1.26403809e-01,
        2.49023438e-01,  2.09960938e-02, -1.09863281e-01,  5.81054688e-02,
       -3.35693359e-02,  1.29577637e-01,  2.41699219e-02,  3.48129272e-02,
       -2.60009766e-01,  

**Cell 13: Find words most similar to the resulting vector**

This cell finds and displays the words most similar to the vector calculated in cell 11. As expected, 'queen' is among the most similar words, demonstrating the learned analogy.

In [14]:

wv.most_similar([vec])

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376775860786438),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]