<!-- ![Alt Text](https://raw.githubusercontent.com/msfasha/307304-Data-Mining/main/images/header.png) -->

<div style="display: flex; justify-content: flex-start; align-items: center;">
   <a href="https://colab.research.google.com/github/msfasha/307307-BI-Methods-Generative-AI/blob/main/20251/Module%204%20-%20Introduction%20to%20Words%20Embeddings/2-Word%20Embeddings%20Examples.ipynb" target="_parent"><img 
   src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

## Word Embeddings Using Gensim Library

First, we need to install Gensim library on our local machine

In [None]:
%pip install gensim

Next, we will download  `word2vec-google-news-300` model. <br>
The `word2vec-google-news-300` model is a **pre-trained Word2Vec model** created by Google.  
It has the following characteristics:

- **Training data**: Trained on approximately **100 billion words** from the **Google News** dataset.
- **Vector size**: Each word is represented as a **300-dimensional** vector.
- **Vocabulary size**: It contains **about 3 million unique words and phrases**.
- **Training method**: It uses the **skip-gram** Word2Vec architecture to predict context words given a target word.
- It's a little **heavy**, about 1.5 GB, so itâ€™s good to know the best way to handle it.

This model captures a wide range of **semantic** and **syntactic** relationships between words.  
Because it is trained on a large and diverse corpus, it is widely used for many natural language processing (NLP) tasks where high-quality word embeddings are required.

In [None]:
import gensim.downloader as api

# Load the pre-trained Word2Vec model
word2vec_model = api.load("word2vec-google-news-300")


Find words similar to 'computer'

In [None]:
similar_words = word2vec_model.most_similar('computer', topn=5)
print("Words similar to 'computer':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

Example of a word analogy: king - man + woman = ?

In [None]:
analogy_result = word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print("\nResult of analogy (king - man + woman):")
for word, similarity in analogy_result:
    print(f"{word}: {similarity:.4f}")

Find the odd word

In [None]:
odd_one_out = word2vec_model.doesnt_match(["breakfast", "lunch", "dinner", "car"])
print("\nOdd one out in ['breakfast', 'lunch', 'dinner', 'car']:", odd_one_out)

Compute the similarity between two words (using cosine similarity measure)

In [None]:
similarity_score = word2vec_model.similarity('coffee', 'tea')
print(f"\nSimilarity between 'coffee' and 'tea': {similarity_score:.4f}")