<!-- ![Alt Text](https://raw.githubusercontent.com/msfasha/307304-Data-Mining/main/images/header.png) -->

<div style="display: flex; justify-content: flex-start; align-items: center;">
   <a href="https://colab.research.google.com/github/msfasha/307307-BI-Methods-Generative-AI/blob/main/20243/Part%202%20-%20Introduction%20to%20NNs%20and%20Word%20Embeddings/Introduction%20to%20Neural%20Networks%20and%20Word%20Embeddings-Python.ipynb" target="_parent"><img 
   src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

## Word Embeddings

### Building Word Embeddings from Scratch
We can build word embeddings from sratch using a corpus of our own and using gensim library to build Word2Vec representations.

First, we need to install gensim library

In [None]:
# This code can run on local machine
#%pip install gensim

> If faced with compatibility issues in Google Colab, run the code below

In [None]:
# This code is used to fix the issue with gensim version on Google Colab
%pip uninstall -y gensim
%pip install --force-reinstall "scipy<1.11" "gensim==4.3.2"

We also need to install nltk to tokenize our text.

In [None]:
%pip install nltk

import nltk
nltk.download('punkt_tab')

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\me\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\me\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample corpus
sentences = [
    "Large language models are transforming business applications",
    "Natural language processing helps computers understand human language",
    "Word embeddings capture semantic relationships between words",
    "Neural networks learn distributed representations of words",
    "Businesses use language models for various applications",
    "Customer service can be improved with language technology",
    "Modern language models require significant computing resources",
    "Language models can generate human-like text for businesses"
]

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train Word2Vec model
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,    # Embedding dimension
    window=5,           # Context window size
    min_count=1,        # Minimum word frequency
    workers=4           # Number of threads
)

# Save the model
model.save("word2vec.model")


Vector for 'business' (first 10 dimensions):
[ 0.00816812 -0.00444303  0.00898543  0.00825366 -0.00443522  0.00030311
  0.00427449 -0.00392632 -0.00555997 -0.00651232]
Words most similar to 'language':
natural: 0.2196
between: 0.2167
resources: 0.1955
distributed: 0.1696
significant: 0.1522


#### Display vector for a specific word

In [None]:
word_vector = model.wv["business"]
print(f"\nVector for 'business' (first 10 dimensions):\n{word_vector[:10]}") # Print first 10 dimensions


Vector for 'business' (first 10 dimensions):
[ 0.00816812 -0.00444303  0.00898543  0.00825366 -0.00443522  0.00030311
  0.00427449 -0.00392632 -0.00555997 -0.00651232]


#### Find the most similar words to "language"

In [19]:
similar_words = model.wv.most_similar("language", topn=5)
print("Words most similar to 'language':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

Words most similar to 'language':
natural: 0.2196
between: 0.2167
resources: 0.1955
distributed: 0.1696
significant: 0.1522


---

#### Use Real Embeddings - Gensim library

For this example, we will use the `word2vec-google-news-300` model. <br>
The `word2vec-google-news-300` model is a **pre-trained Word2Vec model** created by Google.  
It has the following characteristics:

- **Training data**: Trained on approximately **100 billion words** from the **Google News** dataset.
- **Vector size**: Each word is represented as a **300-dimensional** vector.
- **Vocabulary size**: It contains **about 3 million unique words and phrases**.
- **Training method**: It uses the **skip-gram** Word2Vec architecture to predict context words given a target word.
- It's a little **heavy**, about 1.5 GB, so it’s good to know the best way to handle it.

This model captures a wide range of **semantic** and **syntactic** relationships between words.  
Because it is trained on a large and diverse corpus, it is widely used for many natural language processing (NLP) tasks where high-quality word embeddings are required.

In [22]:
import gensim.downloader as api

# Load the pre-trained Word2Vec model
word2vec_model = api.load("word2vec-google-news-300")

# Find words similar to 'computer'
similar_words = word2vec_model.most_similar('computer', topn=5)
print("Words similar to 'computer':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

# Example of a word analogy: king - man + woman = ?
analogy_result = word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print("\nResult of analogy (king - man + woman):")
for word, similarity in analogy_result:
    print(f"{word}: {similarity:.4f}")

# Find odd one out
odd_one_out = word2vec_model.doesnt_match(["breakfast", "lunch", "dinner", "car"])
print("\nOdd one out in ['breakfast', 'lunch', 'dinner', 'car']:", odd_one_out)

# Compute similarity between two words
similarity_score = word2vec_model.similarity('coffee', 'tea')
print(f"\nSimilarity between 'coffee' and 'tea': {similarity_score:.4f}")

Words similar to 'computer':
computers: 0.7979
laptop: 0.6640
laptop_computer: 0.6549
Computer: 0.6473
com_puter: 0.6082

Result of analogy (king - man + woman):
queen: 0.7118

Odd one out in ['breakfast', 'lunch', 'dinner', 'car']: car

Similarity between 'coffee' and 'tea': 0.5635
