# Word2Vec: Word Embeddings for Natural Language Processing

In this notebook, we will explore **Word2Vec**, a powerful technique for learning word embeddings. Word embeddings are dense vector representations of words that capture their semantic meaning and relationships.

## Objectives
By the end of this notebook, you will:
1. Understand the basics of Word2Vec and its two main training approaches: CBOW and Skip-gram.
2. Learn how to train a Word2Vec model using the `gensim` library.
3. Perform tasks like finding similar words and solving word analogies.
4. Save and load the trained Word2Vec model.

---

## What is Word2Vec?
Word2Vec is a technique developed by Google to learn word embeddings from a large corpus of text. These embeddings are vectors in a high-dimensional space, where similar words are close to each other.

### Training Approaches:
1. **CBOW (Continuous Bag of Words)**:
   - Predict a target word from its surrounding context words.
   - Faster but less sensitive to rare words.
2. **Skip-gram**:
   - Predict context words given a target word.
   - Slower but better for representing rare words.


# Preliminaries: Libraries and Data Preparation

We will use the following libraries:
1. **`gensim`**: For training and using the Word2Vec model.
2. **`nltk`**: To access the `brown` corpus for training data.
3. **`nltk.corpus.brown`**: A collection of text from a wide variety of genres.

The Brown Corpus is pre-tokenized, making it an ideal dataset for training Word2Vec models.



In [None]:
import gensim
from gensim.models import Word2Vec
import nltk
from nltk.corpus import brown

# Download the Brown Corpus
nltk.download('brown')

# Load the sentences from the Brown Corpus
sentences = brown.sents()

In [2]:
# Print an example sentences from the corpus
print(sentences[:1])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']]


# Step 1: Train a Word2Vec Model

We will use the `Word2Vec` class from `gensim` to train a model on the Brown Corpus.

### Key Parameters:
- **`sentences`**: Input sentences for training.
- **`vector_size`**: Dimensionality of the word vectors (default: 100).
- **`window`**: Context window size (default: 5).
- **`min_count`**: Ignores words that appear fewer than this number of times.
- **`workers`**: Number of CPU threads to use.
- **`sg`**: Training algorithm (0 for CBOW, 1 for Skip-gram).

We will train the model using the CBOW approach (`sg=0`).


In [3]:
# Train the Word2Vec model
model = Word2Vec(
    sentences=sentences,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4,
    sg=0  # 0 for CBOW, 1 for Skip-gram
)

print("Model training completed.")
print("\nVocabulary size:", len(model.wv.key_to_index))


Model training completed.

Vocabulary size: 56057


# Step 2: Find Similar Words

Using the trained Word2Vec model, we can find words that are semantically similar to a given word based on cosine similarity in the embedding space.

### Example:
If we query for similar words to "king", the output might include "queen", "prince", and "monarch" with similarity scores.


In [4]:
# Function to find similar words
def find_similar_words(word, topn=5):
    try:
        similar_words = model.wv.most_similar(word, topn=topn)
        print(f"Words most similar to '{word}':")
        for w, score in similar_words:
            print(f"  {w}: {score:.4f}")
    except KeyError:
        print(f"'{word}' not in vocabulary")

find_similar_words("king")
find_similar_words("queen")


Words most similar to 'king':
  Yankee: 0.9662
  Model: 0.9661
  former: 0.9657
  Prince: 0.9651
  mood: 0.9650
Words most similar to 'queen':
  Book: 0.9511
  captain: 0.9480
  clerk: 0.9469
  minister: 0.9464
  Lord: 0.9459


Let's try another example:

In [11]:
find_similar_words("doctor")
find_similar_words("science")

Words most similar to 'doctor':
  boy: 0.9541
  patient: 0.9418
  President: 0.9378
  conversation: 0.9293
  letter: 0.9228
Words most similar to 'science':
  distinction: 0.9684
  philosophy: 0.9649
  adolescent: 0.9562
  style: 0.9531
  inherent: 0.9517


We can see that our model does do a decent job at generating a handful of semantically similar words!


# Exercises: Hands-on Practice

1. **Train with Skip-gram**:
   Modify the training code to use Skip-gram (`sg=1`) instead of CBOW. Compare the results for word similarity and analogy tasks.

2. **Experiment with Parameters**:
   Change the `window` size or `min_count` parameter in the training process. Observe how these changes affect the vocabulary and results.

3. **Out-of-Vocabulary Words**:
   Test the model with words not present in the Brown Corpus. How does the model handle such cases?

4. **Custom Corpus**:
   Train the Word2Vec model on a custom text corpus, such as a collection of articles or books.

