# ðŸ“˜ Introduction to Word Embeddings with Word2vec

Welcome to your 2-hour journey into the fascinating world of Natural Language Processing (NLP)! Today, we'll explore one of the most important concepts that revolutionized how machines understand human language: **Word Embeddings**.

### ðŸŽ¯ Learning Objectives

By the end of this session, you will be able to:

1.  **Understand** what word embeddings are and why they are useful.
2.  **Explain** the core idea behind Word2vec and its two main architectures: CBOW and Skip-gram.
3.  **See** how word embeddings capture word meanings and relationships.
4.  **Apply** these concepts through simple, hands-on practice tasks.

## Topic 1: What are Word Embeddings?

Imagine you had to represent every word in the dictionary as a number. An easy way would be to assign a unique ID to each word. This is called **One-Hot Encoding**.

However, this method has two big problems:
1.  **It's Inefficient:** If you have 10,000 words, each word's representation is a list with 9,999 zeros and only one '1'. That's a lot of wasted space!
2.  **It's Not Smart:** The vectors for "cat" and "dog" are no more related than the vectors for "cat" and "car". It doesn't capture any meaning.

**Word Embeddings solve this!** They represent words as dense, multi-dimensional vectors (like `[0.2, -0.4, 0.7, ...]`). The magic is that words with similar meanings are placed close to each other in this vector space.

In [1]:
# Example: Let's visualize the problem with One-Hot Encoding

# Our small vocabulary
vocabulary = ["cat", "dog", "car", "house"]
print(f"Our vocabulary is: {vocabulary}\n")

# A simple function to create a one-hot vector
def one_hot_encode(word, vocab):
    # Create a vector of zeros with the length of the vocabulary
    vector = [0] * len(vocab)
    # Find the index of our word
    try:
        index = vocab.index(word)
        # Place a '1' at that index
        vector[index] = 1
        return vector
    except ValueError:
        return "Word not in vocabulary"

# Let's see the vectors
cat_vector = one_hot_encode("cat", vocabulary)
dog_vector = one_hot_encode("dog", vocabulary)
car_vector = one_hot_encode("car", vocabulary)

print(f"'cat' vector: {cat_vector}")
print(f"'dog' vector: {dog_vector}")
print(f"'car' vector: {car_vector}")

# Notice how the vectors don't show any relationship between 'cat' and 'dog'.

Our vocabulary is: ['cat', 'dog', 'car', 'house']

'cat' vector: [1, 0, 0, 0]
'dog' vector: [0, 1, 0, 0]
'car' vector: [0, 0, 1, 0]


### ðŸ§  Practice Task 1

Using the `one_hot_encode` function from the cell above, create a one-hot vector for the word `"house"` from our vocabulary. What do you expect the output to be? Write your code in the cell below!

In [2]:
# Your code here!
vocabulary = ["cat", "dog", "car", "house"]

def one_hot_encode(word, vocab):
    vector = [0] * len(vocab)
    try:
        index = vocab.index(word)
        vector[index] = 1
        return vector
    except ValueError:
        return "Word not in vocabulary"

# Create the vector for 'house'
house_vector = one_hot_encode("house", vocabulary) # Your turn to complete this line!
print(f"'house' vector: {house_vector}")

'house' vector: [0, 0, 0, 1]


## Topic 2: Introducing Word2vec ðŸ’¡

**Word2vec** is a powerful model developed at Google that learns to create these amazing word embeddings. Its foundation is a simple but profound idea from linguistics called the **Distributional Hypothesis**: 

> "A word is characterized by the company it keeps."

In simple terms, Word2vec looks at tons of text (a **corpus**) and learns that words appearing in similar contexts (e.g., around the same words) should have similar meanings. For example, it will see sentences like "My dog loves to play fetch" and "The cat is sleeping on the mat". It will notice that 'dog' and 'cat' often appear near words like 'pet', 'food', and 'play', so it will place their vectors close together.

### The Famous Analogy: King - Man + Woman = Queen

The vectors created by Word2vec are so powerful they can even capture relationships. The most famous example is:

`vector('King') - vector('Man') + vector('Woman')` results in a vector that is very close to `vector('Queen')`!

This shows the model learned the concept of gender and royalty just by reading text!

## Topic 3: The Two Flavors of Word2vec - CBOW & Skip-gram

Word2vec isn't one single algorithm; it's a family of two model architectures. Let's imagine we have the sentence: `"The quick brown fox jumps over the lazy dog"` and we're looking at the word `"fox"` with a **window size** of 2. This means we consider 2 words before and 2 words after our target word.

**Target Word:** `fox`
**Context Words:** `quick`, `brown` (before), `jumps`, `over` (after)

---

### ðŸ“„ 1. Continuous Bag-of-Words (CBOW)

The CBOW model learns by doing the following:
- **Goal:** Predict the target word from its context words.
- **Analogy:** It's like a fill-in-the-blanks puzzle.

**How it works for our example:**
1.  **Input:** The context words `[quick, brown, jumps, over]`.
2.  **Task:** Predict the word `_____` in the middle.
3.  **Output:** The model should predict `fox`.

During training, the model adjusts its vectors so it gets better at this prediction. CBOW is fast and works well for words that appear often.

### ðŸ“„ 2. Continuous Skip-gram

The Skip-gram model does the exact opposite:
- **Goal:** Predict the context words from the target word.
- **Analogy:** Given one word, guess its neighbors.

**How it works for our example:**
1.  **Input:** The target word `fox`.
2.  **Task:** Predict the words that are likely to be its neighbors.
3.  **Output:** The model should predict `quick`, `brown`, `jumps`, and `over`.

Skip-gram is slower than CBOW but is excellent at learning good representations for rare words.

In [5]:
# Example: Generating training samples for Skip-gram
# Let's write a simple function to see what the training data looks like.

def generate_skipgram_pairs(sentence, target_word, window_size):
    words = sentence.split()
    target_index = words.index(target_word)
    
    pairs = []
    # Iterate through the window around the target word
    for i in range(max(0, target_index - window_size), min(len(words), target_index + window_size + 1)):
        # Make sure we don't pair the word with itself
        if i != target_index:
            context_word = words[i]
            pairs.append((target_word, context_word))
    return pairs

sentence = "The quick brown fox jumps over the lazy dog"
target = "fox"
window = 2

training_pairs = generate_skipgram_pairs(sentence, target, window)
print(f"Sentence: '{sentence}'")
print(f"Target Word: '{target}' with window size {window}\n")
print(f"Generated Skip-gram pairs (input, output):")
for pair in training_pairs:
    print(pair)

print("\nðŸ§ª Try changing the target word or the window size and see what happens!")

Sentence: 'The quick brown fox jumps over the lazy dog'
Target Word: 'fox' with window size 2

Generated Skip-gram pairs (input, output):
('fox', 'quick')
('fox', 'brown')
('fox', 'jumps')
('fox', 'over')

ðŸ§ª Try changing the target word or the window size and see what happens!


### ðŸ§  Practice Task 2

You are given the sentence `"Natural language processing is fun"` and a window size of 1. 

What would be the training samples generated for the **Skip-gram model** with the target word `"processing"`? 

Use the code cell below to find out!

In [6]:
# Your code here!
def generate_skipgram_pairs(sentence, target_word, window_size):
    words = sentence.split()
    target_index = words.index(target_word)
    pairs = []
    for i in range(max(0, target_index - window_size), min(len(words), target_index + window_size + 1)):
        if i != target_index:
            context_word = words[i]
            pairs.append((target_word, context_word))
    return pairs

# Define your new sentence and parameters
my_sentence = "Natural language processing is fun"
my_target = "processing"
my_window = 1

# Generate the pairs
my_pairs = generate_skipgram_pairs(my_sentence, my_target, my_window)
print(f"The generated pairs are: {my_pairs}")

The generated pairs are: [('processing', 'language'), ('processing', 'is')]


âœ… **Well done!** You've just seen how Word2vec models turn sentences into training data.

## Topic 4: Measuring Similarity

How do we know if two word vectors are "close" to each other in the vector space? We use a metric called **Cosine Similarity**.

Imagine two vectors as arrows starting from the same point. 
- If the arrows point in the exact same direction, their similarity is **1**.
- If they are perpendicular (90 degrees apart), their similarity is **0**.
- If they point in opposite directions, their similarity is **-1**.

Cosine similarity measures the angle between the vectors, not their length. This is perfect for word embeddings, as it tells us about orientation (meaning) rather than magnitude (which can be related to word frequency).

In [7]:
# Example: Calculating Cosine Similarity
# We'll use the numpy library for this. It's a fundamental library for numerical operations in Python.

import numpy as np

# Let's create some simple, fake word vectors
cat_vec = np.array([1, 2, 3])
dog_vec = np.array([2, 3, 4])  # Should be similar to cat
car_vec = np.array([-1, -2, 5]) # Should be different from cat

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity_cat_dog = cosine_similarity(cat_vec, dog_vec)
similarity_cat_car = cosine_similarity(cat_vec, car_vec)

print(f"Similarity between 'cat' and 'dog': {similarity_cat_dog:.4f}") # The .4f formats the number nicely
print(f"Similarity between 'cat' and 'car': {similarity_cat_car:.4f}")

# As expected, the similarity score for cat/dog is much higher (closer to 1)!

Similarity between 'cat' and 'dog': 0.9926
Similarity between 'cat' and 'car': 0.4880


### ðŸ§  Practice Task 3

You have three new word vectors:
- `fruit` = `[3, 4, 0]`
- `apple` = `[4, 5, 1]`
- `book` = `[-2, 1, 5]`

Which pair do you think will be more similar: (`fruit`, `apple`) or (`fruit`, `book`)?

Use the code cell below to define these vectors and calculate their similarities to check your hypothesis.

In [8]:
# Your code here!
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

# Define the vectors
fruit_vec = np.array([3, 4, 0])
apple_vec = np.array([4, 5, 1])
book_vec = np.array([-2, 1, 5])

# Calculate similarity for the first pair
sim_fruit_apple = cosine_similarity(fruit_vec, apple_vec)

# Calculate similarity for the second pair
sim_fruit_book = cosine_similarity(fruit_vec, book_vec)

print(f"Similarity between 'fruit' and 'apple': {sim_fruit_apple:.4f}")
print(f"Similarity between 'fruit' and 'book': {sim_fruit_book:.4f}")

Similarity between 'fruit' and 'apple': 0.9875
Similarity between 'fruit' and 'book': -0.0730


##  Final Revision Assignment

Congratulations on making it this far! It's time to put everything you've learned together. These tasks are designed for you to practice at home and solidify your understanding.

--- 
**Task 1 (MCQ):** Which statement best describes the primary objective of the Continuous Bag-of-Words (CBOW) model?

A) To predict the context words given a target word.
B) To predict a target word from a bag of its context words.
C) To count the co-occurrence of words in a corpus.
D) To represent words as character n-grams.

*Hint: Think "fill-in-the-blanks".*

---
**Task 2 (MCQ):** In the context of Word2vec, what is the main advantage of the Skip-gram architecture over CBOW?

A) It is computationally faster to train.
B) It handles rare words more effectively.
C) It performs better for representing frequent words.
D) It uses a global co-occurrence matrix.

*Hint: Which model pays more attention to each specific word?*

---
**Task 3 (Short Question):** In your own words, briefly explain why an optimization like **Negative Sampling** is needed. Why can't we just use the standard approach for a large vocabulary?

*Hint: Think about how many words are in a real-world dictionary.*

---
**Task 4 (Problem-Solving):** The vector equation `vector('King') - vector('Man') + vector('Woman') â‰ˆ vector('Queen')` is a powerful demonstration of what Word2vec learns. What kind of relationship is being captured here? Can you think of another example, like `Paris - France + Germany`? What would you expect the result to be?

---
**Task 5 (Case Study):** A startup wants to build a recommendation engine for news articles. They need to represent articles in a way that allows them to find similar articles. How could they use word embeddings for this task? Would you recommend CBOW or Skip-gram, and why?

*Hint: Think about the goal. Do they need speed for massive amounts of text, or do they care more about capturing the meaning of specific, important (and possibly rare) keywords in the articles?*

---
**Task 6 (Code Challenge):** Look back at our `cosine_similarity` function. Imagine you have real word vectors for `python` (a programming language), `java` (another language), and `snake` (an animal). Which pair do you think will have the highest similarity? Which will have the lowest? Write some code to prove it!

```python
# Pre-defined, simplified vectors for this exercise
python_vec = np.array([0.8, 0.2, -0.5])
java_vec = np.array([0.7, 0.1, -0.4])
snake_vec = np.array([-0.1, 0.9, 0.3])

# Your code here to calculate and print the similarities for:
# 1. (python, java)
# 2. (python, snake)
# 3. (java, snake)
```

## ðŸŽ¯ Summary & Further Learning

You've done an amazing job today! Let's quickly recap the key takeaways.

| Concept             | Description                                                                                             |
|---------------------|---------------------------------------------------------------------------------------------------------|
| **Word Embedding**  | A dense vector representation of a word that captures its semantic and syntactic meaning.               |
| **Word2vec**        | A predictive model that uses a shallow neural network to learn word embeddings from a large text corpus.    |
| **CBOW**            | Predicts the target word from its context words. Fast and efficient for frequent words.                   |
| **Skip-gram**       | Predicts context words from a target word. Slower but better for rare words.                            |
| **Core Idea**       | Words appearing in similar contexts will have similar vector representations (Distributional Hypothesis). |
| **Cosine Similarity**| A metric to measure the similarity (angle) between two word vectors.                                    |

### ðŸ”— Related Study Resources

To continue your learning journey, check out these fantastic resources:

- **[The Illustrated Word2vec by Jay Alammar](https://jalammar.github.io/illustrated-word2vec/)**: An excellent and intuitive visual explanation of Word2vec's mechanics.
- **[TensorFlow Word Embeddings Tutorial](https://www.tensorflow.org/text/guide/word_embeddings)**: A practical guide to creating and visualizing word embeddings.
- **[Stanford's CS224N: NLP with Deep Learning](http://web.stanford.edu/class/cs224n/)**: A comprehensive university course covering word vectors in depth.
- **[Gensim Library Documentation](https://radimrehurek.com/gensim/models/word2vec.html)**: A popular Python library for implementing Word2vec.