# üìò Introduction to Word Embeddings for AI Beginners

### Welcome to Your 2-Hour Journey into Word Embeddings!

Over the next two hours, we will explore one of the most important concepts in modern Artificial Intelligence and Natural Language Processing (NLP). Get ready to learn how we teach computers to understand human language!

**üéØ Learning Objectives:**
1.  Understand what word embeddings are and why they are essential for AI.
2.  Learn about different types of embeddings, from classic to state-of-the-art.
3.  See how vector math can capture word meanings (e.g., king - man + woman = queen).
4.  Write basic Python code using the Keras `Embedding` layer.
5.  Know when to use pre-trained embeddings versus training your own.

## Topic 1: What Are Word Embeddings?

Word embeddings are numerical representations of words. Think of them as coordinates for each word in a multi-dimensional 'meaning space'. 

This technique allows words with similar meanings to have similar vector representations (similar coordinates). This is how we get computers to understand that 'cat' is more similar to 'dog' than it is to 'car'.

The core idea comes from the **distributional hypothesis**: *"You shall know a word by the company it keeps."* This means words that appear in similar sentences (or contexts) are likely to have similar meanings.

## Topic 2: Why Not Just Use Numbers? (Old Methods)

Before we had smart embeddings, we had simpler methods. One of the most basic is **One-Hot Encoding**.

Imagine our entire vocabulary is just five words: `['the', 'cat', 'sat', 'on', 'mat']`.

To represent the word 'cat', we create a vector that is all zeros, except for a '1' at the position for 'cat'.

- `the`: `[1, 0, 0, 0, 0]`
- `cat`: `[0, 1, 0, 0, 0]`
- `sat`: `[0, 0, 1, 0, 0]`

This seems simple, but it has big problems:

üëé **High Dimensionality**: If you have a vocabulary of 50,000 words, each vector will have 50,000 dimensions! This is computationally very expensive.

üëé **Sparsity**: The vectors are mostly zeros, which is not efficient.

üëé **No Semantic Relationship**: The vectors for 'cat' and 'dog' are no more similar than the vectors for 'cat' and 'car'. The model can't see any relationship between words.

### üß† Practice Task

Imagine our vocabulary is `['apple', 'banana', 'fruit', 'car']`.

What would the one-hot encoded vector for the word **'banana'** look like? Write it down.

## Topic 3: Static Embeddings (Word2Vec, GloVe, FastText)

To solve the problems of one-hot encoding, researchers created **static word embeddings**. These models learn a single, fixed vector for each word from a huge amount of text.

#### Word2Vec (Google, 2013)
A groundbreaking model that learns word associations. It has two main 'flavors':
- **CBOW (Continuous Bag-of-Words):** Predicts a target word from its context. (e.g., given `['The', 'cat', 'on', 'the', 'mat']`, predict `sat`). It's fast and good for common words.
- **Skip-Gram:** Predicts the context words from a target word. (e.g., given `sat`, predict `['The', 'cat', 'on', 'the', 'mat']`). It's slower but excellent for representing rare words.

#### GloVe (Stanford, 2014)
GloVe stands for **Global Vectors**. It learns by looking at how often words appear together across an entire corpus, giving it a more 'global' understanding of language.

#### FastText (Facebook, 2016)
FastText's superpower is handling **out-of-vocabulary (OOV) words**. It breaks words down into character parts (n-grams). So, if it has never seen the word `embedding`, but it has seen `embed` and `ing`, it can create a reasonable vector for the new word. This is great for languages with many word forms.

### üí° The Magic of Semantic Arithmetic

One of the coolest discoveries about word embeddings is that they capture relationships, which you can explore with simple math!

The most famous example is:

`vector('king') - vector('man') + vector('woman') ‚âà vector('queen')`

This shows that the model has learned the concept of gender and royalty. The distance and direction between 'king' and 'man' is very similar to the distance and direction between 'queen' and 'woman'.

### üß† Practice Task

Using the same logic, what do you think the result of the following vector math would be?

`vector('Paris') - vector('France') + vector('Germany') ‚âà ?`

Think about the relationship between the first two words and apply it to the third.

## Topic 4: Context is King! (Contextual Embeddings)

Static embeddings have a major weakness: a word has only one meaning. But what about the word **'bank'**?

- I went to the **bank** to deposit money.
- We had a picnic on the river **bank**.

The meaning is completely different! **Contextual embeddings** solve this by generating a different vector for a word each time it appears, based on the surrounding sentence.

- **ELMo (2018):** One of the first models to do this effectively.
- **BERT (2018):** A revolutionary model from Google that reads the *entire sentence at once* (it's bidirectional) to create incredibly rich, context-aware embeddings.
- **GPT:** The model behind ChatGPT, which is excellent at understanding context to generate human-like text.

### üß† Practice Task

Write down two sentences where the word **'right'** has completely different meanings. This shows why context is so important!

## Topic 5: Using Embeddings in Code (Keras)

Let's see how we can use embeddings in a real deep learning model using the Keras library.

The `Embedding` layer in Keras is a dictionary that maps integer indices (representing words) to dense vectors. We can either train these vectors from scratch on our own data or load in pre-trained ones.

### Example: Training an Embedding Layer from Scratch

This is useful when you have a lot of specific data for your task (e.g., medical records, legal documents).

In [None]:
# We need to import the necessary tools from TensorFlow and Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
import numpy as np

# --- Model Parameters ---
# Imagine we have 10,000 unique words in our vocabulary
vocab_size = 10000

# We will represent each word with a 128-dimensional vector
embedding_dim = 128

# Each input sentence will be padded/truncated to 50 words
max_length = 50

# --- Building the Model ---
model = Sequential()

# 1. The Embedding Layer
# This layer will learn a 'lookup table' of size (10000 x 128)
# It takes integer-encoded sentences of length 50 as input
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))

# 2. The Flatten Layer
# This layer converts the 3D tensor from the embedding layer into a 2D tensor
# Shape changes from (None, 50, 128) to (None, 50 * 128)
model.add(Flatten())

# 3. The Output Layer
# A standard Dense layer for a binary classification task (e.g., positive/negative sentiment)
model.add(Dense(1, activation='sigmoid'))

# Compile the model and print its summary
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

### üß™ Practice Task

In the code cell above, change the `output_dim` of the `Embedding` layer from `128` to `64` and re-run the cell.

Look at the model summary. How does the number of parameters in the embedding layer change? Why?

## Topic 6: Using Pre-trained GloVe Embeddings

Training embeddings from scratch requires a lot of data. A more common approach is **Transfer Learning**: using embeddings that have already been trained on a massive dataset (like all of Wikipedia!).

Here's how you would conceptually load pre-trained GloVe embeddings and 'freeze' them in your model, so they don't change during training.

*(Note: The following code is for demonstration. You would need to download the GloVe file to run it fully.)*

In [None]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
import tensorflow as tf # Often needed for initializers

# --- Step 1: Pretend we loaded a GloVe file ---
# In a real scenario, you'd parse the GloVe file here.
# Let's create a fake dictionary for demonstration.
embeddings_index = {
    'the': np.random.rand(100), 
    'a': np.random.rand(100),
    'cat': np.random.rand(100),
    'dog': np.random.rand(100)
}
print(f"Loaded {len(embeddings_index)} pre-trained word vectors.")

# --- Step 2: Create an embedding matrix for our vocabulary ---
# Let's say our project's vocabulary is this:
word_index = {'the': 1, 'cat': 2, 'dog': 3, 'house': 4} # Note: 'house' is not in GloVe
num_tokens = len(word_index) + 1 # +1 for the padding token
embedding_dim = 100 # Must match the GloVe file dimension

# Create a matrix of zeros
embedding_matrix = np.zeros((num_tokens, embedding_dim))

# Fill the matrix with the GloVe vectors
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words found in the embedding index will be non-zero.
        embedding_matrix[i] = embedding_vector

print("\nCreated our project's embedding matrix.")
print("Shape of matrix:", embedding_matrix.shape)

# --- Step 3: Create the Keras Embedding layer ---
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,  # IMPORTANT: Freeze the pre-trained weights
)

# --- Step 4: Build your model ---
model = Sequential([
    embedding_layer,
    Flatten(),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
print("\nFinal Model Summary:")
model.summary()

‚úÖ **Well done!** You've now seen the two main ways to use embeddings in a neural network.

## üèÜ Final Revision Assignment

Time to test your knowledge! Try to answer these questions to solidify what you've learned. This is great for home practice.

#### Question 1 (MCQ)

What is the primary advantage of FastText over Word2Vec and GloVe?

A) It is faster to train.

B) It can handle out-of-vocabulary words.

C) It produces lower-dimensional vectors.

D) It uses a Transformer architecture.

#### Question 2 (MCQ)

Which of the following models generates contextualized word embeddings?

A) Word2Vec

B) GloVe

C) BERT

D) Skip-Gram

#### Question 3 (Short Answer)

Explain the difference between the CBOW and Skip-Gram architectures in Word2Vec. Which one is generally better for representing rare words?

#### Question 4 (Short Answer)

Why is one-hot encoding not ideal for representing words in NLP tasks? Mention at least two reasons.

#### Question 5 (Problem-Solving)

You are given the following word vectors: `v_apple`, `v_fruit`, `v_car`, `v_vehicle`. How would you expect the **cosine similarity** (a measure of how similar vectors are) to compare between the pair `(v_apple, v_fruit)` and the pair `(v_apple, v_car)`? Justify your answer.

#### Question 6 (Case Study)

A company wants to build a sentiment analysis model for customer reviews of their new electronic product. They have a relatively small dataset of 5,000 reviews. Would you recommend they train their own word embeddings from scratch or use pre-trained embeddings like GloVe? Explain the trade-offs of each approach in this scenario.

### üéâ Congratulations!

You have completed the introduction to word embeddings. This is a foundational skill in NLP and will help you understand many advanced AI models. Keep experimenting!