# ðŸ“˜ Introduction to Gensim and Custom Word Embeddings

Welcome to this 2-hour interactive session on Natural Language Processing (NLP)! Today, we'll dive into the fascinating world of word embeddings using one of the most popular Python libraries: **Gensim**.

### ðŸŽ¯ Learning Objectives

By the end of this session, you will be able to:
1.  Understand what Gensim and Word Embeddings are.
2.  Recognize why custom-trained embeddings are powerful.
3.  Prepare a text corpus for model training.
4.  Train your own `Word2Vec` model from scratch.
5.  Train your own `FastText` model and understand its advantages.
6.  Explore the results to find word similarities and solve analogies.

### What is Gensim?

**Gensim** (which stands for "Generate Similar") is a fantastic open-source Python library for NLP. It's famous for being highly efficient, especially with large text collections, because it can process text in a streaming fashion (meaning it doesn't need to load everything into memory at once!).

### What are Word Embeddings?

Imagine you could represent words as numbers in a way that captures their meaning and relationships. That's exactly what **word embeddings** do! They are dense vector representations of words in a multi-dimensional space. The core idea is simple: **words that appear in similar contexts tend to have similar meanings.** For example, the vectors for "king" and "queen" would be much closer to each other than the vectors for "king" and "apple".

## Topic 1: Corpus Preparation

The quality of our word embeddings depends heavily on the quality of our input text, or **corpus**. Gensim models expect the corpus in a specific format: a **list of lists of strings**, where each inner list represents a tokenized sentence.

In [1]:
# This is the format Gensim expects:
# A list, where each item is another list containing the words of a sentence.

corpus = [
    ['this', 'is', 'the', 'first', 'sentence'],
    ['this', 'document', 'is', 'the', 'second', 'sentence'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

# Let's print it to see!
print(corpus)

[['this', 'is', 'the', 'first', 'sentence'], ['this', 'document', 'is', 'the', 'second', 'sentence'], ['and', 'this', 'is', 'the', 'third', 'one'], ['is', 'this', 'the', 'first', 'document']]


To get our text into this format, we usually perform a few preprocessing steps:
1.  **Tokenization**: Splitting sentences into individual words (tokens).
2.  **Lowercasing**: Converting all text to lowercase (e.g., treating "Apple" and "apple" as the same).
3.  **Removing Punctuation/Numbers**: Getting rid of characters that don't carry semantic meaning.

Let's prepare a small sample corpus for our models.

In [2]:
# Our sample corpus for today's session
# Notice it's already tokenized and lowercased for us!
sentences = [
    ['the', 'king', 'is', 'a', 'strong', 'ruler'],
    ['the', 'queen', 'is', 'a', 'wise', 'leader'],
    ['the', 'prince', 'is', 'a', 'young', 'man'],
    ['the', 'princess', 'is', 'a', 'young', 'woman'],
    ['a', 'man', 'can', 'be', 'a', 'king'],
    ['a', 'woman', 'can', 'be', 'a', 'queen'],
    ['royalty', 'includes', 'the', 'king', 'and', 'queen']
]

print("Corpus is ready!")

Corpus is ready!


### ðŸŽ¯ Practice Task 1: Prepare Your Own Corpus

You are given a list of raw sentences. Your task is to turn them into the `list of lists` format that Gensim needs. You'll need to lowercase each sentence and then split it into words.

In [None]:
raw_sentences = [
    "The weather is sunny and bright.",
    "I love a bright sunny day."
]

processed_corpus = []
# Your code here! Loop through raw_sentences.
# For each sentence, convert it to lowercase and split it into words.
# Then, append the list of words to processed_corpus.

# for sentence in raw_sentences:
#     lower_sentence = ...
#     words = ...
#     processed_corpus.append(words)

# print(processed_corpus)
# Expected output: [['the', 'weather', 'is', 'sunny', 'and', 'bright.'], ['i', 'love', 'a', 'bright', 'sunny', 'day.']]

## Topic 2: Training a Word2Vec Model

`Word2Vec` is a famous model developed at Google that learns word embeddings. It has two main architectures:

-   **CBOW (Continuous Bag of Words)**: Predicts a target word from its context words. It's fast and works well for frequent words.
-   **Skip-Gram**: Predicts the context words from a target word. It's slower but great for rare words and often produces higher-quality embeddings.

Let's train our first `Word2Vec` model using the **Skip-Gram** architecture (`sg=1`).

In [3]:
import gensim
import logging

# This helps us see the training progress in the notebook output
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

print("--- Training Word2Vec Model ---")

# Let's define the model with its key parameters
word2vec_model = gensim.models.Word2Vec(
    sentences=sentences,  # Our prepared corpus
    vector_size=100,      # Dimensionality of the word vectors (how many numbers per word)
    window=5,             # Max distance between current and predicted word within a sentence
    min_count=1,          # Ignores all words with total frequency lower than this
    sg=1,                 # Training algorithm: 1 for Skip-Gram; 0 for CBOW
    workers=4             # Use 4 CPU threads to speed up training
)

print("\nâœ… Word2Vec model training complete!")

2025-11-13 11:32:42,794 : INFO : collecting all words and their counts
2025-11-13 11:32:42,796 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-11-13 11:32:42,797 : INFO : collected 19 word types from a corpus of 42 raw words and 7 sentences
2025-11-13 11:32:42,799 : INFO : Creating a fresh vocabulary
2025-11-13 11:32:42,801 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 19 unique words (100.00% of original 19, drops 0)', 'datetime': '2025-11-13T11:32:42.801188', 'gensim': '4.3.3', 'python': '3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'prepare_vocab'}
2025-11-13 11:32:42,802 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 42 word corpus (100.00% of original 42, drops 0)', 'datetime': '2025-11-13T11:32:42.802197', 'gensim': '4.3.3', 'python': '3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:

--- Training Word2Vec Model ---

âœ… Word2Vec model training complete!


ðŸ’¡ **A Note on Parameters**
-   `vector_size`: A common range is 100-300. More dimensions can capture more info but need more data.
-   `window`: A larger window considers more context.
-   `min_count`: Helps filter out rare words or typos.
-   `sg=1`: We chose Skip-Gram. Try changing it to `sg=0` later to see how it affects the results!

## Topic 3: Exploring the Word2Vec Model

Now for the fun part! Let's see what our model has learned. We can access all the word vectors and query them through the `model.wv` object (which stands for 'word vectors').

In [4]:
# Get the KeyedVectors instance from our trained model
wv = word2vec_model.wv

# Let's find the words most similar to 'king'
# topn=3 means we want the top 3 results
print("Most similar to 'king':", wv.most_similar('king', topn=3))

Most similar to 'king': [('wise', 0.2528882920742035), ('is', 0.17026387155056), ('and', 0.15015792846679688)]


In [5]:
# We can also check the cosine similarity between two words.
# A value closer to 1.0 means they are very similar.
similarity_score = wv.similarity('king', 'queen')
print(f"Similarity between 'king' and 'queen': {similarity_score:.4f}")

Similarity between 'king' and 'queen': -0.0445


In [6]:
# Now for the classic word analogy: king - man + woman = ?
# The model should predict 'queen'!
# `positive` words are added, `negative` words are subtracted.
analogy_result = wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

print("Analogy 'king - man + woman':", analogy_result)

Analogy 'king - man + woman': [('princess', 0.22007820010185242)]


### ðŸŽ¯ Practice Task 2: Explore Relationships

Now it's your turn! Use the `wv` object to:
1.  Find the top 3 most similar words to `'woman'`.
2.  Calculate the similarity between `'prince'` and `'princess'`.

In [None]:
# 1. Find the top 3 most similar words to 'woman'
# print("Most similar to 'woman':", ...)

# 2. Calculate the similarity between 'prince' and 'princess'
# print("Similarity between 'prince' and 'princess':", ...)

## Topic 4: Training a FastText Model

`FastText`, developed by Facebook AI Research, is a powerful extension of `Word2Vec`. Its key innovation is that it learns vectors for **character n-grams** (sub-parts of words) instead of just whole words.

For example, the word `apple` (with n=3) is broken down into `<ap, app, ppl, ple, le>`. The final vector for `apple` is the sum of these n-gram vectors.

This gives it a superpower: **it can create vectors for words it has never seen before (Out-of-Vocabulary or OOV words)!** This is extremely useful for real-world text which often contains typos, slang, or new words.

In [7]:
print("\n--- Training FastText Model ---")

# The parameters are very similar to Word2Vec
fasttext_model = gensim.models.FastText(
    sentences=sentences,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1,
    workers=4,
    min_n=3,          # Minimum length of char n-grams
    max_n=6           # Maximum length of char n-grams
)

print("\nâœ… FastText model training complete!")

2025-11-13 11:39:18,498 : INFO : collecting all words and their counts
2025-11-13 11:39:18,501 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-11-13 11:39:18,503 : INFO : collected 19 word types from a corpus of 42 raw words and 7 sentences
2025-11-13 11:39:18,504 : INFO : Creating a fresh vocabulary
2025-11-13 11:39:18,507 : INFO : FastText lifecycle event {'msg': 'effective_min_count=1 retains 19 unique words (100.00% of original 19, drops 0)', 'datetime': '2025-11-13T11:39:18.507165', 'gensim': '4.3.3', 'python': '3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'prepare_vocab'}
2025-11-13 11:39:18,510 : INFO : FastText lifecycle event {'msg': 'effective_min_count=1 leaves 42 word corpus (100.00% of original 42, drops 0)', 'datetime': '2025-11-13T11:39:18.510226', 'gensim': '4.3.3', 'python': '3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:


--- Training FastText Model ---


2025-11-13 11:39:20,608 : INFO : FastText lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-11-13T11:39:20.608505', 'gensim': '4.3.3', 'python': '3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'build_vocab'}
2025-11-13 11:39:20,612 : INFO : FastText lifecycle event {'msg': 'training model with 4 workers on 19 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-11-13T11:39:20.612440', 'gensim': '4.3.3', 'python': '3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'train'}
2025-11-13 11:39:20,628 : INFO : EPOCH 0: training on 42 raw words (7 effective words) took 0.0s, 663 effective words/s
2025-11-13 11:39:20,636 : INFO : EPOCH 1: training on 42 raw words (5 effective words) took 0.0s, 3862 effective wor


âœ… FastText model training complete!


## Topic 5: Handling OOV Words with FastText

Let's test FastText's superpower. The word `'royal'` does not appear in our training sentences, but the word `'royalty'` does. Because they share n-grams (like `roy`), FastText can generate a meaningful vector for `'royal'`. Word2Vec cannot do this and will raise an error.

In [8]:
# Get the KeyedVectors for the FastText model
ft_wv = fasttext_model.wv

# Demonstrate handling of an Out-of-Vocabulary (OOV) word
print("FastText can create a vector for the OOV word 'royal'!")
print(ft_wv['royal'])

# Now, let's see what happens when we try the same with Word2Vec
print("\n--- Trying to access OOV word 'royal' with Word2Vec ---")
try:
    print(wv['royal'])
except KeyError as e:
    print("Word2Vec error as expected:", e)

FastText can create a vector for the OOV word 'royal'!
[-1.5122148e-03  1.1532352e-03  1.9209286e-03 -8.8359398e-04
  3.3625741e-03 -2.1618661e-03  7.6207158e-04  4.1295314e-04
 -2.7965769e-04 -1.8673305e-03  1.7786868e-03 -2.0222842e-04
 -9.1285910e-04  6.8314903e-04  8.3535007e-04 -8.3769549e-04
 -2.3814852e-03 -2.2419088e-03  1.6209234e-03  5.5180426e-04
 -8.8045344e-04 -7.6456659e-04  8.1196550e-04 -2.7520659e-03
  5.6462327e-04  1.6299620e-03 -1.3160080e-03  2.1835254e-03
 -2.8695953e-03 -1.7305206e-03  8.6225721e-04 -6.0667080e-04
  6.9958577e-04 -2.3803189e-03 -4.8611444e-04 -6.2479317e-04
 -1.3061651e-03 -5.1116204e-04 -6.6606834e-04 -1.0390515e-03
  2.5509144e-03  2.2024622e-03 -6.8543415e-04 -5.3005147e-04
  2.5311424e-03 -4.4328329e-04 -1.7440926e-03  3.3177424e-03
  2.2625123e-04  2.1600495e-03 -5.1944656e-04 -9.9181675e-04
 -2.5285182e-03 -6.9755915e-04 -6.9205300e-05  3.9778976e-04
 -4.1564279e-05  7.9426903e-04  2.7825802e-03 -1.1663277e-03
  5.5489287e-04 -1.1981005e-03

### ðŸŽ¯ Practice Task 3: Test Another OOV Word

Try to get a vector for another OOV word, such as `'kingdom'`. Does FastText handle it? What about Word2Vec?

In [None]:
# Your code here!
# Try to access the vector for 'kingdom' using ft_wv (FastText)

# print(ft_wv['kingdom'])

##  Final Revision Assignment

Congratulations on making it to the end of the session! Now it's time to combine everything you've learned. We will use a new corpus of customer reviews to build and test our models.

**Your goal:** Analyze customer reviews to find similar concepts.

In [None]:
reviews_raw = [
    "The customer service was excellent and friendly.",
    "I was not happy with the product quality.",
    "The delivery was slow but the service was helpful.",
    "Excellent product and quick delivery."
]

### Task 1: Preprocess the Reviews

Complete the code below to tokenize and lowercase the `reviews_raw` list.

In [None]:
reviews_corpus = []
for review in reviews_raw:
    # 1. Lowercase the review
    # 2. Split it into words (tokens)
    # 3. Append the list of words to reviews_corpus
    pass

# print(reviews_corpus)

### Task 2: Train a Word2Vec Model

Train a `Word2Vec` model on the `reviews_corpus`.
-   `vector_size` = 50
-   `window` = 3
-   `sg` = 1 (Skip-Gram)
-   `min_count` = 1

In [None]:
# Your code here to train a Word2Vec model
# review_w2v_model = ...


### Task 3: Find Similar Words

Using your new Word2Vec model, find the top 3 words most similar to `"service"`.

In [None]:
# Your code here
# review_wv = review_w2v_model.wv
# print(review_wv.most_similar('service', topn=3))

### Task 4: Train a FastText Model

Now, train a `FastText` model on the same `reviews_corpus`. Use the same parameters as the Word2Vec model.

In [None]:
# Your code here to train a FastText model
# review_ft_model = ...


### Task 5: Compare OOV Handling

The word `"friendliness"` is not in our corpus, but `"friendly"` is. Try to get the vector for `"friendliness"` using both your Word2Vec and FastText models. What happens?

In [None]:
# Get the vectors from the FastText model
# review_ft_wv = review_ft_model.wv
# print("FastText vector for 'friendliness':", review_ft_wv['friendliness'])

# Try to get the vectors from the Word2Vec model
# try:
#     print(review_wv['friendliness'])
# except KeyError as e:
#     print("Word2Vec error:", e)

### Task 6: Conceptual Question

A hospital wants to build a search engine for its patient discharge summaries, which contain a lot of specific medical terms. Would you recommend using a generic pre-trained model (like one from Google News) or training a custom **FastText** model on the hospital's data?

**Double-click here to write your answer:**

*Your answer here. Justify your choice!*

## ðŸŽ‰ You've completed the session! 

Well done! You now have hands-on experience training and using custom word embedding models with Gensim. This is a fundamental skill in modern NLP and opens the door to building powerful, domain-aware applications.