# Embedding Clustering and Vectorization Workshop

This notebook demonstrates the implementation of Word2Vec and GloVe embedding models on a real-world text corpus relevant to our final project.

## Team Members
- Kapil
- Parag
- Preetpal

## 🔄 NLP Preprocessing Pipeline
Steps:
1. Document collection
2. Tokenization
3. Lowercasing
4. Stopword removal
5. Lemmatization

In [3]:
import nltk
nltk.download('punkt')  # This line ensures punkt is available
nltk.download('stopwords')  # Already done, but safe to call again


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# Step 1: Load the file
with open("./Data/Corpus_Titanic.txt", "r", encoding="utf-8") as file:
    raw_corpus = file.read()

# Step 2: Import libraries
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Step 3: Download and set up NLTK resources
nltk_data_dir = "nltk_data"  # local folder
nltk.download('punkt', download_dir=nltk_data_dir)
nltk.download('stopwords', download_dir=nltk_data_dir)

# Add to NLTK path explicitly
nltk.data.path.append(nltk_data_dir)

# Step 4: Preprocessing function
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # remove punctuation/numbers
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return tokens

# Step 5: Apply preprocessing
tokens = preprocess(raw_corpus)

# Step 6: Print output
print(tokens[:50])  # show first 50 tokens


[nltk_data] Downloading package punkt to nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


['titanic', 'massive', 'passenger', 'liner', 'sank', 'north', 'atlantic', 'ocean', 'hitting', 'iceberg', 'people', 'died', 'one', 'deadliest', 'commercial', 'peacetime', 'maritime', 'disasters', 'modern', 'history', 'titanic', 'considered', 'unsinkable', 'nature', 'proved', 'otherwise', 'century', 'later', 'submersible', 'named', 'titan', 'imploded', 'descending', 'titanic', 'wreck', 'site', 'titan', 'privatelyoperated', 'deepsea', 'vehicle', 'designed', 'underwater', 'exploration', 'tragically', 'five', 'people', 'onboard', 'killed', 'titans', 'disappearance']


## 🔤 Word2Vec Implementation (Predictive Model)

In [11]:
# Step 1: Load the corpus
with open("Data/Corpus_Titanic.txt", "r", encoding="utf-8") as file:
    raw_corpus = file.read()

# Step 2: Import required libraries
import re
import nltk
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from gensim.models import Word2Vec

# Step 3: Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Step 4: Preprocess function (sentence and word level)
def preprocess_sentences(text):
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # remove punctuation and digits
    sentences = sent_tokenize(text)
    tokenized = [
        [word for word in word_tokenize(sent) if word not in stop_words]
        for sent in sentences
    ]
    return tokenized

# Step 5: Tokenize the corpus
tokenized_docs = preprocess_sentences(raw_corpus)

# Step 6: Train Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_docs, vector_size=50, window=3, min_count=1, workers=4)
w2v_model.save("word2vec.model")

# Step 7: Get similar words to "titanic"
try:
    similar_words = w2v_model.wv.most_similar('titanic', topn=10)
except KeyError:
    similar_words = []

# Step 8: Display results in grid format using pandas
if similar_words:
    df_similar = pd.DataFrame(similar_words, columns=['Word', 'Similarity Score'])
    print("🔍 Top 10 words similar to 'titanic':")
    display(df_similar)
else:
    print("⚠️ The word 'titanic' was not found in the vocabulary.")


🔍 Top 10 words similar to 'titanic':


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Word,Similarity Score
0,sea,0.271427
1,descending,0.271214
2,raised,0.269587
3,peacetime,0.254257
4,passenger,0.241415
5,ended,0.23966
6,investigators,0.224481
7,site,0.211033
8,private,0.196525
9,killed,0.186529


## 📊 GloVe Implementation (Count-based Model)
Note: We'll use the `glove_python` package.

In [18]:
import gensim.downloader as api

# Load pretrained GloVe 50-dimensional embeddings (takes a while first time)
glove_vectors = api.load("glove-wiki-gigaword-50")

# Find similar words to 'titanic'
similar_words = glove_vectors.most_similar('titanic', topn=10)

print("Similar words to 'titanic' using pretrained GloVe:")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")


Similar words to 'titanic' using pretrained GloVe:
odyssey: 0.6530
phantom: 0.6510
doomed: 0.6414
r.m.s.: 0.6303
cinderella: 0.6263
voyager: 0.6227
wreck: 0.6045
ghost: 0.5991
horror: 0.5960
tragedy: 0.5954


In [21]:
!pip install glove-python


Defaulting to user installation because normal site-packages is not writeable
Collecting glove-python
  Using cached glove_python-0.1.0.tar.gz (263 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: glove-python
  Building wheel for glove-python (setup.py): started
  Building wheel for glove-python (setup.py): finished with status 'error'
  Running setup.py clean for glove-python
Failed to build glove-python


  DEPRECATION: Building 'glove-python' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'glove-python'. Discussion can be found at https://github.com/pypa/pip/issues/6334
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [68 lines of output]
      !!
      
              ********************************************************************************
              Please remove any references to `setuptools.command.test` in all supported versions of the affected package.
      
              This deprecation is overdue, please update your project and remove deprecated
              calls to avoid build errors in the future.

In [None]:
    from glove import Corpus, Glove

    # Step 1: Build the corpus
    corpus = Corpus()
    corpus.fit(tokenized_docs, window=3)

    # Step 2: Train the GloVe model
    glove = Glove(no_components=50, learning_rate=0.05)
    glove.fit(corpus.matrix, epochs=10, no_threads=4, verbose=True)
    glove.add_dictionary(corpus.dictionary)

    # Step 3: Get similar words to "titanic"
    similar_glove_words = glove.most_similar('titanic', number=10)

    # Step 4: Display results in a table
    import pandas as pd

    df_glove = pd.DataFrame(similar_glove_words, columns=["Word", "Similarity Score"])
    print("🔍 Top 10 words similar to 'titanic' using GloVe:")
    display(df_glove)


IndentationError: unexpected indent (1924384277.py, line 3)

## 📈 Talking Points: Word2Vec vs. GloVe

| Feature | Word2Vec | GloVe |
|--------|----------|--------|
| Model Type | Predictive | Count-based |
| Training Speed | Slower due to context prediction | Faster with precomputed co-occurrence matrix |
| Handles Rare Words | Less effective | Better representation of rare words |
| Semantic Accuracy | High | Moderate |
| Use Case Fit | Better for dynamic language (e.g., user reviews) | Better for static corpora like product catalogs |

## ✅ Summary
- Implemented both predictive and count-based embedding models
- Cleaned and tokenized real-world documents from our Grocery Buddies project
- Compared GloVe vs Word2Vec in practical use cases
- All members contributed equally to this collaborative peer-reviewed notebook