# Embedding Clustering and Vectorization Workshop

This notebook demonstrates the implementation of Word2Vec and GloVe embedding models on a real-world text corpus relevant to our final project.

## Team Members
- Kapil
- Parag
- Preetpal

## 🔄 NLP Preprocessing Pipeline
Steps:
1. Document collection
2. Tokenization
3. Lowercasing
4. Stopword removal
5. Lemmatization

## 1.1 Document Collection


In [None]:
# reading corupus for Titanic and Titan ship which sank in last century
with open("./data/Corpus_Titanic.txt", "r", encoding="utf-8") as file:
    raw_corpus = file.read()


## 1.2 Preprocessing (Tokenization, Lowercase, Stopword Removal, Lemmatization)


In [29]:
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Setup
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Preprocessing function
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [w for w in tokens if w not in stopwords.words("english")]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return tokens

tokens = preprocess(raw_corpus)
print(tokens[:50])


['titanic', 'massive', 'passenger', 'liner', 'sank', 'north', 'atlantic', 'ocean', 'hitting', 'iceberg', 'people', 'died', 'one', 'deadliest', 'commercial', 'peacetime', 'maritime', 'disaster', 'modern', 'history', 'titanic', 'considered', 'unsinkable', 'nature', 'proved', 'otherwise', 'century', 'later', 'submersible', 'named', 'titan', 'imploded', 'descending', 'titanic', 'wreck', 'site', 'titan', 'privatelyoperated', 'deepsea', 'vehicle', 'designed', 'underwater', 'exploration', 'tragically', 'five', 'people', 'onboard', 'killed', 'titan', 'disappearance']


[nltk_data] Downloading package punkt to C:\Users\acer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\acer/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Step 2: Word2Vec Embedding (Predictive Model)

## 2.1 Prepare Sentences for Word2Vec

In [30]:
from nltk.util import ngrams
from gensim.models import Word2Vec

# If needed: simple sentence split for training
sentences = [tokens[i:i+30] for i in range(0, len(tokens), 30)]

# Train Word2Vec
w2v_model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=2, sg=1, workers=4)


## 2.2 Get Similar Words to "titanic"



In [31]:
w2v_similar = w2v_model.wv.most_similar("titanic", topn=10)
import pandas as pd
df_w2v = pd.DataFrame(w2v_similar, columns=["Word", "Similarity Score"])
display(df_w2v)


Unnamed: 0,Word,Similarity Score
0,maritime,0.216139
1,massive,0.093101
2,later,0.092917
3,ocean,0.079591
4,people,0.062851
5,century,0.027057
6,deepsea,0.016135
7,titan,-0.010832
8,human,-0.027654
9,disaster,-0.052347


## GloVe Implementation (Count-based Model)
Note: We'll use the `glove_python` package.

In [32]:
w2v_similar = w2v_model.wv.most_similar("titanic", topn=10)
import pandas as pd
df_w2v = pd.DataFrame(w2v_similar, columns=["Word", "Similarity Score"])
display(df_w2v)


Unnamed: 0,Word,Similarity Score
0,maritime,0.216139
1,massive,0.093101
2,later,0.092917
3,ocean,0.079591
4,people,0.062851
5,century,0.027057
6,deepsea,0.016135
7,titan,-0.010832
8,human,-0.027654
9,disaster,-0.052347


# Step 3: GloVe Embedding (Count-Based Model)

## 3.1 Load Pretrained GloVe

In [33]:
import gensim.downloader as api
glove_model = api.load("glove-wiki-gigaword-100")  # Or 50/200/300


## 3.2 Get Similar Words to "titanic"

In [34]:
glove_similar = glove_model.most_similar("titanic", topn=10)
df_glove = pd.DataFrame(glove_similar, columns=["Word", "Similarity Score"])
display(df_glove)


Unnamed: 0,Word,Similarity Score
0,sinking,0.560998
1,dicaprio,0.558514
2,rms,0.554885
3,voyage,0.53636
4,sunk,0.530033
5,epic,0.516684
6,starship,0.514416
7,winslet,0.513265
8,r.m.s.,0.507997
9,iceberg,0.504808


##  Talking Points: Word2Vec vs. GloVe


| **Topic**                          | **Word2Vec**                                                       | **GloVe**                                                    |
| ---------------------------------- | ------------------------------------------------------------------ | ------------------------------------------------------------ |
| **Model Type**                     | Predictive                                                         | Count-based                                                  |
| **Training Source**                | Trained on custom Titanic corpus                                   | Pre-trained on Wikipedia + Gigaword                          |
| **Vocabulary Coverage**            | Limited to your dataset                                            | Very broad (global vocab)                                    |
| **"Titanic" Similar Words Output** | Contextually relevant to Titanic (e.g., “ship”, “passenger”)       | Broader or more abstract terms based on global co-occurrence |
| **Accuracy**                       | Context-sensitive, but can miss words not frequent in small corpus | Captures global word meaning well                            |
| **Training Time**                  | Takes time based on data size                                      | Instant (pre-trained model)                                  |
| **Customization**                  | Fully customizable                                                 | Not trainable (read-only)                                    |
| **Memory Use**                     | Efficient for small corpora                                        | Larger model size (\~800MB)                                  |
| **Use Case Fit**                   | Best for domain-specific tasks                                     | Best for general NLP tasks                                   |
| **Similarity Method**              | Cosine similarity on trained vectors                               | Cosine similarity on pre-trained vectors                     |


##  Summary
- Implemented both predictive and count-based embedding models
- Cleaned and tokenized real-world documents from our Grocery Buddies project
- Compared GloVe vs Word2Vec in practical use cases
- All members contributed equally to this collaborative peer-reviewed notebook