# Text 4: Word2Vec
**Internet Analytics - Lab 4**

---

**Group:** *J*

**Names:**

* *Kenza Driss*
* *Maximilien Hoffbeck*
* *Jaeyi Jeong*
* *Yoojin Kim*

---

#### Instructions

*This is a template for part 4 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import re
import numpy as np
from scipy.sparse import csr_matrix
from collections import defaultdict
import json
from utils import *
import gensim
from sklearn.cluster import KMeans


courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Redo pre-processing

In [2]:
import json
import re
import pickle


course_names = []
course_descriptions = []

with open("data/courses.txt", "r") as f:
    lines = f.readlines()
    for line in lines:
        data = json.loads(line)
        course_names.append(data["name"])
        course_descriptions.append(data["description"])


with open("data/stopwords.pkl", "rb") as f:
    stopwords = pickle.load(f)


tokenized_courses = []
vocab = set()

for desc in course_descriptions:
    tokens = re.findall(r"\b[A-Za-z0-9]+\b", desc)
    tokens = [t for t in tokens if t not in stopwords]  
    tokenized_courses.append(tokens)
    vocab.update(tokens)

print(f" {len(course_descriptions)} course descriptions processed.")
print(f" Vocabulary size (case-sensitive): {len(vocab)}")


 854 course descriptions processed.
 Vocabulary size (case-sensitive): 19081


## Exercise 4.12 : Clustering word vectors

In [3]:
from gensim.models import KeyedVectors


w2v_path = "/home/ix/ix-data/model.txt"

w2v = KeyedVectors.load_word2vec_format(w2v_path, binary=False)


words_in_model = [w for w in vocab if w in w2v]

print(f" {len(words_in_model)} words in your vocabulary are in the pretrained Word2Vec model.")


 13300 words in your vocabulary are in the pretrained Word2Vec model.


In [4]:
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


X = np.array([w2v[word] for word in words_in_model])


X_normalized = normalize(X)


k = 20
kmeans = KMeans(n_clusters=k, random_state=42, n_init="auto")
kmeans.fit(X_normalized)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_


for i in range(k):
    print(f"\n Cluster {i}")
    cluster_indices = np.where(labels == i)[0]
    cluster_words = [words_in_model[j] for j in cluster_indices]
    cluster_vectors = X_normalized[cluster_indices]
    
    sims = cosine_similarity(cluster_vectors, centroids[i].reshape(1, -1)).flatten()
    top_indices = sims.argsort()[-10:][::-1]
    
    for idx in top_indices:
        print(f"  {cluster_words[idx]}")



 Cluster 0
  thin
  casing
  cylindrical
  sliding
  wires
  vertical
  stacked
  horizontal
  thick
  mesh

 Cluster 1
  notion
  context
  concepts
  notions
  normative
  epistemologies
  interpretation
  understanding
  implications
  assumptions

 Cluster 2
  photodetectors
  photocurrent
  nonlinearities
  photoluminescence
  photomultipliers
  photoemission
  dielectric
  excitation
  linewidth
  electromigration

 Cluster 3
  Biochemistry
  Microbiology
  Neuroscience
  Physics
  Biomedical
  Chemistry
  Psychology
  Neurophysiology
  Biophysics
  Applied

 Cluster 4
  quadratic
  ODEs
  eigenvalues
  summability
  invariant
  discretized
  finite
  equations
  stationarity
  integrability

 Cluster 5
  investments
  investment
  enterprises
  suppliers
  investing
  procurement
  healthcare
  outsourcing
  pricing
  enterprise

 Cluster 6
  texts
  illustrations
  essays
  illustrative
  illustrating
  inspired
  references
  illustrated
  compositions
  formes

 Cluster 7
  

#### 1. Top-10 words per cluster

See output above. We clustered the Word2Vec vectors of **13,300 vocabulary words that were present in the pretrained model** into **k = 20** clusters using KMeans after normalization (so that cosine similarity ≈ Euclidean distance). For each cluster, we printed the 10 most similar words to the centroid.

#### 2. Cluster types and interpretation

Here are 10 example clusters and their interpreted theme:

- **Cluster 0 – Mechanical structures / physical components**: `thin`, `casing`, `cylindrical`, `sliding`, `mesh`...
- **Cluster 1 – Abstract reasoning / epistemology**: `notion`, `context`, `assumptions`, `epistemologies`, `interpretation`...
- **Cluster 2 – Photonics / optoelectronics**: `photodetectors`, `photocurrent`, `photoluminescence`, `excitation`, `dielectric`...
- **Cluster 3 – Scientific disciplines**: `Biochemistry`, `Neuroscience`, `Psychology`, `Physics`, `Biomedical`...
- **Cluster 4 – Applied mathematics / equations**: `ODEs`, `quadratic`, `eigenvalues`, `stationarity`, `discretized`...
- **Cluster 5 – Economics / supply chain**: `investments`, `suppliers`, `procurement`, `enterprise`, `pricing`...
- **Cluster 6 – Visual communication / illustration**: `illustrations`, `essays`, `compositions`, `references`, `formes`...
- **Cluster 7 – Human cognition / understanding**: `knowing`, `thinking`, `comprehend`, `mind`, `empathy`...
- **Cluster 8 – Environmental science / ecosystems**: `wetlands`, `vegetation`, `groundwater`, `basin`, `biotopes`...
- **Cluster 11 – Cell biology / neuroscience**: `apoptosis`, `neuronal`, `proteins`, `neuroprotection`, `inhibition`...

Some clusters (exp Cluster 9 or 19) are ambiguous or noisy, including numbers or surnames, but most are semantically coherent.

#### 3. Comparison to LSI and LDA

- **LSI** and **LDA** group terms based on their usage patterns across documents (topics as co-occurrence patterns).
- **KMeans on Word2Vec** groups terms based on **semantic similarity in pretrained embeddings** (from Wikipedia), regardless of our corpus.

Compared to LSI/LDA:
- These clusters are **sharper and cleaner** (exp Cluster 2 = clearly photonics-related).
- But they lack **document-level context**, it's term-only.
- However, many LSI/LDA topics align closely with the clusters, showing that Word2Vec embeddings are **consistent with latent topics** discovered from the corpus.


## Exercise 4.13 : Document similarity search

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
import numpy as np
import re

corpus_joined = [" ".join(doc) for doc in tokenized_courses]

vectorizer = TfidfVectorizer(vocabulary=words_in_model, lowercase=False)
tfidf_matrix = vectorizer.fit_transform(corpus_joined)
idf_dict = dict(zip(vectorizer.get_feature_names_out(), vectorizer.idf_))

def document_vector(doc_tokens):
    vecs = []
    for word in doc_tokens:
        if word in w2v and word in idf_dict:
            weight = idf_dict[word]
            vecs.append(weight * w2v[word])
    return np.mean(vecs, axis=0) if vecs else np.zeros(w2v.vector_size)

doc_vectors = np.array([document_vector(doc) for doc in tokenized_courses])
doc_vectors = normalize(doc_vectors)

def search(query, top_k=5):
    tokens = re.findall(r"\b[A-Za-z0-9]+\b", query)
    tokens = [t for t in tokens if t in w2v and t in idf_dict]
    
    if not tokens:
        print(f"No valid words found in query '{query}'")
        return

    query_vec = document_vector(tokens).reshape(1, -1)
    query_vec = normalize(query_vec)

    sims = cosine_similarity(doc_vectors, query_vec).flatten()
    top_indices = sims.argsort()[-top_k:][::-1]

    print(f"\nQuery: '{query}'\n")
    for i in top_indices:
        print(f"Score: {sims[i]:.4f} | {course_names[i]}: {course_descriptions[i][:80]}...")

search("Markov chains")
search("Facebook")



Query: 'Markov chains'

Score: 0.6352 | Applied stochastic processes: This course introduces the theory of stochastic processes including Markov chain...
Score: 0.5740 | Applied probability & stochastic processes: This course focuses on dynamic models of random phenomena, and in particular, th...
Score: 0.5652 | Markov chains and algorithmic applications: The study of random walks finds many applications in computer science and commun...
Score: 0.5611 | Statistical Sequence Processing: This course discusses advanced methods extensively used for the processing, pred...
Score: 0.5343 | Polymer chemistry and macromolecular engineering: Know modern methods of polymer synthesis. Understand how parameters, which deter...

Query: 'Facebook'

Score: 0.6106 | Computational Social Media: The course integrates concepts from media studies, machine learning, multimedia ...
Score: 0.4408 | Computer networks: This course provides an introduction to computer networks. It describes the prin...
Score: 


We implemented a search function where each course is represented as a **TF-IDF–weighted average of Word2Vec word vectors**.  
Queries are processed in the same way, and we use **cosine similarity** to compare them to each course.

#### Results for "Markov chains":

The top-ranked courses are highly relevant:
- **Applied stochastic processes**
- **Applied probability & stochastic processes**
- **Markov chains and algorithmic applications**
- **Statistical Sequence Processing**

All of these explicitly focus on stochastic processes, random walks, or statistical modeling, confirming that the model successfully retrieves semantically appropriate results, even when the exact query words are not repeated.

#### Results for "Facebook":

The highest-ranked courses include:
- **Computational Social Media**
- **Computer networks**
- **Privacy Protection**
- **Internet analytics**
- **Media security**

These courses deal with social networks, online communication, privacy, and media, all thematically related to Facebook. Even though the word “Facebook” might not appear in most descriptions, the system generalizes well through word embeddings.

#### Comparison with Vector Space Model and LSI:

Compared to the **vector space model** (TF-IDF only), this method captures **semantic similarity** rather than relying on exact keyword matches. In contrast, TF-IDF would miss several relevant courses due to vocabulary mismatch.

Compared to **LSI**, Word2Vec leverages pretrained semantic knowledge at the word level (from Wikipedia), allowing more fine-grained matching. LSI focuses on latent topics at the document level, while Word2Vec provides a more **context-aware and locally precise** representation.

Overall, Word2Vec + TF-IDF offers **more robust, flexible, and meaningful document matching**, especially for short or expressive queries.


## Exercise 4.14: Document similarity search with outside terms

In [7]:
print("MySpace" in vocab)
print("Orkut" in vocab)
print("coronavirus" in vocab)


False
False
False


In [14]:
def search_generalized(query, top_k=5):
    tokens = re.findall(r"\b[A-Za-z0-9]+\b", query)
    tokens = [t for t in tokens if t in w2v]

    if not tokens:
        print(f"No recognized words in query '{query}' (not in Word2Vec vocabulary).")
        return

    vecs = [w2v[word] for word in tokens]
    query_vec = np.mean(vecs, axis=0).reshape(1, -1)
    query_vec = normalize(query_vec)

    sims = cosine_similarity(doc_vectors, query_vec).flatten()
    top_indices = sims.argsort()[-top_k:][::-1]

    print(f"\nQuery: '{query}' (words not in corpus)\n")
    for i in top_indices:
        print(f"Score: {sims[i]:.4f} | {course_names[i]}: {course_descriptions[i][:80]}...")

search_generalized("MySpace Orkut")
search_generalized("coronavirus")



Query: 'MySpace Orkut' (words not in corpus)

Score: 0.5942 | Computational Social Media: The course integrates concepts from media studies, machine learning, multimedia ...
Score: 0.4841 | Computer networks: This course provides an introduction to computer networks. It describes the prin...
Score: 0.4520 | Mobile networks: This course provides a detailed description of the organization and operating pr...
Score: 0.4479 | Privacy Protection: Main threats against privacy, description of protection techniques and of their ...
Score: 0.4454 | Internet analytics: Internet analytics is the collection, modeling, and analysis of user data in lar...

Query: 'coronavirus' (words not in corpus)

Score: 0.5193 | Infection biology: Infectious diseases (ID) are still a major problem to human health. But how do p...
Score: 0.5086 | Biotechnology lab (for CGC): Students apply basic techniques in molecular biology to clone a cDNA of interest...
Score: 0.5067 | Practical - Lemaitre Lab: Drosophila imm



We tested the ability of our Word2Vec-based search system to generalize to queries containing words that do not appear in our dataset. Thanks to the pretrained embeddings, which capture semantic information from large corpora such as Wikipedia, the model can still retrieve relevant documents by leveraging similarity in meaning, even when the exact query terms are unseen in the corpus.

#### Query: "MySpace Orkut"

Neither "MySpace" nor "Orkut" appears in any EPFL course description. However, the top retrieved courses included:
- Computational Social Media
- Computer Networks
- Privacy Protection
- Internet Analytics
- Media Security

These courses are thematically related to online platforms, digital communication, and social media, demonstrating that the system can generalize well to related topics using only pretrained word semantics. The results are also comparable to what we obtained for the query "Facebook" in Exercise 4.13, confirming the consistency of semantic matching.

#### Query: "coronavirus"

The term "coronavirus" is not present in the course corpus. Nonetheless, the system retrieved:
- Infectious Diseases
- Molecular Biology Lab
- Drosophila Immunity
- Biomedical Research
- Protein Engineering

These courses are directly relevant to biology, immunology, and public health, and would be suitable for someone interested in the scientific context of a pandemic. This shows that the model can link the term "coronavirus" to courses covering related biological and health-related content, a task that traditional TF-IDF or LSI models would fail at due to vocabulary mismatch.

#### Conclusion

This experiment highlights the advantage of using pretrained word embeddings: they provide strong generalization capabilities by capturing the semantic relationships between words beyond the training corpus. Unlike LSI or traditional vector space models, this allows our system to handle queries that include terms entirely missing from the document set while still producing meaningful and accurate results.
