# **Build Embeddings**

In [1]:
import pandas as pd, spacy, re, unicodedata, ftfy, pathlib, gensim
from gensim.models import Word2Vec
import numpy as np
from tqdm import tqdm

spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## **NLP** 
### _(Normalization + Tokenization + Stopword removal + Stemming)_

- **Using spaCy instead of re.findall.**

Regex is fast & simple, but it can split improperly on apostrophes/hyphens and misses non-ASCII tokens. spaCy’s tokenizer is more robust for messy, user-written queries.

- **Using token.is_stop from spaCy instead nltk.corpus.stopwords**

nltk.corpus.stopwords (≈ 180 words). While, token.is_stop from spaCy (≈ 500, incl. pronouns, auxiliaries).	Bigger list removes more “noise” words like ‘will’, ‘have’, ‘ourselves’ that rarely help similarity search.

In [5]:
# Load the model and deactivate ner(entities recognition), parser(subject-verb-object), tagger (grammatical labels)
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"]) # tagger activate or deactivated?

def preprocess(text: str) -> list[str]:
    text = ftfy.fix_text(text) # Repairs mis-encoded characters (“â€™” → “’”).
    text = unicodedata.normalize("NFKC", text) #Normalize unicode 
    text = re.sub(r"[‐-–—]", "-", text)          # weird - in only one -
    doc  = nlp(text.lower()) # Tokenization process with Spacy
    return [t.lemma_ for t in doc if t.is_alpha and not t.is_stop] # Keep clean lemmas only

# Performing the model
# Load FAQ data
faq_df = pd.read_csv("../data/processed/faqs.csv")
faq_df["text"] = faq_df["question"].astype(str)

# Load resource data
res_path = "../data/processed/student_resources_index.csv"

try:
    res_df = pd.read_csv(res_path)
    if all(col in res_df.columns for col in ["title", "description"]):
        res_df["text"] = res_df["title"].astype(str) + " " + res_df["description"].astype(str)
    else:
        print("Columns 'title' and 'description' not found in resources CSV.")
        res_df = pd.DataFrame(columns=["text"])
except FileNotFoundError:
    print(f"File not found: {res_path}")
    res_df = pd.DataFrame(columns=["text"])


# Combine both into one corpus
combined_texts = pd.concat([faq_df["text"], res_df["text"]], ignore_index=True)

# Preprocess each entry
sentences = combined_texts.apply(preprocess).tolist()

print(f"Total documents processed: {len(sentences)}")

Total documents processed: 456


In te previous code block We used `spaCy` to normalize and tokenize the questions from the FAQ corpus and the resources:

- Applied `ftfy` and `unicodedata` to fix encoding issues.
- Normalize Unicode characters.
- Tokenized each question using `spaCy`'s English model (`en_core_web_sm`).
- Filtered out non-alphabetic tokens and common stopwords.
- Extracted the lemma of each remaining token to reduce words to their base form (e.g., "studying" → "study").

This resulted in a cleaned, tokenized version of each question, ready to be used for embedding with Word2Vec or GloVe.

## **Word2Vec**
### _Vectorization_

Converts words into numerical vectors that a model can compare or classify.

- "osap" y "bursary" → closed embeddings.
- "vmock" y "resume" → closed.
- "fees" y "library" → so far.

In [6]:
w2v = Word2Vec(
        sentences=sentences, # Tokens processed previously
        vector_size=100,# 100 dimension each vector
        window=5, #Context 5 words prev and post
        min_count=1, # Include all words even it appears more than one
        workers=4, sg=1, 
        epochs=50, seed=42)
w2v.save("../models/embeddings/word2vec_faqs.bin")


### **GloVe Embeddings**

In addition to training our own Word2Vec model on the FAQ and resource corpus, we also load pre-trained **GloVe embeddings (Global Vectors for Word Representation)**.

GloVe was trained on massive corpora like Wikipedia and Gigaword, and captures rich semantic relationships between words (e.g., "resume" and "job" are close; "fee" and "tuition" are related). 

By using GloVe:
- We can represent **words that don't appear often (or at all)** in our training corpus.
- We benefit from **external general knowledge**, which helps improve chatbot responses.
- It allows us to **compare** performance between our custom Word2Vec and industry-standard embeddings.

We use the 100-dimensional `glove.6B.100d.txt` file.


### Setting Glove model

In [7]:
# Path where GloVe will be stored
glove_path = pathlib.Path("../models/embeddings/glove.6B.100d.txt")

# Download and unzip if not already present
if not glove_path.exists():
    import zipfile, requests
    zip_url = "http://nlp.stanford.edu/data/glove.6B.zip"
    zip_path = glove_path.parent / "glove.6B.zip"

    print("Downloading GloVe...")
    with open(zip_path, "wb") as f:
        f.write(requests.get(zip_url).content)
    print("Unzipping...")
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(glove_path.parent)



In [8]:
# Load GloVe embeddings
glove = gensim.models.KeyedVectors.load_word2vec_format(glove_path, binary=False, no_header=True)

## **Vectorize whole Courpus**


In [9]:

# Function to get sentence vector using GloVe
def sent_vector(sent, model, dim=100):
    tokens = preprocess(sent)
    vecs = [model[t] for t in tokens if t in model] # Get vector each word
    return np.mean(vecs, axis=0) if vecs else np.zeros(dim)


In [10]:
# Vectorize all questions/resources with GloVe
tqdm.pandas(desc="Vectorizing with GloVe") # Progress bar
vectors_glove = combined_texts.progress_apply(lambda s: sent_vector(s, glove, dim=100))

# Vectorize all (FAQ + resources)
w2v = Word2Vec.load("../models/embeddings/word2vec_faqs.bin")
tqdm.pandas(desc="Vectorizing with Word2Vec")
vectors_w2v = combined_texts.progress_apply(lambda s: sent_vector(s, w2v.wv, dim=100))

# Save as pickle for future use
pd.DataFrame({"vec_glove": vectors_glove}).to_pickle("../data/processed/glove_vectors.pkl") #Saving vectors
print("GloVe embeddings saved.")

pd.DataFrame({"vec_w2v": vectors_w2v}).to_pickle("../data/processed/word2vec_vectors.pkl")
print("✅ Word2Vec embeddings saved.")



Vectorizing with GloVe: 100%|██████████| 456/456 [00:01<00:00, 245.95it/s]
Vectorizing with Word2Vec: 100%|██████████| 456/456 [00:01<00:00, 251.39it/s]


GloVe embeddings saved.
✅ Word2Vec embeddings saved.
