After cleaning, normalizing and labeling data, they are ready to train.

In [2]:
# Train Word2Vec & FastText
from gensim.models import Word2Vec, FastText
import pandas as pd
from pathlib import Path

# The cleaned, normalized, labeled data files that are processed in "data_prcocessing.ipynb" 
files = [
    "data/clean/labeled-sentiment_2col.xlsx",
    "data/clean/test__1__2col.xlsx",
    "data/clean/train__3__2col.xlsx",
    "data/clean/train-00000-of-00001_2col.xlsx",
    "data/clean/merged_dataset_CSV__1__2col.xlsx",
]

sentences = []
for f in files:
    df = pd.read_excel(f, usecols=["cleaned_text"])
    sentences.extend(df["cleaned_text"].astype(str).str.split().tolist())

Path("embeddings").mkdir(exist_ok=True)  # Create a folder named embeddings

w2v = Word2Vec(sentences=sentences, vector_size=300, window=5,
               min_count=3, sg=1, negative=10, epochs=10, seed=42)  # Train Word2Vec
w2v.save("embeddings/word2vec.model")

print("Saved Word2Vec model")

Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


Saved Word2Vec model


In [3]:
ft = FastText(sentences=sentences, vector_size=300, window=5,
              min_count=3, sg=1, min_n=3, max_n=6, epochs=10, seed=42)  # Train FastText
ft.save("embeddings/fasttext.model")

print("Saved FaxtText model")

Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


Saved FaxtText model


Now, we will compare **Word2Vec** vs **FastText** using metrics like:  
- Coverage  
- Synonym/Antonym similarity
- Nearest-neighbor quality.

In [5]:
# Compare Word2Vec vs FastText
from gensim.models import Word2Vec, FastText
import numpy as np

w2v = Word2Vec.load("embeddings/word2vec.model")
ft = FastText.load("embeddings/fasttext.model")

seed_words = ["yaxşı","pis","çox","bahalı","ucuz","mükəmməl","dəhşət",
              "<PRICE>","<RATING_POS>"]

syn_pairs = [("yaxşı","əla"), ("bahalı","qiymətli"), ("ucuz","sərfəli")]
ant_pairs = [("yaxşı","pis"), ("bahalı","ucuz")]

def read_tokens(f):
    df = pd.read_excel(f, usecols=["cleaned_text"])
    return [t for row in df["cleaned_text"].astype(str) for t in row.split()]

def lexical_coverage(model, tokens):
    vocab = model.wv.key_to_index
    return sum(1 for t in tokens if t in vocab) / max(1, len(tokens))

print("== Lexical coverage (per dataset) ==")
for f in files:
    toks = read_tokens(f)
    cov_w2v = lexical_coverage(w2v, toks)
    cov_ftv = lexical_coverage(ft, toks)
    print(f"{f}: W2V={cov_w2v:.3f}, FT(vocab)={cov_ftv:.3f}")


== Lexical coverage (per dataset) ==
data/clean/labeled-sentiment_2col.xlsx: W2V=0.932, FT(vocab)=0.932
data/clean/test__1__2col.xlsx: W2V=0.987, FT(vocab)=0.987
data/clean/train__3__2col.xlsx: W2V=0.990, FT(vocab)=0.990
data/clean/train-00000-of-00001_2col.xlsx: W2V=0.943, FT(vocab)=0.943
data/clean/merged_dataset_CSV__1__2col.xlsx: W2V=0.949, FT(vocab)=0.949


**Lexical Coverage** measures how many of the tokens in your dataset are included in model's vocabulary.  
It actually says that how well the model knows your corpus words.  

 $ \text{Coverage} = \frac{\text{Number of tokens in vocabulary}}{\text{Total number of tokens in the dataset}} $.


When we compare **Lexical Coverage** of both models for each dataset, We observe that values are closed to %100 percent which is good. Also, they do not differ each other.

In [8]:
def cos(a,b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def pair_sim(model, pairs):  # compute similarity for each pair
    vals = []
    for a,b in pairs:
        try:
            vals.append(model.wv.similarity(a,b))
        except KeyError:
            pass
    return sum(vals)/len(vals) if vals else float('nan')

syn_w2v = pair_sim(w2v, syn_pairs)
syn_ft = pair_sim(ft, syn_pairs)
ant_w2v = pair_sim(w2v, ant_pairs)
ant_ft = pair_sim(ft, ant_pairs)

print("\n== Similarity ==")
print(f"Synonyms: W2V={syn_w2v:.3f}, FT={syn_ft:.3f}")
print(f"Antonyms: W2V={ant_w2v:.3f}, FT={ant_ft:.3f}")
print(f"Separation: W2V={syn_w2v - ant_w2v:.3f}, FT={syn_ft - ant_ft:.3f}")

def neighbors(model, word, k=5):
    try:
        return [w for w,_ in model.wv.most_similar(word, topn=k)]
    except KeyError:
        return []

print("\n== Nearest Neighbors ==")
"""
seed_words = ["yaxşı","pis","çox","bahalı","ucuz","mükəmməl","dəhşət",
              "<PRICE>","<RATING_POS>"]
"""
for w in seed_words:
    print(f"  W2V NN for '{w}':", neighbors(w2v, w))
    print(f"  FT NN for '{w}':", neighbors(ft, w))



== Similarity ==
Synonyms: W2V=0.355, FT=0.442
Antonyms: W2V=0.349, FT=0.436
Separation: W2V=0.007, FT=0.005

== Nearest Neighbors ==
  W2V NN for 'yaxşı': ['<RATING_POS>', 'iyi', 'yaxshi', 'yaxsı', 'awsome']
  FT NN for 'yaxşı': ['yaxşıı', 'yaxşıkı', 'yaxşıca', 'yaxş', 'yaxşıya']
  W2V NN for 'pis': ['vərdişlərə', 'lire', 'kardeşi', 'xalçalardan', 'günd']
  FT NN for 'pis': ['piis', 'pi', 'pisdii', 'pixlr', 'pisi']
  W2V NN for 'çox': ['bəyənilsin', 'çoox', 'çöx', 'gözəldir', 'əladir']
  FT NN for 'çox': ['çoxçox', 'çoxx', 'çoxh', 'ço', 'çoh']
  W2V NN for 'bahalı': ['yaxtaları', 'metallarla', 'villaları', 'radiusda', 'portretlerinə']
  FT NN for 'bahalı': ['bahalıı', 'bahalısı', 'bahalıq', 'baharlı', 'pahalı']
  W2V NN for 'ucuz': ['düzəltdirilib', 'baha', 'qiymete', 'keyfiyetli', 'sududu']
  FT NN for 'ucuz': ['ucuzu', 'ucuza', 'ucuzdu', 'ucuzluğa', 'ucuzdur']
  W2V NN for 'mükəmməl': ['möhtəşəmm', 'kəliməylə', 'yaradilanlarin', 'mukəmməl', 'möhdəşəm']
  FT NN for 'mükəmməl': ['mük

#### Synonym/Antonym Similarity
Similarity in word embeddings measures how close 2 vectors are in meaning.  
Mathematically, it is computed as cosine similarity

$ \text{cosine\_similarity}(a, b) = \frac{a \cdot b}{\\|a\\| \\|b\\|} $.

Range from +1 to -1:
- +1 means very similar.
- 0 means unrelated.
- -1 means opposite directions.

**Separation** measures how well the model distinguishes between similar and opposite words.
$ \text{Separation} = \text{mean(similarity of synonyms)} - \text{mean(similarity of antonyms)} $.

If separation is large, we observe that model clearly understand that synonyms are similar than antonyms.  
However if it is small model is not good at distinguishing them.

When we look at the output, **FastText** is better for both synonym and anthonym similarity. Separation result is small for both models-meaning they do not strongly separate meanings. 

**NOTE:** Limited corpus size or insufficient domain balance can cause bad results.


#### Nearest-neighbor quality.
The **Nearest-Neighbor (NN)** metric means the words closest to a given word's vector. It is calculated by `Cosine Similarity`.

When we look at the output, `FastText` have better results compared to `Word2Vec`.