`re` module let us do pattern matching operations using regular expressions.  
`compile()` converts strings into **compiled regex object** which has methods like `.search()`, `.findall()`, `.match()` etc.  



In [2]:
import re  #regular expression

# re.I -> flag for cas-insensitive matching ("trend" = "Trend")
# (alternative1 | alternative2 | ......)
NEWS_HINTS = re.compile(r"\b(apa|trend|azertac|reuters|bloomberg|dha|aa)\b", re.I)
SOCIAL_HINTS = re.compile(r"\b(rt)\b|@|#|(?:😂|😍|😊|👍|👎|😡|🙂)")
REV_HINTS = re.compile(r"\b(azn|manat|qiymət|aldım|ulduz|çox yaxşı|çox pis)\b", re.I)

def detect_domain(text: str) -> str:
    s = text.lower()
    if NEWS_HINTS.search(s):
        return "news"
    if SOCIAL_HINTS.search(s):
        return "social"
    if REV_HINTS.search(s):
        return "reviews"
    return "general"


So we have normally 4 different domains which are `reviews`, `social`, `news`, `general`.  
On the following cell, we detect **review based expressions** in Azerbaijani text (e.g. 20 manat, 5 ulduz). Later, models such as Word2Vec, FastText can learn meaningful patterns.

In [4]:
# --- Domain-specific normalization (reviews) ---

PRICE_RE = re.compile(r"\b\d+\s*(azn|manat)\b", re.I)
STARS_RE = re.compile(r"\b([1-5])\s*ulduz\b", re.I)
POS_RATE = re.compile(r"\bçox yaxşı\b")
NEG_RATE = re.compile(r"\bçox pis\b")

def domain_specific_normalize(cleaned: str, domain: str) -> str:
    if domain == "reviews":
        s = PRICE_RE.sub(" <PRICE> ", cleaned)
        s = STARS_RE.sub(lambda m: f" <STARS_{m.group(1)}> ", s)
        s = POS_RATE.sub(" <RATING_POS> ", s)
        s = NEG_RATE.sub(" <RATING_NEG> ", s)
        return " ".join(s.split())
    return cleaned


Domain tag is a prefix that is added tı text lines to help models distinguish between different types of text.  
(Models can be `Word2Vec`, `FastText`, `BERT`).

In [6]:
#  Domain tag token for corpus "domreview + data"
def add_domain_tag(line: str, domain: str) -> str:
    return f"dom{domain} " + line

This following cell handles: 
- **Encoding problems**  (l’humanitÃ© → l'humanité)
- **Punctuations**  (https://... → `URL`)
- **URLs** (????? → ?) 
- **Emojis**  (😊 → `EMO_POS`)
- **İnformal writing**  (slm → salam)

In [8]:
# -*- coding: utf-8 -*-
import html, unicodedata
import pandas as pd
from pathlib import Path

# ftfy "fixes text for you" cleans encoding problems "l’humanitÃ©"  --> "l'humanité"
try:
    from ftfy import fix_text
except Exception:
    def fix_text(s): return s

# Azerbaijani-aware lowercase
def lower_az(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = unicodedata.normalize("NFC", s)
    s = s.replace("I", "ı").replace("İ", "i") #Azerbaijani casing rules
    s = s.lower().replace("i̇", "i")
    return s


# These define patterns to detect unwanted elements
HTML_TAG_RE = re.compile(r"<[^>]+>")
URL_RE = re.compile(r"(https?://\S+|www\.\S+)", re.IGNORECASE)
EMAIL_RE = re.compile(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", re.IGNORECASE)
PHONE_RE = re.compile(r"\+?\d[\d\-\s\(\)]{6,}\d")
USER_RE = re.compile(r"@\w+")
MULTI_PUNCT = re.compile(r"([!?.,;:])\1{1,}")
MULTI_SPACE = re.compile(r"\s+")
REPEAT_CHARS = re.compile(r"(.)\1{2,}", flags=re.UNICODE) # cooool -> cool

# TOKEN_RE defines what a valid token looks like
TOKEN_RE = re.compile(
    r"[A-Za-zƏəĞğIıİiÖöÜüÇçŞşXxQq]+(?:'[A-Za-zƏəĞğIıİiÖöÜüÇçŞşXxQq]+)?"
    r"|<NUM>|URL|EMAIL|PHONE|USER|EMO_(?:POS|NEG)"
)

# Before tokenization, we replaces emojis with two tags.
# This preserves sentiment signal even we later strip punctuations.
EMO_MAP = {
    "🙂": "EMO_POS", "😀": "EMO_POS", "😍": "EMO_POS", "😊": "EMO_POS" ,"👍": "EMO_POS",
    "☹": "EMO_NEG", "🙁": "EMO_NEG", "😠": "EMO_NEG", "😡": "EMO_NEG", "👎": "EMO_NEG"
}

# slang map to standardize common informal forms
SLANG_MAP = {"slm": "salam", "tmm": "tamam", "sagol": "sağol", "cox": "çox", "yaxsi": "yaxşı"}
NEGATORS = {"yox", "deyil", "heç", "qətiyyən", "yoxdur"}

On the below code snippet, we clean and tokenize the text with what we define above. (`regex patterns`, `.lower_az() function`) 

In [10]:
def normalize_text_az(s: str, numbers_to_token=True, keep_sentence_punct=False) -> str:
    if not isinstance(s, str):
        return ""

    # emoji map
    for emo, tag in EMO_MAP.items():
        s = s.replace(emo, f" {tag} ")  # Converting emojis to emo_tags

    s = fix_text(s)  # clean encoding problems
    s = html.unescape(s)  # decoces html entities 
    s = HTML_TAG_RE.sub(" ", s)  # strip any <tag> markup
    s = URL_RE.sub(" URL ", s)  # Replaces linkswith URL, Email, Phone patterns
    s = EMAIL_RE.sub(" EMAIL ", s)
    s = PHONE_RE.sub(" PHONE ", s)

    # Remove the # hashtag symbol but keep the inner text
    # if the inner text is written camelCase, insert space
    s = re.sub(r"#([A-Za-z0-9_]+)", lambda m: " " +
               re.sub('([a-z])([A-Z])', r'\1 \2', m.group(1)) + " ", s)

    # convert @name to USER
    s = USER_RE.sub(" USER ", s)
    s = lower_az(s)  # Azerbaijani-aware lowercasing

    s = MULTI_PUNCT.sub(r"\1", s)

    if numbers_to_token:
        s = re.sub(r"\d+", " <NUM> ", s)

    if keep_sentence_punct:
        s = re.sub(r"[^\w\s<>'əğıöşüçƏĞIİÖŞÜÇxqXQ.!?]", " ", s)
    else:
        s = re.sub(r"[^\w\s<>'əğıöşüçƏĞIİÖŞÜÇxqXQ]", " ", s)

    s = MULTI_SPACE.sub(" ", s).strip()  # Remove if more than 1 space
    toks = TOKEN_RE.findall(s)  # Finds all regex patterns

    norm = [] # 
    mark_neg = 0
    for t in toks:
        t = REPEAT_CHARS.sub(r"\1\1", t)
        t = SLANG_MAP.get(t, t)

        if t in NEGATORS:
            norm.append(t)
            mark_neg = 3
            continue

        if mark_neg > 0 and t not in {"URL", "EMAIL", "PHONE", "USER"}:
            norm.append(t + "_NEG")
            mark_neg -= 1 
        else:
            norm.append(t)

    norm = [t for t in norm if not (len(t) == 1 and t not in {"o", "e"})]
    return " ".join(norm).strip()  # Remove single character tokens except "o", "e"


This function standardizes labels from different datasets into a uniform numeric sentiment value for training

In [12]:
def map_sentiment_value(v, scheme: str):  # v : raw sentiment label
    if scheme == "binary":
        try:
            return 1.0 if int(v) == 1 else 0.0
        except Exception:
            return None

    s = str(v).strip().lower() 
    if s in {"pos","positive","1","müsbət","good","pozitiv"}:
        return 1.0
    if s in {"neu","neutral","2","neytral"}:
        return 0.5
    if s in {"neg","negative","0","mənfi","bad","neqativ"}:
        return 0.0
    return None


##### These following code snippet captures many processes above. 
- `Reading` pd.read_excel()
- `Cleaning` dropna() , dropduplicates()
- `Normalization` normalize_text_az()
- `Domain Detection` detect_domain()
- `Label Mapping` map_sentiment_value()
- `Exporting` out_df


In [14]:
def process_file(in_path, text_col, label_col, scheme, out_two_col_path, remove_stopwords=False):
    df = pd.read_excel(in_path)

    for c in ["Unnamed: 0", "index"]:  # Remove useless columns
        if c in df.columns:
            df = df.drop(columns=[c])

    assert text_col in df.columns and label_col in df.columns  # Check needed columns exist

    df = df.dropna(subset=[text_col])  # Remove null cells
    df = df[df[text_col].astype(str).str.strip().str.len() > 0]
    df = df.drop_duplicates(subset=[text_col])  # Remove duplicates

    # Call the .normalize_text_az() method 
    df["cleaned_text"] = df[text_col].astype(str).apply(lambda s: normalize_text_az(s)) 
    # Call the detect_domain()
    df["domain"] = df[text_col].astype(str).apply(detect_domain)
    # Call the domain_specific_normalize for review 
    df["cleaned_text"] = df.apply(
        lambda r: domain_specific_normalize(r["cleaned_text"], r["domain"]),
        axis=1
    )

    # Stopword are useless words for model learning.
    # Also it counts too much which causes use of large memory
    if remove_stopwords:
        sw = set(["və","ilə","amma","ancaq","lakin","ya","həm","ki","bu","bir",
                  "o","biz","siz","mən","sən","orada","burada","bütün",
                  "hər","artıq","çox","az","ən","də","da","üçün"])
        for keep in ["deyil","yox","heç","qətiyyən","yoxdur"]:
            sw.discard(keep)

        df["cleaned_text"] = df["cleaned_text"].apply(
            lambda s: " ".join([t for t in s.split() if t not in sw])
        )

    # Calling .map_sentiment_value() method to specify sentiment value
    df["sentiment_value"] = df[label_col].apply(lambda v: map_sentiment_value(v, scheme))
    df = df.dropna(subset=["sentiment_value"])  # Drop null sentiment values
    df["sentiment_value"] = df["sentiment_value"].astype(float)

    # we have two columns: "cleaned_text", "sentiment_value"
    # These files will be needed fır corpus file and embedding process.
    out_df = df[["cleaned_text", "sentiment_value"]].reset_index(drop=True)
    Path(out_two_col_path).parent.mkdir(parents=True, exist_ok=True)
    out_df.to_excel(out_two_col_path, index=False)
    print(f"Saved: {out_two_col_path} (rows={len(out_df)})")


##### These following method merges all excel datasets into txt file where each line:
- One Sentence
- Starts with domain tag
- lowercased, punc-free and ready for Word2Vec /Faxtext training

In [16]:
def build_corpus_txt(input_files, text_cols, out_txt="corpus_all.txt"):
    lines = []
    for (f, text_col) in zip(input_files, text_cols):
        df = pd.read_excel(f)
        for raw in df[text_col].dropna().astype(str):
            dom = detect_domain(raw)
            s = normalize_text_az(raw, keep_sentence_punct=True)
            parts = re.split(r"[.!?]+", s)
            for p in parts:
                p = p.strip()
                if not p:
                    continue
                p = re.sub(r"[^\w\səğıöşüçƏĞIİÖŞÜÇxqXQ]", " ", p)
                p = " ".join(p.split()).lower()
                if p:
                    lines.append(f"dom{dom} " + p)

    with open(out_txt, "w", encoding="utf-8") as w:
        for ln in lines:
            w.write(ln + "\n")
    print(f"Wrote {out_txt} with {len(lines)} lines")


Execution

In [18]:
if __name__ == "__main__":
    CFG = [
        ("labeled-sentiment.xlsx", "text", "sentiment", "tri"),
        ("test__1_.xlsx", "text", "label", "binary"),
        ("train__3_.xlsx", "text", "label", "binary"),
        ("train-00000-of-00001.xlsx", "text", "labels", "tri"),
        ("merged_dataset_CSV__1_.xlsx", "text", "labels", "binary"),
    ]

    for fname, tcol, lcol, scheme in CFG:
        out = f"{Path(fname).stem}_2col.xlsx"
        process_file(fname, tcol, lcol, scheme, out, remove_stopwords=False)

    build_corpus_txt([c[0] for c in CFG], [c[1] for c in CFG], out_txt="corpus_all.txt")


Saved: labeled-sentiment_2col.xlsx (rows=2955)
Saved: test__1__2col.xlsx (rows=4198)
Saved: train__3__2col.xlsx (rows=19557)
Saved: train-00000-of-00001_2col.xlsx (rows=41756)
Saved: merged_dataset_CSV__1__2col.xlsx (rows=55662)
Wrote corpus_all.txt with 124353 lines


After cleaning, normalizing and labeling data, they are ready to train.

In [None]:
# Train Word2Vec & FastText
from gensim.models import Word2Vec, FastText
import pandas as pd
from pathlib import Path

# The cleaned, normalized, labeled data files that are processed in "data_prcocessing.ipynb" 
files = [
    "labeled-sentiment_2col.xlsx",
    "test__1__2col.xlsx",
    "train__3__2col.xlsx",
    "train-00000-of-00001_2col.xlsx",
    "merged_dataset_CSV__1__2col.xlsx",
]

sentences = []
for f in files:
    df = pd.read_excel(f, usecols=["cleaned_text"])
    sentences.extend(df["cleaned_text"].astype(str).str.split().tolist())

Path("embeddings").mkdir(exist_ok=True)  # Create a folder named embeddings

w2v = Word2Vec(sentences=sentences, vector_size=300, window=5,
               min_count=3, sg=1, negative=10, epochs=10)  # Train Word2Vec
w2v.save("embeddings/word2vec.model")

print("Saved Word2Vec model")

Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


Saved Word2Vec model


In [None]:
ft = FastText(sentences=sentences, vector_size=300, window=5,
              min_count=3, sg=1, min_n=3, max_n=6, epochs=10)  # Train FastText
ft.save("embeddings/fasttext.model")

print("Saved FaxtText model")

Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


Saved FaxtText model


Now, we will compare **Word2Vec** vs **FastText** using metrics like:  
- Coverage  
- Synonym/Antonym similarity
- Nearest-neighbor quality.

In [None]:
# Compare Word2Vec vs FastText
from gensim.models import Word2Vec, FastText
import numpy as np

w2v = Word2Vec.load("embeddings/word2vec.model")
ft = FastText.load("embeddings/fasttext.model")

seed_words = ["yaxşı","pis","çox","bahalı","ucuz","mükəmməl","dəhşət",
              "<PRICE>","<RATING_POS>"]

syn_pairs = [("yaxşı","əla"), ("bahalı","qiymətli"), ("ucuz","sərfəli")]
ant_pairs = [("yaxşı","pis"), ("bahalı","ucuz")]

def read_tokens(f):
    df = pd.read_excel(f, usecols=["cleaned_text"])
    return [t for row in df["cleaned_text"].astype(str) for t in row.split()]

def lexical_coverage(model, tokens):
    vocab = model.wv.key_to_index
    return sum(1 for t in tokens if t in vocab) / max(1, len(tokens))

print("== Lexical coverage (per dataset) ==")
for f in files:
    toks = read_tokens(f)
    cov_w2v = lexical_coverage(w2v, toks)
    cov_ftv = lexical_coverage(ft, toks)
    print(f"{f}: W2V={cov_w2v:.3f}, FT(vocab)={cov_ftv:.3f}")


== Lexical coverage (per dataset) ==
labeled-sentiment_2col.xlsx: W2V=0.932, FT(vocab)=0.932
test__1__2col.xlsx: W2V=0.987, FT(vocab)=0.987
train__3__2col.xlsx: W2V=0.990, FT(vocab)=0.990
train-00000-of-00001_2col.xlsx: W2V=0.943, FT(vocab)=0.943
merged_dataset_CSV__1__2col.xlsx: W2V=0.949, FT(vocab)=0.949


**Lexical Coverage** measures how many of the tokens in your dataset are included in model's vocabulary.  
It actually says that how well the model knows your corpus words.  

 $ \text{Coverage} = \frac{\text{Number of tokens in vocabulary}}{\text{Total number of tokens in the dataset}} $.


When we compare **Lexical Coverage** of both models for each dataset, We observe that values are closed to %100 percent which is good. Also, they do not differ each other.

In [None]:
def cos(a,b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def pair_sim(model, pairs):  # compute similarity for each pair
    vals = []
    for a,b in pairs:
        try:
            vals.append(model.wv.similarity(a,b))
        except KeyError:
            pass
    return sum(vals)/len(vals) if vals else float('nan')

syn_w2v = pair_sim(w2v, syn_pairs)
syn_ft = pair_sim(ft, syn_pairs)
ant_w2v = pair_sim(w2v, ant_pairs)
ant_ft = pair_sim(ft, ant_pairs)

print("\n== Similarity ==")
print(f"Synonyms: W2V={syn_w2v:.3f}, FT={syn_ft:.3f}")
print(f"Antonyms: W2V={ant_w2v:.3f}, FT={ant_ft:.3f}")
print(f"Separation: W2V={syn_w2v - ant_w2v:.3f}, FT={syn_ft - ant_ft:.3f}")

def neighbors(model, word, k=5):
    try:
        return [w for w,_ in model.wv.most_similar(word, topn=k)]
    except KeyError:
        return []

print("\n== Nearest Neighbors ==")
"""
seed_words = ["yaxşı","pis","çox","bahalı","ucuz","mükəmməl","dəhşət",
              "<PRICE>","<RATING_POS>"]
"""
for w in seed_words:
    print(f"  W2V NN for '{w}':", neighbors(w2v, w))
    print(f"  FT NN for '{w}':", neighbors(ft, w))



== Similarity ==
Synonyms: W2V=0.356, FT=0.435
Antonyms: W2V=0.343, FT=0.435
Separation: W2V=0.013, FT=0.001

== Nearest Neighbors ==
  W2V NN for 'yaxşı': ['iyi', '<RATING_POS>', 'yaxshi', 'yaxwi', 'yaxsı']
  FT NN for 'yaxşı': ['yaxşıı', 'yaxşıkı', 'yaxşıca', 'yaxş', 'yaxşıya']
  W2V NN for 'pis': ['günd', '<RATING_NEG>', 'vərdişlərə', 'bugunki', 'sport']
  FT NN for 'pis': ['piis', 'pi', 'pisdii', 'pisi', 'pisə']
  W2V NN for 'çox': ['çöx', 'çoox', 'bəyənilsin', 'gözəldir', 'cooxx']
  FT NN for 'çox': ['çoxçox', 'çoxx', 'çoxh', 'ço', 'çoh']
  W2V NN for 'bahalı': ['yaxtaları', 'metallarla', 'radiusda', 'qabardılır', 'portretlerinə']
  FT NN for 'bahalı': ['bahalıı', 'bahalısı', 'bahalıq', 'baharlı', 'bahalığı']
  W2V NN for 'ucuz': ['şeytanbazardan', 'düzəltdirilib', 'sududu', 'qiymete', 'sorbasi']
  FT NN for 'ucuz': ['ucuzu', 'ucuza', 'ucuzdu', 'ucuzluğa', 'ucuzdur']
  W2V NN for 'mükəmməl': ['möhtəşəmm', 'kəliməylə', 'mukəmməl', 'möhdəşəm', 'bayıldım']
  FT NN for 'mükəmməl': ['

#### Synonym/Antonym Similarity
Similarity in word embeddings measures how close 2 vectors are in meaning.  
Mathematically, it is computed as cosine similarity

$ \text{cosine\_similarity}(a, b) = \frac{a \cdot b}{\\|a\\| \\|b\\|} $.

Range from +1 to -1:
- +1 means very similar.
- 0 means unrelated.
- -1 means opposite directions.

**Separation** measures how well the model distinguishes between similar and opposite words.
$ \text{Separation} = \text{mean(similarity of synonyms)} - \text{mean(similarity of antonyms)} $.

If separation is large, we observe that model clearly understand that synonyms are similar than antonyms.  
However if it is small model is not good at distinguishing them.

When we look at the output, **FastText** is better for both synonym and anthonym similarity. Separation result is small for both models-meaning they do not strongly separate meanings. 

**NOTE:** Limited corpus size or insufficient domain balance can cause bad results.


#### Nearest-neighbor quality.
The **Nearest-Neighbor (NN)** metric means the words closest to a given word's vector. It is calculated by `Cosine Similarity`.

When we look at the output, `FastText` have better results compared to `Word2Vec`.