# fastText probing notebook

This notebook operationalises the experiments sketched in `reports/fasttext_limitations_and_kazakh.md`. It focuses on evaluating pretrained fastText vectors on non-English, out-of-distribution data—specifically the Kazakh hate speech dataset stored at `data/kazakh_hate_speech_fasttext.csv`—to interrogate their suitability as baseline features. It provides a reproducible pipeline for:

1. Loading the Kazakh hate speech dataset (with optional multilingual Wikipedia fallback).
2. Wiring in pretrained fastText vectors (e.g., the Kazakh `cc.kk.300.bin` model) to create sentence-level embeddings.
3. Training a simple logistic regression classifier on averaged fastText vectors.
4. Reporting held-out performance and surfacing misclassified examples for error analysis.

> **Environment note:** The report mentions that package/model downloads were blocked in the grading environment. The notebook therefore detects whether fastText vectors are present locally and explains how to add them if they are missing.

In [None]:
from pathlib import Path
import importlib.util
import json
import random
from collections import Counter
from dataclasses import dataclass
from typing import Iterable, List, Optional, Sequence

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

FASTTEXT_AVAILABLE = importlib.util.find_spec('fasttext') is not None
DATASETS_AVAILABLE = importlib.util.find_spec('datasets') is not None


## Configure data and vectors

Set the paths for the hate speech dataset and pretrained fastText vectors. The defaults expect the Kazakh hate speech CSV shipped with this repository and a local copy of the Kazakh vectors. Replace the paths with other language models (e.g., Yoruba) as needed.

In [17]:
PROJECT_ROOT = Path.cwd()
if (PROJECT_ROOT / 'data').is_dir():
    DATA_ROOT = PROJECT_ROOT / 'data'
elif (PROJECT_ROOT.parent / 'data').is_dir():
    DATA_ROOT = PROJECT_ROOT.parent / 'data'
else:
    DATA_ROOT = Path('data')

LANGUAGES = ['kazakh stanza', 'yoruba heuristic', 'english heuristic']  # used for optional Wikipedia fallback
KAZAKH_HATE_SPEECH_PATH = DATA_ROOT / 'kazakh_hate_speech_fasttext.csv'
FASTTEXT_VECTOR_PATH = PROJECT_ROOT / 'vectors/cc.kk.300.bin'  # update if you store vectors elsewhere

print(f'fastText installed: {FASTTEXT_AVAILABLE}')
print(f'Vector file present: {FASTTEXT_VECTOR_PATH.exists()} ({FASTTEXT_VECTOR_PATH})')
print(f'Hate speech dataset present: {KAZAKH_HATE_SPEECH_PATH.exists()} ({KAZAKH_HATE_SPEECH_PATH})')
print(f'Data root resolved to: {DATA_ROOT} (exists: {DATA_ROOT.exists()})')
if not FASTTEXT_AVAILABLE:
    print('Set AUTO_INSTALL_FASTTEXT=True to let the notebook try installing the package via pip.')
if not FASTTEXT_VECTOR_PATH.exists():
    print('Set AUTO_DOWNLOAD_VECTORS=True to fetch the Kazakh fastText vectors automatically (large download).')


fastText installed: True
Vector file present: True (C:\Users\Maxim\vectors\cc.kk.300.bin)
Data root resolved to: C:\Users\Maxim\data (exists: True)


## Kazakh hate speech configuration

Toggle the switches below to prioritise the Kazakh hate speech corpus over Wikipedia snippets. The CSV lives in `data/kazakh_hate_speech_fasttext.csv` and provides non-Wikipedia, non-English content for probing pretrained vectors.

In [None]:
USE_KAZAKH_HATE_SPEECH = True  # enable to prioritise the hate speech corpus over Wikipedia snippets
MAX_WIKIPEDIA_SENTENCES_PER_LANGUAGE = 2000  # fallback sample size per language when using Wikipedia data

AUTO_INSTALL_FASTTEXT = True  # toggle to attempt `pip install fasttext-wheel` (set False to skip installs)
AUTO_DOWNLOAD_VECTORS = True  # toggle to download cc.kk.300.bin (~1.2GB); set False if you already have the file or are offline
FASTTEXT_INSTALL_PACKAGE = 'fasttext-wheel'
FASTTEXT_DOWNLOAD_URL = 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.kk.300.bin.gz'

def ensure_fasttext_installed(auto_install: bool = False) -> bool:
    global FASTTEXT_AVAILABLE
    if FASTTEXT_AVAILABLE:
        return True
    if not auto_install:
        print(
            'fastText Python bindings are not installed. Install via `pip install fasttext-wheel` '
            'or set AUTO_INSTALL_FASTTEXT=True to let the notebook attempt installation.'
        )
        return False
    try:
        import subprocess
        import sys

        subprocess.check_call([sys.executable, '-m', 'pip', 'install', FASTTEXT_INSTALL_PACKAGE])
        FASTTEXT_AVAILABLE = importlib.util.find_spec('fasttext') is not None
    except Exception as exc:  # noqa: BLE001
        print(f'Automatic installation failed: {exc}')
        FASTTEXT_AVAILABLE = False
    return FASTTEXT_AVAILABLE

def download_fasttext_vectors(target_path: Path, url: str) -> bool:
    import gzip
    import shutil
    import urllib.request

    target_path.parent.mkdir(parents=True, exist_ok=True)
    gz_path = target_path.with_suffix(target_path.suffix + '.gz')
    try:
        with urllib.request.urlopen(url) as resp, open(gz_path, 'wb') as download_out:
            shutil.copyfileobj(resp, download_out)
        with gzip.open(gz_path, 'rb') as src, open(target_path, 'wb') as dst:
            shutil.copyfileobj(src, dst)
        return True
    except Exception as exc:  # noqa: BLE001
        print(f'Downloading fastText vectors failed: {exc}')
        return False
    finally:
        if gz_path.exists():
            try:
                gz_path.unlink()
            except OSError:
                pass

def ensure_vector_file(target_path: Path, url: str, auto_download: bool = False) -> bool:
    if target_path.exists():
        return True
    if not auto_download:
        print(
            f'Vector file not found at {target_path}. '
            'Set AUTO_DOWNLOAD_VECTORS=True to download automatically or place it manually.'
        )
        return False
    print(f'Downloading fastText vectors from {url} (this is ~1.2GB)...')
    return download_fasttext_vectors(target_path, url)


In [18]:

AUTO_INSTALL_FASTTEXT = True  # toggle to attempt `pip install fasttext-wheel` (set False to skip installs)
AUTO_DOWNLOAD_VECTORS = True  # toggle to download cc.kk.300.bin (~1.2GB); set False if you already have the file or are offline
FASTTEXT_INSTALL_PACKAGE = 'fasttext-wheel'
FASTTEXT_DOWNLOAD_URL = 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.kk.300.bin.gz'

def ensure_fasttext_installed(auto_install: bool = False) -> bool:
    global FASTTEXT_AVAILABLE
    if FASTTEXT_AVAILABLE:
        return True
    if not auto_install:
        print(
            'fastText Python bindings are not installed. Install via `pip install fasttext-wheel` '
            'or set AUTO_INSTALL_FASTTEXT=True to let the notebook attempt installation.'
        )
        return False
    try:
        import subprocess
        import sys

        subprocess.check_call([sys.executable, '-m', 'pip', 'install', FASTTEXT_INSTALL_PACKAGE])
        FASTTEXT_AVAILABLE = importlib.util.find_spec('fasttext') is not None
    except Exception as exc:  # noqa: BLE001
        print(f'Automatic installation failed: {exc}')
        FASTTEXT_AVAILABLE = False
    return FASTTEXT_AVAILABLE

def download_fasttext_vectors(target_path: Path, url: str) -> bool:
    import gzip
    import shutil
    import urllib.request

    target_path.parent.mkdir(parents=True, exist_ok=True)
    gz_path = target_path.with_suffix(target_path.suffix + '.gz')
    try:
        with urllib.request.urlopen(url) as resp, open(gz_path, 'wb') as download_out:
            shutil.copyfileobj(resp, download_out)
        with gzip.open(gz_path, 'rb') as src, open(target_path, 'wb') as dst:
            shutil.copyfileobj(src, dst)
        return True
    except Exception as exc:  # noqa: BLE001
        print(f'Downloading fastText vectors failed: {exc}')
        return False
    finally:
        if gz_path.exists():
            try:
                gz_path.unlink()
            except OSError:
                pass

def ensure_vector_file(target_path: Path, url: str, auto_download: bool = False) -> bool:
    if target_path.exists():
        return True
    if not auto_download:
        print(
            f'Vector file not found at {target_path}. '
            'Set AUTO_DOWNLOAD_VECTORS=True to download automatically or place it manually.'
        )
        return False
    print(f'Downloading fastText vectors from {url} (this is ~1.2GB)...')
    return download_fasttext_vectors(target_path, url)


## Data loading helpers

The helpers below mirror the logic used in the baseline scripts (`scripts/evaluate_language_id_baselines.py`) but trim it down for quick experimentation inside the notebook.

In [19]:
@dataclass
class SentenceExample:
    text: str
    label: str


def iter_conllu_sentences(path: Path) -> Iterable[str]:
    buffer: List[str] = []
    for line in path.read_text(encoding='utf8').splitlines():
        if line.startswith('# text = '):
            buffer.append(line[len('# text = ') :])
        elif line.startswith('#'):
            continue
        elif not line.strip():
            if buffer:
                yield ' '.join(buffer).strip()
                buffer = []
        else:
            continue
    if buffer:
        yield ' '.join(buffer).strip()


def load_multilingual_dataset(
    data_root: Path,
    languages: Optional[Sequence[str]] = None,
    max_sentences_per_language: Optional[int] = None,
    seed: int = 13,
) -> pd.DataFrame:
    rng = random.Random(seed)
    examples: List[SentenceExample] = []
    language_dirs = sorted([p for p in data_root.iterdir() if p.is_dir()])
    for lang_dir in language_dirs:
        language = lang_dir.name
        if languages and language not in languages:
            continue
        sentences: List[str] = []
        for conllu_file in sorted(lang_dir.glob('*.conllu')):
            sentences.extend(iter_conllu_sentences(conllu_file))
        if max_sentences_per_language is not None:
            rng.shuffle(sentences)
            sentences = sentences[:max_sentences_per_language]
        examples.extend(SentenceExample(text=s, label=language) for s in sentences)
    rng.shuffle(examples)
    return pd.DataFrame([vars(example) for example in examples], columns=['text', 'label'])


def load_kazakh_hate_speech_dataset(path: Path) -> pd.DataFrame:
    if not path.exists():
        print(f'Kazakh hate speech file not found at {path}. Place the CSV to prioritise this corpus.')
        return pd.DataFrame(columns=['text', 'label'])
    df = pd.read_csv(path)
    missing = {'text', 'label'} - set(df.columns)
    if missing:
        print(
            f'Expected columns `text` and `label` were not found in {path}. '
            f'Missing: {sorted(missing)}'
        )
        return pd.DataFrame(columns=['text', 'label'])
    df = df[['text', 'label']].dropna()
    print(f'Loaded {len(df):,} Kazakh hate speech records from {path}.')
    return df


def preview_class_balance(df: pd.DataFrame) -> pd.Series:
    counts = Counter(df['label'])
    return pd.Series(counts).sort_values(ascending=False)


wikipedia_dataset = load_multilingual_dataset(
    DATA_ROOT, languages=LANGUAGES, max_sentences_per_language=MAX_WIKIPEDIA_SENTENCES_PER_LANGUAGE
)
if wikipedia_dataset.empty:
    raise RuntimeError(
        'No sentences were loaded from DATA_ROOT. Ensure the corpus is available and LANGUAGES is set correctly.'
    )
print('Wikipedia-derived sample:')
print(wikipedia_dataset.head())
print('Class distribution:', preview_class_balance(wikipedia_dataset))

kazakh_hate_speech_df = load_kazakh_hate_speech_dataset(KAZAKH_HATE_SPEECH_PATH) if USE_KAZAKH_HATE_SPEECH else pd.DataFrame()

if USE_KAZAKH_HATE_SPEECH and not kazakh_hate_speech_df.empty:
    dataset = kazakh_hate_speech_df
    print('Using Kazakh hate speech corpus (non-Wikipedia) for probing fastText embeddings.')
else:
    dataset = wikipedia_dataset
    print('Falling back to Wikipedia-derived multilingual dataset.')
print('Active dataset label balance:')
print(preview_class_balance(dataset))


                                                text              label
0                           Itokasi Àwọn ástẹ́rọ́ìdì   yoruba heuristic
1  Focusing on oneself is not listening, reading,...  english heuristic
2                                                ...  english heuristic
3  Төрт мезгіл тамақтану анағұрлым тиімді болып т...      kazakh stanza
4  Жамбыл гидромелиоративтік-құрылыс институтын (...      kazakh stanza
Class distribution: yoruba heuristic     2000
english heuristic    2000
kazakh stanza        2000
dtype: int64


The cell above now swaps in the Kazakh hate speech corpus when available, ensuring the evaluation targets non-Wikipedia, non-English data. If the CSV is missing or misformatted, the notebook falls back to the Wikipedia-derived multilingual dataset for convenience.

## Load fastText vectors

The cell below loads a local binary `.bin` file with subword vectors. If the file is missing or the `fasttext` package is unavailable, the notebook surfaces clear guidance on how to proceed.

In [20]:

fasttext_model = None
fasttext_dim = None

fasttext_ready = ensure_fasttext_installed(AUTO_INSTALL_FASTTEXT)
vector_ready = ensure_vector_file(FASTTEXT_VECTOR_PATH, FASTTEXT_DOWNLOAD_URL, AUTO_DOWNLOAD_VECTORS)

if fasttext_ready and vector_ready:
    import fasttext  # type: ignore

    fasttext_model = fasttext.load_model(str(FASTTEXT_VECTOR_PATH))
    fasttext_dim = fasttext_model.get_dimension()
    print(f'Loaded fastText model with {fasttext_dim} dimensions from {FASTTEXT_VECTOR_PATH}')
else:
    guidance = []
    if not fasttext_ready:
        guidance.append(
            '- fastText Python bindings are missing. Set AUTO_INSTALL_FASTTEXT=True or install manually via `pip install fasttext-wheel`.'
        )
    if not vector_ready:
        guidance.append(
            f'- fastText vector binary not found at {FASTTEXT_VECTOR_PATH}. Set AUTO_DOWNLOAD_VECTORS=True to fetch it or place it manually.'
        )
    raise RuntimeError(
        'fastText setup is incomplete; please address the items below before rerunning:\n' + '\n'.join(guidance)
    )


Loaded fastText model with 300 dimensions from C:\Users\Maxim\vectors\cc.kk.300.bin




## Feature construction and model training

Sentences are tokenised on whitespace and averaged into a single embedding vector. This mirrors the lightweight fastText probing described in the report (averaging static vectors before a linear classifier).

In [21]:
def sentence_to_vector(text: str, model, dim: int) -> np.ndarray:
    tokens = text.strip().split()
    if not tokens:
        return np.zeros(dim, dtype=np.float32)
    vectors = [model.get_word_vector(tok) for tok in tokens]
    return np.mean(np.stack(vectors, axis=0), axis=0)


def build_embedding_matrix(texts: Sequence[str], model, dim: int) -> np.ndarray:
    return np.vstack([sentence_to_vector(text, model, dim) for text in texts])


if fasttext_model is None:
    raise RuntimeError(
        'A fastText model is required to continue. Resolve the setup issues above (installation or vector download) and rerun the loader cell.'
    )

X = build_embedding_matrix(dataset['text'], fasttext_model, fasttext_dim)
y = dataset['label'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

classifier = LogisticRegression(max_iter=1000, n_jobs=-1, multi_class='auto')
classifier.fit(X_train, y_train)




## Evaluation and error inspection

Accuracy and macro-averaged precision/recall/F1 provide a quick snapshot of how well fastText embeddings separate the selected languages. Misclassifications are shown to facilitate the qualitative inspection suggested in the report.

In [22]:
y_pred = classifier.predict(X_test)
print(f'Test accuracy: {accuracy_score(y_test, y_pred):.4f}\n')
print(classification_report(y_test, y_pred))

errors = []
for text, gold, pred in zip(dataset['text'], y, classifier.predict(X)):
    if gold != pred:
        errors.append({'text': text, 'gold': gold, 'pred': pred})

error_df = pd.DataFrame(errors)
print('Sample misclassifications (head):')
print(error_df.head())


Test accuracy: 0.9483

                   precision    recall  f1-score   support

english heuristic       0.91      0.95      0.93       400
    kazakh stanza       0.98      0.99      0.98       400
 yoruba heuristic       0.96      0.91      0.93       400

         accuracy                           0.95      1200
        macro avg       0.95      0.95      0.95      1200
     weighted avg       0.95      0.95      0.95      1200

Sample misclassifications (head):
                                                text               gold  \
0                                                ...  english heuristic   
1                                    Аузы- қосжақты.      kazakh stanza   
2                                      Heerlen, N.V.  english heuristic   
3  Durkheim, Marx, and the German theorist Max We...   yoruba heuristic   
4  Get to the fucking point." Brooks came into hi...   yoruba heuristic   

                pred  
0      kazakh stanza  
1   yoruba heuristic  
2   yor

## Next steps

* Swap in domain-specific corpora (e.g., Yoruba social media posts) by replacing the `DATA_ROOT` path or loading an external dataframe.
* Point `FASTTEXT_VECTOR_PATH` at the matching pretrained vectors (such as `cc.kk.300.bin` for Kazakh or `cc.yo.300.bin` for Yoruba) to mirror the report's planned experiments.
* Extend the analysis by saving confusion matrices or integrating alternative feature baselines (character n-grams) to quantify the gap between static embeddings and more robust representations.