# fastText probing notebook

This notebook operationalises the experiments sketched in `reports/fasttext_limitations_and_kazakh.md`. It provides a reproducible pipeline for:

1. Loading multilingual Wikipedia snippets from the repository's `data/` directory.
2. Wiring in pretrained fastText vectors (e.g., the Kazakh `cc.kk.300.bin` model) to create sentence-level embeddings.
3. Training a simple logistic regression classifier on averaged fastText vectors.
4. Reporting held-out performance and surfacing misclassified examples for error analysis.

> **Environment note:** The report mentions that package/model downloads were blocked in the grading environment. The notebook therefore detects whether fastText vectors are present locally and explains how to add them if they are missing.

In [None]:
from pathlib import Path
import importlib.util
import json
from collections import Counter
from dataclasses import dataclass
from typing import Iterable, List, Optional, Sequence

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

FASTTEXT_AVAILABLE = importlib.util.find_spec('fasttext') is not None


## Configure data and vectors

Set the languages to evaluate and the location of your pretrained fastText vectors. The defaults work with the repository's Wikipedia-derived dataset and expect a local copy of the Kazakh vectors. Replace the paths with other language models (e.g., Yoruba) as needed.

In [None]:
DATA_ROOT = Path('data')
LANGUAGES = ['kazakh', 'yoruba', 'english']  # adjust to probe different subsets
FASTTEXT_VECTOR_PATH = Path('vectors/cc.kk.300.bin')  # update if you store vectors elsewhere

print(f'fastText installed: {FASTTEXT_AVAILABLE}')
print(f'Vector file present: {FASTTEXT_VECTOR_PATH.exists()} ({FASTTEXT_VECTOR_PATH})')


## Data loading helpers

The helpers below mirror the logic used in the baseline scripts (`scripts/evaluate_language_id_baselines.py`) but trim it down for quick experimentation inside the notebook.

In [None]:
@dataclass
class SentenceExample:
    text: str
    label: str


def iter_conllu_sentences(path: Path) -> Iterable[str]:
    buffer: List[str] = []
    for line in path.read_text(encoding='utf8').splitlines():
        if line.startswith('# text = '):
            buffer.append(line[len('# text = ') :])
        elif line.startswith('#'):
            continue
        elif not line.strip():
            if buffer:
                yield ' '.join(buffer).strip()
                buffer = []
    if buffer:
        yield ' '.join(buffer).strip()


def load_multilingual_dataset(
    data_root: Path,
    languages: Optional[Sequence[str]] = None,
    max_sentences_per_language: Optional[int] = None,
) -> pd.DataFrame:
    examples: List[SentenceExample] = []
    language_dirs = sorted([p for p in data_root.iterdir() if p.is_dir()])
    for lang_dir in language_dirs:
        if languages and lang_dir.name not in languages:
            continue
        sentences: List[str] = []
        for conllu_file in sorted(lang_dir.glob('*.conllu')):
            sentences.extend(iter_conllu_sentences(conllu_file))
        if max_sentences_per_language is not None:
            sentences = sentences[:max_sentences_per_language]
        examples.extend(SentenceExample(text=s, label=lang_dir.name) for s in sentences)
    rng = np.random.default_rng(13)
    rng.shuffle(examples)
    return pd.DataFrame(examples)


def preview_class_balance(df: pd.DataFrame) -> pd.Series:
    counts = Counter(df['label'])
    return pd.Series(counts).sort_values(ascending=False)


dataset = load_multilingual_dataset(DATA_ROOT, languages=LANGUAGES, max_sentences_per_language=2000)
print(dataset.head())
print('
Class distribution:
', preview_class_balance(dataset))


## Load fastText vectors

The cell below loads a local binary `.bin` file with subword vectors. If the file is missing or the `fasttext` package is unavailable, the notebook surfaces clear guidance on how to proceed.

In [None]:
fasttext_model = None
fasttext_dim = None

if FASTTEXT_AVAILABLE and FASTTEXT_VECTOR_PATH.exists():
    import fasttext  # type: ignore

    fasttext_model = fasttext.load_model(str(FASTTEXT_VECTOR_PATH))
    fasttext_dim = fasttext_model.get_dimension()
    print(f'Loaded fastText model with {fasttext_dim} dimensions from {FASTTEXT_VECTOR_PATH}')
elif not FASTTEXT_AVAILABLE:
    print(
        'fastText Python bindings are not installed. Install via `pip install fasttext` ' 
        'and download the desired `.bin` vectors (e.g., cc.kk.300.bin) before rerunning.'
    )
else:
    print(
        f'Vector file not found at {FASTTEXT_VECTOR_PATH}. Place the pretrained fastText binary ' 
        'there or update FASTTEXT_VECTOR_PATH to point at your local copy.'
    )


## Feature construction and model training

Sentences are tokenised on whitespace and averaged into a single embedding vector. This mirrors the lightweight fastText probing described in the report (averaging static vectors before a linear classifier).

In [None]:
def sentence_to_vector(text: str, model, dim: int) -> np.ndarray:
    tokens = text.strip().split()
    if not tokens:
        return np.zeros(dim, dtype=np.float32)
    vectors = [model.get_word_vector(tok) for tok in tokens]
    return np.mean(np.stack(vectors, axis=0), axis=0)


def build_embedding_matrix(texts: Sequence[str], model, dim: int) -> np.ndarray:
    return np.vstack([sentence_to_vector(text, model, dim) for text in texts])


if fasttext_model is None:
    raise RuntimeError(
        'A fastText model is required to continue. Please ensure FASTTEXT_VECTOR_PATH points ' 
        'to a valid .bin file and that the `fasttext` package is installed.'
    )

X = build_embedding_matrix(dataset['text'], fasttext_model, fasttext_dim)
y = dataset['label'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

classifier = LogisticRegression(max_iter=1000, n_jobs=-1, multi_class='auto')
classifier.fit(X_train, y_train)


## Evaluation and error inspection

Accuracy and macro-averaged precision/recall/F1 provide a quick snapshot of how well fastText embeddings separate the selected languages. Misclassifications are shown to facilitate the qualitative inspection suggested in the report.

In [None]:
y_pred = classifier.predict(X_test)
print(f'Test accuracy: {accuracy_score(y_test, y_pred):.4f}\n')
print(classification_report(y_test, y_pred))

errors = []
for text, gold, pred in zip(dataset['text'], y, classifier.predict(X)):
    if gold != pred:
        errors.append({'text': text, 'gold': gold, 'pred': pred})

error_df = pd.DataFrame(errors)
print('Sample misclassifications (head):')
print(error_df.head())


## Next steps

* Swap in domain-specific corpora (e.g., Yoruba social media posts) by replacing the `DATA_ROOT` path or loading an external dataframe.
* Point `FASTTEXT_VECTOR_PATH` at the matching pretrained vectors (such as `cc.kk.300.bin` for Kazakh or `cc.yo.300.bin` for Yoruba) to mirror the report's planned experiments.
* Extend the analysis by saving confusion matrices or integrating alternative feature baselines (character n-grams) to quantify the gap between static embeddings and more robust representations.