# Language Confidence Score (Word-Frequency Matching)

Cel: majac tylko `word_counts` (slownik word -> count) oraz liste najczestszych slow w jezyku (word -> frequency), wyznaczyc wynik dopasowania tekstu do jezyka.

Wymagane dane do eksperymentu (5 tekstow):
- `wiki_long` (z wiki, 5000+ slow)
- `wiki_short_bad` (z wiki, 20+ slow, jak najgorzej dopasowany do jezyka wiki)
- `ext_<lang>` (dluzszy tekst spoza wiki dla kazdego z 3 jezykow)

Wymagane jezyki: 3 (jezyk wybranej wiki + 2 inne).

Zrodlo listy najczestszych slow: `wordfreq` (min. 1000 slow na jezyk).

In [2]:
import sys
from pathlib import Path

# Ensure project root (folder containing wiki_scraper/) is importable
_cwd = Path.cwd().resolve()
_root = None
for _p in [_cwd, *_cwd.parents]:
    if (_p / 'wiki_scraper').is_dir() and (_p / 'wiki_scraper' / '__init__.py').exists():
        _root = _p
        break
if _root is None:
    raise RuntimeError('Could not find project root containing wiki_scraper/')
if str(_root) not in sys.path:
    sys.path.insert(0, str(_root))



from __future__ import annotations

import json
import math
from collections import Counter
from dataclasses import dataclass
from pathlib import Path

import pandas as pd
import os, sys
from pathlib import Path

print("cwd:", os.getcwd())
print("python:", sys.executable)

# znajdz katalog projektu (taki, ktory zawiera folder wiki_scraper/)
p = Path.cwd().resolve()
root = None
for parent in [p, *p.parents]:
  if (parent / "wiki_scraper").is_dir() and (parent / "wiki_scraper" / "__init__.py").exists():
      root = parent
      break

print("root:", root)
print("has wiki_scraper dir:", (root / "wiki_scraper").is_dir() if root else None)

if root and str(root) not in sys.path:
  sys.path.insert(0, str(root))

print("sys.path[0:3]:", sys.path[:3])


#from wiki_scraper.words import tokenize_words


cwd: /home/piotr/Dokumenty/python/wck-backup/notebooks
python: /home/piotr/Dokumenty/python/wck-backup/venv/bin/python3
root: /home/piotr/Dokumenty/python/wck-backup
has wiki_scraper dir: True
sys.path[0:3]: ['/home/piotr/Dokumenty/python/wck-backup', '/home/piotr/Dokumenty/python/wck-backup/notebooks', '/home/piotr/Dokumenty/python/wck-backup/notebooks']


## Konfiguracja danych

Ponizej ustaw sciezki do danych.

Jak je utworzyc:
- Dla wiki: uruchom `python3 wiki_scraper.py --count-words "Tytul"`, potem skopiuj `word-counts.json` do `data/wiki_long.json` i `data/wiki_short_bad.json` (za kazdym razem kasuj `word-counts.json`, zeby nie sumowalo sie miedzy tekstami).
- Dla tekstow zewnetrznych: wklej dlugie teksty do `data/ext_en.txt`, `data/ext_pl.txt`, `data/ext_es.txt` (lub inne jezyki).

In [None]:
LANGUAGES = [
    # Jezyk wiki (Bulbapedia):
    'en',
    # Dwa inne jezyki do porownania:
    'pl',
    'es',
]

K_VALUES = [3, 10, 100, 1000]

DATA_DIR = Path('data')
PATH_WIKI_LONG = DATA_DIR / 'wiki_long.json'
PATH_WIKI_SHORT_BAD = DATA_DIR / 'wiki_short_bad.json'

# Teksty zewnetrzne (po jednym na jezyk).
# Jesli nie masz ktoregos, mozesz podmienic sciezke albo dopisac plik.
PATH_EXT = {
    'en': DATA_DIR / 'ext_en.txt',
    'pl': DATA_DIR / 'ext_pl.txt',
    'es': DATA_DIR / 'ext_es.txt',
}


## Ladowanie danych

In [None]:
def load_word_counts_json(path: Path) -> dict[str, int]:
    if not path.exists():
        raise FileNotFoundError(f'Missing file: {path}. Create it from wiki_scraper.py --count-words.')
    data = json.loads(path.read_text(encoding='utf-8'))
    if not isinstance(data, dict):
        raise ValueError(f'Invalid JSON object in {path}')
    out: dict[str, int] = {}
    for k, v in data.items():
        if isinstance(k, str) and isinstance(v, int):
            out[k] = v
    return out

def counts_from_text_file(path: Path) -> dict[str, int]:
    if not path.exists():
        raise FileNotFoundError(f'Missing file: {path}. Provide a longer external text.')
    text = path.read_text(encoding='utf-8', errors='replace')
    return dict(Counter(tokenize_words(text)))

def total_words(counts: dict[str, int]) -> int:
    return int(sum(counts.values()))

wiki_long = load_word_counts_json(PATH_WIKI_LONG)
wiki_short_bad = load_word_counts_json(PATH_WIKI_SHORT_BAD)

ext_counts: dict[str, dict[str, int]] = {}
for lang in LANGUAGES:
    if lang not in PATH_EXT:
        continue
    ext_counts[lang] = counts_from_text_file(PATH_EXT[lang])

datasets: dict[str, dict[str, int]] = {
    'wiki_long': wiki_long,
    'wiki_short_bad': wiki_short_bad,
}
for lang, wc in ext_counts.items():
    datasets[f'ext_{lang}'] = wc

pd.DataFrame(
    [{'dataset': name, 'total_words': total_words(wc), 'unique_words': len(wc)} for name, wc in datasets.items()]
).sort_values('dataset')


Sprawdz wymogi:
- `wiki_long` powinien miec 5000+ slow
- `wiki_short_bad` powinien miec 20+ slow

In [None]:
assert total_words(datasets['wiki_long']) >= 5000, 'wiki_long must be 5000+ words'
assert total_words(datasets['wiki_short_bad']) >= 20, 'wiki_short_bad must be 20+ words'
'OK'


## Dane jezykowe (najczestsze slowa + czestotliwosci)

`wordfreq` daje nam liste slow (top_n_list) i funkcje czestotliwosci (word_frequency).

Wymaganie: min. 1000 najczestszych slow dla kazdego jezyka.

In [None]:
from wordfreq import top_n_list, word_frequency

def get_language_frequency_list(language_code: str, n: int) -> list[tuple[str, float]]:
    words = top_n_list(language_code, n)
    # word_frequency gives a non-negative frequency estimate
    pairs = [(w, float(word_frequency(w, language_code))) for w in words]
    # Ensure sorted (defensive):
    pairs.sort(key=lambda x: x[1], reverse=True)
    return pairs

LANG_LIST_SIZE = 5000
language_lists: dict[str, list[tuple[str, float]]] = {
    lang: get_language_frequency_list(lang, LANG_LIST_SIZE) for lang in LANGUAGES
}
{lang: len(lst) for lang, lst in language_lists.items()}


## Funkcja `lang_confidence_score(word_counts, language_words_with_frequency)`

Propozycja: cosine similarity pomiedzy:
- rozkladem slow w tekscie (ograniczonym do slow z listy jezyka)
- rozkladem slow jezyka (top-k)

Interpretacja: im wiecej slow z tekstu pokrywa sie z najczestszymi slowami jezyka i im bardziej podobne sa proporcje, tym wyzszy wynik.

Zastosowanie `k`: w eksperymentach przekazujemy `language_words_with_frequency[:k]`.

In [None]:
def lang_confidence_score(
    word_counts: dict[str, int],
    language_words_with_frequency: list[tuple[str, float]],
) -> float:
    """Return a confidence score that the text matches the language.

    Expected input: `language_words_with_frequency` is already truncated to top-k.
    Score: cosine similarity between normalized text distribution (restricted to the top-k language vocab)
    and normalized language distribution over the same vocab.
"""

    if not language_words_with_frequency:
        return 0.0

    vocab = [w for (w, _) in language_words_with_frequency if w]
    lang_freq = {w: float(f) for (w, f) in language_words_with_frequency if w}

    lang_sum = sum(lang_freq.values())
    if lang_sum <= 0:
        return 0.0

    # Normalize language vector
    lang_vec = {w: lang_freq[w] / lang_sum for w in vocab}

    # Normalize text vector over vocab
    text_sum = 0
    text_raw: dict[str, float] = {}
    for w in vocab:
        c = int(word_counts.get(w, 0))
        if c > 0:
            text_raw[w] = float(c)
            text_sum += c

    if text_sum <= 0:
        return 0.0

    text_vec = {w: c / text_sum for w, c in text_raw.items()}

    dot = 0.0
    a2 = 0.0
    b2 = 0.0
    for w in vocab:
        a = text_vec.get(w, 0.0)
        b = lang_vec.get(w, 0.0)
        dot += a * b
        a2 += a * a
        b2 += b * b

    denom = math.sqrt(a2) * math.sqrt(b2)
    if denom <= 0:
        return 0.0
    return dot / denom


## Eksperyment: k = 3, 10, 100, 1000

Dla kazdego k, dla kazdego jezyka, dla kazdego datasetu liczymy score.

In [None]:
rows = []
for k in K_VALUES:
    for lang in LANGUAGES:
        topk = language_lists[lang][:k]
        for dataset_name, wc in datasets.items():
            score = lang_confidence_score(wc, topk)
            rows.append({
                'k': k,
                'language': lang,
                'dataset': dataset_name,
                'score': score,
            })

results = pd.DataFrame(rows)
results.sort_values(['dataset', 'k', 'language']).head(30)


In [None]:
pivot = results.pivot_table(index=['dataset', 'k'], columns='language', values='score')
pivot


## Wykresy

Dla kazdego datasetu: score vs k (logarytmicznie), osobna linia dla kazdego jezyka.

In [None]:
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.dpi'] = 130

for dataset_name in sorted(datasets.keys()):
    sub = results[results['dataset'] == dataset_name].copy()
    fig, ax = plt.subplots(figsize=(10, 4.5))
    for lang in LANGUAGES:
        s = sub[sub['language'] == lang].sort_values('k')
        ax.plot(s['k'], s['score'], marker='o', label=lang)
    ax.set_xscale('log')
    ax.set_title(f'lang_confidence_score vs k ({dataset_name})')
    ax.set_xlabel('k (log scale)')
    ax.set_ylabel('score')
    ax.grid(alpha=0.25)
    ax.legend()
    plt.show()


## Podsumowanie rankingow

Dla kazdego datasetu i k: pokazujemy jezyk z najwyzszym score.

In [None]:
def best_language_table(results: pd.DataFrame) -> pd.DataFrame:
    out_rows = []
    for (dataset_name, k), group in results.groupby(['dataset', 'k']):
        group = group.sort_values('score', ascending=False)
        out_rows.append({
            'dataset': dataset_name,
            'k': int(k),
            'best_language': group.iloc[0]['language'],
            'best_score': float(group.iloc[0]['score']),
            'second_language': group.iloc[1]['language'] if len(group) > 1 else None,
            'second_score': float(group.iloc[1]['score']) if len(group) > 1 else None,
        })
    return pd.DataFrame(out_rows).sort_values(['dataset', 'k'])

best_tbl = best_language_table(results)
best_tbl


## Dodatkowe metryki: pokrycie (overlap)

Mierzymy, jaki procent slow (z tekstu) znajduje sie w top-k slow danego jezyka.
To pomaga zinterpretowac jezyki z duza odmiana (np. polski), gdzie wiele form moze nie trafic w top-k.

In [None]:
def overlap_ratio(word_counts: dict[str, int], vocab: set[str]) -> float:
    total = sum(word_counts.values())
    if total <= 0:
        return 0.0
    in_vocab = 0
    for w, c in word_counts.items():
        if w in vocab:
            in_vocab += int(c)
    return in_vocab / total

overlap_rows = []
for k in K_VALUES:
    for lang in LANGUAGES:
        vocab = {w for (w, _) in language_lists[lang][:k]}
        for dataset_name, wc in datasets.items():
            overlap_rows.append({
                'k': k,
                'language': lang,
                'dataset': dataset_name,
                'overlap_ratio': overlap_ratio(wc, vocab),
            })

overlap_df = pd.DataFrame(overlap_rows)
overlap_df.pivot_table(index=['dataset', 'k'], columns='language', values='overlap_ratio')


## Opis wynikow (do wypelnienia)

Ponizej wpisz wlasny opis po uruchomieniu notebooka na realnych danych.

Pytania z polecenia:
- Czy dobor jezykow mial duze znaczenie?
- Czy po wartosciach (np. overlap/score) widac odmiane slow w jezyku (np. polski)?
- Czy trudno bylo znalezc artykul z wiki, ktory minimalizuje score dla jezyka wiki? Czy to specyfika wiki?

Wskazowka: skorzystaj z tabel `pivot`, `best_tbl` oraz `overlap_df`.