# Language Confidence Score

This notebook implements a simple method for scoring how well a text's word-frequency profile matches a given language's common-word frequency list.

Inputs:
- `word_counts`: dict mapping word -> count (as produced by `--count-words`)
- `language_words_with_frequency`: list of `(word, frequency)` sorted by frequency desc (at least 1000 words)

The experiments below are self-contained and offline by default. If you have `wordfreq` installed, you can use real frequency lists via `wordfreq.top_n_list` and `wordfreq.word_frequency`.

In [None]:
from __future__ import annotations

from dataclasses import dataclass
from collections import Counter
import math

import pandas as pd

from wiki_scraper.words import tokenize_words


In [None]:
def lang_confidence_score(
    word_counts: dict[str, int],
    language_words_with_frequency: list[tuple[str, float]],
    *,
    k: int,
) -> float:
    """Score how well `word_counts` matches a language frequency list.

    Approach (pragmatic and easy to explain):
    - Take top-k language words.
    - Compute normalized frequencies for the text restricted to those words.
    - Compute normalized frequencies for the language restricted to those words.
    - Return cosine similarity between the two vectors.

    Notes:
    - Score is in [0, 1] for non-negative vectors.
    - Words not in the top-k language list are ignored in the main similarity signal (names, jargon).
    - If there is no overlap, returns 0.0.
    """

    if k <= 0:
        raise ValueError("k must be > 0")

    topk = language_words_with_frequency[:k]
    if not topk:
        return 0.0

    lang_map = {w: float(f) for (w, f) in topk if w}
    # Normalize language freqs over the top-k list.
    lang_sum = sum(lang_map.values())
    if lang_sum <= 0:
        return 0.0
    lang_norm = {w: f / lang_sum for (w, f) in lang_map.items()}

    # Build text frequencies over the same vocabulary.
    text_sum = 0
    text_map: dict[str, float] = {}
    for w in lang_map.keys():
        c = int(word_counts.get(w, 0))
        if c > 0:
            text_map[w] = float(c)
            text_sum += c

    if text_sum <= 0:
        return 0.0

    text_norm = {w: c / text_sum for (w, c) in text_map.items()}

    # Cosine similarity.
    dot = 0.0
    a2 = 0.0
    b2 = 0.0
    for w in lang_map.keys():
        a = text_norm.get(w, 0.0)
        b = lang_norm.get(w, 0.0)
        dot += a * b
        a2 += a * a
        b2 += b * b

    denom = math.sqrt(a2) * math.sqrt(b2)
    if denom <= 0:
        return 0.0
    return dot / denom


In [None]:
def get_language_frequency_list(language_code: str, n: int) -> list[tuple[str, float]]:
    """Return a list of (word, frequency) sorted by frequency desc.

    Uses wordfreq if available.
    """

    try:
        from wordfreq import top_n_list, word_frequency
    except Exception as exc:  # noqa: BLE001
        raise RuntimeError(
            "wordfreq not installed. Install it (pip install wordfreq) to run this cell."
        ) from exc

    words = top_n_list(language_code, n)
    return [(w, float(word_frequency(w, language_code))) for w in words]


In [None]:
def counts_from_text(text: str) -> dict[str, int]:
    words = tokenize_words(text)
    return dict(Counter(words))


## Datasets

Below we create 5 datasets. Replace these with real outputs from your project as needed (for example by loading JSON files produced by `--count-words`).

Practical note: the assignment asks for one long wiki article (5000+ words). If you do not have one saved offline, you can still validate the scoring pipeline on synthetic or external texts, then swap in real word-count dictionaries later.

In [None]:
en_text = (
    "Team Rocket is a villainous team that tries to steal rare creatures and cause trouble. "
    "They often fail, but they continue to plan and return."
)
pl_text = (
    "To jest przykladowy tekst po polsku. Zawiera slowa czeste i rzadkie, aby zasymulowac rozklad. "
    "To nie jest tekst z wiki, ale nadaje sie do testow."
)
es_text = (
    "Este es un texto de ejemplo en espanol. Contiene palabras comunes y algunas menos comunes para pruebas. "
    "No es un texto de wiki, pero sirve para comparar."
)

# Synthetic long and short samples
wiki_long_en = counts_from_text((en_text + ' ') * 400)  # 5000+ words approx
wiki_short_bad = counts_from_text("Bulbasaur Pikachu Sevii Islands Rocket-dan James Jessie Meowth " * 5)
ext_en = counts_from_text(("The quick brown fox jumps over the lazy dog. " * 300))
ext_pl = counts_from_text((pl_text + ' ') * 300)
ext_es = counts_from_text((es_text + ' ') * 300)

datasets = {
    'wiki_long_en': wiki_long_en,
    'wiki_short_bad': wiki_short_bad,
    'ext_en': ext_en,
    'ext_pl': ext_pl,
    'ext_es': ext_es,
}
{name: sum(d.values()) for name, d in datasets.items()}

## Experiment

Compute scores for k in {3, 10, 100, 1000} across 3 languages.

In [None]:
languages = ['en', 'pl', 'es']
ks = [3, 10, 100, 1000]

language_lists: dict[str, list[tuple[str, float]]] = {}
for lang in languages:
    language_lists[lang] = get_language_frequency_list(lang, 5000)

rows = []
for k in ks:
    for lang in languages:
        lang_list = language_lists[lang]
        for dataset_name, wc in datasets.items():
            score = lang_confidence_score(wc, lang_list, k=k)
            rows.append({'k': k, 'language': lang, 'dataset': dataset_name, 'score': score})

df = pd.DataFrame(rows)
df.sort_values(['k', 'dataset', 'language']).head(20)

In [None]:
pivot = df.pivot_table(index=['dataset', 'k'], columns='language', values='score')
pivot

In [None]:
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (12, 6)

for dataset_name in sorted(datasets.keys()):
    sub = df[df['dataset'] == dataset_name]
    fig, ax = plt.subplots()
    for lang in languages:
        s = sub[sub['language'] == lang].sort_values('k')
        ax.plot(s['k'], s['score'], marker='o', label=lang)
    ax.set_xscale('log')
    ax.set_title(f'lang_confidence_score vs k ({dataset_name})')
    ax.set_xlabel('k (log scale)')
    ax.set_ylabel('score')
    ax.grid(alpha=0.25)
    ax.legend()
    plt.show()

## Notes / Discussion (template)

Replace this section with your own empirical observations after you swap in real wiki-derived `word-counts.json` files and real external texts.

Questions to address:
- Did the choice of languages matter?
- Can you see evidence of inflection (many word forms) in the word-frequency overlap?
- Was it hard to find an article that minimized the score for the wiki language?