## 1. Environment Setup
Importing necessary libraries for data manipulation (`pandas`, `json`), visualization (`matplotlib`), and language statistics (`wordfreq`). The `wordfreq` library provides the "ground truth" frequency of words in different languages, which is essential for our comparison.

In [None]:
import json
import matplotlib.pyplot as plt
import pandas as pd
from wordfreq import top_n_list, word_frequency

print("Libraries loaded successfully.")

## 2. Language Confidence Score
Core logic for language detection.
Calculating a score based on the intersection between word counts in given text and the top $k$ most frequent words in a target language. The score increases when the text frequently uses words that are common in the target language (e.g., "the", "and" in English).

In [None]:
def lang_confidence_score(word_counts, language_words_with_frequency):
    """
    Calculates a confidence score indicating how well the word_counts match the language profile.
    """
    score = 0.0
    total_text_words = sum(word_counts.values())
    
    if total_text_words == 0:
        return 0.0

    # Normalize text counts to frequencies (0.0 to 1.0)
    text_freqs = {w: c / total_text_words for w, c in word_counts.items()}
    
    # Calculate score based on overlap
    for word, lang_freq in language_words_with_frequency.items():
        if word in text_freqs:
            # We multiply by the frequency in the text to weight it by usage
            score += text_freqs[word] * lang_freq

    return score

## 3. Data Loading Utility
A simple helper function to safely load JSON files from the disk. It handles `FileNotFoundError` gracefully to ensure the analysis can proceed even if one file is missing or misnamed.

In [None]:
def load_json(filename):
    try:
        with open(filename, 'r', encoding='utf-8') as f:
            return json.load(f)
    except FileNotFoundError:
        print(f"Warning: {filename} not found.")
        return {}

print("Loader function defined.")

In [None]:
# Load your 5 test files
datasets = {
    "Wiki Long (En)": load_json('data_wiki_long.json'),
    "Wiki Short (En)": load_json('data_wiki_short.json'),
    "External (En)":  load_json('data_ext_en.json'),
    "External (Pl)":  load_json('data_ext_pl.json'),
    "External (De)":  load_json('data_ext_de.json')
}

print("Datasets loaded.")

In [None]:
# Configuration
k_values = [3, 10, 100, 1000]
languages = ['en', 'pl', 'de']  # English, Polish, German

results = []

for k in k_values:
    # 1. Prepare Language Data for this K
    lang_profiles = {}
    for lang in languages:
        # Get top K words
        top_words = top_n_list(lang, k)
        # Get their frequencies
        lang_profiles[lang] = {w: word_frequency(w, lang) for w in top_words}
        
    # 2. Score each dataset against each language
    for data_name, word_counts in datasets.items():
        for lang in languages:
            score = lang_confidence_score(word_counts, lang_profiles[lang])
            
            results.append({
                'k': k,
                'Dataset': data_name,
                'Language': lang,
                'Score': score
            })

df_results = pd.DataFrame(results)
print(df_results.head(10))

In [None]:
# Pivot data for plotting
# We create a separate chart for each K value
for k in k_values:
    subset = df_results[df_results['k'] == k]
    pivot = subset.pivot(index='Dataset', columns='Language', values='Score')
    
    pivot.plot(kind='bar', figsize=(10, 6))
    plt.title(f'Language Confidence Score (top k={k} words)')
    plt.ylabel('Score')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()