# Explore Multilinguality in the Impresso collection

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/explore-vis/explore_langident-ImpressoMD.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What is this notebook about?
This notebook analyzes language usage within the Impresso historical newspaper collection, with a focus on computing language entropy by year and newspaper. Language entropy serves as an indicator of language diversity within each newspaper's yearly content, where higher entropy implies a more balanced distribution of languages.

Based on language identification data from the Impresso's November 2024 Data Release, we determine language distributions per newspaper and calculate entropy values for visualization and analysis.

## What will you learn in this notebook?
By the end of this notebook, you will be able to:

 - Calculate and interpret language entropy per newspaper across years in the Impresso collection.
 - Visualize the language diversity and entropy values by year, observing trends and patterns across newspapers.
 - Generate bar and line charts using Plotly to analyze language diversity within historical newspapers.

## Loading Data and Computing Entropy
The initial code loads JSON data containing language frequency distributions per newspaper per year. Each language frequency distribution is processed to calculate Shannon entropy, providing an entropy score to quantify the spread of languages.


In [2]:
import json
import math
import pandas as pd
from smart_open import open

# Load JSON data from file (replace 'your_file_path' with the actual file path)
with open('https://os.zhdk.cloud.switch.ch/42-processed-data-final/langident/langident_v1-4-4/langident_v0-0-2.json', 'r') as file:
    data = json.load(file)

# Function to calculate Shannon entropy
def compute_entropy(lang_fd):
    total_items = sum(count for lang, count in lang_fd.items() if lang != "None")
    if total_items == 0:
        return 0
    entropy = 0
    for lang, count in lang_fd.items():
        if lang == "None" or count == 0:
            continue
        p = count / total_items
        entropy -= p * math.log2(p)
    return round(entropy,3)

# Initialize a list to store results
entropy_results = []

# Iterate through the media statistics in the JSON data
for media in data['media_list']:
    for stats in media['media_statistics']:
        if stats['granularity'] == 'year':
            # Get the language distribution (lang_fd)
            lang_fd = stats['nps_stats']['lang_fd']
            # Compute the entropy
            entropy = compute_entropy(lang_fd)
            # Append the result: newspaper-year (element) and the entropy value
            entropy_results.append({
                'newspaper_year': stats['element'],
                'entropy': entropy
            })


In [3]:
# Convert the results into a DataFrame for better readability
df_entropy = pd.DataFrame(entropy_results)


# Sort the DataFrame by entropy in descending order
df_entropy_sorted = df_entropy.sort_values('entropy', ascending=False,ignore_index=True)

# Display the sorted DataFrame
df_entropy_sorted

Unnamed: 0,newspaper_year,entropy
0,onsjongen-1946,1.574
1,dunioun-1947,1.544
2,onsjongen-1947,1.543
3,dunioun-1948,1.528
4,luxwort-1945,1.524
...,...,...
4263,SMZ-1913,0.000
4264,SMZ-1917,0.000
4265,SMZ-1985,0.000
4266,SMZ-2000,0.000


## Visualization of Average Entropy by Newspaper
This section visualizes the average language entropy per newspaper, aggregated across all years. Newspapers are sorted by descending entropy to highlight those with higher language diversity. The visualization includes standard deviation as an error bar to show variation in entropy across years.

In [4]:

import plotly.express as px
import numpy as np

# Extract newspaper name from newspaper_year
df_entropy_sorted['newspaper'] = df_entropy_sorted['newspaper_year'].str.split('-').str[0]

# Calculate average entropy and standard deviation by newspaper
avg_entropy = df_entropy_sorted.groupby('newspaper')['entropy'].agg(['mean', 'std'])

# Sort by decreasing average entropy
avg_entropy_sorted = avg_entropy.sort_values('mean', ascending=False)

# Create a chart
fig = px.bar(
    avg_entropy_sorted,
    x=avg_entropy_sorted.index,
    y='mean',
    error_y='std',
    title='Average Entropy by Newspaper per Year (Sorted by Decreasing Entropy) with Standard Deviation'
)

fig.update_layout(xaxis_title='Newspaper', yaxis_title='Average Entropy')

fig.show()



## Language Distribution per Newspaper over Time
This interactive section enables the exploration of language distributions over time for each newspaper. Using a dropdown menu, you can select specific newspapers, view language distribution trends by year, and examine how language usage shifts over time.

Each language’s proportion is displayed as a percentage, with entropy as a secondary y-axis, providing a clear picture of both language prevalence and diversity over the years.

In [5]:
import json
import math
import pandas as pd
import plotly.graph_objects as go

# Function to calculate Shannon entropy
def compute_entropy(lang_fd):
    total_items = sum(count for lang, count in lang_fd.items() if lang != "None")
    if total_items == 0:
        return 0
    entropy = 0
    for lang, count in lang_fd.items():
        if lang == "None" or count == 0:
            continue
        p = count / total_items
        entropy -= p * math.log2(p)
    return round(entropy, 3)

# Initialize a list to store results
data_for_plot = []

# Track languages that meet the 5-article threshold in any year
all_languages = set()

# Iterate through the media statistics in the JSON data
for media in data['media_list']:
    for stats in media['media_statistics']:
        if stats['granularity'] == 'year':
            year = int(stats['element'].split('-')[-1])
            newspaper = stats['element'].split('-')[0]
            lang_fd = stats['nps_stats']['lang_fd']

            # Calculate the total number of content items (excluding "None")
            total_items = sum(count for lang, count in lang_fd.items() if lang != "None")

            # Skip years where no content items exist
            if total_items == 0:
                continue

            # Calculate relative percentage for each language
            lang_percentages = {
                lang: (count / total_items) * 100
                for lang, count in lang_fd.items()
                if lang != "None"
            }

            # Track all languages that appear at least once in any year
            all_languages.update(lang_percentages.keys())

            # Store the absolute counts as custom data for the tooltip
            lang_absolute = {
                lang: count
                for lang, count in lang_fd.items()
                if lang != "None"
            }

            # Calculate entropy
            entropy = compute_entropy(lang_fd)

            # Prepare data for plotting
            lang_percentages['year'] = year
            lang_percentages['newspaper'] = newspaper
            lang_percentages['entropy'] = entropy
            lang_percentages['customdata'] = lang_absolute  # Store absolute counts for hover
            lang_percentages['total_items'] = total_items  # Add total items for the selected newspaper
            data_for_plot.append(lang_percentages)

# Create a DataFrame for plotting
df_plot = pd.DataFrame(data_for_plot)

# Sorting by year (to ensure x-axis is in ascending order)
df_plot = df_plot.sort_values(by='year')

# Ensure all languages are present in each year, filling with 0% if missing
for language in all_languages:
    if language not in df_plot.columns:
        df_plot[language] = 0.0

# Calculate average entropy per newspaper and sort by it (in descending order)
df_avg_entropy = df_plot.groupby('newspaper')['entropy'].mean().sort_values(ascending=False).reset_index()
newspaper_options = df_avg_entropy['newspaper'].tolist()

# Create traces for each newspaper, but make them invisible initially
fig = go.Figure()

# Create separate traces for each newspaper and make them invisible
for newspaper in newspaper_options:
    df_selected = df_plot[df_plot['newspaper'] == newspaper]

    # Filter to only include languages with non-zero values in the selected newspaper
    languages = [col for col in df_selected.columns if col not in ['year', 'newspaper', 'entropy', 'total_items', 'customdata']]
    for language in languages:
        # Check if the language has any non-zero values in the selected newspaper
        if df_selected[language].sum() > 0:
            # Add trace for each language
            fig.add_trace(go.Scatter(
                x=df_selected['year'],
                y=df_selected[language],
                mode='lines+markers',
                name=f'{language} ({newspaper})',
                visible=False,  # Initially invisible
                customdata=df_selected['customdata'].apply(lambda x: x.get(language, None)),  # Absolute counts for hover
                hovertemplate=f'{language}: %{{y:.2f}}%<br>Articles: %{{customdata}}'  # Show absolute article count in hover
            ))

    # Add entropy trace with a dashed line (for the selected newspaper only)
    fig.add_trace(go.Scatter(
        x=df_selected['year'],
        y=df_selected['entropy'],
        mode='lines+markers',
        name=f'Entropy ({newspaper})',
        visible=False,  # Initially invisible
        line=dict(dash='dash', color='black'),
        yaxis='y2',
        customdata=df_selected['total_items'],  # Total number of articles for entropy (newspaper-specific)
        hovertemplate=f'Entropy: %{{y:.2f}}<br>Articles: %{{customdata}}'  # Add total articles for the selected newspaper
    ))

# Make the first newspaper's traces visible by default
for trace in fig.data:
    if newspaper_options[0] in trace.name:
        trace.visible = True

# Create dropdown menu options
dropdown_buttons = []

# Find unique years and corresponding total items for the selected newspaper only
def get_newspaper_year_data(df_selected):
    return df_selected[['year', 'total_items']].drop_duplicates().sort_values(by='year')

for newspaper in newspaper_options:
    df_selected = df_plot[df_plot['newspaper'] == newspaper]
    # Get unique years and total items for the selected newspaper
    unique_years = get_newspaper_year_data(df_selected)
    tick_labels = [f'{year}  (#{int(total)})' for year, total in zip(unique_years['year'], unique_years['total_items'])]

    # Define the width based on the number of years (e.g., 100 pixels per year)
    fig_width = max(40 * len(unique_years['year']),800)

    # Add button that updates both visible traces, x-axis, and figure width
    dropdown_buttons.append(dict(
        label=newspaper,
        method="update",
        args=[
            {
                "visible": [newspaper in trace.name for trace in fig.data],
            },
            {
                "title": f"Relative Language Percentage and Entropy for {newspaper}",
                "xaxis": dict(
                    tickmode='array',
                    tickvals=unique_years['year'],  # Update x-axis ticks dynamically based on newspaper
                    ticktext=tick_labels,  # Update x-axis labels dynamically
                    tickangle=-270  # Keep the labels vertical
                ),
                "width": fig_width  # Dynamically set the width based on the number of years
            }
        ]
    ))

# Find unique years and corresponding total items for the first selected newspaper
unique_years = get_newspaper_year_data(df_plot[df_plot['newspaper'] == newspaper_options[0]])

# Add total article counts below the years for x-axis tick labels
tick_labels = [f'{year}  (#{int(total)})' for year, total in zip(unique_years['year'], unique_years['total_items'])]

# Define the initial figure width based on the first selected newspaper
fig_width =  max(35* len(unique_years['year']),800)


# Update layout to include dropdown, secondary y-axis, and set full year ticks for x-axis
fig.update_layout(
    title=f"Relative Language Percentage and Entropy for {newspaper_options[0]}",
    width=fig_width,  # Dynamically set the initial width based on the first selected newspaper
    xaxis_title='Year',
    yaxis_title='Percentage (%)',
    yaxis_range=[-2, 102],  # Y-axis from 0 to 100%
    xaxis=dict(
        tickmode='array',
        tickvals=unique_years['year'],  # Ensure unique years as ticks on the x-axis for the selected newspaper
        ticktext=tick_labels,  # Show both year and total articles (newspaper-specific) on the x-axis
        tickangle=-270  # Set labels to vertical mode
    ),
    yaxis2=dict(
        title='Entropy',
        overlaying='y',
        side='right',
        range=[-0.04, 2.04],  # Y-axis for entropy from 0 to 2
        tickvals=[0, 0.4, 0.8, 1.2, 1.6, 2],  # Align with percentage axis steps (0, 20%, ..., 100%)
    ),
    updatemenus=[{
        "buttons": dropdown_buttons,
        "direction": "down",
        "showactive": True,
        "x": 0.1,
        "y": 1.15
    }]
)

# Show the plot
fig.show()
