# Parliamentary Speech Topic Modeling Pipeline

This notebook implements a comprehensive topic modeling pipeline for parliamentary speech data from three countries (Great Britain, Austria, and Croatia). The pipeline uses BERTopic with Gaussian Mixture Model clustering and OpenAI GPT for topic classification into 23 predefined policy categories.

## Pipeline Overview:
1. **Data Loading & Configuration** - Load preprocessed datasets and define constants
2. **Core Classes & Functions** - Define GMM clustering and topic modeling functions  
3. **OpenAI Classification** - Classify discovered topics into policy categories
4. **Execution** - Run the complete pipeline on all datasets
5. **Results & Analysis** - Save results and generate comprehensive summary

In [1]:
# === IMPORTS AND SETUP ===
import pandas as pd
import numpy as np
import time

# NLP and Topic Modeling
import nltk
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, CountVectorizer
from sklearn.mixture import GaussianMixture
from bertopic import BERTopic
from umap import UMAP
from nltk.corpus import stopwords

# OpenAI for topic classification
from openai import OpenAI
from dotenv import load_dotenv

# Analysis and Visualization
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import matplotlib.pyplot as plt

# Setup
nltk.download('stopwords', quiet=True)
load_dotenv()
pd.options.display.max_columns = None

print("📦 All imports loaded successfully")

📦 All imports loaded successfully


# Data Loading

Loading the preprocessed parliamentary speech datasets with embeddings from all three countries.

In [2]:
# === DATA LOADING ===
print("📂 Loading datasets...")

# Load the preprocessed data with embeddings
AT_combined = pd.read_pickle(r"data folder\AT\AT_final.pkl")
AT_combined.drop(columns=['Segment_ID'], inplace=True, errors='ignore')

HR_combined = pd.read_pickle(r"data folder\HR\HR_final.pkl")
HR_combined.drop(columns=['Segment_ID'], inplace=True, errors='ignore')

GB = pd.read_pickle(r"data folder\GB\GB_final.pkl")

print(f"✅ Loaded datasets:")
print(f"   • AT (Austrian Parliament): {AT_combined.shape}")
print(f"   • HR (Croatian Parliament): {HR_combined.shape}")
print(f"   • GB (British Parliament): {GB.shape}")

📂 Loading datasets...
✅ Loaded datasets:
   • AT (Austrian Parliament): (231759, 32)
   • HR (Croatian Parliament): (504338, 32)
   • GB (British Parliament): (670912, 29)
✅ Loaded datasets:
   • AT (Austrian Parliament): (231759, 32)
   • HR (Croatian Parliament): (504338, 32)
   • GB (British Parliament): (670912, 29)


# Configuration & Constants

Setting up the policy category framework and language-specific stopwords for topic modeling.

## Policy Categories
The analysis uses 23 predefined policy categories based on the Policy Agendas Project framework, covering all major areas of government policy from Education to Defense to Environmental issues.

## Stopwords
Custom stopwords are defined for each language to filter out parliamentary procedure terms, titles, and common political vocabulary that doesn't indicate policy content.

In [3]:
# === CONFIGURATION AND CONSTANTS ===

# Target policy categories (22 categories + Mix)
LABEL_DICT = {
    "Education": "Issues related to educational policies, primary and secondary schools, student loans and education finance, the regulation of colleges and universities, school reforms, teachers, vocational training, evening schools, safety in schools, efforts to improve educational standards, and issues related to libraries, dictionaries, teaching material, research in education",
    "Technology": "Issues related to science and technology transfer and international science cooperation, research policy, government space programs and space exploration, telephones and telecommunication regulation, broadcast media (television, radio, newspapers, films), weather forecasting, geological surveys, computer industry, cyber security.",
    "Health": "Issues related to health care, health care reforms, health insurance, drug industry, medical facilities, medical workers, disease prevention, treatment, and health promotion, drug and alcohol abuse, mental health, research in medicine, medical liability and unfair medical practices.",
    "Environment": "Issues related to environmental policy, drinking water safety, all kinds of pollution (air, noise, soil), waste disposal, recycling, climate change, outdoor environmental hazards (e.g., asbestos), species and forest protection, marine and freshwater environment, hunting, regulation of laboratory or performance animals, land and water resource conservation, research in environmental technology.",
    "Housing": "Issues related to housing, urban affairs and community development, housing market, property tax, spatial planning, rural development, location permits, construction inspection, illegal construction, industrial and commercial building issues, national housing policy, housing for low-income individuals, rental housing, housing for the elderly, e.g., nursing homes, housing for the homeless and efforts to reduce homelessness, research related to housing, construction inspection, illegal construction, industrial and commercial building issues, national housing policy, housing for low-income individuals, rental housing, housing for the elderly, e.g., nursing homes, housing for the homeless and efforts to reduce homelessness, research related to housing.",
    "Labor": "Issues related to labor, employment, employment programs, employee benefits, pensions and retirement accounts, minimum wage, labor law, job training, labor unions, worker safety and protection, youth employment and seasonal workers.",
    "Defense": "Issues related to defense policy, military intelligence, espionage, weapons, military personnel, reserve forces, military buildings, military courts, nuclear weapons, civil defense, including firefighters and mountain rescue services, homeland security, military aid or arms sales to other countries, prisoners of war and collateral damage to civilian populations, military nuclear and hazardous waste disposal and military environmental compliance, defense alliances and agreements, direct foreign military operations, claims against military, defense research.",
    "Government Operations": "Issues related to general government operations, the work of multiple departments, public employees, postal services, nominations and appointments, national mints, medals, and commemorative coins, management of government property, government procurement and contractors, public scandal and impeachment, claims against the government, the state inspectorate and audit, anti-corruption policies, regulation of political campaigns, political advertising and voter registration, census and statistics collection by government; issues related to local government, capital city and municipalities, including decentralization; issues related to national holidays.",
    "Social Welfare": "Issues related to social welfare policy, the Ministry of Social Affairs, social services, poverty assistance for low-income families and for the elderly, parental leave and child care, assistance for people with physical or mental disabilities, including early retirement pension, discounts on public services, volunteer associations (e.g., Red Cross), charities, and youth organizations.",
    "Macroeconomics": "Issues related to domestic macroeconomic policy, such as the state and prospect of the national economy, economic policy,inflation, interest rates, monetary policy, cost of living, unemployment rate, national budget, public debt, price control, tax enforcement, industrial revitalization and growth.",
    "Domestic Commerce": "Issues related to banking, finance and internal commerce, including stock exchange, investments, consumer finance, mortgages, credit cards, insurance availability and cost, accounting regulation, personal, commercial, and municipal bankruptcies, programs to promote small businesses, copyrights and patents, intellectual property, natural disaster preparedness and relief, consumer safety; regulation and promotion of tourism, sports, gambling, and personal fitness; domestic commerce research.",
    "Civil Rights": "Issues related to civil rights and minority rights, discrimination towards races, gender, sexual orientation, handicap, and other minorities, voting rights, freedom of speech, religious freedoms, privacy rights, protection of personal data, abortion rights, anti-government activity groups (e.g., local insurgency groups), religion and the Church.",
    "International Affairs": "Issues related to international affairs, foreign policy and relations to other countries, issues related to the Ministry of Foreign Affairs, foreign aid, international agreements (such as Kyoto agreement on the environment, the Schengen agreement), international organizations (including United Nations, UNESCO, International Olympic Committee, International Criminal Court), NGOs, issues related to diplomacy, embassies, citizens abroad; issues related to border control; issues related to international finance, including the World Bank and International Monetary Fund, the financial situation of the EU; issues related to a foreign country that do not impact the home country; issues related to human rights in other countries, international terrorism.",
    "Transportation": "Issues related to mass transportation construction and regulation, bus transport, regulation related to motor vehicles, road construction, maintenance and safety, parking facilities, traffic accidents statistics, air travel, rail travel, rail freight, maritime transportation, inland waterways and channels, transportation research and development.",
    "Immigration": "Issues related to immigration, refugees, and citizenship, integration issues, regulation of residence permits, asylum applications; criminal offences and diseases caused by immigration.",
    "Law and Crime": "Issues related to the control, prevention, and impact of crime; all law enforcement agencies, including border and customs, police, court system, prison system; terrorism, white collar crime, counterfeiting and fraud, cyber-crime, drug trafficking, domestic violence, child welfare, family law, juvenile crime.",
    "Agriculture": " Issues related to agriculture policy, fishing, agricultural foreign trade, food marketing, subsidies to farmers, food inspection and safety, animal and crop disease, pest control and pesticide regulation, welfare for animals in farms, pets, veterinary medicine, agricultural research.",
    "Foreign Trade": "Issues related to foreign trade, trade negotiations, free trade agreements, import regulation, export promotion and regulation, subsidies, private business investment and corporate development, competitiveness, exchange rates, the strength of national currency in comparison to other currencies, foreign investment and sales of companies abroad.",
    "Culture": "Issues related to cultural policies, Ministry of Culture, public spending on culture, cultural employees, issues related to support of theatres and artists; allocation of funds from the national lottery, issues related to cultural heritage.",
    "Public Lands": "Issues related to national parks, memorials, historic sites, and protected areas, including the management and staffing of cultural sites; museums; use of public lands and forests, establishment and management of harbors and marinas; issues related to flood control, forest fires, livestock grazing.",
    "Energy": "Issues related to energy policy, electricity, regulation of electrical utilities, nuclear energy and disposal of nuclear waste, natural gas and oil, drilling, oil spills, oil and gas prices, heat supply, shortages and gasoline regulation, coal production, alternative and renewable energy, energy conservation and energy efficiency, energy research.",
    "Other": "Other topics not mentioning policy agendas, including the procedures of parliamentary meetings, e.g., points of order, voting procedures, meeting logistics; interpersonal speech, e.g., greetings, personal stories, tributes, interjections, arguments between the members; rhetorical speech, e.g., jokes, literary references.",
    "Mix": "Use this category when the topic clearly spans multiple policy areas or when there is significant uncertainty about which single category best fits the topic. This is for topics that genuinely combine elements from 2-3 different categories in a meaningful way, making it difficult to assign to just one category with high confidence."
}

# Language-specific stopwords
ENGLISH_CUSTOM_STOPWORDS = [
    'mr', 'mrs', 'ms', 'dr', 'madam', 'honorable', 'honourable', 'member', 'members', 'vp', 'sp', 'fp', 'ae', 'po'
    'minister', 'speaker', 'deputy', 'president', 'chairman', 'chair', 'schilling', 'my', 'lords', 'lord', 'bzs', 'prll', 'bz'
    'secretary', 'lord', 'gp', 'lady', 'question', 'order', 'point', 'debate', 'motion', 'amendment', 'backbench', 'week',
    'congratulations', 'congratulate', 'thanks', 'thank', 'say', 'one', 'want', 'know', 'think', 'noble', 'opg',
    'believe', 'see', 'go', 'come', 'give', 'take', 'people', 'federal', 'government', 'austria', 'baroness',
    'austrian', 'committee', 'call', 'said', 'already', 'please', 'request', 'proceed', 'reading', 'prime',
    'course', 'welcome', 'council', 'open', 'written', 'contain', 'items', 'item', 'yes', 'no', 
    'following', 'next', 'speech', 'year', 'years', 'state', 'also', 'would', 'like', 'may', 'must', 
    'upon', 'indeed', 'session', 'meeting', 'report', 'commission', 'behalf', 'gentleman', 'gentlemen', 
    'ladies', 'applause', 'group', 'colleague', 'colleagues', 'issue', 'issues', 'chancellor', 'court', 
    'ask', 'answer', 'reply', 'regard', 'regarding', 'regards', 'respect', 'respectfully', 'sign', 
    'shall', 'procedure', 'declare', 'hear', 'minutes', 'speaking', 'close', 'abg', 'mag', 'orf', 'wait'
]

GERMAN_CUSTOM_STOPWORDS = [
    'der', 'die', 'das', 'und', 'in', 'zu', 'den', 'mit', 'von', 'für', 'bb', 'bz', 'bzs', 'prll',
    'auf', 'ist', 'im', 'sich', 'eine', 'sie', 'dem', 'nicht', 'ein', 'als',
    'auch', 'es', 'an', 'werden', 'aus', 'er', 'hat', 'dass', 'wir', 'ich',
    'haben', 'sind', 'kann', 'sehr', 'meine', 'muss', 'doch', 'wenn', 'sein',
    'dann', 'weil', 'bei', 'nach', 'so', 'oder', 'aber', 'vor', 'über', 'noch',
    'nur', 'wie', 'war', 'waren', 'wird', 'wurde', 'wurden', 'ihr', 'ihre',
    'ihren', 'seiner', 'seine', 'seinem', 'seinen', 'dieser', 'diese', 'dieses',
    'durch', 'ohne', 'gegen', 'unter', 'zwischen', 'während', 'bis', 'seit',
    'danke', 'bitte', 'gern', 'abgeordnete', 'abgeordneten', 'bundesregierung',
    'bundeskanzler', 'nationalrat', 'bundesrat', 'parlament', 'fraktion',
    'ausschuss', 'sitzung', 'präsident', 'vizepräsident', 'minister',
    'staatssekretär', 'klubobmann', 'antrag', 'anfrage', 'interpellation',
    'dringliche', 'aktuelle', 'stunde', 'debatte', 'abstimmung', 'beschluss',
    'gesetz', 'novelle', 'verordnung', 'regierungsvorlage', 'initiativantrag',
    'danke', 'dankeschön', 'geschätzte', 'kolleginnen', 'kollegen', 'hohes'
]

CROATIAN_CUSTOM_STOPWORDS = [
    'a', 'ako', 'ali', 'bi', 'bih', 'bila', 'bili', 'bilo', 'bio', 'bismo', 
    'biste', 'biti', 'bumo', 'da', 'do', 'duž', 'ga', 'hoće', 'hoćemo', 
    'hoćete', 'hoćeš', 'hoću', 'i', 'iako', 'ih', 'ili', 'iz', 'ja', 'je', 
    'jedna', 'jedne', 'jedno', 'jer', 'jesam', 'jesi', 'jesmo', 'jest', 
    'jeste', 'jesu', 'jim', 'joj', 'još', 'ju', 'kada', 'kako', 'kao', 
    'koja', 'koje', 'koji', 'kojima', 'koju', 'kroz', 'li', 'me', 'mene', 
    'meni', 'mi', 'mimo', 'moj', 'moja', 'moje', 'mu', 'na', 'nad', 'nakon', 
    'nam', 'nama', 'nas', 'naš', 'naša', 'naše', 'našeg', 'ne', 'nego', 
    'neka', 'neki', 'nekog', 'neku', 'nema', 'netko', 'neće', 'nećemo', 
    'nećete', 'nećeš', 'neću', 'nešto', 'ni', 'nije', 'nikoga', 'nikoje', 
    'nikoju', 'nisam', 'nisi', 'nismo', 'niste', 'nisu', 'njega', 'njegov', 
    'njegova', 'njegovo', 'njemu', 'njezin', 'njezina', 'njezino', 'njih', 
    'njihov', 'njihova', 'njihovo', 'njim', 'njima', 'njoj', 'nju', 'no', 
    'o', 'od', 'odmah', 'on', 'ona', 'oni', 'ono', 'ova', 'pa', 'pak', 
    'po', 'pod', 'pored', 'prije', 's', 'sa', 'sam', 'samo', 'se', 'sebe', 
    'sebi', 'si', 'smo', 'ste', 'su', 'sve', 'svi', 'svog', 'svoj', 'svoja', 
    'svoje', 'svom', 'ta', 'tada', 'taj', 'tako', 'te', 'tebe', 'tebi', 
    'ti', 'to', 'toj', 'tome', 'tu', 'tvoj', 'tvoja', 'tvoje', 'u', 'uz' 
    'vam', 'vama', 'vas', 'vaš', 'vaša', 'vaše', 'već', 'vi', 'vrlo', 'za', 
    'zar', 'će', 'ćemo', 'ćete', 'ćeš', 'ću', 'što', 'zastupnik', 'zastupnica', 
    'zastupnici', 'hvala', 'sabor', 'hrvatska', 'vlada', 'molim', 'gospodin', 
    'gospođa', 'premijer', 'predsjednik', 'predsjednica', 'ministar', 'ministrica',
    'državni', 'tajnik', 'tajnica', 'odbor', 'sjednica', 'rasprava', 'prijedlog', 
    'zakon', 'odluka', 'glasovanje', 'amandman', 'interpelacija', 'pitanje', 
    'odgovor', 'klupski', 'obnašatelj', 'dužnosti', 'potpredsjednik', 
    'potpredsjednica', 'kolegice', 'kolege', 'dame', 'gospodo', 'poštovani', 'poštovana'
]

# Combine with NLTK stopwords
ALL_ENGLISH_STOPWORDS = list(set(list(ENGLISH_STOP_WORDS) + ENGLISH_CUSTOM_STOPWORDS))
ALL_GERMAN_STOPWORDS = list(set(stopwords.words('german') + GERMAN_CUSTOM_STOPWORDS))
ALL_CROATIAN_STOPWORDS = list(set(CROATIAN_CUSTOM_STOPWORDS))

STOPWORDS_MAP = {
    "english": ALL_ENGLISH_STOPWORDS,
    "german": ALL_GERMAN_STOPWORDS,
    "croatian": ALL_CROATIAN_STOPWORDS
}

print(f"🎯 Configuration loaded:")
print(f"   • Target categories: {len(LABEL_DICT)} policy areas")
print(f"   • Stopwords: English({len(ALL_ENGLISH_STOPWORDS)}), German({len(ALL_GERMAN_STOPWORDS)}), Croatian({len(ALL_CROATIAN_STOPWORDS)})")

🎯 Configuration loaded:
   • Target categories: 23 policy areas
   • Stopwords: English(428), German(276), Croatian(218)


# Core Classes & Functions

## GMM Clustering
Custom Gaussian Mixture Model clustering class that replaces HDBSCAN in BERTopic. GMM provides more control over the number of clusters and works well with high-dimensional embedding spaces.

## Topic Modeling Functions
Core functions for data preparation, BERTopic model creation, and topic modeling execution. These functions handle the complete pipeline from raw text to discovered topics.

## Cluster Optimization Functions
Functions to systematically test different cluster numbers and find optimal configurations using multiple clustering quality metrics.

In [4]:
# === GAUSSIAN MIXTURE MODEL CLUSTERING CLASS ===

class GMMClustering:
    """Custom Gaussian Mixture Model clustering for BERTopic to replace HDBSCAN"""
    
    def __init__(self, n_components=200, covariance_type='tied', random_state=42, **kwargs):
        """
        Initialize GMM clustering.
        
        Args:
            n_components: Number of clusters to create
            covariance_type: 'tied' shares covariance across clusters, 'full' allows different shapes
            random_state: For reproducibility
        """
        self.n_components = n_components
        self.covariance_type = covariance_type
        self.random_state = random_state
        self.kwargs = kwargs
        self.model = None
        self.labels_ = None  # Required by BERTopic
    
    def fit(self, X, y=None):
        """Fit GMM model and store cluster labels."""
        self.model = GaussianMixture(
            n_components=self.n_components,
            covariance_type=self.covariance_type,
            random_state=self.random_state,
            **self.kwargs
        )
        self.model.fit(X)
        self.labels_ = self.model.predict(X)
        return self
    
    def fit_predict(self, X):
        """Fit GMM model and predict cluster labels."""
        self.fit(X)
        return self.labels_

# === CORE TOPIC MODELING FUNCTIONS ===

def prepare_segment_data(df, segment_id_col, text_col, embedding_col):
    """
    Prepare segment-level data for topic modeling by grouping sentences into segments.
    
    Returns:
        documents: List of concatenated text per segment
        embeddings: Array of segment embeddings  
        segment_ids: List of segment identifiers
    """
    grouped_data = df.groupby(segment_id_col).agg({
        text_col: ' '.join,
        embedding_col: 'first'
    }).reset_index()
    
    documents = grouped_data[text_col].tolist()
    embeddings = np.array(grouped_data[embedding_col].tolist())
    segment_ids = grouped_data[segment_id_col].tolist()
    
    return documents, embeddings, segment_ids

def create_bertopic_model(language, n_clusters, min_topic_size=3):
    """
    Create a configured BERTopic model with optimized parameters.
    
    Args:
        language: Language for stopwords ('english', 'german', 'croatian')
        n_clusters: Number of clusters for GMM
        min_topic_size: Minimum size for a topic to be kept
    """
    # CountVectorizer with language-specific stopwords
    vectorizer_model = CountVectorizer(
        stop_words=STOPWORDS_MAP.get(language, ALL_ENGLISH_STOPWORDS),
        ngram_range=(1, 2),  # Include unigrams and bigrams
        min_df=5,            # Ignore terms appearing in fewer than 5 documents
        max_df=0.9,          # Ignore terms appearing in more than 90% of documents
        max_features=20000,  # Maximum vocabulary size
        lowercase=True,
        strip_accents='unicode',
    )
    
    # UMAP for dimensionality reduction
    umap_model = UMAP(
        n_neighbors=15,      # Balance between local and global structure
        n_components=10,     # Dimensions for clustering
        min_dist=0.05,       # Tight clusters
        metric='cosine',     # Good for text embeddings
        random_state=42,
        low_memory=True
    )
    
    # GMM clustering instead of HDBSCAN
    gmm_model = GMMClustering(
        n_components=n_clusters,
        covariance_type='tied',  # Efficient for many clusters
        random_state=42,
        max_iter=300,
        init_params='kmeans',
        reg_covar=1e-5,     # Regularization to prevent singularities
        tol=1e-3
    )
    
    # BERTopic model
    topic_model = BERTopic(
        top_n_words=20,              # Extract top 20 words per topic
        vectorizer_model=vectorizer_model,
        umap_model=umap_model,
        hdbscan_model=gmm_model,     # Using GMM instead of HDBSCAN
        verbose=True,
        calculate_probabilities=False,  # Faster without probabilities
        embedding_model=None,           # Use pre-computed embeddings
        min_topic_size=min_topic_size
    )
    
    return topic_model

def run_topic_modeling(df, dataset_name, language, text_col, segment_id_col, 
                      embedding_col, n_clusters=200, min_topic_size=3):
    """
    Run complete topic modeling pipeline for a dataset.
    
    Returns:
        df_result: Original dataframe with topic assignments
        topic_model: Fitted BERTopic model
        topic_info: DataFrame with topic information
        segment_ids: List of segment IDs used
        topics: List of topic assignments
    """
    print(f"\n🔍 Running topic modeling for {dataset_name} ({language})")
    print(f"   Clusters: {n_clusters}, Min topic size: {min_topic_size}")
    
    # Prepare segment-level data
    documents, embeddings, segment_ids = prepare_segment_data(
        df, segment_id_col, text_col, embedding_col
    )
    print(f"📊 Prepared {len(documents)} segments for modeling")
    
    # Create and fit BERTopic model
    topic_model = create_bertopic_model(language, n_clusters, min_topic_size)
    
    print("🤖 Fitting BERTopic model with GMM clustering...")
    embeddings = embeddings.astype(np.float32)
    topics, _ = topic_model.fit_transform(documents, embeddings)
    
    # Get topic information
    topic_info = topic_model.get_topic_info()
    
    # Create segment-to-topic mapping
    segment_topics = pd.DataFrame({
        segment_id_col: segment_ids,
        f'Segment_Topic_{dataset_name}_{language}': topics
    })
    
    # Merge back to original dataframe
    df_result = df.merge(segment_topics, on=segment_id_col, how='left')
    
    print(f"✅ Discovered {len(set(topics))} topics from {len(segment_ids)} segments")
    return df_result, topic_model, topic_info, segment_ids, topics

# === CLUSTER OPTIMIZATION FUNCTIONS ===

def comprehensive_cluster_optimization(embeddings, n_clusters_range, covariance_type='tied', dataset_name="Dataset"):
    """
    Test different cluster numbers and evaluate clustering quality using multiple metrics.
    
    Args:
        embeddings: UMAP-reduced embeddings for clustering
        n_clusters_range: List of cluster numbers to test
        covariance_type: GMM covariance type
        dataset_name: Name for reporting
        
    Returns:
        results_df: DataFrame with all results
        valid_results: DataFrame with valid results only
    """
    print(f"🔬 Testing {len(n_clusters_range)} cluster configurations for {dataset_name}...")
    print(f"   Range: {min(n_clusters_range)} to {max(n_clusters_range)} clusters")
    
    results = []
    
    for i, n_clusters in enumerate(n_clusters_range):
        print(f"\n🔍 Progress: {i+1}/{len(n_clusters_range)} - Testing {n_clusters} clusters...")
        
        try:
            # Fit GMM
            gmm = GaussianMixture(
                n_components=n_clusters, 
                covariance_type=covariance_type, 
                random_state=42,
                max_iter=300,
                reg_covar=1e-5,
                tol=1e-3
            )
            labels = gmm.fit_predict(embeddings)
            
            # Calculate clustering quality metrics
            silhouette = silhouette_score(embeddings, labels)
            calinski_harabasz = calinski_harabasz_score(embeddings, labels)
            davies_bouldin = davies_bouldin_score(embeddings, labels)
            
            # Additional metrics
            n_unique_clusters = len(set(labels))
            cluster_sizes = pd.Series(labels).value_counts()
            min_cluster_size = cluster_sizes.min()
            max_cluster_size = cluster_sizes.max()
            mean_cluster_size = cluster_sizes.mean()
            std_cluster_size = cluster_sizes.std()
            
            # Model selection criteria
            aic = gmm.aic(embeddings)
            bic = gmm.bic(embeddings)
            
            results.append({
                "dataset": dataset_name,
                "n_clusters": n_clusters,
                "silhouette": silhouette,
                "calinski_harabasz": calinski_harabasz,
                "davies_bouldin": davies_bouldin,
                "n_unique_clusters": n_unique_clusters,
                "min_cluster_size": min_cluster_size,
                "max_cluster_size": max_cluster_size,
                "mean_cluster_size": mean_cluster_size,
                "std_cluster_size": std_cluster_size,
                "aic": aic,
                "bic": bic,
                "log_likelihood": gmm.score(embeddings)
            })
            
            print(f"   ✅ Silhouette: {silhouette:.4f}, C-H: {calinski_harabasz:.1f}, D-B: {davies_bouldin:.4f}")
            
        except Exception as e:
            print(f"   ❌ Failed with {n_clusters} clusters: {str(e)}")
            results.append({
                "dataset": dataset_name,
                "n_clusters": n_clusters,
                **{k: np.nan for k in ["silhouette", "calinski_harabasz", "davies_bouldin", 
                                      "n_unique_clusters", "min_cluster_size", "max_cluster_size",
                                      "mean_cluster_size", "std_cluster_size", "aic", "bic", "log_likelihood"]}
            })
    
    # Convert to DataFrame and remove failed results
    results_df = pd.DataFrame(results)
    valid_results = results_df.dropna()
    
    if len(valid_results) == 0:
        print("❌ No valid results found!")
        return results_df, results_df
    
    # Calculate composite score (normalized metrics)
    valid_results = valid_results.copy()
    
    # Normalize metrics (0-1 scale, higher is better)
    valid_results['silhouette_norm'] = (valid_results['silhouette'] - valid_results['silhouette'].min()) / (valid_results['silhouette'].max() - valid_results['silhouette'].min())
    valid_results['calinski_norm'] = (valid_results['calinski_harabasz'] - valid_results['calinski_harabasz'].min()) / (valid_results['calinski_harabasz'].max() - valid_results['calinski_harabasz'].min())
    valid_results['davies_norm'] = 1 - ((valid_results['davies_bouldin'] - valid_results['davies_bouldin'].min()) / (valid_results['davies_bouldin'].max() - valid_results['davies_bouldin'].min()))  # Invert since lower is better
    valid_results['aic_norm'] = 1 - ((valid_results['aic'] - valid_results['aic'].min()) / (valid_results['aic'].max() - valid_results['aic'].min()))  # Invert since lower is better
    valid_results['bic_norm'] = 1 - ((valid_results['bic'] - valid_results['bic'].min()) / (valid_results['bic'].max() - valid_results['bic'].min()))  # Invert since lower is better
    
    # Composite score with slight bias towards higher cluster numbers (better for topic modeling)
    valid_results['cluster_preference'] = (valid_results['n_clusters'] / valid_results['n_clusters'].max()) * 0.1
    valid_results['composite_score'] = (
        valid_results['silhouette_norm'] * 0.25 +
        valid_results['calinski_norm'] * 0.25 +
        valid_results['davies_norm'] * 0.2 +
        valid_results['aic_norm'] * 0.1 +
        valid_results['bic_norm'] * 0.1 +
        valid_results['cluster_preference'] * 0.1
    )
    
    # Find best results
    best_composite = valid_results.loc[valid_results['composite_score'].idxmax()]
    
    print("\n📈 OPTIMIZATION RESULTS:")
    print("="*50)
    print(f"🥇 Best Overall (Composite Score): {best_composite['n_clusters']} clusters")
    print(f"   Composite Score: {best_composite['composite_score']:.4f}")
    print(f"   Silhouette: {best_composite['silhouette']:.4f}")
    print(f"   Calinski-Harabasz: {best_composite['calinski_harabasz']:.1f}")
    print(f"   Davies-Bouldin: {best_composite['davies_bouldin']:.4f}")
    print(f"   Mean cluster size: {best_composite['mean_cluster_size']:.1f}")
    
    # Top 5 recommendations
    print(f"\n🏅 TOP 5 RECOMMENDATIONS:")
    print("="*30)
    top_5 = valid_results.nlargest(5, 'composite_score')[['n_clusters', 'composite_score', 'silhouette', 'calinski_harabasz', 'davies_bouldin', 'mean_cluster_size']]
    for i, (idx, row) in enumerate(top_5.iterrows(), 1):
        print(f"#{i}: {int(row['n_clusters'])} clusters (Score: {row['composite_score']:.4f}, Avg size: {row['mean_cluster_size']:.1f})")
    
    return results_df, valid_results

def prepare_dataset_embeddings(df, segment_id_col, embedding_col, dataset_name):
    """Prepare segment-level embeddings for cluster optimization."""
    print(f"🔄 Preparing {dataset_name} embeddings for optimization...")
    
    # Group by segments to get segment-level embeddings
    grouped = df.groupby(segment_id_col).agg({
        'Text': ' '.join,
        embedding_col: 'first'
    }).reset_index()
    
    # Get embeddings and apply UMAP transformation
    embeddings = np.array(grouped[embedding_col].tolist())
    
    # Apply same UMAP transformation as used in BERTopic
    umap_model = UMAP(
        n_neighbors=15, 
        n_components=10, 
        min_dist=0.05, 
        metric='cosine', 
        random_state=42, 
        low_memory=True
    )
    umap_embeddings = umap_model.fit_transform(embeddings)
    
    print(f"   📊 {dataset_name}: {umap_embeddings.shape[0]} segments, {umap_embeddings.shape[1]} UMAP dimensions")
    return umap_embeddings

print("🔧 Core topic modeling and optimization functions defined")

🔧 Core topic modeling and optimization functions defined


# Cluster Optimization

## Finding Optimal Cluster Numbers

Before running the full topic modeling pipeline, we need to determine the optimal number of clusters for each dataset. This section systematically tests different cluster numbers using multiple quality metrics.

## Optimization Process
1. **Test Range**: Test cluster numbers from 100 to 300 with step size 20
2. **Quality Metrics**: Evaluate using Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index, AIC, and BIC
3. **Composite Score**: Combine metrics with slight preference for higher cluster numbers (better topic granularity)
4. **Dataset-Specific**: Find optimal numbers for each country/language combination

In [5]:
# === CLUSTER OPTIMIZATION EXECUTION ===

print("🚀 Starting cluster optimization for all datasets...")
print("This will take some time as we test multiple configurations...")

# Define cluster range to test
cluster_range = list(range(100, 301, 20))  # [100, 120, 140, ..., 300]
print(f"🎯 Testing cluster range: {cluster_range}")

# Store optimization results
optimization_results = {}
optimal_clusters = {}

# === GB Dataset Optimization ===
print("\n" + "="*60)
print("🇬🇧 OPTIMIZING BRITISH PARLIAMENT CLUSTERS")
print("="*60)

gb_embeddings = prepare_dataset_embeddings(GB, 'Segment_ID', 'segment_embeddings_english', 'GB_English')
gb_results_df, gb_valid_results = comprehensive_cluster_optimization(
    gb_embeddings, cluster_range, covariance_type='tied', dataset_name="GB_English"
)

# Get optimal cluster number
gb_optimal = gb_valid_results.loc[gb_valid_results['composite_score'].idxmax(), 'n_clusters']
optimal_clusters['GB'] = int(gb_optimal)
optimization_results['GB_English'] = gb_valid_results

print(f"🎯 GB Optimal clusters: {optimal_clusters['GB']}")

# === AT Dataset Optimization (English) ===
print("\n" + "="*60)
print("🇦🇹 OPTIMIZING AUSTRIAN PARLIAMENT CLUSTERS - ENGLISH")
print("="*60)

at_en_embeddings = prepare_dataset_embeddings(AT_combined, 'Segment_ID_english', 'segment_embeddings_english', 'AT_English')
at_en_results_df, at_en_valid_results = comprehensive_cluster_optimization(
    at_en_embeddings, cluster_range, covariance_type='tied', dataset_name="AT_English"
)

# === AT Dataset Optimization (German) ===
print("\n" + "="*60)
print("🇦🇹 OPTIMIZING AUSTRIAN PARLIAMENT CLUSTERS - GERMAN")
print("="*60)

at_de_embeddings = prepare_dataset_embeddings(AT_combined, 'Segment_ID_english', 'segment_embeddings_native_language', 'AT_German')
at_de_results_df, at_de_valid_results = comprehensive_cluster_optimization(
    at_de_embeddings, cluster_range, covariance_type='tied', dataset_name="AT_German"
)

# Find compromise for AT (same cluster number for both languages)
print(f"\n🤝 Finding compromise for AT language pair...")
at_compromise_scores = []

for n_clusters in at_en_valid_results['n_clusters'].unique():
    if n_clusters in at_de_valid_results['n_clusters'].values:
        en_score = at_en_valid_results[at_en_valid_results['n_clusters'] == n_clusters]['composite_score'].iloc[0]
        de_score = at_de_valid_results[at_de_valid_results['n_clusters'] == n_clusters]['composite_score'].iloc[0]
        avg_score = (en_score + de_score) / 2
        at_compromise_scores.append({'n_clusters': n_clusters, 'avg_score': avg_score})

at_compromise_df = pd.DataFrame(at_compromise_scores)
at_optimal = at_compromise_df.loc[at_compromise_df['avg_score'].idxmax(), 'n_clusters']
optimal_clusters['AT'] = int(at_optimal)
optimization_results['AT_English'] = at_en_valid_results
optimization_results['AT_German'] = at_de_valid_results

print(f"🎯 AT Optimal clusters (compromise): {optimal_clusters['AT']}")

# === HR Dataset Optimization (English) ===
print("\n" + "="*60)
print("🇭🇷 OPTIMIZING CROATIAN PARLIAMENT CLUSTERS - ENGLISH")
print("="*60)

hr_en_embeddings = prepare_dataset_embeddings(HR_combined, 'Segment_ID_english', 'segment_embeddings_english', 'HR_English')
hr_en_results_df, hr_en_valid_results = comprehensive_cluster_optimization(
    hr_en_embeddings, cluster_range, covariance_type='tied', dataset_name="HR_English"
)

# === HR Dataset Optimization (Croatian) ===
print("\n" + "="*60)
print("🇭🇷 OPTIMIZING CROATIAN PARLIAMENT CLUSTERS - CROATIAN")
print("="*60)

hr_hr_embeddings = prepare_dataset_embeddings(HR_combined, 'Segment_ID_english', 'segment_embeddings_native_language', 'HR_Croatian')
hr_hr_results_df, hr_hr_valid_results = comprehensive_cluster_optimization(
    hr_hr_embeddings, cluster_range, covariance_type='tied', dataset_name="HR_Croatian"
)

# Find compromise for HR (same cluster number for both languages)
print(f"\n🤝 Finding compromise for HR language pair...")
hr_compromise_scores = []

for n_clusters in hr_en_valid_results['n_clusters'].unique():
    if n_clusters in hr_hr_valid_results['n_clusters'].values:
        en_score = hr_en_valid_results[hr_en_valid_results['n_clusters'] == n_clusters]['composite_score'].iloc[0]
        hr_score = hr_hr_valid_results[hr_hr_valid_results['n_clusters'] == n_clusters]['composite_score'].iloc[0]
        avg_score = (en_score + hr_score) / 2
        hr_compromise_scores.append({'n_clusters': n_clusters, 'avg_score': avg_score})

hr_compromise_df = pd.DataFrame(hr_compromise_scores)
hr_optimal = hr_compromise_df.loc[hr_compromise_df['avg_score'].idxmax(), 'n_clusters']
optimal_clusters['HR'] = int(hr_optimal)
optimization_results['HR_English'] = hr_en_valid_results
optimization_results['HR_Croatian'] = hr_hr_valid_results

print(f"🎯 HR Optimal clusters (compromise): {optimal_clusters['HR']}")

# === FINAL OPTIMIZATION SUMMARY ===
print(f"\n🎉 CLUSTER OPTIMIZATION COMPLETED!")
print("="*50)
print(f"📊 OPTIMAL CLUSTER NUMBERS:")
for country, clusters in optimal_clusters.items():
    print(f"   • {country}: {clusters} clusters")

# Save optimization results
all_optimization_results = pd.concat([
    results.assign(Country=country.split('_')[0], Language=country.split('_')[1]) 
    for country, results in optimization_results.items()
], ignore_index=True)

all_optimization_results.to_csv(r"data folder\cluster_optimization_results.csv", index=False)

# Save final recommendations
recommendations_df = pd.DataFrame([
    {'Country': country, 'Optimal_Clusters': clusters, 'Optimization_Method': 'Composite Score' if country == 'GB' else 'Language Pair Compromise'}
    for country, clusters in optimal_clusters.items()
])
recommendations_df.to_csv(r"data folder\optimal_cluster_recommendations.csv", index=False)

print(f"\n💾 Optimization results saved:")
print(f"   • All results: data folder\\cluster_optimization_results.csv")
print(f"   • Recommendations: data folder\\optimal_cluster_recommendations.csv")

print(f"\n🚀 Ready to run topic modeling with optimized cluster numbers!")

🚀 Starting cluster optimization for all datasets...
This will take some time as we test multiple configurations...
🎯 Testing cluster range: [100, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300]

🇬🇧 OPTIMIZING BRITISH PARLIAMENT CLUSTERS
🔄 Preparing GB_English embeddings for optimization...
   📊 GB_English: 33381 segments, 10 UMAP dimensions
🔬 Testing 11 cluster configurations for GB_English...
   Range: 100 to 300 clusters

🔍 Progress: 1/11 - Testing 100 clusters...
   📊 GB_English: 33381 segments, 10 UMAP dimensions
🔬 Testing 11 cluster configurations for GB_English...
   Range: 100 to 300 clusters

🔍 Progress: 1/11 - Testing 100 clusters...
   ✅ Silhouette: 0.3919, C-H: 15122.6, D-B: 0.9433

🔍 Progress: 2/11 - Testing 120 clusters...
   ✅ Silhouette: 0.3919, C-H: 15122.6, D-B: 0.9433

🔍 Progress: 2/11 - Testing 120 clusters...
   ✅ Silhouette: 0.3978, C-H: 14935.9, D-B: 0.9393

🔍 Progress: 3/11 - Testing 140 clusters...
   ✅ Silhouette: 0.3978, C-H: 14935.9, D-B: 0.9393

🔍 Progress

# OpenAI Topic Classification

In [6]:
# === OPENAI CLASSIFICATION FUNCTIONS ===

def classify_topic_with_openai(topic_words, parliament_context, topic_id=-1):
    """
    Classify a single topic using OpenAI GPT-4 with enhanced Public Lands detection and parliamentary context.
    
    Args:
        topic_words: List of representative words for the topic
        parliament_context: Dict with parliament information (country, language, dataset_name)
        topic_id: Topic identifier for error reporting
    
    Returns:
        category: Classified policy category
    """
    keywords_str = ', '.join(topic_words)
    categories_detailed = '\n'.join([f"• {cat}: {desc}" for cat, desc in LABEL_DICT.items()])
    
    # Enhanced Public Lands keywords detection
    public_lands_keywords = {
        'english': ['park', 'parks', 'national', 'memorial', 'memorials', 'historic', 'heritage', 'site', 'sites', 
                   'protected', 'conservation', 'preserve', 'museum', 'museums', 'forest', 'forests', 'wildlife',
                   'habitat', 'sanctuary', 'monument', 'monuments', 'cultural', 'archaeological', 'landmark',
                   'harbor', 'harbors', 'harbour', 'harbours', 'marina', 'marinas', 'flood', 'grazing',
                   'livestock', 'fire', 'fires', 'forestry', 'wilderness', 'recreational', 'trail', 'trails'],
        'german': ['park', 'parks', 'national', 'denkmal', 'denkmäler', 'historisch', 'erbe', 'standort', 
                  'geschützt', 'naturschutz', 'museum', 'museen', 'wald', 'wälder', 'wildtiere',
                  'habitat', 'denkmal', 'denkmäler', 'kulturell', 'archäologisch', 'wahrzeichen',
                  'hafen', 'häfen', 'marina', 'hochwasser', 'beweidung', 'vieh', 'feuer', 'brände',
                  'forstwirtschaft', 'wildnis', 'erholung', 'wanderweg', 'wanderwege'],
        'croatian': ['park', 'parkovi', 'nacionalni', 'spomenik', 'spomenici', 'povijesni', 'baština', 
                    'lokacija', 'zaštićen', 'očuvanje', 'muzej', 'muzeji', 'šuma', 'šume', 'divlje životinje',
                    'stanište', 'svetište', 'spomenik', 'spomenici', 'kulturni', 'arheološki', 
                    'luka', 'luke', 'marina', 'poplava', 'ispaša', 'stoka', 'požar', 'požari',
                    'šumarstvo', 'divljina', 'rekreacija', 'staza', 'staze']
    }
    
    # Check for Public Lands keywords
    language = parliament_context['language'].lower()
    relevant_keywords = public_lands_keywords.get(language, public_lands_keywords['english'])
    
    # Count Public Lands keyword matches (case-insensitive)
    keyword_matches = sum(1 for word in topic_words if any(keyword.lower() in word.lower() for keyword in relevant_keywords))
    
    # Create parliament context description
    country_info = {
        'GB': 'British Parliament (House of Commons/Lords) - may include UK-specific terms like HS2 (High Speed 2 railway), NHS (National Health Service), etc.',
        'AT': 'Austrian Parliament (Nationalrat/Bundesrat) - may include Austrian-specific terms and German language political terminology',
        'HR': 'Croatian Parliament (Sabor) - may include Croatian-specific terms and post-Yugoslav political context'
    }
    
    parliament_desc = country_info.get(parliament_context['country'], f"{parliament_context['country']} Parliament")
    
    prompt = f"""Analyze these parliamentary debate keywords and classify into ONE policy category.

PARLIAMENT CONTEXT: {parliament_desc}
LANGUAGE: {parliament_context['language']}
KEYWORDS: {keywords_str}

CATEGORIES:
{categories_detailed}

CRITICAL CLASSIFICATION INSTRUCTIONS:
1. Think step-by-step about what policy domain these keywords represent
2. Consider the parliament context - country-specific abbreviations, institutions, and terminology
3. Look for domain-specific terminology and policy-relevant terms
4. ONLY classify into a specific policy category if you are ABSOLUTELY CERTAIN it fits
5. If there is ANY doubt, ambiguity, or the keywords could fit multiple domains, use "Other"
6. If keywords are purely procedural, parliamentary process, or non-policy content, use "Other"
7. Be extremely conservative - it's better to use "Other" than to misclassify
8. Use "Mix" only when keywords clearly and explicitly span multiple policy domains

CONFIDENCE THRESHOLD: Only assign a specific policy category if you are 95%+ certain based on clear, unambiguous policy-specific terminology.

Format:
REASONING: [your detailed step-by-step analysis including parliament context considerations]
CATEGORY: [exact category name - default to "Other" unless absolutely certain]"""

    try:
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are an expert parliamentary policy classifier with knowledge of different parliamentary systems. Be conservative in your classifications - only assign specific policy categories when you are absolutely certain. Consider country-specific context and terminology. Default to 'Other' when in doubt."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.01,  # Very low temperature for consistent classification
            max_tokens=400
        )
        
        response_text = response.choices[0].message.content.strip()
        
        # Parse response
        category = "Other"  # Default to Other for safety
        
        for line in response_text.split('\n'):
            if line.startswith('CATEGORY:'):
                category = line.split(':', 1)[1].strip().replace('"', '').replace("'", "")
                break
        
        # Validate category exists in our label dictionary
        if category not in LABEL_DICT.keys():
            print(f"⚠️ Warning: Invalid category '{category}' for topic {topic_id}. Using 'Other'")
            category = "Other"
        
        return category
        
    except Exception as e:
        print(f"❌ Error classifying topic {topic_id}: {str(e)}")
        return "Other"

def classify_all_topics(topic_model, topic_info, parliament_context):
    """
    Classify all discovered topics using OpenAI with rate limiting and parliament context.
    
    Args:
        topic_model: Fitted BERTopic model
        topic_info: DataFrame with topic information
        parliament_context: Dict with parliament information (country, language, dataset_name)
        
    Returns:
        topic_info: Enhanced topic info with classifications
        topic_categories: Dict mapping topic IDs to categories
    """
    print(f"🤖 Classifying topics with OpenAI GPT-4 for {parliament_context['country']} Parliament ({parliament_context['language']})...")
    
    topic_categories = {}
    
    for idx, row in topic_info.iterrows():
        topic_id = row['Topic']
        # Get top words for this topic
        topic_words = [word for word, _ in topic_model.get_topic(topic_id)]
        
        # Classify with OpenAI including parliament context
        category = classify_topic_with_openai(topic_words, parliament_context, topic_id)
        topic_categories[topic_id] = category
        
        print(f"   Topic {topic_id}: → {category}")
        time.sleep(0.3)  # Rate limiting to avoid API limits
    
    # Add classifications to topic_info DataFrame
    topic_info = topic_info.copy()
    topic_info['Category'] = topic_info['Topic'].map(topic_categories)
    
    # Show category distribution
    cat_dist = pd.Series(list(topic_categories.values())).value_counts()
    print(f"\n📊 Topic Classification Results:")
    for category, count in cat_dist.items():
        print(f"   • {category}: {count} topics")
    
    return topic_info, topic_categories

def apply_classifications_to_data(df, segment_ids, topics, topic_categories, 
                                 dataset_name, language, segment_id_col):
    """
    Apply topic classifications back to the original dataframe.
    
    Args:
        df: Original dataframe
        segment_ids: List of segment IDs used in topic modeling
        topics: List of topic assignments from BERTopic
        topic_categories: Dict mapping topics to policy categories
        dataset_name: Dataset identifier
        language: Language identifier
        segment_id_col: Column name for segment IDs
        
    Returns:
        df_result: DataFrame with topic categories added
    """
    # Map topic IDs to categories
    segment_categories = [topic_categories.get(t, "Other") for t in topics]
    
    # Create segment mapping DataFrame
    segment_topics = pd.DataFrame({
        segment_id_col: segment_ids,
        f'Segment_Category_{dataset_name}_{language}': segment_categories
    })
    
    # Merge classifications back to original data
    df_result = df.merge(segment_topics, on=segment_id_col, how='left')
    
    # Show category distribution
    category_dist = pd.Series(segment_categories).value_counts()
    print(f"\n📈 Final Category Distribution:")
    for category, count in category_dist.items():
        print(f"   • {category}: {count} segments")
    
    return df_result

print("🔮 OpenAI classification functions defined")

🔮 OpenAI classification functions defined


In [7]:
# === EXECUTE TOPIC MODELING FOR ALL DATASETS ===

print("🚀 Starting comprehensive topic modeling pipeline...")
print(f"📊 Using dynamically optimized cluster numbers: {optimal_clusters}")

# Store all results
all_results = {}

# === GB Dataset (English) ===
print("\n" + "="*60)
print("🇬🇧 BRITISH PARLIAMENT")
print("="*60)

gb_data, gb_model, gb_topics, gb_segment_ids, gb_topics_list = run_topic_modeling(
    GB, "GB", "english", "Text", "Segment_ID", "segment_embeddings_english",
    n_clusters=optimal_clusters['GB'], min_topic_size=3
)

# Create parliament context for classification
gb_context = {
    'country': 'GB',
    'language': 'english',
    'dataset_name': 'GB_English'
}

gb_classified_topics, gb_categories = classify_all_topics(gb_model, gb_topics, gb_context)
gb_final = apply_classifications_to_data(
    gb_data, gb_segment_ids, gb_topics_list, gb_categories,
    "GB", "english", "Segment_ID"
)

all_results['GB_English'] = {
    'data': gb_final,
    'model': gb_model,
    'topics': gb_classified_topics,
    'categories': gb_categories
}

# === AT Dataset (English) ===
print("\n" + "="*60)
print("🇦🇹 AUSTRIAN PARLIAMENT - ENGLISH")
print("="*60)

at_en_data, at_en_model, at_en_topics, at_en_segment_ids, at_en_topics_list = run_topic_modeling(
    AT_combined, "AT", "english", "Text", "Segment_ID_english", "segment_embeddings_english",
    n_clusters=optimal_clusters['AT'], min_topic_size=3
)

# Create parliament context for classification
at_en_context = {
    'country': 'AT',
    'language': 'english', 
    'dataset_name': 'AT_English'
}

at_en_classified_topics, at_en_categories = classify_all_topics(at_en_model, at_en_topics, at_en_context)
at_en_final = apply_classifications_to_data(
    at_en_data, at_en_segment_ids, at_en_topics_list, at_en_categories,
    "AT", "english", "Segment_ID_english"
)

all_results['AT_English'] = {
    'data': at_en_final,
    'model': at_en_model,
    'topics': at_en_classified_topics,
    'categories': at_en_categories
}

# === AT Dataset (German) ===
print("\n" + "="*60)
print("🇦🇹 AUSTRIAN PARLIAMENT - GERMAN")
print("="*60)

at_de_data, at_de_model, at_de_topics, at_de_segment_ids, at_de_topics_list = run_topic_modeling(
    at_en_final, "AT", "german", "Text_native_language", "Segment_ID_english", "segment_embeddings_native_language",
    n_clusters=optimal_clusters['AT'], min_topic_size=3
)

# Create parliament context for classification
at_de_context = {
    'country': 'AT',
    'language': 'german',
    'dataset_name': 'AT_German'
}

at_de_classified_topics, at_de_categories = classify_all_topics(at_de_model, at_de_topics, at_de_context)
at_de_final = apply_classifications_to_data(
    at_de_data, at_de_segment_ids, at_de_topics_list, at_de_categories,
    "AT", "german", "Segment_ID_english"
)

all_results['AT_German'] = {
    'data': at_de_final,
    'model': at_de_model,
    'topics': at_de_classified_topics,
    'categories': at_de_categories
}

# === HR Dataset (English) ===
print("\n" + "="*60)
print("🇭🇷 CROATIAN PARLIAMENT - ENGLISH")
print("="*60)

hr_en_data, hr_en_model, hr_en_topics, hr_en_segment_ids, hr_en_topics_list = run_topic_modeling(
    HR_combined, "HR", "english", "Text", "Segment_ID_english", "segment_embeddings_english",
    n_clusters=optimal_clusters['HR'], min_topic_size=3
)

# Create parliament context for classification
hr_en_context = {
    'country': 'HR',
    'language': 'english',
    'dataset_name': 'HR_English'
}

hr_en_classified_topics, hr_en_categories = classify_all_topics(hr_en_model, hr_en_topics, hr_en_context)
hr_en_final = apply_classifications_to_data(
    hr_en_data, hr_en_segment_ids, hr_en_topics_list, hr_en_categories,
    "HR", "english", "Segment_ID_english"
)

all_results['HR_English'] = {
    'data': hr_en_final,
    'model': hr_en_model,
    'topics': hr_en_classified_topics,
    'categories': hr_en_categories
}

# === HR Dataset (Croatian) ===
print("\n" + "="*60)
print("🇭🇷 CROATIAN PARLIAMENT - CROATIAN")
print("="*60)

hr_hr_data, hr_hr_model, hr_hr_topics, hr_hr_segment_ids, hr_hr_topics_list = run_topic_modeling(
    hr_en_final, "HR", "croatian", "Text_native_language", "Segment_ID_english", "segment_embeddings_native_language",
    n_clusters=optimal_clusters['HR'], min_topic_size=3
)

# Create parliament context for classification
hr_hr_context = {
    'country': 'HR',
    'language': 'croatian',
    'dataset_name': 'HR_Croatian'
}

hr_hr_classified_topics, hr_hr_categories = classify_all_topics(hr_hr_model, hr_hr_topics, hr_hr_context)
hr_hr_final = apply_classifications_to_data(
    hr_hr_data, hr_hr_segment_ids, hr_hr_topics_list, hr_hr_categories,
    "HR", "croatian", "Segment_ID_english"
)

all_results['HR_Croatian'] = {
    'data': hr_hr_final,
    'model': hr_hr_model,
    'topics': hr_hr_classified_topics,
    'categories': hr_hr_categories
}

print("\n✅ All topic modeling and classification completed!")
print(f"📊 Processed {len(all_results)} datasets with optimized topic classifications")
print(f"🎯 Used cluster numbers: {optimal_clusters}")

🚀 Starting comprehensive topic modeling pipeline...
📊 Using dynamically optimized cluster numbers: {'GB': 160, 'AT': 160, 'HR': 280}

🇬🇧 BRITISH PARLIAMENT

🔍 Running topic modeling for GB (english)
   Clusters: 160, Min topic size: 3


2025-10-09 22:17:08,559 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


📊 Prepared 33381 segments for modeling
🤖 Fitting BERTopic model with GMM clustering...


2025-10-09 22:17:45,581 - BERTopic - Dimensionality - Completed ✓
2025-10-09 22:17:45,584 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 22:17:45,584 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 22:18:04,758 - BERTopic - Cluster - Completed ✓
2025-10-09 22:18:04,758 - BERTopic - Cluster - Completed ✓
2025-10-09 22:18:04,773 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 22:18:04,773 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 22:21:58,121 - BERTopic - Representation - Completed ✓
2025-10-09 22:21:58,121 - BERTopic - Representation - Completed ✓


✅ Discovered 160 topics from 33381 segments
🤖 Classifying topics with OpenAI GPT-4 for GB Parliament (english)...
   Topic 0: → Other
   Topic 0: → Other
   Topic 1: → Health
   Topic 1: → Health
   Topic 2: → Foreign Trade
   Topic 2: → Foreign Trade
   Topic 3: → Macroeconomics
   Topic 3: → Macroeconomics
   Topic 4: → Foreign Trade
   Topic 4: → Foreign Trade
   Topic 5: → Health
   Topic 5: → Health
   Topic 6: → Transportation
   Topic 6: → Transportation
   Topic 7: → Health
   Topic 7: → Health
   Topic 8: → Education
   Topic 8: → Education
   Topic 9: → Macroeconomics
   Topic 9: → Macroeconomics
   Topic 10: → Law and Crime
   Topic 10: → Law and Crime
   Topic 11: → Other
   Topic 11: → Other
   Topic 12: → Environment
   Topic 12: → Environment
   Topic 13: → Defense
   Topic 13: → Defense
   Topic 14: → Other
   Topic 14: → Other
   Topic 15: → International Affairs
   Topic 15: → International Affairs
   Topic 16: → Transportation
   Topic 16: → Transportation
   Topic 1

2025-10-09 22:35:53,980 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


📊 Prepared 12529 segments for modeling
🤖 Fitting BERTopic model with GMM clustering...


2025-10-09 22:36:07,094 - BERTopic - Dimensionality - Completed ✓
2025-10-09 22:36:07,096 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 22:36:07,096 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 22:36:16,221 - BERTopic - Cluster - Completed ✓
2025-10-09 22:36:16,225 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 22:36:16,221 - BERTopic - Cluster - Completed ✓
2025-10-09 22:36:16,225 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 22:38:10,018 - BERTopic - Representation - Completed ✓
2025-10-09 22:38:10,018 - BERTopic - Representation - Completed ✓


✅ Discovered 160 topics from 12529 segments
🤖 Classifying topics with OpenAI GPT-4 for AT Parliament (english)...
   Topic 0: → Education
   Topic 0: → Education
   Topic 1: → Defense
   Topic 1: → Defense
   Topic 2: → Culture
   Topic 2: → Culture
   Topic 3: → Agriculture
   Topic 3: → Agriculture
   Topic 4: → Macroeconomics
   Topic 4: → Macroeconomics
   Topic 5: → Health
   Topic 5: → Health
   Topic 6: → Labor
   Topic 6: → Labor
   Topic 7: → International Affairs
   Topic 7: → International Affairs
   Topic 8: → Education
   Topic 8: → Education
   Topic 9: → Law and Crime
   Topic 9: → Law and Crime
   Topic 10: → Environment
   Topic 10: → Environment
   Topic 11: → Macroeconomics
   Topic 11: → Macroeconomics
   Topic 12: → Domestic Commerce
   Topic 12: → Domestic Commerce
   Topic 13: → Social Welfare
   Topic 13: → Social Welfare
   Topic 14: → Health
   Topic 14: → Health
   Topic 15: → Immigration
   Topic 15: → Immigration
   Topic 16: → Transportation
   Topic 16: →

2025-10-09 22:51:45,151 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


📊 Prepared 12529 segments for modeling
🤖 Fitting BERTopic model with GMM clustering...


2025-10-09 22:51:57,617 - BERTopic - Dimensionality - Completed ✓
2025-10-09 22:51:57,618 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 22:51:57,618 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 22:52:02,262 - BERTopic - Cluster - Completed ✓
2025-10-09 22:52:02,270 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 22:52:02,262 - BERTopic - Cluster - Completed ✓
2025-10-09 22:52:02,270 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 22:54:42,761 - BERTopic - Representation - Completed ✓
2025-10-09 22:54:42,761 - BERTopic - Representation - Completed ✓


✅ Discovered 160 topics from 12529 segments
🤖 Classifying topics with OpenAI GPT-4 for AT Parliament (german)...
🤖 Classifying topics with OpenAI GPT-4 for AT Parliament (german)...
   Topic 0: → Other
   Topic 0: → Other
   Topic 1: → Education
   Topic 1: → Education
   Topic 2: → Transportation
   Topic 2: → Transportation
   Topic 3: → Social Welfare
   Topic 3: → Social Welfare
   Topic 4: → Agriculture
   Topic 4: → Agriculture
   Topic 5: → Education
   Topic 5: → Education
   Topic 6: → Macroeconomics
   Topic 6: → Macroeconomics
   Topic 7: → Labor
   Topic 7: → Labor
   Topic 8: → Health
   Topic 8: → Health
   Topic 9: → Macroeconomics
   Topic 9: → Macroeconomics
   Topic 10: → Health
   Topic 10: → Health
   Topic 11: → Environment
   Topic 11: → Environment
   Topic 12: → Education
   Topic 12: → Education
   Topic 13: → Immigration
   Topic 13: → Immigration
   Topic 14: → Transportation
   Topic 14: → Transportation
   Topic 15: → Other
   Topic 15: → Other
   Topic 16:

2025-10-09 23:09:33,643 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


📊 Prepared 25115 segments for modeling
🤖 Fitting BERTopic model with GMM clustering...


2025-10-09 23:10:00,536 - BERTopic - Dimensionality - Completed ✓
2025-10-09 23:10:00,539 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 23:10:00,539 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 23:10:33,247 - BERTopic - Cluster - Completed ✓
2025-10-09 23:10:33,256 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 23:10:33,247 - BERTopic - Cluster - Completed ✓
2025-10-09 23:10:33,256 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 23:13:20,395 - BERTopic - Representation - Completed ✓
2025-10-09 23:13:20,395 - BERTopic - Representation - Completed ✓


✅ Discovered 280 topics from 25115 segments
🤖 Classifying topics with OpenAI GPT-4 for HR Parliament (english)...
   Topic 0: → Other
   Topic 0: → Other
   Topic 1: → Defense
   Topic 1: → Defense
   Topic 2: → Labor
   Topic 2: → Labor
   Topic 3: → Macroeconomics
   Topic 3: → Macroeconomics
   Topic 4: → Agriculture
   Topic 4: → Agriculture
   Topic 5: → Health
   Topic 5: → Health
   Topic 6: → Social Welfare
   Topic 6: → Social Welfare
   Topic 7: → Civil Rights
   Topic 7: → Civil Rights
   Topic 8: → International Affairs
   Topic 8: → International Affairs
   Topic 9: → Other
   Topic 9: → Other
   Topic 10: → Education
   Topic 10: → Education
   Topic 11: → Macroeconomics
   Topic 11: → Macroeconomics
   Topic 12: → Law and Crime
   Topic 12: → Law and Crime
   Topic 13: → Domestic Commerce
   Topic 13: → Domestic Commerce
   Topic 14: → Macroeconomics
   Topic 14: → Macroeconomics
   Topic 15: → Agriculture
   Topic 15: → Agriculture
   Topic 16: → Health
   Topic 16: → H

2025-10-09 23:37:12,756 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


📊 Prepared 25115 segments for modeling
🤖 Fitting BERTopic model with GMM clustering...


2025-10-09 23:37:39,698 - BERTopic - Dimensionality - Completed ✓
2025-10-09 23:37:39,699 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 23:37:39,699 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-09 23:38:01,922 - BERTopic - Cluster - Completed ✓
2025-10-09 23:38:01,929 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 23:38:01,922 - BERTopic - Cluster - Completed ✓
2025-10-09 23:38:01,929 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-09 23:42:13,619 - BERTopic - Representation - Completed ✓
2025-10-09 23:42:13,619 - BERTopic - Representation - Completed ✓


✅ Discovered 280 topics from 25115 segments
🤖 Classifying topics with OpenAI GPT-4 for HR Parliament (croatian)...
   Topic 0: → Labor
   Topic 0: → Labor
   Topic 1: → Health
   Topic 1: → Health
   Topic 2: → Agriculture
   Topic 2: → Agriculture
   Topic 3: → Social Welfare
   Topic 3: → Social Welfare
   Topic 4: → Health
   Topic 4: → Health
   Topic 5: → Macroeconomics
   Topic 5: → Macroeconomics
   Topic 6: → Other
   Topic 6: → Other
   Topic 7: → Domestic Commerce
   Topic 7: → Domestic Commerce
   Topic 8: → Other
   Topic 8: → Other
   Topic 9: → Other
   Topic 9: → Other
   Topic 10: → International Affairs
   Topic 10: → International Affairs
   Topic 11: → Other
   Topic 11: → Other
   Topic 12: → Other
   Topic 12: → Other
   Topic 13: → Macroeconomics
   Topic 13: → Macroeconomics
   Topic 14: → Domestic Commerce
   Topic 14: → Domestic Commerce
   Topic 15: → Other
   Topic 15: → Other
   Topic 16: → Other
   Topic 16: → Other
   Topic 17: → Macroeconomics
   Topic 17

In [8]:
# === SAVE RESULTS AND GENERATE COMPREHENSIVE ANALYSIS ===

print("💾 Saving all results and generating comprehensive analysis...")

# === COMBINE MULTILINGUAL RESULTS FOR AT AND HR ===

# For AT: Combine English and German results
at_combined_final = all_results['AT_English']['data'].copy()

# Add German topic classifications to the English dataframe
german_topic_col = 'Segment_Category_AT_german'
if german_topic_col in all_results['AT_German']['data'].columns:
    # Merge German topic column
    german_topics = all_results['AT_German']['data'][['Segment_ID_english', german_topic_col]]
    at_combined_final = at_combined_final.merge(german_topics, on='Segment_ID_english', how='left')
    print("✅ Combined AT English and German topic classifications")

# For HR: Combine English and Croatian results
hr_combined_final = all_results['HR_English']['data'].copy()

# Add Croatian topic classifications to the English dataframe
croatian_topic_col = 'Segment_Category_HR_croatian'
if croatian_topic_col in all_results['HR_Croatian']['data'].columns:
    # Merge Croatian topic column
    croatian_topics = all_results['HR_Croatian']['data'][['Segment_ID_english', croatian_topic_col]]
    hr_combined_final = hr_combined_final.merge(croatian_topics, on='Segment_ID_english', how='left')
    print("✅ Combined HR English and Croatian topic classifications")

# === SAVE COMBINED RESULTS ===

# Save GB (single language)
gb_path = f"data folder\\GB\\GB_final_with_topics.pkl"
pd.to_pickle(all_results['GB_English']['data'], gb_path)
print(f"✅ Saved GB → {gb_path}")

# Save AT combined (English + German topics)
at_path = f"data folder\\AT\\AT_final_with_topics_combined.pkl"
pd.to_pickle(at_combined_final, at_path)
print(f"✅ Saved AT combined → {at_path}")

# Save HR combined (English + Croatian topics)
hr_path = f"data folder\\HR\\HR_final_with_topics_combined.pkl"
pd.to_pickle(hr_combined_final, hr_path)
print(f"✅ Saved HR combined → {hr_path}")

# === SAVE TOPIC INFORMATION AND MODELS ===

# Save individual topic information and models for each language
for dataset_name, results in all_results.items():
    country = dataset_name.split('_')[0]
    language = dataset_name.split('_')[1].lower()
    
    # Save classified topic information
    topic_path = f"data folder\\{country}\\{country}_topic_info_classified_{language}.pkl"
    pd.to_pickle(results['topics'], topic_path)
    
    # Save topic model for future use
    model_path = f"data folder\\{country}\\{country}_topic_model_final_{language}.pkl"
    pd.to_pickle(results['model'], model_path)
    
    print(f"✅ Saved {dataset_name} topic info and model")

# === COMPREHENSIVE ANALYSIS ===

# Initialize analysis variables
summary_data = []
total_topics = 0
category_distribution = {}

print(f"\n📊 GENERATING COMPREHENSIVE ANALYSIS")
print("="*50)

# Analyze each dataset
for dataset_name, results in all_results.items():
    country = dataset_name.split('_')[0]
    language = dataset_name.split('_')[1]
    
    topics_count = len(results['categories'])
    total_topics += topics_count
    
    # Category distribution for this dataset
    cat_dist = pd.Series(list(results['categories'].values())).value_counts()
    for cat, count in cat_dist.items():
        category_distribution[cat] = category_distribution.get(cat, 0) + count
    
    # Dataset summary
    summary_data.append({
        'Dataset': dataset_name,
        'Country': country,
        'Language': language,
        'Total_Topics': topics_count,
        'Clusters_Used': optimal_clusters[country],
        'Top_Category': cat_dist.index[0] if len(cat_dist) > 0 else 'N/A',
        'Top_Category_Count': cat_dist.iloc[0] if len(cat_dist) > 0 else 0
    })
    
    print(f"\n{dataset_name}:")
    print(f"   • Topics discovered: {topics_count}")
    print(f"   • Top category: {cat_dist.index[0] if len(cat_dist) > 0 else 'N/A'} ({cat_dist.iloc[0] if len(cat_dist) > 0 else 0} topics)")

# Save comprehensive summary
summary_df = pd.DataFrame(summary_data)
summary_df.to_csv(r"data folder\final_topic_modeling_summary.csv", index=False)

# Save category distribution across all datasets
category_dist_df = pd.DataFrame([
    {
        'Category': cat, 
        'Total_Topics': count, 
        'Percentage': round((count/total_topics)*100, 2),
        'Rank': rank
    }
    for rank, (cat, count) in enumerate(sorted(category_distribution.items(), key=lambda x: x[1], reverse=True), 1)
])
category_dist_df.to_csv(r"data folder\category_distribution_all_datasets.csv", index=False)

# === FINAL SUMMARY ===
print(f"\n🎉 TOPIC MODELING PIPELINE COMPLETED!")
print("="*60)
print(f"✅ Processed: {len(all_results)} datasets from 3 countries")
print(f"✅ Discovered: {total_topics} topics across all datasets")
print(f"✅ Classified: All topics into {len(LABEL_DICT)} policy categories")
print(f"✅ Optimized clusters: {optimal_clusters}")

print(f"\n📋 SAVED FILES:")
print(f"   • GB: Single dataframe with English topics")
print(f"   • AT: Combined dataframe with English + German topic columns")
print(f"   • HR: Combined dataframe with English + Croatian topic columns")
print(f"   • Topic models and info: {len(all_results)} files (separate by language)")
print(f"   • Summary statistics: 2 CSV files")
print(f"   • Optimization results: 2 CSV files")

print(f"\n📈 POLICY CATEGORIES (across all datasets):")
for i, (cat, count) in enumerate(sorted(category_distribution.items(), key=lambda x: x[1], reverse=True), 1):
    percentage = (count / total_topics) * 100
    print(f"   {i:2d}. {cat}: {count} topics ({percentage:.1f}%)")

print(f"\n💾 All results saved to data folder/")
print(f"\n🔬 Ready for downstream analysis:")
print(f"   • Policy attention analysis by country and time")
print(f"   • Cross-language topic comparison") 
print(f"   • Parliamentary agenda setting studies")
print(f"   • Multilingual policy classification validation")

💾 Saving all results and generating comprehensive analysis...

✅ Combined AT English and German topic classifications
✅ Combined AT English and German topic classifications
✅ Combined HR English and Croatian topic classifications
✅ Combined HR English and Croatian topic classifications
✅ Saved GB → data folder\GB\GB_final_with_topics.pkl
✅ Saved GB → data folder\GB\GB_final_with_topics.pkl
✅ Saved AT combined → data folder\AT\AT_final_with_topics_combined.pkl
✅ Saved AT combined → data folder\AT\AT_final_with_topics_combined.pkl
✅ Saved HR combined → data folder\HR\HR_final_with_topics_combined.pkl
✅ Saved HR combined → data folder\HR\HR_final_with_topics_combined.pkl
✅ Saved GB_English topic info and model
✅ Saved GB_English topic info and model
✅ Saved AT_English topic info and model
✅ Saved AT_English topic info and model
✅ Saved AT_German topic info and model
✅ Saved AT_German topic info and model
✅ Saved HR_English topic info and model
✅ Saved HR_English topic info and model
✅ Sav

In [9]:
print(f"\n📈 POLICY CATEGORIES (across all datasets):")
for i, (cat, count) in enumerate(sorted(category_distribution.items(), key=lambda x: x[1], reverse=True), 1):
    percentage = (count / total_topics) * 100
    print(f"   {i:2d}. {cat}: {count} topics ({percentage:.1f}%)")


📈 POLICY CATEGORIES (across all datasets):
    1. Other: 286 topics (27.5%)
    2. Macroeconomics: 81 topics (7.8%)
    3. Health: 66 topics (6.3%)
    4. Domestic Commerce: 58 topics (5.6%)
    5. Social Welfare: 52 topics (5.0%)
    6. Law and Crime: 51 topics (4.9%)
    7. Labor: 50 topics (4.8%)
    8. Civil Rights: 42 topics (4.0%)
    9. Agriculture: 42 topics (4.0%)
   10. Education: 39 topics (3.8%)
   11. Transportation: 37 topics (3.6%)
   12. Defense: 36 topics (3.5%)
   13. Environment: 32 topics (3.1%)
   14. International Affairs: 31 topics (3.0%)
   15. Government Operations: 26 topics (2.5%)
   16. Energy: 25 topics (2.4%)
   17. Housing: 23 topics (2.2%)
   18. Immigration: 17 topics (1.6%)
   19. Technology: 13 topics (1.2%)
   20. Mix: 13 topics (1.2%)
   21. Culture: 10 topics (1.0%)
   22. Foreign Trade: 9 topics (0.9%)
   23. Public Lands: 1 topics (0.1%)
