# Parliamentary Speech Topic Modeling

Applies BERTopic with GMM clustering to discover topics, then uses GPT-4 to classify them into 23 policy categories.

**Input**: Processed data from data_preprocessing.ipynb  
**Output**: Same dataframes with added topic classification columns  
**Method**: Segment-level topic modeling with GPM + OpenAI classification

## Setup & Configuration

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.mixture import GaussianMixture
from bertopic import BERTopic
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from openai import OpenAI
from dotenv import load_dotenv
from tqdm import tqdm

load_dotenv()
pd.options.display.max_columns = None

# Policy categories for classification (full CAP descriptions)
POLICY_CATEGORIES = {
    "Education": "Issues related to educational policies, primary and secondary schools, student loans and education finance, the regulation of colleges and universities, school reforms, teachers, vocational training, evening schools, safety in schools, efforts to improve educational standards, and issues related to libraries, dictionaries, teaching material, research in education",
    "Technology": "Issues related to science and technology transfer and international science cooperation, research policy, government space programs and space exploration, telephones and telecommunication regulation, broadcast media (television, radio, newspapers, films), weather forecasting, geological surveys, computer industry, cyber security",
    "Health": "Issues related to health care, health care reforms, health insurance, drug industry, medical facilities, medical workers, disease prevention, treatment, and health promotion, drug and alcohol abuse, mental health, research in medicine, medical liability and unfair medical practices",
    "Environment": "Issues related to environmental policy, drinking water safety, all kinds of pollution (air, noise, soil), waste disposal, recycling, climate change, outdoor environmental hazards (e.g., asbestos), species and forest protection, marine and freshwater environment, hunting, regulation of laboratory or performance animals, land and water resource conservation, research in environmental technology",
    "Housing": "Issues related to housing, urban affairs and community development, housing market, property tax, spatial planning, rural development, location permits, construction inspection, illegal construction, industrial and commercial building issues, national housing policy, housing for low-income individuals, rental housing, housing for the elderly, e.g., nursing homes, housing for the homeless and efforts to reduce homelessness, research related to housing",
    "Labor": "Issues related to labor, employment, employment programs, employee benefits, pensions and retirement accounts, minimum wage, labor law, job training, labor unions, worker safety and protection, youth employment and seasonal workers",
    "Defense": "Issues related to defense policy, military intelligence, espionage, weapons, military personnel, reserve forces, military buildings, military courts, nuclear weapons, civil defense, including firefighters and mountain rescue services, homeland security, military aid or arms sales to other countries, prisoners of war and collateral damage to civilian populations, military nuclear and hazardous waste disposal and military environmental compliance, defense alliances and agreements, direct foreign military operations, claims against military, defense research",
    "Government Operations": "Issues related to general government operations, the work of multiple departments, public employees, postal services, nominations and appointments, national mints, medals, and commemorative coins, management of government property, government procurement and contractors, public scandal and impeachment, claims against the government, the state inspectorate and audit, anti-corruption policies, regulation of political campaigns, political advertising and voter registration, census and statistics collection by government; issues related to local government, capital city and municipalities, including decentralization; issues related to national holidays",
    "Social Welfare": "Issues related to social welfare policy, the Ministry of Social Affairs, social services, poverty assistance for low-income families and for the elderly, parental leave and child care, assistance for people with physical or mental disabilities, including early retirement pension, discounts on public services, volunteer associations (e.g., Red Cross), charities, and youth organizations",
    "Macroeconomics": "Issues related to domestic macroeconomic policy, such as the state and prospect of the national economy, economic policy, inflation, interest rates, monetary policy, cost of living, unemployment rate, national budget, public debt, price control, tax enforcement, industrial revitalization and growth",
    "Domestic Commerce": "Issues related to banking, finance and internal commerce, including stock exchange, investments, consumer finance, mortgages, credit cards, insurance availability and cost, accounting regulation, personal, commercial, and municipal bankruptcies, programs to promote small businesses, copyrights and patents, intellectual property, natural disaster preparedness and relief, consumer safety; regulation and promotion of tourism, sports, gambling, and personal fitness; domestic commerce research",
    "Civil Rights": "Issues related to civil rights and minority rights, discrimination towards races, gender, sexual orientation, handicap, and other minorities, voting rights, freedom of speech, religious freedoms, privacy rights, protection of personal data, abortion rights, anti-government activity groups (e.g., local insurgency groups), religion and the Church",
    "International Affairs": "Issues related to international affairs, foreign policy and relations to other countries, issues related to the Ministry of Foreign Affairs, foreign aid, international agreements (such as Kyoto agreement on the environment, the Schengen agreement), international organizations (including United Nations, UNESCO, International Olympic Committee, International Criminal Court), NGOs, issues related to diplomacy, embassies, citizens abroad; issues related to border control; issues related to international finance, including the World Bank and International Monetary Fund, the financial situation of the EU; issues related to a foreign country that do not impact the home country; issues related to human rights in other countries, international terrorism",
    "Transportation": "Issues related to mass transportation construction and regulation, bus transport, regulation related to motor vehicles, road construction, maintenance and safety, parking facilities, traffic accidents statistics, air travel, rail travel, rail freight, maritime transportation, inland waterways and channels, transportation research and development",
    "Immigration": "Issues related to immigration, refugees, and citizenship, integration issues, regulation of residence permits, asylum applications; criminal offences and diseases caused by immigration",
    "Law and Crime": "Issues related to the control, prevention, and impact of crime; all law enforcement agencies, including border and customs, police, court system, prison system; terrorism, white collar crime, counterfeiting and fraud, cyber-crime, drug trafficking, domestic violence, child welfare, family law, juvenile crime",
    "Agriculture": "Issues related to agriculture policy, fishing, agricultural foreign trade, food marketing, subsidies to farmers, food inspection and safety, animal and crop disease, pest control and pesticide regulation, welfare for animals in farms, pets, veterinary medicine, agricultural research",
    "Foreign Trade": "Issues related to foreign trade, trade negotiations, free trade agreements, import regulation, export promotion and regulation, subsidies, private business investment and corporate development, competitiveness, exchange rates, the strength of national currency in comparison to other currencies, foreign investment and sales of companies abroad",
    "Culture": "Issues related to cultural policies, Ministry of Culture, public spending on culture, cultural employees, issues related to support of theatres and artists; allocation of funds from the national lottery, issues related to cultural heritage",
    "Public Lands": "Issues related to national parks, memorials, historic sites, and protected areas, including the management and staffing of cultural sites; museums; use of public lands and forests, establishment and management of harbors and marinas; issues related to flood control, forest fires, livestock grazing",
    "Energy": "Issues related to energy policy, electricity, regulation of electrical utilities, nuclear energy and disposal of nuclear waste, natural gas and oil, drilling, oil spills, oil and gas prices, heat supply, shortages and gasoline regulation, coal production, alternative and renewable energy, energy conservation and energy efficiency, energy research",
    "Other": "Other topics not mentioning policy agendas, including the procedures of parliamentary meetings, e.g., points of order, voting procedures, meeting logistics; interpersonal speech, e.g., greetings, personal stories, tributes, interjections, arguments between the members; rhetorical speech, e.g., jokes, literary references",
    "Mix": "Use this category when the topic clearly spans multiple policy areas or when there is significant uncertainty about which single category best fits the topic. This is for topics that genuinely combine elements from 2-3 different categories in a meaningful way, making it difficult to assign to just one category with high confidence"
}

# Language-specific stopwords (comprehensive lists)
ENGLISH_STOPWORDS = [
    'mr', 'mrs', 'ms', 'dr', 'madam', 'honorable', 'honourable', 'member', 'members', 'vp', 'sp', 'fp', 'ae', 'po',
    'minister', 'speaker', 'deputy', 'president', 'chairman', 'chair', 'schilling', 'my', 'lords', 'lord', 'bzs', 'prll', 'bz',
    'secretary', 'lord', 'gp', 'lady', 'question', 'order', 'point', 'debate', 'motion', 'amendment', 'backbench', 'week',
    'congratulations', 'congratulate', 'thanks', 'thank', 'say', 'one', 'want', 'know', 'think', 'noble', 'opg',
    'believe', 'see', 'go', 'come', 'give', 'take', 'people', 'federal', 'government', 'austria', 'baroness',
    'austrian', 'committee', 'call', 'said', 'already', 'please', 'request', 'proceed', 'reading', 'prime',
    'course', 'welcome', 'council', 'open', 'written', 'contain', 'items', 'item', 'yes', 'no',
    'following', 'next', 'speech', 'year', 'years', 'state', 'also', 'would', 'like', 'may', 'must',
    'upon', 'indeed', 'session', 'meeting', 'report', 'commission', 'behalf', 'gentleman', 'gentlemen',
    'ladies', 'applause', 'group', 'colleague', 'colleagues', 'issue', 'issues', 'chancellor', 'court',
    'ask', 'answer', 'reply', 'regard', 'regarding', 'regards', 'respect', 'respectfully', 'sign',
    'shall', 'procedure', 'declare', 'hear', 'minutes', 'speaking', 'close', 'abg', 'mag', 'orf', 'wait'
]

GERMAN_STOPWORDS = [
    'der', 'die', 'das', 'und', 'in', 'zu', 'den', 'mit', 'von', 'f√ºr', 'bb', 'bz', 'bzs', 'prll',
    'auf', 'ist', 'im', 'sich', 'eine', 'sie', 'dem', 'nicht', 'ein', 'als',
    'auch', 'es', 'an', 'werden', 'aus', 'er', 'hat', 'dass', 'wir', 'ich',
    'haben', 'sind', 'kann', 'sehr', 'meine', 'muss', 'doch', 'wenn', 'sein',
    'dann', 'weil', 'bei', 'nach', 'so', 'oder', 'aber', 'vor', '√ºber', 'noch',
    'nur', 'wie', 'war', 'waren', 'wird', 'wurde', 'wurden', 'ihr', 'ihre',
    'ihren', 'seiner', 'seine', 'seinem', 'seinen', 'dieser', 'diese', 'dieses',
    'durch', 'ohne', 'gegen', 'unter', 'zwischen', 'w√§hrend', 'bis', 'seit',
    'danke', 'bitte', 'gern', 'abgeordnete', 'abgeordneten', 'bundesregierung',
    'bundeskanzler', 'nationalrat', 'bundesrat', 'parlament', 'fraktion',
    'ausschuss', 'sitzung', 'pr√§sident', 'vizepr√§sident', 'minister',
    'staatssekret√§r', 'klubobmann', 'antrag', 'anfrage', 'interpellation',
    'dringliche', 'aktuelle', 'stunde', 'debatte', 'abstimmung', 'beschluss',
    'gesetz', 'novelle', 'verordnung', 'regierungsvorlage', 'initiativantrag',
    'danke', 'dankesch√∂n', 'gesch√§tzte', 'kolleginnen', 'kollegen', 'hohes'
]

CROATIAN_STOPWORDS = [
    'a', 'ako', 'ali', 'bi', 'bih', 'bila', 'bili', 'bilo', 'bio', 'bismo',
    'biste', 'biti', 'bumo', 'da', 'do', 'du≈æ', 'ga', 'hoƒáe', 'hoƒáemo',
    'hoƒáete', 'hoƒáe≈°', 'hoƒáu', 'i', 'iako', 'ih', 'ili', 'iz', 'ja', 'je',
    'jedna', 'jedne', 'jedno', 'jer', 'jesam', 'jesi', 'jesmo', 'jest',
    'jeste', 'jesu', 'jim', 'joj', 'jo≈°', 'ju', 'kada', 'kako', 'kao',
    'koja', 'koje', 'koji', 'kojima', 'koju', 'kroz', 'li', 'me', 'mene',
    'meni', 'mi', 'mimo', 'moj', 'moja', 'moje', 'mu', 'na', 'nad', 'nakon',
    'nam', 'nama', 'nas', 'na≈°', 'na≈°a', 'na≈°e', 'na≈°eg', 'ne', 'nego',
    'neka', 'neki', 'nekog', 'neku', 'nema', 'netko', 'neƒáe', 'neƒáemo',
    'neƒáete', 'neƒáe≈°', 'neƒáu', 'ne≈°to', 'ni', 'nije', 'nikoga', 'nikoje',
    'nikoju', 'nisam', 'nisi', 'nismo', 'niste', 'nisu', 'njega', 'njegov',
    'njegova', 'njegovo', 'njemu', 'njezin', 'njezina', 'njezino', 'njih',
    'njihov', 'njihova', 'njihovo', 'njim', 'njima', 'njoj', 'nju', 'no',
    'o', 'od', 'odmah', 'on', 'ona', 'oni', 'ono', 'ova', 'pa', 'pak',
    'po', 'pod', 'pored', 'prije', 's', 'sa', 'sam', 'samo', 'se', 'sebe',
    'sebi', 'si', 'smo', 'ste', 'su', 'sve', 'svi', 'svog', 'svoj', 'svoja',
    'svoje', 'svom', 'ta', 'tada', 'taj', 'tako', 'te', 'tebe', 'tebi',
    'ti', 'to', 'toj', 'tome', 'tu', 'tvoj', 'tvoja', 'tvoje', 'u', 'uz',
    'vam', 'vama', 'vas', 'va≈°', 'va≈°a', 'va≈°e', 'veƒá', 'vi', 'vrlo', 'za',
    'zar', 'ƒáe', 'ƒáemo', 'ƒáete', 'ƒáe≈°', 'ƒáu', '≈°to', 'zastupnik', 'zastupnica',
    'zastupnici', 'hvala', 'sabor', 'hrvatska', 'vlada', 'molim', 'gospodin',
    'gospoƒëa', 'premijer', 'predsjednik', 'predsjednica', 'ministar', 'ministrica',
    'dr≈æavni', 'tajnik', 'tajnica', 'odbor', 'sjednica', 'rasprava', 'prijedlog',
    'zakon', 'odluka', 'glasovanje', 'amandman', 'interpelacija', 'pitanje',
    'odgovor', 'klupski', 'obna≈°atelj', 'du≈ænosti', 'potpredsjednik',
    'potpredsjednica', 'kolegice', 'kolege', 'dame', 'gospodo', 'po≈°tovani', 'po≈°tovana'
]

STOPWORDS = {
    "english": ENGLISH_STOPWORDS,
    "german": GERMAN_STOPWORDS,
    "croatian": CROATIAN_STOPWORDS
}

print("‚úÖ Configuration loaded")
print(f"   Policy categories: {len(POLICY_CATEGORIES)}")
print(f"   Stopwords: EN={len(ENGLISH_STOPWORDS)}, DE={len(GERMAN_STOPWORDS)}, HR={len(CROATIAN_STOPWORDS)}")

‚úÖ Configuration loaded
   Policy categories: 23
   Stopwords: EN=130, DE=116, HR=219


## Load Data

Load the processed dataframes from data_preprocessing.ipynb.

In [2]:
import os

# Path where data_preprocessing.ipynb saves processed data
BASE_DATA_DIR = r"data folder"

# Load processed datasets
AT = pd.read_pickle(os.path.join(BASE_DATA_DIR, "AT/AT_speeches_processed.pkl"))
HR = pd.read_pickle(os.path.join(BASE_DATA_DIR, "HR/HR_speeches_processed.pkl"))
GB = pd.read_pickle(os.path.join(BASE_DATA_DIR, "GB/GB_speeches_processed.pkl"))

print(f"‚úÖ Loaded from: {BASE_DATA_DIR}")
print(f"   AT={AT.shape}, HR={HR.shape}, GB={GB.shape}")

‚úÖ Loaded from: data folder
   AT=(231759, 32), HR=(504338, 32), GB=(670912, 29)


## Topic Modeling Functions

In [3]:
from sklearn.metrics import silhouette_score

class GMMClustering:
    """GMM clustering for BERTopic"""
    def __init__(self, n_components=200, random_state=42):
        self.n_components = n_components
        self.random_state = random_state
        self.labels_ = None
    
    def fit(self, X, y=None):
        model = GaussianMixture(n_components=self.n_components, random_state=self.random_state,
                               covariance_type='tied', max_iter=300)
        model.fit(X)
        self.labels_ = model.predict(X)
        return self
    
    def fit_predict(self, X):
        self.fit(X)
        return self.labels_


def create_topic_model(language, n_clusters=200):
    """Create BERTopic model with GMM clustering"""
    vectorizer = CountVectorizer(
        stop_words=STOPWORDS.get(language, STOPWORDS['english']),
        ngram_range=(1, 2), min_df=5, max_df=0.9, max_features=20000
    )
    
    umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.05, 
                     metric='cosine', random_state=42)
    
    gmm_model = GMMClustering(n_components=n_clusters)
    
    return BERTopic(
        vectorizer_model=vectorizer,
        umap_model=umap_model,
        hdbscan_model=gmm_model,
        embedding_model=None,
        top_n_words=15,
        verbose=True
    )


def prepare_segments(df, segment_col, text_col, embedding_col):
    """Group speeches into segments"""
    grouped = df.groupby(segment_col).agg({
        text_col: ' '.join,
        embedding_col: 'first'
    }).reset_index()
    
    return (grouped[text_col].tolist(), 
            np.array(grouped[embedding_col].tolist()),
            grouped[segment_col].tolist())


def classify_with_gpt(topic_words, country, language, max_retries=3):
    """Classify topic using GPT-4 - returns both topic name and CAP category"""
    categories_str = '\n'.join([f"‚Ä¢ {cat}: {desc}" for cat, desc in POLICY_CATEGORIES.items()])
    
    prompt = f"""Analyze these parliamentary keywords and provide TWO outputs IN ENGLISH:

Country: {country} Parliament
Source Language: {language}
Keywords: {', '.join(topic_words)}

IMPORTANT: Regardless of the source language, provide your response entirely in English.

TASK 1 - Topic Name: Create a short, descriptive name IN ENGLISH (2-4 words) that captures what this topic is about discussed in the parliamentary meeting.
TASK 2 - CAP Classification: Classify into ONE of these policy categories:

{categories_str}

Instructions:
- Always respond in English, even if keywords are in German, Croatian, or other languages
- For Topic Name: Be specific and descriptive (e.g., "Healthcare Reform", "Military Defense Budget")
- For CAP Classification: Choose the most specific policy category
- Use "Other" for procedural/non-policy content
- Use "Mix" only if clearly spanning multiple domains
- Be conservative: default to "Other" if uncertain

Format your response EXACTLY as:
TOPIC: [your English topic name]
CATEGORY: [exact category name from list]"""
    
    for attempt in range(max_retries):
        try:
            client = OpenAI()
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a parliamentary policy classifier. Always respond in English."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.01,
                max_tokens=300
            )
            
            text = response.choices[0].message.content.strip()
            topic_name = "Unknown"
            category = "Other"
            
            for line in text.split('\n'):
                if line.startswith('TOPIC:'):
                    topic_name = line.split(':', 1)[1].strip().replace('"', '').replace("'", "")
                elif line.startswith('CATEGORY:'):
                    cat = line.split(':', 1)[1].strip().replace('"', '').replace("'", "")
                    if cat in POLICY_CATEGORIES:
                        category = cat
            
            return topic_name, category
        except Exception as e:
            if "insufficient_quota" in str(e) or "429" in str(e):
                print(f"‚ö†Ô∏è API quota exceeded for topic with keywords: {', '.join(topic_words[:5])}...")
                return "Unknown", "Other"
            elif attempt < max_retries - 1:
                print(f"‚ö†Ô∏è Retry {attempt + 1}/{max_retries} for error: {e}")
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                print(f"‚ùå Error after {max_retries} attempts: {e}")
                return "Unknown", "Other"
    
    return "Unknown", "Other"


def optimize_cluster_size(embeddings, cluster_range=None, sample_size=5000):
    """Find optimal number of clusters using Silhouette Score"""
    if cluster_range is None:
        cluster_range = list(range(150, 251, 5)) 
    
    print(f"\nOptimizing cluster size (testing: {len(cluster_range)} values from {min(cluster_range)} to {max(cluster_range)})...")
    
    # Sample for faster optimization
    if len(embeddings) > sample_size:
        indices = np.random.choice(len(embeddings), sample_size, replace=False)
        X_sample = embeddings[indices]
    else:
        X_sample = embeddings
    
    # Reduce dimensionality first
    umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.05, 
                     metric='cosine', random_state=42)
    X_reduced = umap_model.fit_transform(X_sample.astype(np.float32))
    
    scores = {}
    for n_clusters in cluster_range:
        print(f"  Testing n={n_clusters}...", end=" ")
        gmm = GaussianMixture(n_components=n_clusters, random_state=42,
                             covariance_type='tied', max_iter=300)
        labels = gmm.fit_predict(X_reduced)
        score = silhouette_score(X_reduced, labels, sample_size=min(2000, len(X_reduced)))
        scores[n_clusters] = score
        print(f"silhouette={score:.4f}")
    
    optimal_n = max(scores, key=scores.get)
    print(f"\n‚úÖ Optimal clusters: {optimal_n} (score={scores[optimal_n]:.4f})")
    return optimal_n


def run_topic_pipeline(df, country, language, text_col, segment_col, embedding_col, n_clusters=None):
    """Complete topic modeling pipeline with optional optimization"""
    print(f"\n{'='*60}")
    print(f"{country} Parliament - {language}")
    print(f"{'='*60}")
    
    # Prepare data
    documents, embeddings, segment_ids = prepare_segments(df, segment_col, text_col, embedding_col)
    print(f"Processing {len(documents)} segments...")
    
    # Optimize cluster size if not provided
    if n_clusters is None:
        n_clusters = optimize_cluster_size(embeddings)
    else:
        print(f"Using fixed cluster size: {n_clusters}")
    
    # Fit model
    topic_model = create_topic_model(language, n_clusters)
    topics, _ = topic_model.fit_transform(documents, embeddings.astype(np.float32))
    topic_info = topic_model.get_topic_info()
    
    print(f"Discovered {len(set(topics))} topics")
    
    # Classify topics
    print("Classifying with GPT-4...")
    topic_metadata = {}
    for idx, row in tqdm(topic_info.iterrows(), total=len(topic_info)):
        topic_id = row['Topic']
        words = [w for w, _ in topic_model.get_topic(topic_id)]
        topic_name, category = classify_with_gpt(words, country, language)
        topic_metadata[topic_id] = {
            'keywords': ', '.join(words[:15]),  # Top 15 n-grams (unigrams + bigrams)
            'topic_name': topic_name,
            'cap_category': category
        }
        time.sleep(0.3)
    
    # Create segment-to-metadata mapping
    segment_topic_map = {}
    for seg_id, topic_id in zip(segment_ids, topics):
        meta = topic_metadata.get(topic_id, {'keywords': '', 'topic_name': 'Unknown', 'cap_category': 'Other'})
        segment_topic_map[seg_id] = meta
    
    # Map metadata to each row in the dataframe via segment_col
    df_result = df.copy()
    df_result[f'Topic_Keywords_{country}_{language}'] = df_result[segment_col].map(
        lambda x: segment_topic_map.get(x, {}).get('keywords', '')
    )
    df_result[f'Topic_Name_{country}_{language}'] = df_result[segment_col].map(
        lambda x: segment_topic_map.get(x, {}).get('topic_name', 'Unknown')
    )
    df_result[f'CAP_Category_{country}_{language}'] = df_result[segment_col].map(
        lambda x: segment_topic_map.get(x, {}).get('cap_category', 'Other')
    )
    
    # Show distribution
    cat_dist = df_result[f'CAP_Category_{country}_{language}'].value_counts()
    print(f"\nTop CAP categories (by speech rows):")
    for cat, count in cat_dist.head(5).items():
        print(f"  {cat}: {count}")
    
    return df_result, topic_metadata

print("‚úÖ Functions defined")

‚úÖ Functions defined


## Apply Topic Modeling

Run topic modeling with optimized cluster sizes for each dataset.

In [4]:
# GB (English only) - with optimization
print("\n" + "="*60)
print("STAGE 1: Great Britain (English)")
print("="*60)
GB_final, gb_cats = run_topic_pipeline(
    GB, 'GB', 'english', 'Text_English', 'Segment_ID_English',
    'Segment_Embeddings_English', n_clusters=None  # Auto-optimize
)

# AT (English + German) - optimize on English, reuse for German
print("\n" + "="*60)
print("STAGE 2: Austria (English + German)")
print("="*60)

# First, optimize on English
print("Step 1: Optimize cluster size on English embeddings...")
AT_documents_en, AT_embeddings_en, AT_segment_ids_en = prepare_segments(
    AT, 'Segment_ID_English', 'Text_English', 'Segment_Embeddings_English'
)
optimal_clusters_AT = optimize_cluster_size(AT_embeddings_en)
print(f"‚úÖ Will use {optimal_clusters_AT} clusters for both English and German")

# Run topic modeling with fixed cluster size
AT_temp, at_en_cats = run_topic_pipeline(
    AT, 'AT', 'english', 'Text_English', 'Segment_ID_English',
    'Segment_Embeddings_English', n_clusters=optimal_clusters_AT  # Use optimized value
)

AT_final, at_de_cats = run_topic_pipeline(
    AT_temp, 'AT', 'german', 'Text_Native', 'Segment_ID_Native',
    'Segment_Embeddings_Native', n_clusters=optimal_clusters_AT  # Reuse same value
)

# HR (English + Croatian) - optimize on English, reuse for Croatian
print("\n" + "="*60)
print("STAGE 3: Croatia (English + Croatian)")
print("="*60)

# First, optimize on English
print("Step 1: Optimize cluster size on English embeddings...")
HR_documents_en, HR_embeddings_en, HR_segment_ids_en = prepare_segments(
    HR, 'Segment_ID_English', 'Text_English', 'Segment_Embeddings_English'
)
optimal_clusters_HR = optimize_cluster_size(HR_embeddings_en)
print(f"‚úÖ Will use {optimal_clusters_HR} clusters for both English and Croatian")

# Run topic modeling with fixed cluster size
HR_temp, hr_en_cats = run_topic_pipeline(
    HR, 'HR', 'english', 'Text_English', 'Segment_ID_English',
    'Segment_Embeddings_English', n_clusters=optimal_clusters_HR  # Use optimized value
)

HR_final, hr_hr_cats = run_topic_pipeline(
    HR_temp, 'HR', 'croatian', 'Text_Native', 'Segment_ID_Native',
    'Segment_Embeddings_Native', n_clusters=optimal_clusters_HR  # Reuse same value
)

print("\n" + "="*60)
print("‚úÖ Topic modeling complete for all datasets")
print("="*60)
print(f"   GB (English): Auto-optimized clusters")
print(f"   AT (English + German): {optimal_clusters_AT} clusters (optimized on English)")
print(f"   HR (English + Croatian): {optimal_clusters_HR} clusters (optimized on English)")


STAGE 1: Great Britain (English)

GB Parliament - english


Processing 37605 segments...

Optimizing cluster size (testing: 21 values from 150 to 250)...
  Testing n=150... silhouette=0.3275
  Testing n=155... silhouette=0.3255
  Testing n=160... silhouette=0.3310
  Testing n=165... silhouette=0.3239
  Testing n=170... silhouette=0.3255
  Testing n=175... silhouette=0.3349
  Testing n=180... silhouette=0.3328
  Testing n=185... silhouette=0.3325
  Testing n=190... silhouette=0.3298
  Testing n=195... silhouette=0.3368
  Testing n=200... silhouette=0.3350
  Testing n=205... silhouette=0.3272
  Testing n=210... silhouette=0.3251
  Testing n=215... silhouette=0.3426
  Testing n=220... silhouette=0.3280
  Testing n=225... silhouette=0.3272
  Testing n=230... silhouette=0.3299
  Testing n=235... silhouette=0.3308
  Testing n=240... silhouette=0.3278
  Testing n=245... silhouette=0.3222
  Testing n=250... 

2025-12-02 18:57:44,101 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


silhouette=0.3251

‚úÖ Optimal clusters: 215 (score=0.3426)


2025-12-02 18:59:10,919 - BERTopic - Dimensionality - Completed ‚úì
2025-12-02 18:59:10,924 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-02 19:00:26,375 - BERTopic - Cluster - Completed ‚úì
2025-12-02 19:00:26,411 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-02 19:07:42,624 - BERTopic - Representation - Completed ‚úì


Discovered 215 topics
Classifying with GPT-4...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 215/215 [06:56<00:00,  1.94s/it]



Top CAP categories (by speech rows):
  International Affairs: 98108
  Health: 82113
  Law and Crime: 51314
  Other: 41959
  Social Welfare: 38314

STAGE 2: Austria (English + German)
Step 1: Optimize cluster size on English embeddings...

Optimizing cluster size (testing: 21 values from 150 to 250)...
  Testing n=150... silhouette=0.4174
  Testing n=155... silhouette=0.4288
  Testing n=160... silhouette=0.4181
  Testing n=165... silhouette=0.4360
  Testing n=170... silhouette=0.4378
  Testing n=175... silhouette=0.4260
  Testing n=180... silhouette=0.4369
  Testing n=185... silhouette=0.4234
  Testing n=190... silhouette=0.4164
  Testing n=195... silhouette=0.4194
  Testing n=200... silhouette=0.4149
  Testing n=205... silhouette=0.4227
  Testing n=210... silhouette=0.4122
  Testing n=215... silhouette=0.4057
  Testing n=220... silhouette=0.4152
  Testing n=225... silhouette=0.4074
  Testing n=230... silhouette=0.3990
  Testing n=235... silhouette=0.4128
  Testing n=240... silhouette=

2025-12-02 19:22:20,501 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


Processing 25568 segments...
Using fixed cluster size: 170


2025-12-02 19:23:24,424 - BERTopic - Dimensionality - Completed ‚úì
2025-12-02 19:23:24,428 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-02 19:23:53,828 - BERTopic - Cluster - Completed ‚úì
2025-12-02 19:23:53,844 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-02 19:27:34,312 - BERTopic - Representation - Completed ‚úì


Discovered 170 topics
Classifying with GPT-4...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 170/170 [05:24<00:00,  1.91s/it]



Top CAP categories (by speech rows):
  Macroeconomics: 34423
  Law and Crime: 22789
  Other: 21273
  Education: 16702
  Health: 14472

AT Parliament - german


2025-12-02 19:36:12,760 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


Processing 23752 segments...
Using fixed cluster size: 170


2025-12-02 19:37:02,091 - BERTopic - Dimensionality - Completed ‚úì
2025-12-02 19:37:02,094 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-02 19:37:42,980 - BERTopic - Cluster - Completed ‚úì
2025-12-02 19:37:42,986 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-02 19:42:24,579 - BERTopic - Representation - Completed ‚úì


Discovered 170 topics
Classifying with GPT-4...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 170/170 [05:09<00:00,  1.82s/it]



Top CAP categories (by speech rows):
  Other: 44147
  Macroeconomics: 27512
  Civil Rights: 22285
  Education: 15550
  Health: 12843

STAGE 3: Croatia (English + Croatian)
Step 1: Optimize cluster size on English embeddings...

Optimizing cluster size (testing: 21 values from 150 to 250)...
  Testing n=150... silhouette=0.3132
  Testing n=155... silhouette=0.3009
  Testing n=160... silhouette=0.3055
  Testing n=165... silhouette=0.3070
  Testing n=170... silhouette=0.3073
  Testing n=175... silhouette=0.3111
  Testing n=180... silhouette=0.2983
  Testing n=185... silhouette=0.3119
  Testing n=190... silhouette=0.2927
  Testing n=195... silhouette=0.3104
  Testing n=200... silhouette=0.3057
  Testing n=205... silhouette=0.3068
  Testing n=210... silhouette=0.2951
  Testing n=215... silhouette=0.3072
  Testing n=220... silhouette=0.3039
  Testing n=225... silhouette=0.2896
  Testing n=230... silhouette=0.2814
  Testing n=235... silhouette=0.2909
  Testing n=240... silhouette=0.2845
  Te

2025-12-02 19:52:19,261 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


Processing 32749 segments...
Using fixed cluster size: 150


2025-12-02 19:53:30,712 - BERTopic - Dimensionality - Completed ‚úì
2025-12-02 19:53:30,716 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-02 19:54:35,795 - BERTopic - Cluster - Completed ‚úì
2025-12-02 19:54:35,813 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-02 19:59:52,731 - BERTopic - Representation - Completed ‚úì


Discovered 150 topics
Classifying with GPT-4...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 [04:32<00:00,  1.82s/it]



Top CAP categories (by speech rows):
  Government Operations: 79711
  Macroeconomics: 77083
  Health: 34483
  Law and Crime: 32903
  Other: 32836

HR Parliament - croatian


2025-12-02 20:08:48,093 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


Processing 39889 segments...
Using fixed cluster size: 150


2025-12-02 20:10:12,778 - BERTopic - Dimensionality - Completed ‚úì
2025-12-02 20:10:12,782 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-02 20:11:43,265 - BERTopic - Cluster - Completed ‚úì
2025-12-02 20:11:43,278 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-02 20:18:23,494 - BERTopic - Representation - Completed ‚úì


Discovered 150 topics
Classifying with GPT-4...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 [04:30<00:00,  1.80s/it]



Top CAP categories (by speech rows):
  Macroeconomics: 88863
  Government Operations: 80921
  Other: 52987
  Social Welfare: 32571
  Domestic Commerce: 31502

‚úÖ Topic modeling complete for all datasets
   GB (English): Auto-optimized clusters
   AT (English + German): 170 clusters (optimized on English)
   HR (English + Croatian): 150 clusters (optimized on English)


## Save Topic Metadata

Save topic metadata as CSV files for easy inspection.

In [5]:
# Save topic metadata as CSV for easy inspection
def save_topic_metadata(topic_dict, filename):
    """Save topic metadata to CSV"""
    rows = []
    for topic_id, meta in topic_dict.items():
        rows.append({
            'Topic_ID': topic_id,
            'Keywords': meta['keywords'],
            'Topic_Name': meta['topic_name'],
            'CAP_Category': meta['cap_category']
        })
    pd.DataFrame(rows).to_csv(filename, index=False)

save_topic_metadata(gb_cats, os.path.join(BASE_DATA_DIR, "GB/GB_topic_metadata.csv"))
save_topic_metadata(at_en_cats, os.path.join(BASE_DATA_DIR, "AT/AT_english_topic_metadata.csv"))
save_topic_metadata(at_de_cats, os.path.join(BASE_DATA_DIR, "AT/AT_german_topic_metadata.csv"))
save_topic_metadata(hr_en_cats, os.path.join(BASE_DATA_DIR, "HR/HR_english_topic_metadata.csv"))
save_topic_metadata(hr_hr_cats, os.path.join(BASE_DATA_DIR, "HR/HR_croatian_topic_metadata.csv"))

print("‚úÖ Topic metadata CSV files saved to:", BASE_DATA_DIR)
print(f"\nüìÑ Files created:")
print(f"  GB: GB_topic_metadata.csv")
print(f"  AT: AT_english_topic_metadata.csv, AT_german_topic_metadata.csv")
print(f"  HR: HR_english_topic_metadata.csv, HR_croatian_topic_metadata.csv")

‚úÖ Topic metadata CSV files saved to: data folder

üìÑ Files created:
  GB: GB_topic_metadata.csv
  AT: AT_english_topic_metadata.csv, AT_german_topic_metadata.csv
  HR: HR_english_topic_metadata.csv, HR_croatian_topic_metadata.csv


## Summary

View topic distribution across all datasets.

In [6]:
# Combine all topic columns for overview
print("üìä Dataset Statistics After Topic Modeling")
print("="*60)

print(f"\nüá¨üáß Great Britain (GB):")
print(f"   Total speeches: {len(GB_final):,}")
print(f"   Unique topics: {GB_final['Topic_Name_GB_english'].nunique()}")
print(f"   CAP categories: {GB_final['CAP_Category_GB_english'].nunique()}")

print(f"\nüá¶üáπ Austria (AT):")
print(f"   Total speeches: {len(AT_final):,}")
print(f"   English - Unique topics: {AT_final['Topic_Name_AT_english'].nunique()}")
print(f"   German - Unique topics: {AT_final['Topic_Name_AT_german'].nunique()}")

print(f"\nüá≠üá∑ Croatia (HR):")
print(f"   Total speeches: {len(HR_final):,}")
print(f"   English - Unique topics: {HR_final['Topic_Name_HR_english'].nunique()}")
print(f"   Croatian - Unique topics: {HR_final['Topic_Name_HR_croatian'].nunique()}")

# Combine all CAP categories for overall distribution
all_categories = []
all_categories.extend(GB_final['CAP_Category_GB_english'].dropna().tolist())
all_categories.extend(AT_final['CAP_Category_AT_english'].dropna().tolist())
all_categories.extend(AT_final['CAP_Category_AT_german'].dropna().tolist())
all_categories.extend(HR_final['CAP_Category_HR_english'].dropna().tolist())
all_categories.extend(HR_final['CAP_Category_HR_croatian'].dropna().tolist())

cat_dist = pd.Series(all_categories).value_counts()

print("\nüìä Overall CAP Category Distribution (All Speeches)")
print("="*60)
for i, (cat, count) in enumerate(cat_dist.head(10).items(), 1):
    pct = count / len(all_categories) * 100
    print(f"{i:2d}. {cat}: {count:,} ({pct:.1f}%)")

print(f"\nTotal classified speeches: {len(all_categories):,}")

print("\n‚úÖ Topic modeling pipeline complete!")
print(f"‚úÖ All dataframes now contain topic keywords, names, and CAP categories for each speech")

üìä Dataset Statistics After Topic Modeling

üá¨üáß Great Britain (GB):
   Total speeches: 670,912
   Unique topics: 207
   CAP categories: 21

üá¶üáπ Austria (AT):
   Total speeches: 231,759
   English - Unique topics: 154
   German - Unique topics: 142

üá≠üá∑ Croatia (HR):
   Total speeches: 504,338
   English - Unique topics: 144
   Croatian - Unique topics: 138

üìä Overall CAP Category Distribution (All Speeches)
 1. Macroeconomics: 261,850 (12.2%)
 2. Government Operations: 211,776 (9.9%)
 3. Other: 193,202 (9.0%)
 4. Health: 164,049 (7.7%)
 5. International Affairs: 147,677 (6.9%)
 6. Law and Crime: 138,300 (6.5%)
 7. Social Welfare: 112,648 (5.3%)
 8. Civil Rights: 105,654 (4.9%)
 9. Labor: 99,555 (4.6%)
10. Education: 98,583 (4.6%)

Total classified speeches: 2,143,106

‚úÖ Topic modeling pipeline complete!
‚úÖ All dataframes now contain topic keywords, names, and CAP categories for each speech


## Consensus Topic Determination

Determine consensus topics from multiple language predictions and merge with human labels.

In [7]:
def determine_consensus(row, topic_cols, is_chairperson_col='Speaker_role'):
    """Determine consensus topic from multiple predictions"""
    topics = [row[col] for col in topic_cols if col in row.index]
    topics = [t for t in topics if t != 'Other']
    
    # Single topic column - no consensus needed
    if len(topic_cols) == 1:
        return row[topic_cols[0]] if topic_cols[0] in row.index else 'Other'
    
    # Chairperson requires unanimous agreement
    if row.get(is_chairperson_col) == 'Chairperson':
        return topics[0] if len(set(topics)) == len(topics) == len(topic_cols) else 'Other'
    
    # Regular speakers
    if not topics:
        return 'Other'
    
    topic_counts = pd.Series(topics).value_counts()
    
    # For 2 columns: both agree or mark as Mix
    if len(topic_cols) == 2:
        return topic_counts.idxmax() if topic_counts.iloc[0] == 2 else 'Mix'
    
    # Fallback for other cases
    return topic_counts.idxmax() if topic_counts.iloc[0] > 1 else 'Mix'

# Apply consensus for each dataset
HR_final['topic_consensus'] = HR_final.apply(lambda r: determine_consensus(
    r, ['CAP_Category_HR_english', 'CAP_Category_HR_croatian']), axis=1)

AT_final['topic_consensus'] = AT_final.apply(lambda r: determine_consensus(
    r, ['CAP_Category_AT_english', 'CAP_Category_AT_german']), axis=1)

GB_final['topic_consensus'] = GB_final.apply(lambda r: determine_consensus(
    r, ['CAP_Category_GB_english']), axis=1)

print("‚úÖ Consensus topics determined")
print(f"   GB unique consensus topics: {GB_final['topic_consensus'].nunique()}")
print(f"   AT unique consensus topics: {AT_final['topic_consensus'].nunique()}")
print(f"   HR unique consensus topics: {HR_final['topic_consensus'].nunique()}")

‚úÖ Consensus topics determined
   GB unique consensus topics: 21
   AT unique consensus topics: 22
   HR unique consensus topics: 21


## Merge with Human Labels

Load and merge human-labeled test sets for GB and HR.

In [8]:
# Load human labels
hr_labels = pd.read_json(os.path.join(BASE_DATA_DIR, "HR/ParlaCAP-test-hr.jsonl"), lines=True)
gb_labels = pd.read_json(os.path.join(BASE_DATA_DIR, "GB/ParlaCAP-test-en.jsonl"), lines=True)

# Merge with HR
HR_final = HR_final.merge(hr_labels[['id', 'labels']], left_on='ID', right_on='id', how='left')
HR_final.rename(columns={'labels': 'True_label'}, inplace=True)
HR_final.drop(columns=['id'], inplace=True)

# Merge with GB
GB_final = GB_final.merge(gb_labels[['id', 'labels']], left_on='ID', right_on='id', how='left')
GB_final.rename(columns={'labels': 'True_label'}, inplace=True)
GB_final.drop(columns=['id'], inplace=True)

print("‚úÖ Human labels merged")
print(f"   GB speeches with labels: {GB_final['True_label'].notna().sum():,}")
print(f"   HR speeches with labels: {HR_final['True_label'].notna().sum():,}")

‚úÖ Human labels merged
   GB speeches with labels: 876
   HR speeches with labels: 869


## Load and Merge LIWC Data

Add LIWC-22 linguistic features to each dataset.

In [9]:
# Load LIWC results
AT_LIWC = pd.read_csv(os.path.join(BASE_DATA_DIR, "AT/AT_LIWC_results.csv"))
HR_LIWC = pd.read_csv(os.path.join(BASE_DATA_DIR, "HR/HR_LIWC_results.csv"))
GB_LIWC = pd.read_csv(os.path.join(BASE_DATA_DIR, "GB/GB_LIWC_results.csv"))

# Merge with existing data
AT_final = AT_final.merge(AT_LIWC, on='ID', how='inner')
HR_final = HR_final.merge(HR_LIWC, on='ID', how='inner')
GB_final = GB_final.merge(GB_LIWC, on='ID', how='inner')

# Add country identifier
AT_final['Country'] = 'Austria'
HR_final['Country'] = 'Croatia'
GB_final['Country'] = 'Great Britain'

# Process dates
for df in [AT_final, HR_final, GB_final]:
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    df['Year'] = df['Date'].dt.year

# Calculate speaker age
for df in [AT_final, HR_final, GB_final]:
    df['Speaker_birth'] = pd.to_numeric(df['Speaker_birth'], errors='coerce')
    df['Speaker_age'] = df['Year'] - df['Speaker_birth']

print("‚úÖ LIWC data merged")
print(f"   AT: {len(AT_final):,} speeches with LIWC features")
print(f"   HR: {len(HR_final):,} speeches with LIWC features")
print(f"   GB: {len(GB_final):,} speeches with LIWC features")

‚úÖ LIWC data merged
   AT: 231,759 speeches with LIWC features
   HR: 504,338 speeches with LIWC features
   GB: 670,912 speeches with LIWC features


## Save Final Datasets

Save complete datasets with all features: topics, consensus, labels, and LIWC.

In [10]:
# Cleanup: Remove unnecessary columns before saving
columns_to_remove = ['Segment']  

for df in [GB_final, AT_final, HR_final]:
    for col in columns_to_remove:
        if col in df.columns:
            df.drop(columns=[col], inplace=True)

# Save final datasets ready for visualization
GB_final.to_pickle(os.path.join(BASE_DATA_DIR, "GB/GB_final.pkl"))
AT_final.to_pickle(os.path.join(BASE_DATA_DIR, "AT/AT_final.pkl"))
HR_final.to_pickle(os.path.join(BASE_DATA_DIR, "HR/HR_final.pkl"))

print("\n Final datasets saved!")
print("\n Final Dataset Summary:")
print(f"   GB: {GB_final.shape[0]:,} speeches √ó {GB_final.shape[1]} columns")
print(f"   AT: {AT_final.shape[0]:,} speeches √ó {AT_final.shape[1]} columns")
print(f"   HR: {HR_final.shape[0]:,} speeches √ó {HR_final.shape[1]} columns")


 Final datasets saved!

 Final Dataset Summary:
   GB: 670,912 speeches √ó 155 columns
   AT: 231,759 speeches √ó 160 columns
   HR: 504,338 speeches √ó 161 columns
