# Slovak Parliamentary Narrative Analysis

This notebook analyzes the effectiveness of embedding models and LLMs on Slovak parliamentary transcripts (2010–2023). The goal is to expand a small set of seed statements into a comprehensive set of statements covering the semantic space of the corpus, with minimal overlap.

## Methodology

1. Load and preprocess the corpus.
2. Compute embeddings and similarity scores.
3. Sample and extract topic-relevant statements.
4. Categorize statements using LLMs.
5. Iterate until semantic coverage is achieved.

*For details on the corpus structure and evaluation, see the README.*

In [7]:
from typing import Annotated, List, Dict, Literal
from operator import add
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from typing_extensions import TypedDict
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from dotenv import load_dotenv
import os
from openai import OpenAI
from docx import Document
import random
import re, unicodedata, json
from collections import defaultdict


In [2]:
load_dotenv(r"..\keys.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

In [4]:
client = OpenAI(api_key=openai_api_key)

In [9]:
model = "gpt-5"

In [10]:
def call_openai(role_message,user_message, model=model,reasoning_effort="minimal"):
    response = client.chat.completions.create(
        model=model,
        messages=[
        {"role": "system", "content": role_message},
        {"role": "user", "content": user_message}],
        reasoning_effort=reasoning_effort,
    )
    return response.choices[0].message.content

In [None]:
df = pd.read_parquet(r"..\data\df_to_app_with_openAI_S_L_voyage_gdoogle_mistral_embeddings_obdobie_8_with_narratives.parquet", engine='fastparquet')

This is valid for openai embeddings based on the distribution of similarity scores from cosine_treshold_final.ipynb

In [28]:
Edge_low, Edge_high = 0.45, 0.60
similarity_threshold = 0.5


In [12]:
def get_embeddings(input_text):
    response = client.embeddings.create(
        input=[input_text],  # Ensure input is a list
        model="text-embedding-3-large"
    )
    embedding = np.array(response.data[0].embedding)
    return embedding


def get_combined_narrative_embedding(narrative_list):
    narrative_list = [str(item) for item in narrative_list]
    combined_narrative = " ".join(narrative_list)
    combined_embedding = get_embeddings(combined_narrative)
    return combined_embedding


def calculate_similarity(df, embedding_col, ref_embedding, similarity_col):
    """Calculate cosine similarity between each row embedding and ref_embedding.

    Assumes embeddings already have identical dimensionality. If a row embedding
    mismatches in length, it is skipped (NaN) instead of alignment.
    """
    ref = np.array(ref_embedding).flatten()
    ref_len = ref.shape[0]
    sims = []
    for emb in df[embedding_col]:
        try:
            v = np.array(emb).flatten()
            if v.shape[0] != ref_len:
                sims.append(np.nan)
                continue
            sims.append(float(cosine_similarity(v.reshape(1, -1), ref.reshape(1, -1))[0][0]))
        except Exception:
            sims.append(np.nan)
    df[similarity_col] = sims
    return df

In [13]:
def _normalize_slug(name: str) -> str:
    name = ''.join(ch for ch in unicodedata.normalize('NFD', name) if unicodedata.category(ch) != 'Mn')
    name = re.sub(r'[^a-z0-9\s-]', '', name.lower()).strip()
    name = re.sub(r'\s+', '-', name)
    return name[:80] or 'uncategorized'

def parse_extracted_block(block: str) -> list[str]:
    if not block:
        return []
    lines=[]
    for raw in block.split('\n'):
        ln = raw.strip()
        if not ln:
            continue
        # remove leading bullets / dashes / numbering
        ln = re.sub(r'^[-•\d\.\)\(]+\s*', '', ln).strip()
        if len(ln) < 5:
            continue
        lines.append(ln)
    dedup=[]
    seen=set()
    for l in lines:
        if l not in seen:
            dedup.append(l); seen.add(l)
    return dedup

def build_category_embeddings(categories_structured: list[dict]):
    cat_embs = {}
    for cat in categories_structured or []:
        slug = _normalize_slug(cat.get('name',''))
        items = cat.get('items', [])
        if not items:
            continue
        cat_embs[slug] = get_combined_narrative_embedding(items)
    return cat_embs

In [14]:
anti_vaccine_queries = [
    "Očkovanie nesmie byt v ziadnom pripade povinne.",
    "Nikto nevie ake bude mat ockovanie nasledky do buducnosti. Vakcina nebola vobec riadne odskusana. Je to neodskusana latka",
    "Očkovanie je hlavne velky biznis. My sa tu ockujeme neznamou latkou a farmaceuticke firmy budu mat obrovske zisky",
    "To ockovanie vobec nefunguje. Ak sa zaockujete stale mozete dostat Covid"
]

pro_russian_queries = [
    "Nerobme z Ruska nášho nepriatela Rusko nie je náš nepriateľ.",
    "Rusko vyprokovalo rozširovanie do NATO. Rusko nikdy nedovolí,  aby Ukrajina bola v NATO",
    "Dodavanim zbrani ten konflit len predlžujme.Ak budeme posielat zbrane tak ten konflikt predlžime a viac ľudí bude umierať",
    "Všetci len strašite Ruskom, že Rusko je zle. Ale čo Irak a Juhoslavia čo bombardovali Američania a USA. Prečo ste boli vtedy ticho? ",
    "Sankciami poškudzujeme len seba. Rusko sankciami vobec netrpi. Bez ruskeho plynu to pomrzneme "
]

anti_gender_narrative = [
    "Gender  a rodová ideologia nás ohrozuje",
    "My tu nechceme mať 72 pohlaví",
    "Pohlavie je sociálny konštrukt podľa gender a rodovej ideologie ",
    "Rodovú a gender ideológia  tvrdí, že nezáleží na vrodenom biologickom pohlaví.",
    "Rodová a gender ideológia  tvrdí, každý má mať možnosť vybrať si, ci je mužom alebo ženou, alebo niečo medzi tým",
    "Presadzovať rodovú a gender spolu s LGBTI na školach by malo byt zakazané"
]
ekonomicky_liberal = ["Dane by mali byť, čo najmenšie.", "Podnikatelia zamestnávajú ľudí a tvoria hodnoty",
                      "Potrebujeme, čo najštihlejší štát, či menej úradnikov a úradov tým lepšie",
                      "Mame pomaly najvyššie odvody v Europskej Unie", "Musíme podporovať domácich malých a stredných podnikateľov", "Podnikateľov zatažujeme stále väčším počtom reguláccií", "Deficit verejných financií musí byť čo najnižší", "Nezadlžujte viac už Slovensko", "Daňovo odvodové zaťaženi je na Slovensku na neznesitelné"
] 
anti_smer = ["SMER je ovladany oligarchami", "SMER je mafia, ktorá okrada Slovensko", "Financne skupiny okradaju Slovennsko", "Mafia je na policii, Mafia je na sudoch. SMER nechal vyrast mafiu", "Korupcia a klientelizmus je najvačši problem na Slovensku, a to hlavne vdaka SMER-SD", 
                                 "Zlodejstva SMERU su najvacsim problemom"]


utencenci_narrative = [
    "Utecenci su hrozba.",
    "Moslimská ucelená komunita nedonesenie nič dobré.",
    "Europa nedokáže zvladnuť toľko utečencov.",
    "Slovensko nedokáze zvládnut tolko utečencov.",
    "Pustit niekoho cez hranice bez registracie je nebezpecne.",
    "My neviem, ci ti ludia nie su teroristi, s utecencami pride krimininalita a znasilnenia.",
    "Kvoty na utecentov su absolutny nezmysel."
]       

solarne_panely = ["solarne panely su buducnost", "fotovoltalika ma velky potencial"]

novinari_negativne= ["novinari píšu len za peniaze", "progresivny novináry nikdy nebudú o konzervatívcoch písať pekne", "Novinári píšu bez akejkoľvek objektivity","Novinari píšu častokrat o niečo o čom nevedia" ]

odbory = ["odborari netvoria ziadne hodnoty, len strajkuju", "Odborari su komunisticky vymysel", "Odbori maju v nasom zakoniku prace prilis velky vplyv"]

minimalna_mzda_za = ["Minimalna mzda je dolezity nastroj na zlepsie zivotnje urovne", "Zvysovanie minimalnej mzdy je dolezite pre zlepsovanie zivotnje urovne", ]
minimalna_mzda_proti = ["Minimalna mzda umelo stanovuje cenu prace", "Minimalna mzda skodi nizkoprijmovym skupinam, lebo odrazda ostatnych abyu zamestnali"]
pomahanie_dochodcom = ["Dochodcovia cely zivot pracovali a teraz by im ako stat mali pomoct", "Dochodcovia si zasluzia aspon nejake socialne istoty"] 

In [15]:
narratives_dict = {
    "vaccine_similarity": anti_vaccine_queries,
    "russian_similarity": pro_russian_queries,
    "gender_similarity": anti_gender_narrative,
    "ekonom_similarity":ekonomicky_liberal,
    "smer_similarity":anti_smer,
    "utecenci_similarity":utencenci_narrative,
    "solarne_panely_similarity":solarne_panely,
    "novinari_similarity":novinari_negativne,
    "odbory_similarity":odbory,
    "minimalna_mzda_za_similarity":minimalna_mzda_za,
    "minimalna_mzda_proti_similarity":minimalna_mzda_proti,
    "dochodcovia_similarity":pomahanie_dochodcom }
    


In [16]:
def compute_multiple_similarities(
    df,
    embedding_col,
    narratives_dict):
    """
    For each entry in 'narratives_dict' (a dict of {similarity_col: text_list}),
    compute a combined embedding and then calculate cosine similarities
    against df[embedding_col].

    Args:
        df (pd.DataFrame): DataFrame containing existing embeddings in 'embedding_col'.
        embedding_col (str): Column name where each row's embedding (list/array) is stored.
        narratives_dict (dict): A dict where key is the new similarity column name,
            and value is a list of texts that should be combined and embedded.
        model (str): Name of the model to use for embedding.

    Returns:
        pd.DataFrame: The updated DataFrame with new similarity columns appended.
    """
    for similarity_col, text_list in narratives_dict.items():
        # 2A) Get combined embedding for the entire list of texts
        ref_emb = get_combined_narrative_embedding(text_list)
        # 2B) Calculate similarities for the entire DataFrame
        df = calculate_similarity(df, embedding_col, ref_emb, similarity_col)
    return df

In [17]:
class NarativeaAnalytics(TypedDict):
    df: pd.DataFrame
    sample_df: pd.DataFrame
    narratives: Annotated[list[str], add]            
    extracted_narratives: list[str]                  
    narratives_categories: list[str]                
    categories_structured: list[dict] | None
    final_analysis: list[dict] | str
    category_stats: pd.DataFrame | None
    topic: str
    stance: str
    similarity_col: str
    embedding_col: str
    similarity_threshold: float
    iteration: int
    max_iterations: int
    min_iterations: int
    candidate_count: int
    min_candidate_rows: int
    last_category_count: int
    new_categories_added: int
    new_category_slugs: list[str]           


In [18]:
def router(state: NarativeaAnalytics) -> str:
    iteration = state.get("iteration", 0)
    max_iter = state.get("max_iterations", 5)
    min_iter = state.get("min_iterations", 2)
    new_cats = state.get("new_categories_added", 0)
    # Stop if reached max
    if iteration >= max_iter:
        print(f"[Router] stop: iteration {iteration} >= max {max_iter}")
        return "final_controller"
    # Stop if after min iterations and no new categories
    if iteration >= min_iter and new_cats == 0:
        print(f"[Router] stop: no new categories at iter {iteration}")
        return "final_controller"
    return "sample_new_speeches"


In [29]:
def sample_new_speeches(state: NarativeaAnalytics):
    df = state['df'].copy()
    embedding_col = state['embedding_col']
    iteration = state.get('iteration', 0)
    similarity_col = state['similarity_col']
    base_threshold = float(state.get('similarity_threshold', 0.5))
    categories_structured = state.get('categories_structured') or []
    max_per_category = 10
    edge_low, edge_high = Edge_low, Edge_high

    if 'used_for_extraction' not in df.columns:
        df['used_for_extraction'] = False

    if iteration == 0 or not categories_structured:
        seed_narratives = state.get('narratives', []) or []
        if not seed_narratives:
            print(f"[Iter {iteration}] No seed narratives available.")
            return {'sample_df': pd.DataFrame(), 'df': df, 'iteration': iteration + 1}
        ref_emb = get_combined_narrative_embedding(seed_narratives)
        df = calculate_similarity(df, embedding_col, ref_emb, similarity_col)
        candidate_df = df[(df[similarity_col] >= base_threshold) & (~df['used_for_extraction'])]
        if candidate_df.empty:
            print(f"[Iter {iteration}] centroid sampling: 0 candidates >= {base_threshold}")
            return {'sample_df': pd.DataFrame(), 'df': df, 'iteration': iteration + 1}
        target = min(15, len(candidate_df))
        group_cols = [c for c in ['obdobie','klub'] if c in candidate_df.columns]
        if group_cols and len(candidate_df) > target:
            work = candidate_df.copy()
            for gc in group_cols:
                work[gc] = work[gc].fillna('__MISSING__')
            sizes = work.groupby(group_cols).size()
            proportions = (sizes / sizes.sum()) * target
            alloc = proportions.astype(int)
            remainder = target - alloc.sum()
            if remainder > 0:
                frac = (proportions - alloc).sort_values(ascending=False)
                for idx in frac.index[:remainder]:
                    alloc.loc[idx] += 1
            parts=[]
            g = work.groupby(group_cols)
            for key, need in alloc.items():
                if need <= 0: 
                    continue
                subset = g.get_group(key)
                parts.append(subset.sample(min(int(need), len(subset)), random_state=42))
            sample_df = pd.concat(parts) if parts else candidate_df.sample(target, random_state=42)
        else:
            sample_df = candidate_df.sample(target, random_state=42)
        df.loc[sample_df.index,'used_for_extraction'] = True
        print(f"[Iter {iteration}] centroid candidates={len(candidate_df)} sampled={len(sample_df)} thr={base_threshold}")
        return {'sample_df': sample_df.reset_index(drop=True),
                'df': df,
                'iteration': iteration + 1}

    # Edge-band sampling restricted ONLY to newly added categories
    new_slugs = state.get('new_category_slugs') or []   # <--- fetch newly added category slugs
    cat_embs = build_category_embeddings(categories_structured)
    if new_slugs:
        cat_embs = {k: v for k, v in cat_embs.items() if k in new_slugs}
        if not cat_embs:
            print(f"[Iter {iteration}] no new categories to sample (new slugs set empty after filter)")
            return {'sample_df': pd.DataFrame(), 'df': df, 'iteration': iteration + 1}
    else:
        # No newly added categories => skip sampling (we only want new ones)
        print(f"[Iter {iteration}] no newly added category slugs -> skipping edge sampling")
        return {'sample_df': pd.DataFrame(), 'df': df, 'iteration': iteration + 1}

    all_samples=[]
    for slug, emb in cat_embs.items():
        sim_col = f"cat_{slug}_sim_tmp"
        df = calculate_similarity(df, embedding_col, emb, sim_col)
        mask = (df[sim_col].between(edge_low, edge_high)) & (~df['used_for_extraction'])
        cand = df[mask]
        if cand.empty:
            continue
        take = min(max_per_category, len(cand))
        picked = cand.sample(take, random_state=42).copy()
        picked['__source_category'] = slug
        all_samples.append(picked)

    if not all_samples:
        print(f"[Iter {iteration}] no edge samples for new categories (band {edge_low}-{edge_high})")
        return {'sample_df': pd.DataFrame(), 'df': df, 'iteration': iteration + 1}

    sample_df = pd.concat(all_samples)
    sample_df = sample_df[~sample_df.index.duplicated(keep='first')]
    df.loc[sample_df.index,'used_for_extraction'] = True
    print(f"[Iter {iteration}] edge sampled total={len(sample_df)} new_categories={len(cat_embs)}")
    return {'sample_df': sample_df.reset_index(drop=True),
            'df': df,
            'iteration': iteration + 1}

In [20]:
def extract_narratives(state: NarativeaAnalytics):
    sample_df = state["sample_df"]
    narratives = state.get("narratives", [])              
    topic = state["topic"]
    stance = state["stance"]
    iteration = state.get("iteration", 0)

    if sample_df is None or sample_df.empty:
        print(f"[Iter {iteration}] extract_narratives: empty sample")
        return {"extracted_narratives": [], "narratives": narratives}

    text = "\n\n".join(sample_df["truncated_prepis"].dropna().astype(str))

    # ORIGINAL, UNCHANGED PROMPT TEXT:
    system_msg = f"Si asistent na detekciu naratívov k téme '{topic}'. Texty sú v slovenskom jazyku."
    user_msg = (f"""Nižšie sú texty. Sú to prepisy vystupenia poslancov Národnej rady Slovenskej republiky. Tvoja úloha 
                je identifikovať a extrahovať a skopirovať relevantné výroky z textu na tému '{topic}' s týmto postojom '{stance}'. 
                Tu sú už existujúce naratívy a výroky, týmto spôsobom očakávam extraciu z textu:\n.
                {narratives}

                Formát odpovede sú len a výlučne len skopirované výroky týkajúce sa {topic} s postojom {stance}.\n
              
             

                Texty:\n
                {text}
                """)

    content = call_openai(
        role_message=system_msg,
        user_message=user_msg,
        model=model,
        reasoning_effort="low"
    )

    # Post-processing (added, prompt intact)
    raw_items = parse_extracted_block(content)
    # keep only NEW
    new_items = [itm for itm in raw_items if itm not in narratives]
    updated = narratives + new_items
    print(f"[Iter {iteration}] extract_narratives: new={len(new_items)} total={len(updated)}")
    return {"extracted_narratives": new_items, "narratives": updated}

In [21]:
def categorize_narratives(state: NarativeaAnalytics):
    import json, re
    try:
        from pydantic import BaseModel, Field, ValidationError
    except ImportError:
        BaseModel = object
        ValidationError = Exception
        def Field(*a, **k): return None

    narratives = state.get("narratives", []) or []
    existing_struct = state.get("categories_structured") or []
    prev_slugs = {_normalize_slug(c.get('name','')) for c in existing_struct}
    topic = state["topic"]; stance = state["stance"]
    iteration = state.get("iteration", 0)

    if not narratives:
        print(f"[Iter {iteration}] categorize_narratives: no narratives")
        return {
            "categories_structured": existing_struct,
            "narratives_categories": state.get("narratives_categories", []),
            "last_category_count": len(prev_slugs),
            "new_categories_added": 0,
            "new_category_slugs": []
        }

    class Category(BaseModel):
        name: str
        label: str
        rationale: str
        items: list[str]
    class CategoryResponse(BaseModel):
        categories: list[Category]

    enumerated = "\n".join(f"{i+1}. {n}" for i,n in enumerate(narratives))
    system_msg = ("Si analytický expert na zhlukovanie politických naratívov v slovenčine.")
    user_msg = f"""TÉMA: {topic}
POSTOJ: {stance}
VÝROKY:
{enumerated}
POŽADOVANÝ JSON:
{{
  "categories": [
    {{
      "name": "kratky-identifikator-bez-diacritiky",
      "label": "Čitateľný názov",
      "rationale": "Jedna veta prečo spolu",
      "items": ["Presný pôvodný výrok 1","Presný pôvodný výrok 2"]
    }}
  ]
}}
PRAVIDLÁ:
1. Každý výrok v presne jednej kategórii.
2. 3-15 kategórií.
3. name lowercase bez diakritiky, hyphen separated.
4. Iba čistý JSON.
ODPOVEĎ:
"""
    raw = call_openai(system_msg, user_msg, model=model, reasoning_effort="low")

    def attempt(txt: str):
        try: return json.loads(txt)
        except:
            m = re.search(r"\{[\s\S]*\}", txt)
            if m:
                try: return json.loads(m.group(0))
                except: return None
            return None

    parsed = attempt(raw)
    new_struct=[]
    if isinstance(parsed, dict) and 'categories' in parsed:
        try:
            validated = CategoryResponse(**parsed)
            new_struct = [c.model_dump() for c in validated.categories]
        except ValidationError as ve:
            print(f"[Iter {iteration}] validation error: {ve.errors()[:1]}")
    else:
        print(f"[Iter {iteration}] parse failed len={len(raw)}")

    merged = {_normalize_slug(c.get('name','')): c for c in existing_struct}
    for cat in new_struct:
        slug = _normalize_slug(cat.get('name',''))
        if slug in merged:
            old_items = merged[slug].get('items', [])
            seen=set(old_items)
            for it in cat.get('items', []):
                if it not in seen:
                    old_items.append(it); seen.add(it)
            merged[slug]['items'] = old_items
            # keep longer rationale
            if len(cat.get('rationale','')) > len(merged[slug].get('rationale','')):
                merged[slug]['rationale'] = cat.get('rationale','')
        else:
            merged[slug] = cat

    merged_struct = list(merged.values())
    new_slugs = [slug for slug in merged if slug not in prev_slugs]   # <--- identify new category slugs
    new_count = len(new_slugs)

    flat=[]
    for c in merged_struct:
        slug=_normalize_slug(c.get('name',''))
        for it in c.get('items', []):
            flat.append(f"{slug} | {it}")
    # dedup
    dedup=[]; seen=set()
    for line in flat:
        if line not in seen:
            seen.add(line); dedup.append(line)

    print(f"[Iter {iteration}] categorize_narratives: prev={len(prev_slugs)} now={len(merged)} new={new_count} narratives={len(narratives)}")
    if new_slugs:
        print(f"[Iter {iteration}] new category slugs: {new_slugs}")
    return {
        "categories_structured": merged_struct,
        "narratives_categories": dedup,
        "last_category_count": len(merged),
        "new_categories_added": new_count,
        "new_category_slugs": new_slugs
    }

In [22]:
def final_controller(state: NarativeaAnalytics):
    categories = state.get("categories_structured", [])
    topic = state["topic"]
    stance = state["stance"]
    iteration = state.get("iteration", 0)

    if not categories:
        print(f"[Final Controller] No categories to process")
        return {"final_analysis": "No categories found for final analysis."}

    system_msg = ("Si analytický expert na sumarizáciu politických naratívov v slovenčine")
    user_msg = f""" K dispozícii máš text s výrokmi poslancov NRSR na tému '{topic}' s týmto postojom '{stance}'. 
Tieto výroky pochádzajú z prepisov vystupení poslancov v Národnej rady Slovenskej republiky. Výroky týkajuce sa {topic} a {stance} boli hladané v prepisoch a skopirované to materiálu. 
Neskôr boli jednotlivé výroky kategorizované.

Inštrukcie:
1. Celý materál si prečitaš
2. Výroky sú doslovné citácie, a niekedy sa dostali do textu aj prepisom, ktorý už nesúvisí s témou '{topic}' s týmto postojom '{stance}'. 
   Teda ten, kto to prepisoval spravil chybu a namiesto len výroku prepisal celé vystúpenie, alebo jeho časť. Tvoja úloha je identifikovať a ponechať len výrok.
3. Ak je kategórií viac ako 10, tak vytvoríš nové podkategórie. Semanticky zgrupuješ podobné kategórie do jedného celku aj s ich výrokmi
4. Štruktúra odpovede je taká istá ako na vstupe. Len je tam menej kategórií nakoľko si ich zgrupoval aj s výrokmi. Výroky ponecháš v pôvodnom znení, ktoré si krátil o nesúvisiaci text.
5. Formát JSON s polami: {{"categories": [{{"name": "slug", "label": "názov", "rationale": "zdôvodnenie", "items": ["výrok1", "výrok2"]}}]}}

VSTUPNÉ KATEGÓRIE:
{categories}
"""

    content = call_openai(system_msg, user_msg, model="gpt-5", reasoning_effort="high")

    # Parse the LLM response to extract structured categories
    import json, re
    try:
        from pydantic import BaseModel, Field, ValidationError
    except ImportError:
        BaseModel = object
        ValidationError = Exception

    class Category(BaseModel):
        name: str
        label: str  
        rationale: str
        items: list[str]
    class CategoryResponse(BaseModel):
        categories: list[Category]

    def attempt_parse(txt: str):
        try: 
            return json.loads(txt)
        except:
            m = re.search(r"\{[\s\S]*\}", txt)
            if m:
                try: 
                    return json.loads(m.group(0))
                except: 
                    return None
            return None

    parsed = attempt_parse(content)
    
    if isinstance(parsed, dict) and 'categories' in parsed:
        try:
            validated = CategoryResponse(**parsed)
            final_struct = [c.model_dump() for c in validated.categories]
            print(f"[Final Controller] Successfully processed {len(categories)} -> {len(final_struct)} final categories")
            
            # Return ONLY the final analysis as structured data
            return {"final_analysis": final_struct}
            
        except ValidationError as ve:
            print(f"[Final Controller] Validation error: {ve.errors()[:1]}")
            # Fallback: return original structure  
            return {"final_analysis": categories}
    else:
        print(f"[Final Controller] Parse failed, returning original structure")
        return {"final_analysis": categories}


In [23]:

def save_to_docx(narratives, filename):
    doc = Document()
    doc.add_heading('Rozsirene_narrativy', level=1)
    
    for narrative in narratives:
        if narrative is not None:
            doc.add_paragraph(str(narrative))
    
    doc.save(filename)
    print(f"Document saved as {filename}")


In [30]:

# Rebuild graph without analyze_categories
graph = StateGraph(NarativeaAnalytics)
graph.add_node("sample_new_speeches", sample_new_speeches)
graph.add_node("extract_narratives", extract_narratives)
graph.add_node("categorize_narratives", categorize_narratives)
graph.add_node("final_controller", final_controller)

graph.add_edge(START, "sample_new_speeches")
graph.add_edge("sample_new_speeches", "extract_narratives")
graph.add_edge("extract_narratives", "categorize_narratives")

graph.add_conditional_edges(
    "categorize_narratives",
    router,
    {
        "sample_new_speeches": "sample_new_speeches",
        "final_controller": "final_controller"
    }
) 


graph.add_edge("final_controller", END)
graph = graph.compile()

In [31]:
def agent_pipeline(df,
                   seed_narratives,
                   topic,
                   stance,
                   similarity_col,
                   similarity_threshold: float = 0.50,
                   min_iterations: int = 2,
                   max_iterations: int = 3,
                   embedding_col: str | None = None,
                   doc_export: bool = True):
    if embedding_col is None:
        for cand in ["openAI_embedding_small","openAI_embedding_3076","voyage-3-large_embeddings","mistral_embedings"]:
            if cand in df.columns:
                embedding_col = cand; break
    if embedding_col is None:
        raise ValueError("Embedding column not found.")

    if similarity_col not in df.columns:
        df[similarity_col] = 0.0

    init_state: NarativeaAnalytics = NarativeaAnalytics(
        df=df,
        sample_df=pd.DataFrame(),
        narratives=seed_narratives,
        extracted_narratives=[],
        narratives_categories=[],
        categories_structured=[],
        final_analysis="",
        category_stats=None,
        topic=topic,
        stance=stance,
        similarity_col=similarity_col,
        embedding_col=embedding_col,
        similarity_threshold=similarity_threshold,
        iteration=0,
        max_iterations=max_iterations,
        min_iterations=min_iterations,
        candidate_count=0,
        min_candidate_rows=0,
        last_category_count=0,
        new_categories_added=0
    )

    state = graph.invoke(init_state)

    if doc_export:
        save_to_docx(state.get("narratives", []), rf"..\data\politico_agent_texts\{topic}_narratives_{model}.docx")
        save_to_docx(state.get("narratives_categories", []), rf"..\data\politico_agent_texts\{topic}_narratives_categories_{model}.docx")
        if state.get("category_stats") is not None and not state["category_stats"].empty:
            state["category_stats"].to_excel(rf"..\data\politico_agent_texts\category_stats_{topic}.xlsx", index=False)

    print(f"Finished iterations={state.get('iteration')} categories={state.get('last_category_count')} narratives={len(state.get('narratives',[]))}")
    return state

In [32]:
state_russian = agent_pipeline(df, pro_russian_queries, "Ruska zahranicna politika, vztahy s Ruskom, Ruske ekonomicke zaujmu", "pro-Rusky, podporujuci rusko", similarity_col ="russian_similarity", embedding_col="openAI_embedding_3076" )

[Iter 0] centroid candidates=177 sampled=15 thr=0.5
[Iter 1] extract_narratives: new=14 total=19
[Iter 1] categorize_narratives: prev=0 now=8 new=8 narratives=24
[Iter 1] new category slugs: ['rusko-neohrozuje', 'nato-rozsirovanie-ukrajina', 'proti-dodavkam-zbrani-a-eskalacii', 'whataboutizmus-usa-nato', 'sankcie-skodia-slovensku', 'rusko-neporazitelne-jadrova-mocnost', 'zbrojne-rozpocty-porovnanie', 'dobre-vztahy-s-ruskom']
[Iter 1] edge sampled total=73 new_categories=8
[Iter 2] extract_narratives: new=19 total=43
[Iter 2] categorize_narratives: prev=8 now=20 new=12 narratives=67
[Iter 2] new category slugs: ['rusko-nas-nepriatel-nie', 'nato-ukrajina-rozsirovanie', 'proti-dodavkam-zbrani-za-mier', 'whataboutizmus-usa', 'sankcie-a-energie-skodia-nam', 'rusko-neporazitelne-a-uzemia-udrzi', 'rusko-neni-hrozba', 'slovensko-samoviina-za-zhorsenie-vztahov', 'kritika-nato-a-zapadu', 'krym-je-opravneny-pripad', 'vdaka-sssr-a-rusom-za-oslobodenie', 'obhajoba-krokov-sssr']
[Iter 2] edge sample

In [33]:
doc = state_russian["final_analysis"]

In [34]:
save_to_docx(state_russian["final_analysis"], rf"..\data\politico_agent_texts\final_analysis_{model}_russian.docx")


Document saved as ..\data\politico_agent_texts\final_analysis_gpt-5_russian.docx


.....................................................................................................................................................................................