Szukam chƒôtnych do obliczenia metodƒÖ Kontka liczby outlier√≥w w innych wyborach (np. 2023, 2019, 2015, 2010, 2007) oraz do obliczenia odchyle≈Ñ przestrzennych dla 2025 i poprzednich wybor√≥w w sensowniejszy spos√≥b. My≈õla≈Çem o czym≈õ takim: trzeba (1) zrobiƒá geokodowanie obwod√≥w (trochƒô mamy zrobione, ale tylko trochƒô) i (2) policzyƒá dla ka≈ºdego obwodu, czy jest on outlierem od klastra $k$ okolicznych obwod√≥w, gdzie $k$ jest do ustalenia (mo≈ºe 20?).

1) load and clean data (2015, 2020, 2025)
2) cluster by Kontek, but should easily be replaced
3) use Kontek methods
4) obliczyƒá odchylenia przestrzenne

In [15]:
import pandas as pd
import re
import os
from collections import defaultdict
import numpy as np

In [16]:
POLAND_RAW_DATA = os.path.join(os.getcwd(), "data", "raw", "poland")
POLAND_RAW_DATA

'/Users/gignac/Desktop/Projects/electoral-anomalies/data/raw/poland'

### PRESIDENTIAL

In [17]:
BASE_COLUMNS_MAP = {
    # 2025
    "Nr komisji": "polling_station_id",
    "Teryt Gminy": "teryt_gmina",
    # "Teryt Powiatu": "teryt_powiat", 
    "Liczba wyborc√≥w uprawnionych do g≈Çosowania (umieszczonych w spisie, z uwzglƒôdnieniem dodatkowych formularzy) w chwili zako≈Ñczenia g≈Çosowania": "eligible_voters",
    "Liczba kart wyjƒôtych z urny": "ballots_cast",
    # round 2
    "Liczba g≈Ços√≥w wa≈ºnych oddanych ≈ÇƒÖcznie na wszystkich kandydat√≥w (z kart wa≈ºnych)": "valid_votes",
    # round 1
    "Liczba g≈Ços√≥w wa≈ºnych oddanych ≈ÇƒÖcznie na obu kandydat√≥w (z kart wa≈ºnych)": "valid_votes",
    # 2020
    "Numer obwodu": "polling_station_id",
    "Kod TERYT": "teryt_gmina",
    "Liczba wyborc√≥w uprawnionych do g≈Çosowania": "eligible_voters",
    "Liczba kart wyjƒôtych z urny": "ballots_cast",
    "Liczba g≈Ços√≥w wa≈ºnych oddanych ≈ÇƒÖcznie na wszystkich kandydat√≥w": "valid_votes",
    # 2015
    "Liczba g≈Ços√≥w wa≈ºnych": "valid_votes",
}

In [18]:
CANDIDATE_SURNAMES_BY_YEAR = {
    "2015": [
        "braun", "duda", "jarubas", "komorowski", "korwin-mikke",
        "kowalski", "kukiz", "og√≥rek", "palikot", "tanajno", "wilk"
    ],
    "2020": [
        "biedro≈Ñ", "bosak", "duda", "ho≈Çownia", "jakubiak",
        "kosiniak-kamysz", "piotrowski", "tanajno", "trzaskowski",
        "witkowski", "≈º√≥≈Çtek"
    ],
    "2025": [
        "bartoszewicz", "biejat", "braun", "ho≈Çownia", "jakubiak",
        "maciak", "mentzen", "nawrocki", "senyszyn", "stanowski",
        "trzaskowski", "woch", "zandberg"
    ]
}


def get_candidate_rename_map(df, year):
    surnames = CANDIDATE_SURNAMES_BY_YEAR.get(str(year), [])
    rename_map = {
        col: surname
        for col in df.columns
        for surname in surnames
        if surname in col.lower()
    }
    return rename_map

In [19]:
def extract_postal_code(address):
    match = re.search(r"\b\d{2}-\d{3}\b", str(address))
    return match.group(0) if match else None

In [None]:
def load_presidential_data(year, round, ext="csv"):
    file_path = os.path.join(POLAND_RAW_DATA, f"{year}_presidential", f"round{round}.{ext}")
    if ext == "csv":
        df = pd.read_csv(file_path, sep=";", encoding="utf-8")
    elif ext == "xls":
        df = pd.read_excel(file_path, dtype={"TERYT gminy": str})
    else:
        raise NotImplementedError(f"Cannot load file with {ext} ext!")
    # Normalize column names: replace non-breaking spaces, strip whitespace
    df.columns = df.columns.str.replace("\xa0", " ", regex=False).str.strip()
    return df

In [271]:
def process_presidential_df(df, year, final_cols=[]):
    # Rename columns - to have same naming convention for all files
    df = df.rename(columns=BASE_COLUMNS_MAP)

    # Shorten candidate columns - risky if two candidates with the same surname

    candidate_rename_map = get_candidate_rename_map(df, year=year)
    df.rename(columns=candidate_rename_map, inplace=True)
    
    df = df.rename(columns=candidate_rename_map)
    # Assuming, if postal code column exits, no need for extraction
    if "postal_code" not in df.columns:
        # Get postal code from address ("siedziba")
        df["postal_code"] = df["Siedziba"].apply(extract_postal_code)
    final_cols = (
        list(set(BASE_COLUMNS_MAP.values()))  # ensure uniqueness only here
        + list(candidate_rename_map.values())  # may have overlaps, OK
        + ["postal_code"] + final_cols
    )
    df = df[final_cols]
    return df

In [22]:
def get_presidential_df(year, round, ext="csv"):
    df = load_presidential_data(year, round, ext)
    return process_presidential_df

In [135]:
year = "2025"
df_2025_r1 = process_presidential_df(load_presidential_data(year, "1"), year)
df_2025_r2 = process_presidential_df(load_presidential_data(year, "2"), year)
year = "2020"
df_2020_r1 = process_presidential_df(load_presidential_data(year, "1"), year)
df_2020_r2 = process_presidential_df(load_presidential_data(year, "2"), year)
year = "2015"
ext = "xls"
df_2015_r1 = process_presidential_df(load_presidential_data(year, "1", ext), year)

# In 2015 round 2 data there is no postal code, so we need to join it to round 1
df_2015_r2 = load_presidential_data(year, "2", ext)
df_2015_r2 = df_2015_r2.merge(
    df_2015_r1[["teryt_gmina", "postal_code"]],
    left_on="Teryt Gminy",
    right_on="teryt_gmina",
    how="left"
)
df_2015_r2 = process_presidential_df(df_2015_r2, year)

  df = pd.read_csv(file_path, sep=";", encoding="utf-8")
  df = pd.read_csv(file_path, sep=";", encoding="utf-8")


### 2.2. Grupowanie geograficzne

W drugim etapie komisje wyborcze zosta≈Çy posortowane wed≈Çug kod√≥w pocztowych, a nastƒôpnie
pogrupowane w kolejne bloki komisji znajdujƒÖcych siƒô w bezpo≈õrednim sƒÖsiedztwie. Grupy tworzono
w taki spos√≥b, aby ‚Äî w miarƒô mo≈ºliwo≈õci ‚Äî ka≈ºda zawiera≈Ça od 10 do 16 komisji, ≈ÇƒÖczƒÖc ze sobƒÖ
sƒÖsiednie obszary kod√≥w pocztowych majƒÖce wsp√≥lny prefiks (np. ‚Äû30‚Äù, ‚Äû301‚Äù, ‚Äû3011‚Äù). Celem by≈Ço
maksymalne zwiƒôkszenie sp√≥jno≈õci przestrzennej przy zachowaniu porƒôcznej wielko≈õci grupy, bez
konieczno≈õci ograniczania jej do jednego kodu pocztowego. Zastosowano nastƒôpujƒÖcƒÖ procedurƒô:

  1. PoczƒÖtkowe grupowanie oparto na pierwszych dw√≥ch cyfrach kodu pocztowego (np. ‚Äû30‚Äù dla
obszaru Krakowa).
  2. Je≈ºeli powsta≈Ça grupa zawiera≈Ça od 10 do 16 komisji, zosta≈Ça zaakceptowana bez zmian.
  3. Grupy liczƒÖce mniej ni≈º 10 komisji od≈Ço≈ºono do p√≥≈∫niejszego ≈ÇƒÖczenia.
  4. Grupy przekraczajƒÖce 16 komisji dzielono rekurencyjnie, dodajƒÖc kolejne cyfry kodu pocztowego (np. z ‚Äû30‚Äù ‚Üí ‚Äû301‚Äù ‚Üí ‚Äû3011‚Äù i dalej, a≈º do pe≈Çnych piƒôciu cyfr).
  5. Pozosta≈Çe ma≈Çe grupy ≈ÇƒÖczono z najbli≈ºszymi sƒÖsiadami majƒÖcymi ten sam prefiks, przy czym priorytetem by≈Ça ciƒÖg≈Ço≈õƒá przestrzenna i zr√≥wnowa≈ºona liczebno≈õƒá grup.
  
W odr√≥≈ºnieniu od wcze≈õniejszego podej≈õcia, kt√≥re dopuszcza≈Ço grupy o wielko≈õci 10‚Äì25 komisji,
niniejsze badanie przyjƒô≈Ço wƒô≈ºszy zakres docelowy: od 10 do 16 komisji na grupƒô. Decyzja ta
wynika≈Ça z przeglƒÖdu empirycznego, kt√≥ry wykaza≈Ç, ≈ºe wiƒôksze grupy ‚Äî mimo wydajno≈õci
statystycznej ‚Äî czasami ≈ÇƒÖczy≈Çy odleg≈Çe geograficznie obszary o niejednorodnych wzorcach
g≈Çosowania.

W wyniku zastosowania nowych ogranicze≈Ñ utworzono 2‚ÄØ208 grup, z kt√≥rych ka≈ºda odzwierciedla≈Ça
wzglƒôdnie jednorodnƒÖ lokalnƒÖ dynamikƒô wyborczƒÖ. Dla potwierdzenia ich sp√≥jno≈õci terytorialnej
przeprowadzono test zgodno≈õci kod√≥w pocztowych w ramach ka≈ºdej grupy.

Wiƒôkszo≈õƒá grup spe≈Çni≈Ça za≈Ço≈ºony docelowy rozmiar: 1‚ÄØ386 grup (62,8%) zawiera≈Ço od 10 do 16
komisji, a 2‚ÄØ017 grup (91,3%) zawiera≈Ço od 6 do 30 komisji. Wiƒôksze grupy zazwyczaj odpowiada≈Çy obszarom miejskim ‚Äî na przyk≈Çad takim jak Toru≈Ñ czy W≈Çoc≈Çawek, gdzie pojedynczy kod pocztowy
obejmowa≈Ç ca≈Çe miasto. W takich przypadkach wiƒôksza liczba komisji nie zaburza≈Ça sp√≥jno≈õci
przestrzennej, a wrƒôcz zwiƒôksza≈Ça wiarygodno≈õƒá statystycznƒÖ poprzez powiƒôkszenie pr√≥bki lokalnej.

Grupy mniejsze ni≈º docelowy zakres obejmowa≈Çy komisje, kt√≥re ‚Äî z powodu izolacji geograficznej
‚Äî nie mog≈Çy zostaƒá sensownie po≈ÇƒÖczone z innymi. Choƒá pr√≥bki mniejsze ni≈º 10 jednostek sƒÖ zwykle
uznawane za majƒÖce ograniczonƒÖ moc statystycznƒÖ, zastosowanie metody MAD ‚Äî znanej ze swojej
odporno≈õci na ma≈Çe pr√≥by ‚Äî w znacznym stopniu niweluje to ograniczenie.

------------

Nie jestem w stanie odtworzyƒá grupowania geograficznego 1:1.

JednƒÖ z opcji jest grupowanie losowe: buckety 10 - 16, dokladnie 2208 grup

In [24]:
def add_random_buckets(df, n, k_min, k_max, random_state=None):
    """
    Partition a DataFrame into n random groups (buckets), each with at least k_min and at most k_max rows.
    Adds a 'bucket' column to the DataFrame indicating the group assignment.

    Args:
        df (pd.DataFrame): The DataFrame to partition.
        n (int): Number of buckets (groups) to create.
        k_min (int): Minimum number of rows per bucket.
        k_max (int): Maximum number of rows per bucket.
        random_state (int, optional): Seed for reproducibility. Default is None.

    Returns:
        pd.DataFrame: A new DataFrame with a 'bucket' column indicating the group assignment.

    Raises:
        ValueError: If the constraints cannot be satisfied with the given n, k_min, and k_max.
    """
    N = len(df)
    if N < n * k_min or N > n * k_max:
        raise ValueError("Constraints cannot be satisfied with given n, k_min, and k_max.")
    
    # Start with k_min in each bucket
    sizes = np.full(n, k_min)
    remaining = N - sizes.sum()
    increments = np.zeros(n, dtype=int)
    
    rng = np.random.default_rng(random_state)
    while remaining > 0:
        idxs = np.where(sizes + increments < k_max)[0]
        idx = rng.choice(idxs)
        increments[idx] += 1
        remaining -= 1
    
    final_sizes = sizes + increments
    
    # Shuffle the DataFrame
    shuffled_df = df.sample(frac=1, random_state=random_state).reset_index(drop=True)
    
    # Assign bucket numbers
    bucket_labels = np.repeat(np.arange(n), final_sizes)
    shuffled_df['bucket'] = bucket_labels
    
    # Restore original order if needed (optional)
    # shuffled_df = shuffled_df.sort_index()
    
    return shuffled_df

In [25]:
df = add_random_buckets(df_2025_r2, n=2208, k_min=10, k_max=16)

In [26]:
def print_bucket_stats(df, bucket_col='bucket', range1=(10, 16), range2=(6, 30)):
    """
    Prints statistics about bucket sizes in the DataFrame.
    Shows total number of buckets and how many fall into two specified ranges.

    Args:
        df (pd.DataFrame): DataFrame containing the bucket column.
        bucket_col (str): Name of the column with bucket assignments.
        range1 (tuple): First (min, max) range for bucket size.
        range2 (tuple): Second (min, max) range for bucket size.
    """
    bucket_sizes = df[bucket_col].value_counts()
    total_buckets = bucket_sizes.count()
    buckets_in_range1 = bucket_sizes[(bucket_sizes >= range1[0]) & (bucket_sizes <= range1[1])].count()
    buckets_in_range2 = bucket_sizes[(bucket_sizes >= range2[0]) & (bucket_sizes <= range2[1])].count()
    print(f"Total buckets: {total_buckets}")
    print(f"Buckets with {range1[0]}‚Äì{range1[1]} items: {buckets_in_range1} ({buckets_in_range1 / total_buckets:.1%})")
    print(f"Buckets with {range2[0]}‚Äì{range2[1]} items: {buckets_in_range2} ({buckets_in_range2 / total_buckets:.1%})")


In [27]:
print_bucket_stats(df)

Total buckets: 2208
Buckets with 10‚Äì16 items: 2208 (100.0%)
Buckets with 6‚Äì30 items: 2208 (100.0%)


DrugƒÖ moje "klastrowanie"

In [None]:
def add_janiszewski_postal_buckets(df, min_bucket_size=10, max_bucket_size=16, postal_code_col='postal_code'):
    """
    Assigns spatially-coherent buckets to a DataFrame based on postal code prefixes, with constraints on bucket size.
    The function mimics the logic from generate_buckets_3.ipynb.

    Args:
        df (pd.DataFrame): Input DataFrame. Must contain a column with postal codes.
        min_bucket_size (int): Minimum number of rows per bucket.
        max_bucket_size (int): Maximum number of rows per bucket.
        postal_code_col (str): Name of the column containing postal codes (e.g., 'postal_code').

    Returns:
        pd.DataFrame: DataFrame with a new 'bucket' column indicating group assignment.
    """
    df = df.copy()
    # Clean postal codes: remove dash, ensure string
    df['postal_clean'] = df[postal_code_col].astype(str).str.replace('-', '')
    
    # Compute postal code prefixes
    df['postal_2'] = df['postal_clean'].str[:2]
    df['postal_3'] = df['postal_clean'].str[:3]
    df['postal_4'] = df['postal_clean'].str[:4]
    df['postal_5'] = df['postal_clean']
    
    # Precompute value counts for each prefix
    df['postal_3_value_count'] = df['postal_3'].map(df['postal_3'].value_counts())
    df['postal_4_value_count'] = df['postal_4'].map(df['postal_4'].value_counts())
    
    # Step 1: Assign buckets by 3-digit prefix if group size is valid
    postal_3_counts = df['postal_3'].value_counts()
    valid_postals_3 = postal_3_counts[(postal_3_counts >= min_bucket_size) & (postal_3_counts <= max_bucket_size)].index
    df['bucket'] = df['postal_3'].where(df['postal_3'].isin(valid_postals_3))
    
    # Step 2: Assign by 4-digit prefix for unassigned rows
    postal_4_counts = df['postal_4'].value_counts()
    valid_postals_4 = postal_4_counts[(postal_4_counts >= min_bucket_size) & (postal_4_counts <= max_bucket_size)].index
    mask_4 = df['bucket'].isna() & df['postal_4'].isin(valid_postals_4)
    df.loc[mask_4, 'bucket'] = df.loc[mask_4, 'postal_4']
    
    # Step 3: Assign by 5-digit (full) postal code for unassigned rows
    postal_5_counts = df['postal_5'].value_counts()
    valid_postals_5 = postal_5_counts[(postal_5_counts >= min_bucket_size) & (postal_5_counts <= max_bucket_size)].index
    mask_5 = df['bucket'].isna() & df['postal_5'].isin(valid_postals_5)
    df.loc[mask_5, 'bucket'] = df.loc[mask_5, 'postal_5']
    
    # Step 4: For oversized 5-digit groups, assign if all are ungrouped
    mask_postal_5 = (
        df['bucket'].isna() &
        ((df['postal_3_value_count'] > max_bucket_size) |
         (df['postal_4_value_count'] > max_bucket_size))
    )
    postal_5_counts = df.loc[mask_postal_5, 'postal_5'].value_counts()
    for postal_code, count in postal_5_counts.items():
        if count > max_bucket_size:
            same_postal_mask = (df['postal_5'] == postal_code) & df['bucket'].isna()
            if same_postal_mask.sum() == count:
                df.loc[same_postal_mask, 'bucket'] = postal_code
    
    # Step 5: For remaining unassigned, chunk by sorted postal code within 3-digit prefix
    leftovers = df[df['bucket'].isna()].copy()
    for prefix, group in leftovers.groupby('postal_3'):
        group_sorted = group.sort_values(by='postal_clean')
        i = 0
        while i < len(group_sorted):
            chunk = group_sorted.iloc[i:i+max_bucket_size]
            if len(chunk) >= min_bucket_size:
                bucket_id = f"{prefix}_L{i//max_bucket_size}"
                df.loc[chunk.index, 'bucket'] = bucket_id
                i += len(chunk)
            else:
                break  # remaining rows too small to form a valid group
    
    # Step 6: Try to merge final leftovers into existing buckets, else assign fallback
    final_leftovers = df[df['bucket'].isna()].copy()
    existing_buckets = df.dropna(subset=['bucket']).copy()
    existing_buckets['bucket_size'] = existing_buckets.groupby('bucket')['bucket'].transform('count')
    for prefix, group in final_leftovers.groupby('postal_3'):
        group_sorted = group.sort_values(by='postal_clean')
        existing_in_prefix = existing_buckets[
            existing_buckets['postal_3'] == prefix
        ].groupby('bucket').first()
        for idx, row in group_sorted.iterrows():
            # Try to append to a not-too-large existing group
            assigned = False
            for bucket_id, b_row in existing_in_prefix.iterrows():
                current_size = df[df['bucket'] == bucket_id].shape[0]
                if current_size < max_bucket_size:
                    df.at[idx, 'bucket'] = bucket_id
                    assigned = True
                    break
            # If no spot found, assign a new fallback bucket
            if pd.isna(df.at[idx, 'bucket']):
                fallback_id = f"{prefix}_F{idx}"
                df.at[idx, 'bucket'] = fallback_id
    
    if df['bucket'].isna().sum() > 0:
        # Find all existing buckets and their sizes
        bucket_sizes = df['bucket'].value_counts()
        smallest_buckets = bucket_sizes.nsmallest(10).index  # or all buckets
        for idx in df[df['bucket'].isna()].index:
            # Assign to the bucket with the smallest current size
            target_bucket = df['bucket'].value_counts().idxmin()
            df.at[idx, 'bucket'] = target_bucket


    # Optionally, remove buckets with fewer than 3 members
    vc = df['bucket'].value_counts()
    df = df[df['bucket'].isin(vc[vc >= 3].index)]

    # Remove all helper columns before returning
    helper_cols = [
        'postal_clean', 'postal_2', 'postal_3', 'postal_4', 'postal_5',
        'postal_3_value_count', 'postal_4_value_count'
    ]
    df = df.drop(columns=[col for col in helper_cols if col in df.columns])
    
    return df


In [None]:
df = add_janiszewski_postal_buckets(df_2025_r2, min_bucket_size=10, max_bucket_size=16)


In [30]:
print_bucket_stats(df)

Total buckets: 1809
Buckets with 10‚Äì16 items: 1462 (80.8%)
Buckets with 6‚Äì30 items: 1719 (95.0%)


TrzeciƒÖ klastrowanie Jakuba Bialka
https://github.com/rabitwhte/analiza_kontka_reprodukcja/blob/main/Reprodukcja_wynikow_Kontek_Bialek.ipynb

In [35]:
def add_bialek_postal_buckets(df: pd.DataFrame,
                  postal_code_col: str = "postal_code",
                  min_size: int = 10,
                  max_size: int = 16) -> pd.DataFrame:
    """
    Returns a copy of df with a new column `group_id` containing the group number
    such that each group has min_size <= n <= max_size (as much as possible).
    Groups are formed by recursively splitting by postal code prefix, then merging small groups.

    Args:
        df (pd.DataFrame): Input DataFrame with a postal code column.
        code_col (str): Name of the column with postal codes (should be int or str, 5 digits).
        min_size (int): Minimum group size.
        max_size (int): Maximum group size.

    Returns:
        pd.DataFrame: Copy of df with a new 'group_id' column.
    """
    df['postal_clean'] = df[postal_code_col].astype(str).str.replace('-', '')
    work = df.copy()
    work["_code_str"] = work['postal_clean'].astype(str).str.zfill(5)

    accepted = {}                # {prefix: list[index]}
    small     = {}               # prefixy < min_size (to be merged later)

    # 1. Recursive splitting of large groups
    def split(prefix: str, idxs: list[int], depth: int):
        size = len(idxs)
        # a) Acceptable group
        if min_size <= size <= max_size or depth == 5:
            accepted[prefix] = idxs
            return
        # b) Too small ‚Äì save for later merging
        if size < min_size:
            small[prefix] = idxs
            return
        # c) Too large ‚Äì split deeper
        next_depth = depth + 1
        sub_prefixes = work.loc[idxs, "_code_str"].str[:next_depth]
        for sub_pref, sub_idxs in work.loc[idxs].groupby(sub_prefixes).groups.items():
            split(sub_pref, list(sub_idxs), next_depth)

    # Start with 2-digit prefixes
    for pref2, grp in work.groupby(work["_code_str"].str[:2]).groups.items():
        split(pref2, list(grp), depth=2)

    # 2. Merge small groups within the same 2-digit prefix
    buckets = defaultdict(list)          # {pref2: [(pref, idxs), ...]}
    for p, idxs in small.items():
        buckets[p[:2]].append((p, idxs))

    for pref2, lst in buckets.items():
        # Sort by prefix value for spatial proximity
        lst.sort(key=lambda x: int(x[0]))
        buf_idx, buf_pref = [], []
        for p, idxs in lst:
            buf_idx.extend(idxs)
            buf_pref.append(p)
            if len(buf_idx) >= min_size:
                # If > max_size, split into chunks
                while len(buf_idx) > max_size:
                    accepted[f"{pref2}_{len(accepted)}"] = buf_idx[:max_size]
                    buf_idx = buf_idx[max_size:]
                accepted["+".join(buf_pref)] = buf_idx
                buf_idx, buf_pref = [], []
        # Remainder < min_size ‚Äì append to last group for this prefix
        if buf_idx:
            last_keys = [k for k in accepted.keys() if k.startswith(pref2)]
            if last_keys:
                last_key = last_keys[-1]
                accepted[last_key].extend(buf_idx)
            else:
                # If no group exists, create a new one
                accepted[f"{pref2}_rem"] = buf_idx

    # 3. Assign group numbers
    idx2gid = {}
    for gid, (_, lst) in enumerate(accepted.items()):
        for i in lst:
            idx2gid[i] = gid
    work["bucket"] = work.index.map(idx2gid)

    # 4. Remove groups with fewer than 4 members
    group_sizes = work.groupby('bucket').transform('count')['postal_clean']
    work = work[group_sizes > 3].copy()

    return work.drop(columns="_code_str")

In [86]:
df = add_bialek_postal_buckets(df_2025_r2, min_size=10, max_size=16)

In [88]:
df.head()

Unnamed: 0,valid_votes,polling_station_id,eligible_voters,ballots_cast,teryt_gmina,nawrocki,trzaskowski,postal_code,postal_clean,bucket
0,1178.0,1,1678.0,1187.0,20101.0,582.0,596.0,59-700,59700,1232
1,903.0,2,1269.0,911.0,20101.0,386.0,517.0,59-700,59700,1232
2,993.0,3,1358.0,996.0,20101.0,437.0,556.0,59-700,59700,1232
3,975.0,4,1354.0,984.0,20101.0,408.0,567.0,59-700,59700,1232
4,850.0,5,1195.0,856.0,20101.0,332.0,518.0,59-700,59700,1232


In [89]:
print_bucket_stats(df)

Total buckets: 2304
Buckets with 10‚Äì16 items: 1264 (54.9%)
Buckets with 6‚Äì30 items: 1917 (83.2%)


### IMPLEMENTACJA METOD ZAPROPONOWANYCH PRZEZ DR KONTKA

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5296441

2.3. Wykrywanie warto≈õci odstajƒÖcych

G≈Ç√≥wna innowacja analityczna niniejszego badania polega na oszacowaniu potencjalnego wp≈Çywu
anormalnych komisji wyborczych na poziomie og√≥lnokrajowym. Aby to osiƒÖgnƒÖƒá, w pierwszej
kolejno≈õci zidentyfikowano warto≈õci odstajƒÖce w czterech kategoriach nieprawid≈Çowo≈õci:

In [None]:
## let's do it with clustering proposed by Jakub Bialek

# df = add_bialek_postal_buckets(df, min_size=10, max_size=16)

In [136]:
keep_columns = ["teryt_gmina", "polling_station_id", "trzaskowski", "nawrocki", "postal_code"]

# ROUND 1 
# Step 1: Keep only selected columns
df_2025_r1 = df_2025_r1[keep_columns]
df_2025_r1 = df_2025_r1.dropna(subset=["teryt_gmina"])
# Step 2: Convert teryt_gmina to integer
df_2025_r1["teryt_gmina"] = df_2025_r1["teryt_gmina"].astype(int)

# ROUND 2
# Step 1: Keep only selected columns
df_2025_r2 = df_2025_r2[keep_columns]
df_2025_r2 = df_2025_r2.dropna(subset=["teryt_gmina"])
# Step 2: Convert teryt_gmina to integer
df_2025_r2["teryt_gmina"] = df_2025_r2["teryt_gmina"].astype(int)


In [137]:
# Join both rounds

# Step 1: Rename candidate columns in each round for clarity
df_2025_r1 = df_2025_r1.rename(columns={
    "trzaskowski": "trzaskowski_r1",
    "nawrocki": "nawrocki_r1"
})

df_2025_r2 = df_2025_r2.rename(columns={
    "trzaskowski": "trzaskowski_r2",
    "nawrocki": "nawrocki_r2"
})

# Step 2: Merge on teryt_gmina and polling_station_id
df_2025 = pd.merge(
    df_2025_r1,
    df_2025_r2,
    on=["teryt_gmina", "polling_station_id", "postal_code"],
    how="inner"  # or "outer"/"left" depending on need
)

In [138]:
# Add buckets
df_2025 = add_bialek_postal_buckets(df_2025)


In [139]:
df_2025.head()

Unnamed: 0,teryt_gmina,polling_station_id,trzaskowski_r1,nawrocki_r1,postal_code,trzaskowski_r2,nawrocki_r2,postal_clean,bucket
0,20101,1,361.0,287.0,59-700,596.0,582.0,59700,1567
1,20101,2,381.0,228.0,59-700,517.0,386.0,59700,1567
2,20101,3,356.0,241.0,59-700,556.0,437.0,59700,1567
3,20101,4,390.0,217.0,59-700,567.0,408.0,59700,1567
4,20101,5,343.0,202.0,59-700,518.0,332.0,59700,1567


In [111]:
cand_A = "trzaskowski"
cand_B = "nawrocki"

### 1. Nadmierne poparcie dla Karola Nawrockiego (wzglƒôdem mediany w ramach lokalnej grupy)

Za artyku≈Çem: w ramach ka≈ºdej grupy obliczono mediany oraz odchylenia bezwzglƒôdne od mediany (MAD), dla ka≈ºdej komisji obliczono wska≈∫nik odchylenia od mediany

![X minus median over MAD](./images/X_minus_median_over_MAD.png)

gdzie:

X - wynik w drugiej turze kandydata

mediana - mediana wynik√≥w kandydata w drugiej turze

MAD - odchelenie bezwzglƒôdne mediany kandydata w drugiej turze

z artykulu:

Dla ka≈ºdej grupy komisji oraz dla dw√≥ch pierwszych kategorii nieprawid≈Çowo≈õci:

  ‚Ä¢ obliczono mediany oraz odchylenia bezwzglƒôdne od mediany (MAD);

  ‚Ä¢ dla ka≈ºdej komisji obliczono tzw. wsp√≥≈Çczynnik odchylenia odpornego (robust deviation score, oznaczony jako k_needed), wyra≈ºajƒÖcy skalƒô odchylenia wyniku od mediany grupy w jednostkach MAD, wed≈Çug wzoru (powyzej)

Komisjƒô oznaczano jako odstajƒÖcƒÖ (outlier), je≈õli spe≈Çniony by≈Ç warunek: ùëòùëõùëíùëíùëëùëíùëë > ùëò gdzie k to
warto≈õƒá progowa przyjƒôta w analizie, szczeg√≥≈Çowo opisana w sekcji 2.4.

In [218]:
df = df_2025.copy()

In [219]:
# mediana w grupie
df[cand_A + '_median_r2'] = df.groupby('bucket')[cand_A + '_r2'].transform('median')
df[cand_B + '_median_r2'] = df.groupby('bucket')[cand_B + '_r2'].transform('median')

In [220]:
# MAD w grupie
from scipy.stats import median_abs_deviation
def mad(series):
    """Median Absolute Deviation"""
    return median_abs_deviation(series)

df[cand_A + "_MAD_r2"] = df.groupby("bucket")[cand_A + '_r2'].transform(mad)
df[cand_B + "_MAD_r2"] = df.groupby("bucket")[cand_B + '_r2'].transform(mad)

In [221]:
df[cand_A +'_k_score_1'] = (df[cand_A + '_r2'] - df[cand_A + '_median_r2'])/df[cand_A + '_MAD_r2']
df[cand_B +'_k_score_1'] = (df[cand_B + '_r2'] - df[cand_B + '_median_r2'])/df[cand_B + '_MAD_r2']

2.4 Przeliczenie wynik√≥w

Aby uwzglƒôdniƒá niepewno≈õƒá i wra≈ºliwo≈õƒá zastosowanego podej≈õcia, obliczenia przeprowadzono dla
trzech r√≥≈ºnych prog√≥w detekcji warto≈õci odstajƒÖcych: k > 2.0, k > 2.5 oraz k > 3.0, gdzie k oznacza
liczbƒô jednostek odchylenia bezwzglƒôdnego od mediany (MAD) wzglƒôdem mediany w grupie
lokalnej. Wy≈ºsze warto≈õci k wyodrƒôbniajƒÖ jedynie najbardziej skrajne przypadki, zapewniajƒÖc tym
samym konserwatywnƒÖ estymacjƒô potencjalnego wp≈Çywu. Jednocze≈õnie jednak ograniczajƒÖ zdolno≈õƒá
metody do wychwytywania mniejszych, lecz wciƒÖ≈º istotnych odchyle≈Ñ.

In [204]:
for k in [2.0, 2.5, 3.0]:
    count_A = sum(df[cand_A + '_k_score_1'] > k)
    count_B = sum(df[cand_B + '_k_score_1'] > k)
    print(f'k > {k}')
    print(f'{cand_A}: {count_A}')
    print(f'{cand_B}: {count_B}')
    print('---')

k > 2.0
trzaskowski: 4551
nawrocki: 3750
---
k > 2.5
trzaskowski: 3535
nawrocki: 2750
---
k > 3.0
trzaskowski: 2794
nawrocki: 2015
---


Wyniki:

Dla k=2, takich komisji, w kt√≥rych "za du≈ºe" poparcie ma Nawrocki jest 3762, a Trzaskowski 4554.

In [222]:
# Let's set k = 2
k = 2

In [223]:
# zapisz anomalie na korzy≈õƒá kandydat√≥w do p√≥≈∫niejszego sumowania
df[cand_A + "_anomaly_1"] = df[kandydat_A +'_k_score_1'] > k
df[cand_B + "_anomaly_1"] = df[kandydat_B +'_k_score_1'] > k

In [224]:
# df.head()

df = df.drop(columns=[
    # "trzaskowski_median_r2",
    # "nawrocki_median_r2",
    "trzaskowski_MAD_r2",
    "nawrocki_MAD_r2",
    "trzaskowski_k_score_1",
    "nawrocki_k_score_1"
])

### 2. Nadmierny wzglƒôdny wzrost poparcia dla Karola Nawrockiego miƒôdzy pierwszƒÖ a drugƒÖ turƒÖ, w por√≥wnaniu do odpowiedniego wzrostu poparcia dla Rafa≈Ça Trzaskowskiego w tej samej grupie lokalnej;

za, JB: 

Nie podano wprost jak to by≈Ço obliczone wiƒôc kolejno:

  1. Dla danego kandydata obliczam wzglƒôdny wzrost miƒôdzy pierwszƒÖ a drugƒÖ turƒÖ (dzielƒÖc wynik z drugiej przez wynik z pierwszej)
  2. Nastƒôpnie odnoszƒô go do wzrostu drugiego kandydata - liczƒô r√≥≈ºnicƒô miƒôdzy wzglƒôdnymi wzrostami.
  3. Dalej tak jak w pierwszym typie anomalii - dla tych r√≥≈ºnic liczƒô medianƒô grupy, MAD grupy oraz odchylenie k w komisji.

In [225]:
# wzgledny wzrost miƒôdzy pierwszƒÖ a drugƒÖ turƒÖ
df[cand_A + '_increase'] = df[cand_A + '_r2']/df[cand_A + '_r1']
df[cand_B + '_increase'] = df[cand_B + '_r2']/df[cand_B + '_r1']

In [226]:
# roznica wzglednego wzrostu miedzy kandydatami
df['relative_increase_diff_' + cand_A] =  df[cand_A + '_increase']  - df[cand_B + '_increase'] # wzrost A w por√≥wnaniu do B
df['relative_increase_diff_' + cand_B] =  df[cand_B + '_increase']  - df[cand_A + '_increase'] # wzrost B w por√≥wnaniu do A

In [227]:
# mediana i mad r√≥≈ºnicy wzglƒôdnego wzrostu poparcia
df['relative_increase_diff_' + cand_A +'_median'] = df.groupby('bucket')['relative_increase_diff_' + cand_A].transform('median')
df['relative_increase_diff_' + cand_A + '_MAD'] = df.groupby('bucket')['relative_increase_diff_' + cand_A].transform(mad)

In [228]:
# mediana i mad r√≥≈ºnicy wzglƒôdnego wzrostu poparcia
df['relative_increase_diff_' + cand_B +'_median'] = df.groupby('bucket')['relative_increase_diff_' + cand_B].transform('median')
df['relative_increase_diff_' + cand_B + '_MAD'] = df.groupby('bucket')['relative_increase_diff_' + cand_B].transform(mad)

In [229]:
df[cand_A +'_k_score_2'] = (df['relative_increase_diff_' + cand_A] - df['relative_increase_diff_' + cand_A +'_median'])/df['relative_increase_diff_' + cand_A + '_MAD']


In [230]:
df[cand_B +'_k_score_2'] = (df['relative_increase_diff_' + cand_B] - df['relative_increase_diff_' + cand_B +'_median'])/df['relative_increase_diff_' + cand_B + '_MAD']

In [231]:
for k in [2.0, 2.5, 3.0]:
    count_A = sum(df[cand_A + '_k_score_2'] > k)
    count_B = sum(df[cand_B + '_k_score_2'] > k)
    print(f'k > {k}')
    print(f'{cand_A}: {count_A}')
    print(f'{cand_B}: {count_B}')
    print('---')

k > 2.0
trzaskowski: 3552
nawrocki: 3127
---
k > 2.5
trzaskowski: 2666
nawrocki: 2229
---
k > 3.0
trzaskowski: 2106
nawrocki: 1669
---


In [232]:
# zapisz anomalie na korzy≈õƒá kandydat√≥w do p√≥≈∫niejszego sumowania
df[cand_A + "_anomaly_2"] = df[kandydat_A +'_k_score_2'] > k
df[cand_B + "_anomaly_2"] = df[kandydat_B +'_k_score_2'] > k

In [233]:
df = df.drop(columns=[
    "trzaskowski_increase",
    "nawrocki_increase",
    "relative_increase_diff_trzaskowski",
    "relative_increase_diff_nawrocki",
    "relative_increase_diff_trzaskowski_median",
    "relative_increase_diff_trzaskowski_MAD",
    "relative_increase_diff_nawrocki_median",
    "relative_increase_diff_nawrocki_MAD",
    "trzaskowski_k_score_2",
    "nawrocki_k_score_2"
])

### 3. Komisje, w kt√≥rych Nawrocki uzyska≈Ç wiƒôcej g≈Ços√≥w ni≈º Trzaskowski w drugiej turze, mimo ≈ºe mediana wynik√≥w w grupie wskazywa≈Ça na przewagƒô Trzaskowskiego;

  1. Sprawdzamy, w kt√≥rych grupach dany kandydat mia≈Ç wiƒôkszƒÖ medianƒô
  2. Sumujemy komisje, w kt√≥rych wygra≈Ç kandydat A mimo, ≈ºe wiƒôkszƒÖ medianƒô mia≈Ç kandydat B i na odwr√≥t.

In [234]:
# wieksza mediana w grupie
df['higher_median_' + cand_A] = (df[cand_A + '_median_r2'] >  df[cand_B + '_median_r2']).astype(bool)
df['higher_median_' + cand_B] = (df[cand_B + '_median_r2'] >  df[cand_A + '_median_r2']).astype(bool)

In [236]:
# na korzy≈õƒá kandydat A, czyli wiƒôkszƒÖ medianƒô mia≈Ç B, a wiƒôcej g≈Ços√≥w dosta≈Ç A.
cand_A, sum(df['higher_median_' + cand_B] & (df[cand_A + '_r2'] > df[cand_B + '_r2']))

('trzaskowski', 2608)

In [238]:
cand_B, sum(df['higher_median_' + cand_A] & (df[cand_B + '_r2'] > df[cand_A + '_r2']))

('nawrocki', 1843)

**WYNIKI**:

W grupach, w kt√≥rych wiƒôkszƒÖ medianƒô mia≈Ç Nawrocki, by≈Ço 2608 komisji, w kt√≥rych wy≈ºszy wynik uzyska≈Ç Trzaskowski.

W grupach, w kt√≥rych wiƒôkszƒÖ medianƒô mia≈Ç Trzaskowski, by≈Ço 1843 komisji, w kt√≥rych wy≈ºszy wyniki uzyska≈Ç Nawrocki.

Przyk≈Çadowo:

W komisji 13 gdzie w drugiej turze g≈Çosowa≈Ço.. 13 os√≥b, Trzaskowski uzyska≈Ç wiƒôkszy wynik (8 do 5), mimo ≈ºe w grupie obejmujƒÖcej kod pocztowy 59-730 wiƒôkszƒÖ medianƒô mia≈Ç Nawrocki (344 vs 158).

In [239]:
# anomalie na korzysc
df[cand_A + '_anomaly_3'] = df['higher_median_' + cand_B] & (df[cand_A + '_r2'] > df[cand_B + '_r2']) 
df[cand_B + '_anomaly_3'] = df['higher_median_' + cand_A] & (df[cand_B + '_r2'] > df[cand_A + '_r2']) 

In [241]:
df = df.drop(columns=[
    "trzaskowski_median_r2",
    "nawrocki_median_r2",
    "higher_median_trzaskowski",
    "higher_median_nawrocki"	
])

**WƒÖtpliwo≈õci w tej metodologii**

Jak pisze Piotr Szulc:

https://danetyka.com/kontek-analiza-bledow/

Jedna z cech, jakie bada autor, jest nazwana ‚Äúflip‚Äù i nie ma nic wsp√≥lnego z wy≈ºej podanƒÖ standaryzacjƒÖ i progami. Autor za anomaliƒô uznaje ka≈ºdy przypadek, w kt√≥rym ‚ÄúNawrocki wygrywa lokalnie, mimo ≈ºe mediana wynik√≥w w grupie wskazuje przewagƒô Trzaskowskiego‚Äù. Za≈Ç√≥≈ºmy, ≈ºe procenty poparcia dla Nawrockiego w danej grupie wynoszƒÖ: 45, 46, 47, 48, 49, 51, 52, 53, 54.

In [None]:
# Dane: 9 komisji ‚Äì Trzaskowski ma wy≈ºszƒÖ medianƒô, ale Nawrocki wygrywa w 4 komisjach
dummy_df = pd.DataFrame({
    'okrƒôg': ['A'] * 9,
    'trzaskowski': [55, 54, 53, 52, 51, 49, 47, 46, 45],
    'nawrocki':    [45, 46, 47, 48, 49, 51, 52, 53, 54],
})

Mediana wynosi 49%, wiƒôc ‚Äúmediana wynik√≥w w grupie wskazuje przewagƒô Trzaskowskiego‚Äù:

In [328]:
# Obliczenie median
trzaskowski_median = dummy_df['trzaskowski'].median()  # 51.0
nawrocki_median = dummy_df['nawrocki'].median()        # 49.0

# ale zeby byƒá sp√≥jnym z poprzedniƒÖ implementacjƒÖ:

# mediana w grupie
dummy_df[cand_A + '_median'] = dummy_df.groupby('okrƒôg')[cand_A].transform('median')
dummy_df[cand_B + '_median'] = dummy_df.groupby('okrƒôg')[cand_B].transform('median')

dummy_df['higher_median_' + cand_A] = (dummy_df[cand_A + '_median'] >  dummy_df[cand_B + '_median']).astype(bool)
dummy_df['higher_median_' + cand_B] = (dummy_df[cand_B + '_median'] >  dummy_df[cand_A + '_median']).astype(bool)

a zatem te cztery komisje, w kt√≥rych Nawrockich otrzyma≈Ç ponad 50% to anomalie, co oczywi≈õcie nie ma ≈ºadnego sensu. Ta cecha jest odpowiedzialna za ponad po≈Çowƒô (!) wskaza≈Ñ.

In [330]:
# na korzy≈õƒá kandydat B, czyli wiƒôkszƒÖ medianƒô mia≈Ç A, a wiƒôcej g≈Ços√≥w dosta≈Ç B.
cand_B, sum(dummy_df['higher_median_' + cand_A] & (dummy_df[cand_B] > dummy_df[cand_A]))

('nawrocki', 4)

In [332]:
# Flip: Nawrocki wygrywa, mimo ≈ºe Trzaskowski mia≈Ç wy≈ºszƒÖ medianƒô w grupie
dummy_df['flip_' + cand_B] = dummy_df['higher_median_' + cand_A] & (dummy_df[cand_B] > dummy_df[cand_A])

# Wy≈õwietlenie flip√≥w
print(dummy_df[[cand_A, cand_B, 'flip_' + cand_B]])
print(f"\nLiczba 'anomalii' wed≈Çug flip: {dummy_df['flip_' + cand_B].sum()} z {len(dummy_df)}")

   trzaskowski  nawrocki  flip_nawrocki
0           55        45          False
1           54        46          False
2           53        47          False
3           52        48          False
4           51        49          False
5           49        51           True
6           47        52           True
7           46        53           True
8           45        54           True

Liczba 'anomalii' wed≈Çug flip: 4 z 9


Na obronƒô dra Kontka muszƒô przyznaƒá, ≈ºe te ‚Äûanomalie‚Äù nie zosta≈Çyby uwzglƒôdnione w jego w≈Ça≈õciwej analizie, poniewa≈º zastosowany przez niego pr√≥g istotno≈õci wynosi

_k = 2_. 

W naszym przyk≈Çadzie, mimo ≈ºe wystƒôpujƒÖ przypadki tzw. ‚Äûflip√≥w‚Äù (czyli lokalnej wygranej kandydata z ni≈ºszƒÖ medianƒÖ), ≈ºaden z nich nie osiƒÖga warto≈õci 

_k_score_1 > 2_. 

Oznacza to, ≈ºe r√≥≈ºnice te nie zosta≈Çyby uznane za statystycznie istotne odchylenia w jego modelu i nie trafi≈Çyby na listƒô ‚Äûanomalii‚Äù.

In [333]:
# === Obliczanie MAD w grupie ===
def mad(series):
    return median_abs_deviation(series, scale='normal')  # sp√≥jne z klasycznƒÖ definicjƒÖ

dummy_df[cand_A + '_MAD'] = dummy_df.groupby('okrƒôg')[cand_A].transform(mad)
dummy_df[cand_B + '_MAD'] = dummy_df.groupby('okrƒôg')[cand_B].transform(mad)

# Obliczanie k_score_1
dummy_df[cand_A + '_k_score_1'] = (dummy_df[cand_A] - dummy_df[cand_A + '_median']) / dummy_df[cand_A + '_MAD']
dummy_df[cand_B + '_k_score_1'] = (dummy_df[cand_B] - dummy_df[cand_B + '_median']) / dummy_df[cand_B + '_MAD']

# === Analiza k_score_1 wzglƒôdem prog√≥w ===
for k in [2.0, 2.5, 3.0]:
    count_A = (dummy_df[cand_A + '_k_score_1'] > k).sum()
    count_B = (dummy_df[cand_B + '_k_score_1'] > k).sum()
    print(f'\nk > {k}')
    print(f'{cand_A}: {count_A}')
    print(f'{cand_B}: {count_B}')
    print('---')


k > 2.0
trzaskowski: 0
nawrocki: 0
---

k > 2.5
trzaskowski: 0
nawrocki: 0
---

k > 3.0
trzaskowski: 0
nawrocki: 0
---


### 4. Kandydat otrzyma≈Ç mniej g≈Ços√≥w w drugiej turze ni≈º w pierwszej

In [244]:
cand_A, sum(df[cand_A + '_r2']<df[cand_A + '_r1'])

('trzaskowski', 128)

W 128 komisjach Trzaskowski uzyska≈Ç mniej g≈Ços√≥w w drugiej turze ni≈º w pierwszej.

In [246]:
cand_B, sum(df[cand_B + '_r2']<df[cand_B + '_r1'])

('nawrocki', 112)

W 112 komisjach Nawrocki uzyska≈Ç mniej g≈Ços√≥w w drugiej turze ni≈º w pierwszej.


Przyk≈Çadowe anomalie na korzy≈õƒá Trzaskowskiego:

In [266]:
df[df[cand_B + '_r2'] < df[cand_B + '_r1']].sort_values(by=cand_B + '_r1', ascending=False).head()

Unnamed: 0,teryt_gmina,polling_station_id,trzaskowski_r1,nawrocki_r1,postal_code,trzaskowski_r2,nawrocki_r2,postal_clean,bucket,trzaskowski_anomaly_1,nawrocki_anomaly_1,trzaskowski_anomaly_2,nawrocki_anomaly_2,trzaskowski_anomaly_3,nawrocki_anomaly_3,trzaskowski_anomaly_4,nawrocki_anomaly_4,trzaskowski_sum_anomalies,nawrocki_sum_anomalies
12616,140706,1,105.0,285.0,26-910,467.0,193.0,26910,630,True,False,True,False,True,False,True,False,4,0
25866,261207,4,143.0,224.0,28-200,360.0,209.0,28200,662,True,False,True,False,True,False,True,False,4,0
5098,60903,4,89.0,174.0,23-100,260.0,163.0,23100,522,True,False,True,False,True,False,True,False,4,0
24825,260101,34,172.0,129.0,28-100,148.0,111.0,28100,657,False,False,False,False,True,False,True,True,2,1
2372,40102,9,120.0,105.0,87-720,164.0,85.0,87720,2197,False,False,True,False,True,False,True,False,3,0


In [250]:
# Anomalie na korzysc
df[cand_A + '_anomaly_4'] = df[cand_B + '_r2']<df[cand_B + '_r1']
df[cand_B + '_anomaly_4'] = df[cand_A + '_r2']<df[cand_A + '_r1']

To sƒÖ rzeczywi≈õcie bardzo podejrzane przypadki i o takich przypadkach powinni≈õmy alarmowaƒá w pierwszej kolejno≈õci. Po pierwsze, ju na etapie wprowadzania do systemu, a po drugie do ewentualnej kontroli i ponownego liczenia glos√≥w

## Sumowanie anomalii

### Na korzy≈õƒá Trzaskowskiego

In [253]:
df[cand_A + '_sum_anomalies'] = df[[
    cand_A + '_anomaly_1', 
    cand_A + '_anomaly_2',
    cand_A + '_anomaly_3',
    cand_A + '_anomaly_4']].sum(axis=1)

In [None]:
for number_of_anomalies in [1,2,3,4]:
    print(f"{number_of_anomalies} anomalies:")
    print(cand_A, sum(df[cand_A + '_sum_anomalies']>=number_of_anomalies))

1 anomalies:
trzaskowski 8161
2 anomalies:
trzaskowski 1179
3 anomalies:
trzaskowski 34
4 anomalies:
trzaskowski 3


In [259]:
# Komisje z wszystkimi czterma anomaliami
df[df[cand_A + '_sum_anomalies']>=4]

Unnamed: 0,teryt_gmina,polling_station_id,trzaskowski_r1,nawrocki_r1,postal_code,trzaskowski_r2,nawrocki_r2,postal_clean,bucket,trzaskowski_anomaly_1,nawrocki_anomaly_1,trzaskowski_anomaly_2,nawrocki_anomaly_2,trzaskowski_anomaly_3,nawrocki_anomaly_3,trzaskowski_anomaly_4,nawrocki_anomaly_4,trzaskowski_sum_anomalies
5098,60903,4,89.0,174.0,23-100,260.0,163.0,23100,522,True,False,True,False,True,False,True,False,4
12616,140706,1,105.0,285.0,26-910,467.0,193.0,26910,630,True,False,True,False,True,False,True,False,4
25866,261207,4,143.0,224.0,28-200,360.0,209.0,28200,662,True,False,True,False,True,False,True,False,4


### Na korzy≈õƒá Nawrockiego

In [261]:
df[cand_B + '_sum_anomalies'] = df[[
    cand_B + '_anomaly_1', 
    cand_B + '_anomaly_2',
    cand_B + '_anomaly_3',
    cand_B + '_anomaly_4']].sum(axis=1)

In [264]:
for number_of_anomalies in [1,2,3,4]:
    print(f"{number_of_anomalies} anomalies:")
    print(cand_B, sum(df[cand_B + '_sum_anomalies']>=number_of_anomalies))

1 anomalies:
nawrocki 6871
2 anomalies:
nawrocki 483
3 anomalies:
nawrocki 34
4 anomalies:
nawrocki 2


In [263]:
# Komisje z 4 anomaliami, "widaƒá Krak√≥w"
df[df[cand_B + '_sum_anomalies']>=4]

Unnamed: 0,teryt_gmina,polling_station_id,trzaskowski_r1,nawrocki_r1,postal_code,trzaskowski_r2,nawrocki_r2,postal_clean,bucket,trzaskowski_anomaly_1,nawrocki_anomaly_1,trzaskowski_anomaly_2,nawrocki_anomaly_2,trzaskowski_anomaly_3,nawrocki_anomaly_3,trzaskowski_anomaly_4,nawrocki_anomaly_4,trzaskowski_sum_anomalies,nawrocki_sum_anomalies
11610,126101,95,550.0,218.0,31-346,540.0,1132.0,31346,691,False,True,False,True,False,True,False,True,0,4
17032,161105,9,311.0,107.0,47-100,223.0,416.0,47100,1408,False,True,False,True,False,True,False,True,0,4


### PONOWNIE POLICZONE G≈ÅOSY


https://polskieradio24.pl/artykul/3543223,jakie-sa-wyniki-w-komisjach-w-ktorych-ponownie-przeliczono-glosy-sprawdzilismy

In [273]:
df_vote_recount = load_presidential_data("2025", "2")
df_vote_recount = process_presidential_df(df_vote_recount, "2025", final_cols=["Gmina", "Wojew√≥dztwo", "Siedziba"])

In [284]:
# Example: values from recount
target_nawrocki = 1132
target_trzaskowski = 540

# Find records that match these values exactly
matching_stations = df_vote_recount[
    (df_vote_recount["nawrocki"] == target_nawrocki) &
    (df_vote_recount["trzaskowski"] == target_trzaskowski)
]

print("Matching polling stations after recount:")
print(matching_stations.T)

Matching polling stations after recount:
                                                                11610
valid_votes                                                    1672.0
polling_station_id                                                 95
eligible_voters                                                1980.0
ballots_cast                                                   1684.0
teryt_gmina                                                  126101.0
nawrocki                                                       1132.0
trzaskowski                                                     540.0
postal_code                                                    31-346
Gmina                                                       m. Krak√≥w
Wojew√≥dztwo                                               ma≈Çopolskie
Siedziba            Zesp√≥≈Ç Szkolno-Przedszkolny Nr 14, ul. Stawowa...


In [299]:
# --- Step 1: Create the recount dataset with both old and new values ---
recounts = [
    {"polling_station_id": 95,  "valid_votes": 1672, "old_nawrocki": 1132, "old_trzaskowski": 540,  "new_nawrocki": 540,  "new_trzaskowski": 1132},
    {"polling_station_id": 3,   "valid_votes": 1015, "old_nawrocki": 637,  "old_trzaskowski": 378,  "new_nawrocki": 377,  "new_trzaskowski": 638},
    {"polling_station_id": 13,  "valid_votes": 974,  "old_nawrocki": 611,  "old_trzaskowski": 363,  "new_nawrocki": 364,  "new_trzaskowski": 611},
    {"polling_station_id": 9,   "valid_votes": 639,  "old_nawrocki": 416,  "old_trzaskowski": 223,  "new_nawrocki": 223,  "new_trzaskowski": 416},
    {"polling_station_id": 25,  "valid_votes": 828,  "old_nawrocki": 504,  "old_trzaskowski": 324,  "new_nawrocki": 324,  "new_trzaskowski": 504},
    {"polling_station_id": 17,  "valid_votes": 931,  "old_nawrocki": 585,  "old_trzaskowski": 346,  "new_nawrocki": 344,  "new_trzaskowski": 585},
    {"polling_station_id": 30,  "valid_votes": 959,  "old_nawrocki": 610,  "old_trzaskowski": 349,  "new_nawrocki": 450,  "new_trzaskowski": 509},
    {"polling_station_id": 61,  "valid_votes": 1819, "old_nawrocki": 1048, "old_trzaskowski": 771,  "new_nawrocki": 771,  "new_trzaskowski": 1049},
    {"polling_station_id": 10,  "valid_votes": 330,  "old_nawrocki": 217,  "old_trzaskowski": 113,  "new_nawrocki": 317,  "new_trzaskowski": 363},
    {"polling_station_id": 53,  "valid_votes": 1458, "old_nawrocki": 628,  "old_trzaskowski": 830,  "new_nawrocki": 627,  "new_trzaskowski": 828},
    {"polling_station_id": 35,  "valid_votes": 928,  "old_nawrocki": 581,  "old_trzaskowski": 347,  "new_nawrocki": 347,  "new_trzaskowski": 581},
    {"polling_station_id": 6,   "valid_votes": 706,  "old_nawrocki": 368,  "old_trzaskowski": 338,  "new_nawrocki": 278,  "new_trzaskowski": 428},
    {"polling_station_id": 4,   "valid_votes": 797,  "old_nawrocki": 466,  "old_trzaskowski": 331,  "new_nawrocki": 331,  "new_trzaskowski": 466},
    {"polling_station_id": 4,   "valid_votes": 569,  "old_nawrocki": 209,  "old_trzaskowski": 360,  "new_nawrocki": 360,  "new_trzaskowski": 209},  # Stasz√≥w
    {"polling_station_id": 1,   "valid_votes": 660,  "old_nawrocki": 193,  "old_trzaskowski": 467,  "new_nawrocki": 468,  "new_trzaskowski": 192},  # Magnuszew
    {"polling_station_id": 113, "valid_votes": 1910, "old_nawrocki": 136,  "old_trzaskowski": 1774, "new_nawrocki": 296,  "new_trzaskowski": 1611},
    {"polling_station_id": 20,  "valid_votes": 1225, "old_nawrocki": 543,  "old_trzaskowski": 682,  "new_nawrocki": 542,  "new_trzaskowski": 683},
]

recount_df = pd.DataFrame(recounts)

# --- Step 2: Merge on 4 fields for exact match ---
df_affected_polling_stations = df_vote_recount.merge(
    recount_df,
    how="inner",
    left_on=["polling_station_id", "valid_votes", "nawrocki", "trzaskowski"],
    right_on=["polling_station_id", "valid_votes", "old_nawrocki", "old_trzaskowski"]
)

# --- Step 3: Output ---
# print("‚úÖ Matches with recount corrections:")
df_affected_polling_stations[["teryt_gmina", "polling_station_id", "valid_votes", "nawrocki", "new_nawrocki",
               "trzaskowski", "new_trzaskowski"]].head(17)




Unnamed: 0,teryt_gmina,polling_station_id,valid_votes,nawrocki,new_nawrocki,trzaskowski,new_trzaskowski
0,20701.0,6,706.0,368.0,278,338.0,428
1,41804.0,4,797.0,466.0,331,331.0,466
2,46201.0,25,828.0,504.0,324,324.0,504
3,121611.0,10,330.0,217.0,317,113.0,363
4,126101.0,95,1672.0,1132.0,540,540.0,1132
5,140706.0,1,660.0,193.0,468,467.0,192
6,141201.0,13,974.0,611.0,364,363.0,611
7,146505.0,113,1910.0,136.0,296,1774.0,1611
8,160803.0,3,1015.0,637.0,377,378.0,638
9,161105.0,9,639.0,416.0,223,223.0,416


In [301]:
# Podsumowanie weryfikacji wynik√≥w wybor√≥w 17 komisji

len(df_affected_polling_stations)

17