# RESEARCH: Polling districts geolocation

**Author**: gwiazdan  
**Created**: 07-02-2026  
**Version:** 1.0  

## Overview

This notebook geocodes Polish polling district addresses using the GUCiK (Główny Urząd Geodezji i Karografii) API. The goal is to obtain geographic coordinates for all polling stations to enable spatial aggregations.

## Methodology

### Data Source
- Dataset: Polish polling districts (`obwody_glosowania_utf8.csv`) from Polish Electoral Commite (PKW)
- Contains ~32,000 polling stations across Poland

In [1]:
import pandas as pd
from tqdm import tqdm
import requests

In [2]:
df = pd.read_csv('obwody_glosowania_utf8.csv', sep=';')

In [3]:
print(df.columns)
print(df.shape)

Index(['TERYT gminy', 'Gmina', 'Powiat', 'Województwo', 'Numer', 'Mieszkańcy',
       'Wyborcy', 'Wysłane pakiety wyborcze', 'Siedziba', 'Miejscowość',
       'Ulica', 'Numer posesji', 'Numer lokalu', 'Kod pocztowy', 'Poczta',
       'Typ obwodu', 'Przystosowany dla niepełnosprawnych', 'Typ obszaru',
       'Pełna siedziba', 'Opis granic'],
      dtype='str')
(32143, 20)


In [None]:
df = df[df['TERYT gminy'].notna()]

df0 = df[['TERYT gminy', 'Mieszkańcy', 'Gmina', 'Wyborcy', 'Powiat', 'Województwo', 'Miejscowość', 'Siedziba', 'Ulica', 'Numer posesji', 'Kod pocztowy']]
df0['TERYT gminy'] = df0['TERYT gminy'].astype('str')
print(df0.head())

  TERYT gminy  Mieszkańcy           Gmina  Wyborcy         Powiat  \
0     20101.0      2115.0  m. Bolesławiec   1637.0  bolesławiecki   
1     20101.0      1634.0  m. Bolesławiec   1286.0  bolesławiecki   
2     20101.0      1708.0  m. Bolesławiec   1356.0  bolesławiecki   
3     20101.0      1730.0  m. Bolesławiec   1377.0  bolesławiecki   
4     20101.0      1491.0  m. Bolesławiec   1208.0  bolesławiecki   

    Województwo  Miejscowość                Siedziba            Ulica  \
0  dolnośląskie  Bolesławiec  Szkoła Podstawowa Nr 3   ul. Ceramiczna   
1  dolnośląskie  Bolesławiec  Szkoła Podstawowa Nr 3   ul. Ceramiczna   
2  dolnośląskie  Bolesławiec  Szkoła Podstawowa Nr 5  ul. Dolne Młyny   
3  dolnośląskie  Bolesławiec  Szkoła Podstawowa Nr 5  ul. Dolne Młyny   
4  dolnośląskie  Bolesławiec  Szkoła Podstawowa Nr 5  ul. Dolne Młyny   

  Numer posesji Kod pocztowy  
0             5       59-700  
1             5       59-700  
2            60       59-700  
3            60       

In [None]:
url = "https://services.gugik.gov.pl/uug"

for p in ['ul. ', 'pl. ', 'al. ', 'os. ']:
    df0['Ulica'] = df0['Ulica'].str.removeprefix(p)

In [6]:
def find_address(address):
    params = {
        "request": "GetAddress",
        "address": address,
    }
    try:
        req = requests.get(url, params=params, timeout=10)
        req.raise_for_status()
        data = req.json()

        if data.get("found objects", 0) > 0 and data.get("results"):
            result = data["results"]["1"]

            return {
                "success": True,
                "x": float(result.get("x")),
                "y": float(result.get("y")),
                "matched_address": address,
            }
    except Exception as e:
        pass

    return {"success": False}

### Geocoding Strategy

The geocoding process employs a fallback strategy with multiple address formats to maximize success rate:

1. **Full address** (locality + street + building_number)
2. **Full address with number + a variant** - If the exact building number is not found, try appending 'a' to handled cases where data may be incomplete
3. **Locality + Street**
4. **Locality only** (for rural areas)

In [7]:
df_sample = df0.copy()

df_sample['Ulica'] = df_sample['Ulica'].fillna('')
df_sample["Numer posesji"] = df_sample["Numer posesji"].fillna("")

df_sample["full_address_temp"] = (
    df_sample["Miejscowość"].astype(str)
    + "_"
    + df_sample["Ulica"].fillna("").astype(str)
    + "_"
    + df_sample["Numer posesji"].fillna("").astype(str)
)

df_sample = df_sample.drop_duplicates(subset=["full_address_temp"], keep="first")
df_sample = df_sample.drop(columns=["full_address_temp"])
df_sample = df_sample.reset_index(drop=True)

df_sample = df_sample.copy().sample(n=150, random_state=42).reset_index(drop=True)

results = []

for idx, row in tqdm(
    df_sample.iterrows(), total=len(df_sample), desc="Searching for coordinates..."
):
    locality = row["Miejscowość"]
    street = row.get("Ulica", "")
    number_raw = str(row.get("Numer posesji",""))

    if pd.notna(number_raw):
        number = str(number_raw).strip()
        if "/" in number:
            number = number.split("/")[0].strip()
    else:
        number = ""

    geocoded = False
    result_data = None

    if street != "" and number != "":
        full_address = f"{locality}, {street} {number}"
        result_data = find_address(full_address)
        if result_data["success"]:
            geocoded = True

    if not geocoded and street != "" and number != "":
        number_with_a = f"{number}a"
        address_with_a = f"{locality}, {street} {number_with_a}"
        result_data = find_address(address_with_a)
        if result_data["success"]:
            geocoded = True

    if not geocoded and street != "":
        street_only = f"{locality}, {street}"
        result_data = find_address(street_only)
        if result_data["success"]:
            geocoded = True

    if not geocoded:
        city_only = f"{locality}"
        result_data = find_address(city_only)
        if result_data["success"]:
            geocoded = True

    if geocoded and result_data:
        results.append(
            {
                "original_address": f"{locality}, {street} {number}".strip(),
                "matched_address": result_data.get("matched_address"),
                "address": result_data.get("matched_address"),
                "x": result_data.get("x"),
                "y": result_data.get("y"),
                "geocoded": True,
            }
        )
    else:
        results.append(
            {
                "original_address": f"{locality}, {street} {number}".strip(),
                "matched_address": None,
                "address": None,
                "x": None,
                "y": None,
                "geocoded": False,
            }
        )

df_results = pd.DataFrame(results)

Searching for coordinates...: 100%|██████████| 150/150 [01:08<00:00,  2.20it/s]


In [8]:
total = len(df_results)
geocoded = df_results['geocoded'].sum()
percentage = (geocoded / total) * 100

print(f"Geocoding Statistics:")
print(f"Total addresses: {total}")
print(f"Successfully geocoded: {geocoded}")
print(f"Failed: {total - geocoded}")
print(f"Success rate: {percentage:.2f}%")

Geocoding Statistics:
Total addresses: 150
Successfully geocoded: 150
Failed: 0
Success rate: 100.00%


The high success rate demonstrates that the fallback strategy effectively handles:
- Urban addresses with full street information
- Rural addresses without street names
- Address variants and typos
- Missing building numbers

The next step is to implement this strategy on the full dataset