# **Data Generation Pipeline**

This notebook executes the full data generation pipeline that builds a structured, speaker-linked dataset from raw plenary session documents of the German Bundestag. It downloads, parses, normalizes, and enriches metadata from various sources including XML, JSON, and official politician registries. The pipeline extracts both speeches and audience contributions, links them to politicians and parties, and outputs clean, well-structured Pickle and Excel datasets. These outputs form the foundation for all subsequent NLP and machine learning tasks in this project.
The pipeline logic and structure are based on the [open-discourse project](https://github.com/open-discourse/open-discourse/tree/main), which we discovered during initial research for the project.


---


## **Resulting File Structure:**

```
team-16/
├── dataGeneration/
│   ├── dataGeneratorPipeline.ipynb  # YOU ARE HERE!
│   ├── paths.py
│   ├── dataGenerator_Clean_Text.py
│   ├── dataGenerator_Extract_Contributions.py
│   ├── dataGenerator_Match_Names.py
│   └── bundestagsapi/
data/
├── dataGeneration/
│   ├── rawData/
│   │   ├── electoralTerms/
│   │   │   └── electoralTerms.csv
│   │   ├── politiciansRawData/
│   │   │   ├── MDB_STAMMDATEN.XML
│   │   │   ├── MDB_STAMMDATEN.DTD
│   │   │   └── mgs.pkl
│   │   ├── rawData19json/
│   │   │   ├── protokoll_<Nr.>.json
│   │   │   └── …
│   │   ├── rawData20json/
│   │   │   ├── protokoll_<Nr.>.json
│   │   │   └── …
│   │   ├── rawData19pdf/
│   │   ├── rawData20pdf/
│   │   ├── rawData19xml/
│   │   │   ├── 19001.xml
│   │   │   └── …
│   │   ├── rawData20xml/
│   │   │   ├── 20001.xml
│   │   │   └── …
│   │   │
│   │   dataStage02/
│   │   ├── data19xmlSplit/
│   │   │   ├── 19001/
│   │   │   │   ├── appendix.xml
│   │   │   │   ├── meta_data.xml
│   │   │   │   ├── toc.xml
│   │   │   │   └── session_content.xml
│   │   │   ├── 19002/ …
│   │   │   └── …
│   │   ├── data20xmlSplit/
│   │   │   ├── 20001/
│   │   │   │   ├── appendix.xml
│   │   │   │   ├── meta_data.xml
│   │   │   │   ├── toc.xml
│   │   │   │   └── session_content.xml
│   │   │   ├── 20002/ …
│   │   │   └── …
│   │   ├── dataFactionsStage02/
│   │   │   └── factions.pkl
│   │   ├── dataPoliticiansStage02/
│   │   │   └── mps.pkl
│   │   │
│   │   dataStage03/
│   │   ├── dataFactionsStage03/
│   │   │   └── factionsAbbreviations.pkl
│   │   ├── dataPoliticiansStage03/
│   │   │   ├── mpsFactions.pkl
│   │   │   ├── politicians.csv
│   │   │   └── speaker_faction_lookup.csv
│   │   │
│   │   dataStage04/
│   │   ├── contributionsExtended/
│   │   │   ├── electoral_term_19/
│   │   │   │   ├── 19001.pkl
│   │   │   │   └── …
│   │   │   ├── electoral_term_20/
│   │   │   │   ├── 20001.pkl
│   │   │   │   └── …
│   │   ├── contributionsSimplified/
│   │   │   ├── contributions_simplified_19.pkl
│   │   │   ├── contributions_simplified_20.pkl
│   │   │   └── contributions_simplified_19_20.pkl
│   │   ├── speechContent/
│   │   │   ├── electoral_term_19/
│   │   │   │   └── speech_content.pkl
│   │   │   ├── electoral_term_20/
│   │   │   │   └── speech_content.pkl
│   │   │
│   │   dataStage05/
│   │   ├── contributionsExtendedStage05/
│   │   │   ├── electoral_term_19/
│   │   │   │   ├── 19001.pkl
│   │   │   │   └── …
│   │   │   ├── electoral_term_20/
│   │   │   │   ├── 20001.pkl
│   │   │   │   └── …
│   │   │
│   │   dataStage06/
│   │   ├── contributionsExtendedStage06/
│   │   │   ├── electoral_term_19/
│   │   │   │   ├── 19001.pkl
│   │   │   │   └── …
│   │   │   ├── electoral_term_20/
│   │   │   │   ├── 20001.pkl
│   │   │   │   └── …
│   │
├── dataFinalStage/
│   ├── contributionsExtendedFinalStage/
│   │   ├── contributions_extended_19_20.pkl
│   │   ├── contributions_extended_19.pkl
│   │   └── contributions_extended_20.pkl
│   ├── contributionsSimplifiedFinalStage/
│   │   ├── contributions_simplified_19.pkl
│   │   ├── contributions_simplified_20.pkl
│   │   └── contributions_simplified_19_20.pkl
│   ├── speechContentFinalStage/
│   │   ├── speech_content_19_20.pkl
│   │   ├── speech_content_19.pkl
│   │   └── speech_content_20.pkl
│   └── factionsAbbreviations.pkl
├── dataExcel/
│   ├── finalStage/
│   │   ├── contributions_extended_19_20_finalStage.xlsx
│   │   ├── contributions_extended_19_finalStage.xlsx
│   │   ├── contributions_extended_20_finalStage.xlsx
│   │   ├── contributions_simplified_19_20_finalStage.xlsx
│   │   ├── contributions_simplified_19_finalStage.xlsx
│   │   ├── contributions_simplified_20_finalStage.xlsx
│   │   ├── speech_content_19_20_finalStage.xlsx
│   │   ├── speech_content_19_finalStage.xlsx
│   │   ├── speech_content_20_finalStage.xlsx
│   │   └── factionsAbbreviations.xlsx
│   ├── dataGeneration/
│   │   ├── mgs_wiki_rawData.xlsx
│   │   ├── mps_stage02.xlsx
│   │   ├── mpsFactions_stage03.xlsx
│   │   ├── politicians_stage03.xlsx
│   │   ├── speech_content_19_stage04.xlsx
│   │   ├── speech_content_20_stage04.xlsx
│   │   └── factions_stage02.xlsx
```


---


# **0 Data Generation Pipeline Setup**

## **0.1 Setup Environment for Project**

In [None]:
# create a python environment 'nlp_project_environment' that already has all necessary packages installed with miniconda:
!conda env create -f ../nlp_project_env_setup.yml
# alternatively, run "conda env create -f nlp_project_env_setup_mac.yml" in terminal

In [None]:
# install all necessary packages on selectd python interpreter:
!pip install -r ../requirements.txt

## **0.2 Imports for Pipeline**

In [3]:
# imports
import os
import json
import glob
from pathlib import Path
import xml.etree.ElementTree as et
import regex
import requests
import pandas as pd
import numpy as np
from tqdm import tqdm
from bs4 import BeautifulSoup
from datetime import datetime
import time
import zipfile
import io

# helper functions and constants
from dataGeneration.extract_contributions import extract
from dataGeneration.clean_text import clean_name_headers
from dataGeneration.match_names import insert_politician_id_into_contributions_extended
import paths as PATHS

## **0.3 Read Json Files from API into Python Process**

**The following structure needs to exist in data base directory for the pipeline to work:**
```
Data/
├── dataGeneration/
│   ├── rawData/
│   │   ├── rawData19json/
│   │   │   ├── protokoll_<Nr.>.json
│   │   │   └── …
│   │   ├── rawData20json/
│   │   │   ├── protokoll_<Nr.>.json
│   │   │   └── …
```
It can be generated via [LoadProtocols.java](../bundestagsapi/src/main/java/LoadProtocols.java) in the bundestagsapi.

In [3]:
# Read all JSON-Files from Data20 and Data19
files20 = glob.glob(str(PATHS.RAW_JSON_20 / "*.json"))
protokolle20 = []
files19 = glob.glob(str(PATHS.RAW_JSON_19 / "*.json"))
protokolle19 = []

for file in files20:
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
        protokolle20.append(data)
for file in files19:
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
        protokolle19.append(data)

print(f"{len(protokolle20)} Dateien der 20. WP geladen.")
first20 = protokolle20[0]
print(first20.keys())

print(f"{len(protokolle19)} Dateien der 19. WP geladen.")
first19 = protokolle19[0]
print(first19.keys())

258 Dateien der 20. WP geladen.
dict_keys(['id', 'dokumentart', 'typ', 'dokumentnummer', 'wahlperiode', 'herausgeber', 'datum', 'aktualisiert', 'titel', 'fundstelle', 'pdf_hash', 'vorgangsbezug', 'vorgangsbezug_anzahl', 'text'])
291 Dateien der 19. WP geladen.
dict_keys(['id', 'dokumentart', 'typ', 'dokumentnummer', 'wahlperiode', 'herausgeber', 'datum', 'aktualisiert', 'titel', 'fundstelle', 'pdf_hash', 'vorgangsbezug', 'vorgangsbezug_anzahl', 'text'])


# **1 Download raw data**

## **1.1 Download Plenary Protocols (.pdf & .xml) for Electoral Terms 19 and 20**
- Fetches the PDF and XML files of all plenary protocols for the 19th and 20th electoral periods.
- The files are retrieved by reading the corresponding .json files (which were previously downloaded via the bundestagsapi) and following the pdf_url and xml_url links contained in each.
- Downloaded files are stored separately by format and electoral period.
- This module ensures that all raw protocol documents are available locally in structured form. These files serve as the primary source for downstream parsing and processing steps.

### **Input:**
```
rawData/
├── rawData19json/
│   ├── protokoll_<Nr.>.json
│   └── …
├── rawData20json/
│   ├── protokoll_<Nr.>.json
│   └── …
```

### **Output:**
```
rawData/
├── rawData19pdf/
├── rawData20pdf/
├── rawData19xml/
│   ├── 19001.xml
│   └── …
├── rawData20xml/
│   ├── 20001.xml
│   └── …
```

In [4]:
def download_documents(json_dir, target_dir, url_key, label):
    """
    Downloads documents (PDF or XML) referenced in JSON files and saves them to the target directory.

    :param json_dir (Path): Path to the directory containing the .json files.
    :param target_dir (Path): Path to the directory where the downloaded files should be stored.
    :param url_key (str): The key in the JSON structure pointing to the desired URL (e.g., "pdf_url", "xml_url").
    :param label (str): A label to show in tqdm progress bar.
    """
    json_files = glob.glob(os.path.join(json_dir, "*.json"))
    os.makedirs(target_dir, exist_ok=True)
    for file in tqdm(json_files, desc=f"Downloading {label}"):
        with open(file, "r", encoding="utf-8") as f:
            data = json.load(f)
            file_url = data.get("fundstelle", {}).get(url_key)
            if file_url:
                filename = os.path.basename(file_url).split("#")[0]
                file_path = os.path.join(target_dir, filename)
                if not os.path.exists(file_path):
                    r = requests.get(file_url)
                    with open(file_path, "wb") as out:
                        out.write(r.content)

# Downloads using centralized PATHS
download_documents(PATHS.RAW_JSON_19, PATHS.RAW_PDF_19, "pdf_url", "PDFs for 19th term")
download_documents(PATHS.RAW_JSON_19, PATHS.RAW_XML_19, "xml_url", "XMLs for 19th term")
download_documents(PATHS.RAW_JSON_20, PATHS.RAW_PDF_20, "pdf_url", "PDFs for 20th term")
download_documents(PATHS.RAW_JSON_20, PATHS.RAW_XML_20, "xml_url", "XMLs for 20th term")

Downloading PDFs for 19th term: 100%|██████████| 291/291 [01:18<00:00,  3.71it/s]
Downloading XMLs for 19th term: 100%|██████████| 291/291 [01:11<00:00,  4.09it/s]
Downloading PDFs for 20th term: 100%|██████████| 258/258 [01:26<00:00,  3.00it/s]
Downloading XMLs for 20th term: 100%|██████████| 258/258 [01:16<00:00,  3.39it/s]


## **1.2 Download Metadata of All Members of the Bundestag (MdB)**
Downloads and extracts a ZIP archive containing structured information (as a XML file) for all members of the German Bundestag from the 1st to the 20th electoral period. The files are provided by the Bundestag via the following URL:
    https://www.bundestag.de/resource/blob/472878/7d4d417dbb7f7bd44508b3dc5de08ae2/MdB-Stammdaten-data.zip

### **Input:**
```
None.
```

### **Output:**
```
rawData/
├── politiciansRawData/
│   ├── MDB_STAMMDATEN.XML
│   └── MDB_STAMMDATEN.DTD
```

In [5]:
# output directory
RAW_XML = PATHS.RAW_POLITICIANS
RAW_XML.mkdir(parents=True, exist_ok=True)
#Download MDB Stammdaten.
mp_base_data_link = "https://www.bundestag.de/resource/blob/472878/7d4d417dbb7f7bd44508b3dc5de08ae2/MdB-Stammdaten-data.zip"  # noqa: E501

print("Download & unzip 'MP_BASE_DATA'...", end="", flush=True)

try:
    r = requests.get(mp_base_data_link)
    r.raise_for_status()
    with zipfile.ZipFile(io.BytesIO(r.content)) as z:
        z.extractall(RAW_XML)
    print("Done.")
except requests.exceptions.RequestException as e:
    print("\n❌ Download failed:", e)
except zipfile.BadZipFile:
    print("\n❌ The downloaded file is not a valid ZIP archive.")

#r = requests.get(mp_base_data_link)
#with zipfile.ZipFile(io.BytesIO(r.content)) as z:
#    z.extractall(RAW_XML)
#print("Done.")

Download & unzip 'MP_BASE_DATA'...Done.


## **1.3 Split Plenary Protocol XML Files into Structural Components**

This script processes raw plenary protocol XML files and splits each file into 4 logically separated XML documents:
- toc.xml: Table of contents (XML-tag: vorspann)
- session_content.xml: Speech content (XML-tag: sitzungsverlauf)
- appendix.xml: Appendices such as voting results or exhibits (XML-tag: anlagen)
- meta_data.xml: Metadata on speakers (XML-tag: rednerliste)

### **Input:**
```
rawData/
├── rawData19xml/*.xml
│   ├── 19001.xml
│   └── …
├── rawData20xml/*.xml
│   ├── 20001.xml
│   └── …
```

### **Ouput:**
```
dataStage02/
├── data19xmlSplit/
│   ├── 19001/
│   │   ├── appendix.xml
│   │   ├── meta_data.xml
│   │   ├── toc.xml
│   │   └── session_content.xml
│   ├── 19002/ …
│   └── …
├── data20xmlSplit/
│   ├── 20001/
│   │   ├── appendix.xml
│   │   ├── meta_data.xml
│   │   ├── toc.xml
│   │   └── session_content.xml
│   ├── 20002/ …
│   └── …
```

In [6]:
# Input directory for 19. and 20. electoral period
input_dirs = {
    19: PATHS.RAW_XML_19,
    20: PATHS.RAW_XML_20,
}

# Output directory for 19. and 20. electoral period
output_dirs = {
    19: PATHS.XML_SPLIT_19,
    20: PATHS.XML_SPLIT_20,
}

# Pass through every electoral period
for term_number in [19, 20]:
    input_dir = input_dirs[term_number]
    output_dir = output_dirs[term_number]

    for xml_file_path in tqdm(sorted(input_dir.glob("*.xml")), desc=f"Parsing term {term_number}..."):
        try:
            # read data
            tree = et.parse(xml_file_path)
            root = tree.getroot()
            toc = et.ElementTree(root.find("vorspann"))
            session_content = et.ElementTree(root.find("sitzungsverlauf"))
            appendix = et.ElementTree(root.find("anlagen"))
            meta_data = et.ElementTree(root.find("rednerliste"))

            # using document-numbers to make folder structure
            doc_number = regex.search(r"\d+", xml_file_path.stem).group()
            save_path = output_dir / doc_number
            save_path.mkdir(parents=True, exist_ok=True)


            # save to xmls
            toc.write(save_path / "toc.xml", encoding="UTF-8", xml_declaration=True)
            session_content.write(
                save_path / "session_content.xml",
                encoding="UTF-8",
                xml_declaration=True,
            )
            appendix.write(
                save_path / "appendix.xml",
                encoding="UTF-8",
                xml_declaration=True,
            )
            meta_data.write(
                save_path / "meta_data.xml",
                encoding="UTF-8",
                xml_declaration=True,
            )

        except Exception as e:
            print(f"Error parsing {xml_file_path}: {e}")

Parsing term 19...: 100%|██████████| 239/239 [00:04<00:00, 51.04it/s]
Parsing term 20...: 100%|██████████| 214/214 [00:04<00:00, 50.46it/s]


## **1.4 Extract MP Metadata from MP-Base-Data**

Extracts biographical and institutional metadata for all members of the Bundestag from MDB_STAMMDATEN.xml.

- **Each MP may get multiple entries depending on:**
    - Name changes over time
    - Multiple institutional affiliations (e.g., Bundestag and government office)
    - Participation across multiple electoral terms
- **The extracted data includes:**
    - Biographical information: name, gender, profession, birth/death details
    - Metadata: academic titles, aristocratic prefixes
    - Electoral data: constituency, institution type (e.g. “Regierungsmitglied”)


### **Input:**
```
rawData/
├── politiciansRawData/
│   └── MDB_STAMMDATEN.XML
```

### **Ouput:**
```
dataStage02/
├── dataPoliticiansStage02/
│   └── mps.pkl
dataExcel/
└── mps_stage02.xlsx
```


**Columns (mps.pkl):**
| Column name       | Description                                                   |
|------------------|---------------------------------------------------------------|
| `ui`              | Unique politician identifier                                  |
| `electoral_term`  | Electoral term in which the entry applies                     |
| `first_name`      | First name(s) of the MP                                        |
| `last_name`       | Last name of the MP                                           |
| `birth_place`     | Place of birth                                                |
| `birth_country`   | Country of birth                                              |
| `birth_date`      | Date of birth                                                 |
| `death_date`      | Date of death (or -1 if not applicable)                       |
| `gender`          | Gender                                                        |
| `profession`      | Profession                                                    |
| `constituency`    | Electoral district (constituency)                             |
| `aristocracy`     | Aristocratic title (e.g., Freiherr)                           |
| `academic_title`  | Academic title (e.g., Dr., Prof.)                             |
| `institution_type`| Type of institution (e.g., Fraktion/Gruppe, Regierungsmitglied)|
| `institution_name`| Full name of the institution affiliation                      |

In [7]:
# Input path for raw XML data
MP_BASE_DATA = PATHS.RAW_POLITICIANS / "MDB_STAMMDATEN.xml"

# Output directory for Stage 02
POLITICIANS_STAGE_01 = PATHS.STAGE02 / "dataPoliticiansStage02"
POLITICIANS_STAGE_01.mkdir(parents=True, exist_ok=True)
save_path = PATHS.POLITICIANS_STAGE02  # = mps.pkl

print("Process mps...", end="", flush=True)

# read data
tree = et.parse(MP_BASE_DATA)
root = tree.getroot()

# placeholder for final dataframe
mps = {
    "ui": [],
    "electoral_term": [],
    "first_name": [],
    "last_name": [],
    "birth_place": [],
    "birth_country": [],
    "birth_date": [],
    "death_date": [],
    "gender": [],
    "profession": [],
    "constituency": [],
    "aristocracy": [],
    "academic_title": [],
    "institution_type": [],
    "institution_name": [],
}

last_names_to_revisit = []
i = 0

# Iterate over all MDBs (Mitglieder des Bundestages) in XML File.
for mdb in tqdm(tree.iter("MDB"), desc="Verarbeite MdBs"):
    ui = mdb.findtext("ID")

    # This entries exist only once for every politician.
    if mdb.findtext("BIOGRAFISCHE_ANGABEN/GEBURTSDATUM") == "":
        raise ValueError("Politician has to be born at some point.")
    else:
        birth_date = str(mdb.findtext("BIOGRAFISCHE_ANGABEN/GEBURTSDATUM"))

    birth_place = mdb.findtext("BIOGRAFISCHE_ANGABEN/GEBURTSORT")
    birth_country = mdb.findtext("BIOGRAFISCHE_ANGABEN/GEBURTSLAND")
    if birth_country == "":
        birth_country = "Deutschland"

    if mdb.findtext("BIOGRAFISCHE_ANGABEN/STERBEDATUM") == "":
        death_date = -1
    else:
        death_date = str(mdb.findtext("BIOGRAFISCHE_ANGABEN/STERBEDATUM"))

    gender = mdb.findtext("BIOGRAFISCHE_ANGABEN/GESCHLECHT")
    profession = mdb.findtext("BIOGRAFISCHE_ANGABEN/BERUF")

    # Iterate over all name entries for the poltiician_id, e.g. necessary if
    # name has changed due to a marriage or losing/gaining of titles like "Dr."
    # Or if in another period the location information
    # changed "" -> "Bremerhaven"
    for name in mdb.findall("./NAMEN/NAME"):
        first_name = name.findtext("VORNAME")
        last_name = name.findtext("NACHNAME")
        constituency = name.findtext("ORTSZUSATZ")
        aristocracy = name.findtext("ADEL")
        academic_title = name.findtext("AKAD_TITEL")

        # Hardcode Schmidt (Weilburg). Note: This makes 4 entries for
        # Frank Schmidt!!
        if regex.search(r"\(Weilburg\)", last_name):
            last_name = last_name.replace(" (Weilburg)", "")
            constituency = "(Weilburg)"

        # Iterate over parliament periods the politician was member
        # of the Bundestag.
        for electoral_term in mdb.findall("./WAHLPERIODEN/WAHLPERIODE"):
            electoral_term_number = electoral_term.findtext("WP")

            # Iterate over faction membership in each parliament period, e.g.
            # multiple entries exist if faction was changed within period.
            for institution in electoral_term.findall("./INSTITUTIONEN/INSTITUTION"):
                institution_name = institution.findtext("INS_LANG")
                institution_type = institution.findtext("INSART_LANG")

                mps["ui"].append(ui)
                mps["electoral_term"].append(electoral_term_number)
                mps["first_name"].append(first_name)
                mps["last_name"].append(last_name)
                mps["birth_place"].append(birth_place)
                mps["birth_country"].append(birth_country)
                mps["birth_date"].append(birth_date)
                mps["death_date"].append(death_date)
                mps["gender"].append(gender)
                mps["profession"].append(profession)
                mps["constituency"].append(constituency)
                mps["aristocracy"].append(aristocracy)
                mps["academic_title"].append(academic_title)

                mps["institution_type"].append(institution_type)
                mps["institution_name"].append(institution_name)

# Postprocessing
mps = pd.DataFrame(mps)
mps["constituency"] = mps["constituency"].str.replace("[)(]", "", regex=True)
mps = mps.astype(dtype={"ui": "int64", "birth_date": "str", "death_date": "str"})

# Save as Pickle
mps.to_pickle(save_path)
print("Done.")

# Save as Excel
PATHS.DATA_EXCEL_DIR.mkdir(parents=True, exist_ok=True)
mps.to_excel(PATHS.EXCEL_MPS_STAGE02, index=False)

Process mps...

Verarbeite MdBs: 4609it [00:00, 39307.90it/s]


Done.


## **1.5 Generate Electoral Terms Reference Table**

Creates a CSV reference table listing all electoral terms from 1949 (1st term) to the current term (20th) and assigns each term:
- a unique ID (1-based),
- a start and end date (in seconds since the Unix epoch).

The table can later be used to match speech or politician timestamps to the correct legislative period.


### **Input:**
```
None.
```

### **Ouput:**
```
rawData/
├── electoralTerms/
│   └── electoralTerms.csv
```


**Columns (electoralTerms.csv):**
| Column name | Description                              |
|-------------|------------------------------------------|
| `start_date`| Start of the electoral term (in seconds) |
| `end_date`  | End of the electoral term (in seconds)   |
| `id`        | Unique ID for the electoral term         |

In [8]:
# Output directory from centralized PATHS
ELECTORAL_TERMS_DIR = PATHS.RAW_ELECTORAL_TERMS.parent
ELECTORAL_TERMS_DIR.mkdir(parents=True, exist_ok=True)

electoral_terms = [
    { "start_date": "1949-09-07", "end_date": "1953-10-05" },
    { "start_date": "1953-10-06", "end_date": "1957-10-14" },
    { "start_date": "1957-10-15", "end_date": "1961-10-16" },
    { "start_date": "1961-10-17", "end_date": "1965-10-18" },
    { "start_date": "1965-10-19", "end_date": "1969-10-19" },
    { "start_date": "1969-10-20", "end_date": "1972-12-12" },
    { "start_date": "1972-12-13", "end_date": "1976-12-13" },
    { "start_date": "1976-12-14", "end_date": "1980-11-03" },
    { "start_date": "1980-11-04", "end_date": "1983-03-28" },
    { "start_date": "1983-03-29", "end_date": "1987-02-17" },
    { "start_date": "1987-02-18", "end_date": "1990-12-19" },
    { "start_date": "1990-12-20", "end_date": "1994-11-09" },
    { "start_date": "1994-11-10", "end_date": "1998-10-25" },
    { "start_date": "1998-10-26", "end_date": "2002-10-16" },
    { "start_date": "2002-10-17", "end_date": "2005-10-17" },
    { "start_date": "2005-10-18", "end_date": "2009-10-26" },
    { "start_date": "2009-10-27", "end_date": "2013-10-21" },
    { "start_date": "2013-10-22", "end_date": "2017-10-23" },
    { "start_date": "2017-10-24", "end_date": "2021-10-26" },
    { "start_date": "2021-10-27", "end_date": "2025-10-29" },
]

def string_to_seconds(date_string, ref_date = datetime(year=1970, month=1, day=1)):
    date = datetime.strptime(date_string, "%Y-%m-%d")
    return (date - ref_date).total_seconds()

# convert dates to total seconds and add 1-based id to each term
electoral_terms = [
    {key: string_to_seconds(date_string) for key, date_string in term.items()} | {"id": idx + 1}
    for idx, term in enumerate(electoral_terms)
]

# Save to CSV
pd.DataFrame(electoral_terms).to_csv(PATHS.RAW_ELECTORAL_TERMS, index=False)
print(f"Saved to {PATHS.RAW_ELECTORAL_TERMS} — Done.")

 saved to rawData/electoralTerms/electoralTerms.csv Done.
