# Municipal Population Dataset (1996‚Äì2024): Cleaning and Standarization
## *Preprocessing workflow for INE municipal time-series data*üßπ

### üóÇÔ∏èImport and read the files
Due to the structure of INE datasets, it is important to apply the processing steps defined below, as they make it easier to load any files downloaded from Spanish public institutions, which generally follow this same format.

In [1]:
# Import Pandas
import pandas as pd

# File path
fp_padron_historico = "00_raw_padron_1996_2024.csv"

# Read / import file
padron_hist = pd.read_csv(
    fp_padron_historico,
    # INE downloads: CSV uses ";" as separator
    sep=";",
    # The INE uses "." as the thousands separator, so we define it here
    # to allow pandas to automatically parse numeric columns correctly.
    thousands=".",
    # "," is used as the decimal separator
    decimal=",",
    # Set the data type of each column in Python
    dtype={
        "Municipios": str,
        "Sexo": str,
        "Periodo": int,
        "Total": float
    },
    # low_memory=False ensures pandas reads whole columns to infer types reliably
    low_memory=False,
    # encoding="latin1" because INE CSV files typically use this encoding
    # and this avoids issues with accents/√± during import
    encoding="latin1"
)

### Preview

In [2]:
padron_hist.head()

Unnamed: 0,√Ø¬ª¬øMunicipios,Sexo,Periodo,Total
0,44001 Ababuj,Total,2024,74.0
1,44001 Ababuj,Total,2023,70.0
2,44001 Ababuj,Total,2022,72.0
3,44001 Ababuj,Total,2021,76.0
4,44001 Ababuj,Total,2020,77.0


In [3]:
padron_hist.tail()

Unnamed: 0,√Ø¬ª¬øMunicipios,Sexo,Periodo,Total
708001,04103 Zurgena,Mujeres,2000,1064.0
708002,04103 Zurgena,Mujeres,1999,1068.0
708003,04103 Zurgena,Mujeres,1998,1069.0
708004,04103 Zurgena,Mujeres,1997,
708005,04103 Zurgena,Mujeres,1996,1075.0


### üîßRename Columns

In [4]:
# Create a dictionnary with the new names
new_names = {"√Ø¬ª¬øMunicipios": "Municipios",
             "Sexo": "Cat",
            "Periodo": "Year",
            "Total": "Pop"}

# Rename the columns
padron_hist = padron_hist.rename(columns=new_names)

# See Result
padron_hist.head()

Unnamed: 0,Municipios,Cat,Year,Pop
0,44001 Ababuj,Total,2024,74.0
1,44001 Ababuj,Total,2023,70.0
2,44001 Ababuj,Total,2022,72.0
3,44001 Ababuj,Total,2021,76.0
4,44001 Ababuj,Total,2020,77.0


### üîßSplit one column into two

In [5]:
# " ".split(" ", 1) ‚Üí splits only at the first space
# expand=True ‚Üí creates two new columns
padron_hist[["Mun_Code", "Mun"]] = padron_hist["Municipios"].str.split(" ", n=1, expand=True)

# See Result
padron_hist.head()

Unnamed: 0,Municipios,Cat,Year,Pop,Mun_Code,Mun
0,44001 Ababuj,Total,2024,74.0,44001,Ababuj
1,44001 Ababuj,Total,2023,70.0,44001,Ababuj
2,44001 Ababuj,Total,2022,72.0,44001,Ababuj
3,44001 Ababuj,Total,2021,76.0,44001,Ababuj
4,44001 Ababuj,Total,2020,77.0,44001,Ababuj


### üîßColumn Rearrangement

In [6]:
# Delete Municipios column
padron_hist = padron_hist.drop(columns=["Municipios"])

#Rearrange columns order
padron_hist = padron_hist[["Mun_Code", "Mun", "Cat", "Year", "Pop"]]

#See result
padron_hist.tail()

Unnamed: 0,Mun_Code,Mun,Cat,Year,Pop
708001,4103,Zurgena,Mujeres,2000,1064.0
708002,4103,Zurgena,Mujeres,1999,1068.0
708003,4103,Zurgena,Mujeres,1998,1069.0
708004,4103,Zurgena,Mujeres,1997,
708005,4103,Zurgena,Mujeres,1996,1075.0


### üö´Removing the Empty Census Year (1997)
The year 1997 does not contain any population data in the official INE records, as the municipal census was not conducted that year.
Keeping this empty year would introduce unnecessary NaN values, bias percentage-change calculations, and clutter the dataset without adding meaningful information.

For clarity, analytical consistency, and to ensure clean downstream processing, the entire year 1997 is removed from the cleaned dataset.
The raw dataset remains untouched, preserving the original record.

In [7]:
padron_hist = padron_hist[padron_hist["Year"] != 1997]

#See result
padron_hist.tail()

Unnamed: 0,Mun_Code,Mun,Cat,Year,Pop
708000,4103,Zurgena,Mujeres,2001,1062.0
708001,4103,Zurgena,Mujeres,2000,1064.0
708002,4103,Zurgena,Mujeres,1999,1068.0
708003,4103,Zurgena,Mujeres,1998,1069.0
708005,4103,Zurgena,Mujeres,1996,1075.0


### üîçData Format & Assertions

In [8]:
# Check Data types
padron_hist.dtypes

Mun_Code     object
Mun          object
Cat          object
Year          int64
Pop         float64
dtype: object

In [9]:
# Transform selected columns into appropriate data types
padron_hist = padron_hist.astype({
    # keep as string to preserve leading zeros
    "Mun_Code": "string",
    # municipalities names as strings
    "Mun":      "string",
    # category (Total, Hombres, Mujeres)
    "Cat":      "string",   
    # years stored as integers
    "Year":     "int64",    
    # Pop stays as float as defined earlier (float64)
})

# Check Data types
padron_hist.dtypes

Mun_Code    string[python]
Mun         string[python]
Cat         string[python]
Year                 int64
Pop                float64
dtype: object

In [10]:
# Assert that all population values are non-negative
assert (padron_hist["Pop"] >= 0).all(), "Population values must be >= 0"

AssertionError: Population values must be >= 0

In [11]:
# Check rows with NaN
padron_hist[padron_hist["Pop"].isna()]

Unnamed: 0,Mun_Code,Mun,Cat,Year,Pop
10716,10903,Alag√É¬≥n del R√É¬≠o,Total,2009,
10717,10903,Alag√É¬≥n del R√É¬≠o,Total,2008,
10718,10903,Alag√É¬≥n del R√É¬≠o,Total,2007,
10719,10903,Alag√É¬≥n del R√É¬≠o,Total,2006,
10720,10903,Alag√É¬≥n del R√É¬≠o,Total,2005,
...,...,...,...,...,...
707826,48916,Usansolo,Mujeres,2001,
707827,48916,Usansolo,Mujeres,2000,
707828,48916,Usansolo,Mujeres,1999,
707829,48916,Usansolo,Mujeres,1998,


### üîçMissing Population Values: Diagnostic Summary
A number of rows in the dataset contain missing population values (NaN). These gaps are not limited to a single year and appear across multiple municipalities and categories, likely reflecting inconsistencies or omissions in the original INE records rather than errors introduced during processing. Since these missing values do not prevent the computation of population change rates‚Äîas long as each specific interval has valid data‚Äî they can be retained for now without compromising the subsequent analytical workflow. In later stages, these NaN entries can be explicitly handled, ignored, or imputed depending on the requirements of each analysis. For the moment, the dataset remains sufficiently robust to proceed with the population variation calculations.

### üóÇÔ∏èExport rearranged clean file

In [12]:
output_padron_hist_clean = "01_padron_clean_1996_2024.csv"

padron_hist.to_csv(
    output_padron_hist_clean,
    sep = ",",
    index = False,   # do not include the index column
    encoding = "latin1"   # recommended if I plan using the file in QGIS
)

## üìùConclusion
The historical municipal population dataset from the INE has been successfully cleaned, standardized, and exported into a structured format suitable for analytical workflows. The preprocessing steps ensured consistent encoding, reliable numeric conversion, and the extraction of key geographic identifiers (CP and municipality names). As a result, the dataset is now ready for systematic analysis without the formatting issues typically present in raw administrative files.

#### ‚û°Ô∏èNext Steps
The next notebook will focus on the analytical stage: computing population change across multiple temporal intervals, exploring annual and multi-year trends, and preparing the resulting indicators for potential spatial visualization in GIS environments. This will establish the basis for a reproducible pipeline linking raw statistical data with territorial analysis tools.