# The Explainer Notebook

<h1 align="center">20 Years of Change: Voting Trends in Copenhagen’s National Elections</h1>

<div align="center">

Jasmin Thari (s204155), Johanne Franck () & Smilla Due ()

</div>


### Structure of this Notebook

This notebook consists of eight parts: [Motivation](#1), [Basic Statistics](#2), [Data Analysis](#3), [Genre](#4), [Visualizations](#5), [Discussion](#6), [Contributions](#7), and [References](#8).

In the first part, **Motivation**, we describe the project's goals and provide an overview of the dataset used.

In the second part, **Basic Statistics**, we give an overview of the dataset and its collection process. This section also includes a detailed explanation of the data cleaning procedures. Additionally, we present key dataset statistics and plots from our exploratory data analysis.

In the third part, **Data Analysis**, we describe our analytical approach and the insights derived from the data.

In the fourth part, **Genre**, we explain the genre chosen for our data story and outline the specific elements included in our data story.

In the fifth part, **Visualizations**, we present the visualizations created to support our data story. We explain the purpose of each visualization and how it contributes to the overall narrative.

In the sixth part, **Discussion**, we discuss the implications of our findings and how they relate to the broader context of voting trends in Copenhagen’s national elections. We also discuss the limitations of our analysis.

Finally, in the **Contributions** section, we detail the roles and responsibilities of each group member in the project.


### Table of Contents
1. [Motivation](#1)  
2. [Basic Statistics](#2)  
3. [Data Analysis](#3)  
4. [Genre](#4)  
5. [Visualizations](#5)  
6. [Discussion](#6)  
7. [Contributions](#7)  
8. [References](#8)

---


<a id="1"></a>
## 1.  Motivation

<!-- What is your dataset?
Why did you choose this/these particular dataset(s)?
What was your goal for the end user's experience? -->

### 1.1 Motivation and goal


### 1.2 Data

The data used in this study was obtained from **[Den Danske Valgdatabase](https://valgdatabase.dst.dk/)**, powered by *Danmarks Statistik*. This comprehensive database contains election data from Denmark spanning back to 1979. It includes data from various types of elections including:

- National parliamentary elections (*folketingsvalg*)  
- European Parliament elections (*europaparlamentsvalg*)  
- Referendums (*folkeafstemninger*)  
- Municipal and regional elections (*kommunalvalg* and *regionsrådsvalg*)  
- Parliamentary elections in Greenland and the Faroe Islands  

For the purpose of this project, we have chosen to focus specifically on **national elections in the Copenhagen constituency** over the past two decades. This includes the elections held in the years:  
**2001, 2005, 2007, 2011, 2015, and 2019** — resulting in a total of six elections.

The extracted election data includes:

- Total number of votes cast  
- Number of votes per political party across constituencies  

We filtered the dataset to include only constituencies within the **Copenhagen constituency (Københavns Storkreds)**. These constituencies are:

1. Østerbrokredsen  
2. Sundbyvesterkredsen  
3. Indre Bykredsen  
4. Sundbyøsterkredsen  
5. Nørrebrokredsen  
6. Bispebjergkredsen  
7. Brønshøjkredsen  
8. Valbykredsen  
9. Vesterbrokredsen  
10. Falkonerkredsen  
11. Slotskredsen  
12. Tårnbykredsen  

In addition to election data, **demographic and socioeconomic data** about the population of Copenhagen is also extracted. This supplementary data helps us better understand voting patterns in the context of population characteristics. The additional data categories include:

- **Demographic**: Gender, age, and ethnicity  
- **Socioeconomic**: Income, education level, and employment status  
- **Housing**: Type of housing, housing size, and housing prices  

In total, this resulted in **six `.csv` files**, which will be imported, cleaned, and explained in detail in the following sections.


**Libaries**

The libraries used for this project are presented and imported below.

In [None]:
import os
import json
from functools import partial

import pandas as pd
import numpy as np
import geopandas as gpd

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import to_hex
import matplotlib.patches as mpatches


from shapely.geometry import shape
import folium
from folium import FeatureGroup
from folium.plugins import TimestampedGeoJson


import ipywidgets as widgets
from IPython.display import display, clear_output

# Pandas settings
pd.set_option('future.no_silent_downcasting', True)

<a id="2"></a>
## 2.  Basic Statistics

This section is dividide into two parts: **2.1 Data Cleaning** and **2.2 Explorative Analysis**. In the first part, we load the data, clean it, and provide explanations of the cleaning process. In the second part, we present some basic statistics and visualizations of the data.

### 2.1 Data Cleaning

In this section, we load and clean the data. Overall, we work with three different datasets:

1. **Geographical data**: Contains geographic information at the constituency level. We use this dataset to filter the other datasets so that only constituencies within the Copenhagen area are included. 

2. **Election data**: Contains the number of votes received by each political party in each constituency, for each election year.

3. **Population data**: Contains information about the population at the constituency level. This data is used to analyze voting patterns in relation to various population characteristics. To make the analysis more manageable, we divide this dataset into the following subsets:
   - **Demographic data**: Includes information such as *gender*, *age*, and *citizenship*.
   - **Origin**: Includes data on *immigrants and descendants*. 
   - **Socioeconomic data**: Covers *employment status* and *branch of work*.
   - **Education data**: Covers *educational level*.
   - **Income support**: Includes data on *income support*.
   - **Income data**: Contains information on *income*.
   - **Ownership**: Covers types of *ownership of housing*. 
   - **Housing type**: Specifies the *type of housing* people live in.
   - **Housing size**: Contains data on *the size of house*.

Both the election data and population data are initially in **wide format**, meaning each row represents a constituency and each column represents a variable. We convert the datasets into **long format** using the `melt` function. In long format, each row represents a single observation (e.g., a specific value for a variable in a constituency and year), making it easier to filter, group, and analyze the data.

##### 2.1.1 Geography data
The first dataset we will import is the **geography data**. This dataset contains information about the geographical areas in Copenhagen, including the names of the constituencies and their corresponding codes. This dataset is essential for linking the election data and population data to the geographical areas in Copenhagen.

The first step is to filter the dataset to only include the constituencies in the Copenhagen constituency. Next, we create a mapping between the constituency names and their corresponding codes. This mapping will be used throughout the notebook. 

In [68]:
# Load the data
df_geo = pd.read_csv('Data/raw/Geografi.csv', sep=';', decimal=',', na_values='-')
df_geo.columns = df_geo.columns.str.replace(' ', '', regex=True)

# Filter the data to only keep Copenhagen Constituencies 
df_geo_cph = df_geo.query("Storkredsnavn=='Københavns Storkreds'").drop_duplicates()
df_geo_cph['KredsNr'] = df_geo_cph['KredsNr'].astype(int)

# Keep only the necessary columns
df_geo_cph = df_geo_cph[['KredsNr', 'Kredsnavn','KommuneNr','Kommunenavn']]

# Replace Utterslev with Bispebjerg
df_geo_cph['Kredsnavn'] = df_geo_cph['Kredsnavn'].replace({'6. Utterslev':'6. Bispebjerg'})

# Drop duplicates
df_geo_cph = df_geo_cph.drop_duplicates(subset=['KredsNr', 'Kredsnavn','KommuneNr','Kommunenavn'])

print("Geography data shape:", df_geo_cph.shape)

Geography data shape: (13, 4)


In [70]:
# Split the 'Kredsnavn' column into two parts: ID and Name
constituency_split = df_geo_cph['Kredsnavn'].drop_duplicates().str.split('.', n=1, expand=True)
constituency_split.columns = ['ConstituencyID', 'ConstituencyName']

# Convert ID column to integers and strip whitespace from the names
constituency_split['ConstituencyID'] = constituency_split['ConstituencyID'].astype(int)
constituency_split['ConstituencyName'] = constituency_split['ConstituencyName'].str.strip()

# Create a mapping from ID to Name
constituency_id_to_name = dict(zip(constituency_split['ConstituencyID'], constituency_split['ConstituencyName']))

# Create a reverse mapping from Name to ID
constituency_name_to_id = {name: id_ for id_, name in constituency_id_to_name.items()}

##### 2.1.2 Election data
The second dataset we will import is the **election data**. This dataset is structured in wide format, where each row correspond to a constituency and each column corresponds to a political party per year, resulting in 172 columns. As this is a very inconvenient format, we will reshape the dataset to long format. 

First, we will filter the data to only include the constituencies in the Copenhagen constituency using the defined map. Next, we will reshape the dataset to long format, where each row corresponds to a constituency-year-party combination.

In [71]:
# Load the election data
df_election = pd.read_csv('Data/raw/Valgdata.csv', sep=';' , decimal=',', na_values='-')
df_election.columns = df_election.columns.str.replace(' ', '', regex=True)

# Filter the election data to only keep the relevant constituencies
df_elec_cph = df_election[df_election['KredsNr'].isin([str(k) for k in constituency_id_to_name.keys()])]
df_elec_cph.loc[:, 'KredsNr'] = df_elec_cph['KredsNr'].astype(int)

print("Election data shape:", df_elec_cph.shape)

Election data shape: (12, 172)


In [72]:
# Select vote cols (start with "FV")
vote_columns = [col for col in df_elec_cph.columns if col.startswith("FV")]

# Melt the dataframe to long format
df_elec_cph_long = df_elec_cph.melt(
    id_vars=['KredsNr'],
    value_vars=vote_columns,
    var_name='YearParty',
    value_name='Votes')

# Split the 'YearParty' column into 'Year' and 'Party'
df_elec_cph_long[['Year', 'Partyname']] = df_elec_cph_long['YearParty'].str.extract(r'FV(\d{4})-(.+)')

# Drop columns and reorder
df_elec_cph_long = df_elec_cph_long.drop(columns='YearParty')
df_elec_cph_long = df_elec_cph_long[['KredsNr', 'Year', 'Partyname', 'Votes']]

# Replace NaN values in Votes with 0
df_elec_cph_long['Votes'] = df_elec_cph_long['Votes'].fillna(0)

# Convert 'Year' and KredsNr to integer
df_elec_cph_long['Year'] = df_elec_cph_long['Year'].astype(int)
df_elec_cph_long['KredsNr'] = df_elec_cph_long['KredsNr'].astype(int)

print("Election data long format shape:", df_elec_cph_long.shape)

Election data long format shape: (2016, 4)


In the column `party_name`, we have the all parties represented in the elections - besides that we also have the total number of votes cast, blank votes, the number of invalid votes and eligible voters. However, for the main part of the analysis, we are interested in the political parties, and we also want to divide the votes into *left* and *right* wing. Therefore, we will create a new column `Wing` that indicates whether the party is left or right.

In [74]:
# All parties in the dataset
parties = ['A.Socialdemokratiet','B.DetRadikaleVenstre', 'C.DetKonservativeFolkeparti','D.Centrum-Demokraterne', 'F.SF-SocialistiskFolkeparti',
           'I.LiberalAlliance', 'K.Kristendemokraterne', 'O.DanskFolkeparti','M.Minoritetspartiet', 'V.Venstre,DanmarksLiberaleParti',
           'Y.NyAlliance', 'Ø.Enhedslisten-DeRød-Grønne','Q.FrieGrønne,DanmarksNyeVenstrefløjsparti','Å.Alternativet',   'P.StramKurs', 
           'Æ.Danmarksdemokraterne-IngerStøjberg', 'E.KlausRiskærPedersen']

wing = {
    "left": ['A.Socialdemokratiet',
             'F.SF-SocialistiskFolkeparti',
             'Ø.Enhedslisten-DeRød-Grønne',
             'Q.FrieGrønne,DanmarksNyeVenstrefløjsparti',
             'Å.Alternativet',
             'B.DetRadikaleVenstre',
             'D.Centrum-Demokraterne',
             'M.Minoritetspartiet'],
    "right": ['C.DetKonservativeFolkeparti',
              'V.Venstre,DanmarksLiberaleParti',
              'I.LiberalAlliance',
              'O.DanskFolkeparti',
              'Æ.Danmarksdemokraterne-IngerStøjberg',
              'P.StramKurs',
              'K.Kristendemokraterne',
              'Y.NyAlliance',
              'E.KlausRiskærPedersen']}

In [75]:
df_elec_cph_long['Wing'] = df_elec_cph_long['Partyname'].apply(
    lambda x: 'left' if x in wing['left'] else ('right' if x in wing['right'] else ''))

##### 2.1.3 Population Data
The third dataset we will import is the **population data**. This dataset contains information about the population in Denmark. The first step is to filter the dataset to only include the constituencies in the Copenhagen constituency using the defined map. 

As the data is structured in wide format with 11374 columns, we will split the data into sub dataframe and into long format, each corresponding to a specific characteristic. 

In [76]:
# Load the population data
df_population = pd.read_csv("Data/raw/Befolkning.csv", sep=';', low_memory=False, decimal=',', na_values='-')
df_population.columns = df_population.columns.str.replace(' ', '', regex=True)

# Filter the population data to only keep the relevant constituencies
df_population_cph = df_population[df_population['KredsNr'].isin([str(k) for k in constituency_id_to_name.keys()])]
df_population_cph.loc[:, 'KredsNr'] = df_population_cph['KredsNr'].astype(int)

print(f"Population data shape: {df_population_cph.shape}")

Population data shape: (12, 11374)


The dataset contains several columns that consist entirely of `NaN` values. These columns will therefore be dropped.

In addition, some rows contain missing values. Upon inspection, these `NaN` entries appear to correspond to cases where the count is effectively zero—indicating that no individuals in that particular category were present in the given constituency. 

For example, a column like  
`FV2015-Antalpersoneropgjortefterstatsborgerskabkønogaldersgrupper_Kvinder10-14år_09.Nordamerika`  
represents the number of girls aged 10–14 with North American citizenship in a specific constituency. If the value is missing (`NaN`), it likely means that no such individuals were registered in the data. In such cases, we will treat `NaN` as `0`.



In [77]:
# Drop columns with all NaN values
df_population_cph = df_population_cph.dropna(axis=1, how='all')
print(f"Population data shape after dropping empty columns: {df_population_cph.shape}")

Population data shape after dropping empty columns: (12, 5824)


In [78]:
# Replace NaN values with 0 & convert to numeric
df_population_cph.fillna(0, inplace=True) 
df_population_cph = df_population_cph.apply(pd.to_numeric, errors='coerce')

**Deomographics**

In [None]:
# Extract the demographic columns
demographics_cols = [
    col for col in df_population_cph.columns
    if col.startswith("FV") and "Antalpersoneropgjortefter" in col]

# Melt the DataFrame
df_demo_long = df_population_cph.melt(
    id_vars=['Gruppe', 'KredsNr'],
    value_vars=demographics_cols,
    var_name='RawColumn',
    value_name='Count')

# Extract the fields using regex: FV<year>-Antalpersoner..._<Gender><Age>_<CitizenshipCode>.<CitizenshipName>
df_demo_long[['Year', 'GenderAge', 'Citizenship']] = df_demo_long['RawColumn'].str.extract(
    r'FV(\d{4})-Antalpersoner.*?_(\w+\d+-?\d*år)_(?:\d+\.)?(.+)$')

# Separate Gender and Age
df_demo_long[['Gender', 'Age']] = df_demo_long['GenderAge'].str.extract(r'(\D+)(\d+-?\d*år)')

# Reorder and clean
df_demo_long = df_demo_long.drop(columns=['RawColumn', 'GenderAge'])
df_demo_long = df_demo_long[['Gruppe', 'KredsNr', 'Year', 'Gender', 'Age', 'Citizenship', 'Count']]

print(f"Population demographics data shape: {df_demo_long.shape}")

Population demographics data shape: (27648, 7)


Since there are repeated or inconsistent values in the Citizenship column, we define a mapping to standardize and convert them into a more readable and consistent format. The mapping is as follows:

In [82]:
citizenship_map_en = {
    'Danmark': 'Denmark',
    'Nordiskelande': 'Nordic countries',
    'Tyrkiet': 'Turkey',
    'TidligereJugoslavien': 'Former Yugoslavia',
    'GamleEU-lande': 'Old EU countries',
    'ØvrigegamleEU-lande': 'Old EU countries', 
    'NyeEU-lande': 'New EU countries',
    'ØvrigeEuropa': 'Other European countries',
    'Afrika': 'Africa',
    'Nordamerika': 'North America',
    'Syd-ogMellemamerika': 'South and Central America',
    'Syd-ogMellemam.': 'South and Central America',
    'AsienogOceanien': 'Asia and Oceania',
    'Asienogoceanien': 'Asia and Oceania',
    'Uoplyst': 'Unspecified/Stateless',
    'Uoplyst/statsløse': 'Unspecified/Stateless',
}

df_demo_long['Citizenship'] = df_demo_long['Citizenship'].replace(citizenship_map_en)

**Origin**

In [86]:
# Filter relevant columns
origin_cols = [
    col for col in df_population_cph.columns
    if "Indvandrereogefterkommerefordeltefteroprindelsesland" in col]

# Melt to long format
df_origin_long = df_population_cph.melt(
    id_vars=['Gruppe', 'KredsNr'],
    value_vars=origin_cols,
    var_name='RawColumn',
    value_name='Count')

# Extract year, gender, age, origin
df_origin_long[['Year', 'GenderAge', 'Origin']] = df_origin_long['RawColumn'].str.extract(
    r'FV(\d{4})-Indvandrereogefterkommerefordeltefteroprindelsesland_(\D+\d+-?\d*år)_(?:\d+\.)?(.+)$')

# Split gender and age
df_origin_long[['Gender', 'Age']] = df_origin_long['GenderAge'].str.extract(r'(\D+)(\d+-?\d*år)')

# Clean up
df_origin_long = df_origin_long.drop(columns=['RawColumn', 'GenderAge'])
df_origin_long['Age'] = df_origin_long['Age'].str.replace('år', '', regex=False)

# Apply the citizenship mapping
df_origin_long['Origin'] = df_origin_long['Origin'].replace(citizenship_map_en)

print(f"Population origin data shape: {df_origin_long.shape}")

Population origin data shape: (25296, 7)


**Socio-economic**

In [89]:
# Extract the socioeconomic columns
socioeconomic_cols = [
    col for col in df_population_cph.columns
    if col.startswith("FV") and "Socio-økonomiskstatusogbrancherfordeltpåafstemningsområder" in col]

# Melt to long format
df_socio_long = df_population_cph.melt(
    id_vars=['Gruppe', 'KredsNr'],
    value_vars=socioeconomic_cols,
    var_name='RawColumn',
    value_name='Count')

# Extract Year, Employment Group, and Industry
df_socio_long[['Year', 'EmploymentGroup', 'Industry']] = df_socio_long['RawColumn'].str.extract(
    r'FV(\d{4})-Socio-økonomiskstatusogbrancherfordeltpåafstemningsområder_\d+\.(.+?)_(.+)')

# Clean up columns
df_socio_long = df_socio_long.drop(columns=['RawColumn'])
df_socio_long = df_socio_long[['Gruppe', 'KredsNr', 'Year', 'EmploymentGroup', 'Industry', 'Count']]

print(f"Socioeconomic data shape: {df_socio_long.shape}")

Socioeconomic data shape: (5760, 6)


**Education**

In [92]:
# Extract the education columns
educ_columns = [
    col for col in df_population_cph.columns
    if col.startswith("FV") and "Højstfuldførteerhvervsuddannelseogaldersgrupper" in col]

# Melt the DataFrame
df_educ_long = df_population_cph.melt(
    id_vars=['Gruppe', 'KredsNr'],
    value_vars=educ_columns,
    var_name='RawColumn',
    value_name='Count')

# Extract the fields using regex: FV<year>-Højstfuldførteer..._<Age>_<EducationLevel>
df_educ_long[['Year', 'Age', 'EducationLevel']] = df_educ_long['RawColumn'].str.extract(
    r'FV(\d{4})-Højstfuldførteerhvervsuddannelseogaldersgrupper_(\d{1,3}-?\d*år)(?:_(?:\d+\.)?(.+))?$')

# Remove 'år' from age
df_educ_long['Age'] = df_educ_long['Age'].str.replace('år', '', regex=False)

# Reorder and clean
df_educ_long = df_educ_long.drop(columns=['RawColumn'])
df_educ_long = df_educ_long[['Gruppe', 'KredsNr', 'Year', 'Age', 'EducationLevel', 'Count']]

# Drop NaN values
df_educ_long = df_educ_long.dropna(subset=['Year', 'Age', 'EducationLevel'])

print(f"Education data shape: {df_educ_long.shape}")

Education data shape: (6624, 6)


**Type of income support**

In [93]:
# Extract the support columns
support_cols = [
    col for col in df_population_cph.columns
    if col.startswith("FV") and "Personerefterforsørgelsestype" in col]

# Melt the DataFrame
df_support_long = df_population_cph.melt(
    id_vars=['Gruppe', 'KredsNr'],
    value_vars=support_cols,
    var_name='RawColumn',
    value_name='Count')

# Extract Year and SupportType
df_support_long[['Year', 'SupportType']] = df_support_long['RawColumn'].str.extract(
    r'FV(\d{4})-Personerefterforsørgelsestype_\d+\.(.+)')

# Clean up
df_support_long = df_support_long.drop(columns='RawColumn')
df_support_long = df_support_long[['Gruppe', 'KredsNr', 'Year', 'SupportType', 'Count']]

print(f"Support data shape: {df_support_long.shape}")

Support data shape: (576, 5)


**Income**

In [None]:
# Filter relevant income columns
income_cols = [
    col for col in df_population_cph.columns
    if col.startswith("FV") and "Husstandsindkomsterfordeltpåafstemningsområder" in col]

# Melt
df_income_long = df_population_cph.melt(
    id_vars=['Gruppe', 'KredsNr'],
    value_vars=income_cols,
    var_name='RawColumn',
    value_name='Value')

# Extract Year + IncomeMetric
df_income_long[['Year', 'IncomeMetric']] = df_income_long['RawColumn'].str.extract(
    r'FV(\d{4})-Husstandsindkomsterfordeltpåafstemningsområder_(.+)')

# Clean up
df_income_long = df_income_long.drop(columns='RawColumn')
df_income_long = df_income_long[['Gruppe', 'KredsNr', 'Year', 'IncomeMetric', 'Value']]

print(f"Income data shape: {df_income_long.shape}")

Income data shape: (816, 5)


**Ownership type**

In [98]:
# Identify relevant boligtype columns
ownership_cols = [
    col for col in df_population_cph.columns
    if col.startswith("FV") and "Ejerforhold" in col]

# Melt the DataFrame
df_ownership_type_long = df_population_cph.melt(
    id_vars=['Gruppe', 'KredsNr'],
    value_vars=ownership_cols,
    var_name='RawColumn',
    value_name='Count')

# Extract: year, aggregate level (boliger/personer), housing type
df_ownership_type_long[['Year', 'AggregateLevel', 'OwnershipType']] = df_ownership_type_long['RawColumn'].str.extract(
    r'FV(\d{4})-Ejerforhold_Antal_(boliger|personer)_\d+\.(.+)')

df_ownership_type_long['AggregateLevel'] = df_ownership_type_long['AggregateLevel'].map({'boliger': 'Units', 'personer': 'Residents'})

# Reorder columns
df_ownership_type_long = df_ownership_type_long[['Gruppe', 'KredsNr', 'Year', 'AggregateLevel', 'OwnershipType', 'Count']]

# Drop columns with NaN values
df_ownership_type_long = df_ownership_type_long.dropna(subset=['Year', 'AggregateLevel', 'OwnershipType'])

print(f"Ownership data shape: {df_ownership_type_long.shape}")


Ownership data shape: (432, 6)


**Housing type**

In [102]:
# Identify relevant boligtype columns
housing_type_cols = [
    col for col in df_population_cph.columns
    if col.startswith("FV") and "Boligtype" in col]

# Melt the DataFrame
df_housing_type_long = df_population_cph.melt(
    id_vars=['Gruppe', 'KredsNr'],
    value_vars=housing_type_cols,
    var_name='RawColumn',
    value_name='Count')

# Extract: year, aggregate level (boliger/personer), housing type
df_housing_type_long[['Year', 'AggregateLevel', 'HousingType']] = df_housing_type_long['RawColumn'].str.extract(
    r'FV(\d{4})-Boligtype_Antal_(boliger|personer)_\d+\.(.+)')

df_housing_type_long['AggregateLevel'] = df_housing_type_long['AggregateLevel'].map({'boliger': 'Units', 'personer': 'Residents'})

# Reorder columns 
df_housing_type_long = df_housing_type_long[['Gruppe', 'KredsNr', 'Year', 'AggregateLevel', 'HousingType', 'Count']]

# Drop columns with NaN values
df_housing_type_long = df_housing_type_long.dropna(subset=['Year', 'AggregateLevel', 'HousingType'])

print(f"Housing data shape: {df_housing_type_long.shape}")


Housing data shape: (432, 6)


**Housing size**

In [105]:
hoursing_size_columns = [
    col for col in df_population_cph.columns
    if col.startswith("FV") and "boligstørrelse" in col]

# Melt to long format
df_house_size_long = df_population_cph.melt(
    id_vars=['Gruppe', 'KredsNr'],
    value_vars=hoursing_size_columns,
    var_name='RawColumn',
    value_name='Count')

# Extract year, aggregate level, and size category
df_house_size_long[['Year', 'AggregateLevel', 'HourseSize']] = df_house_size_long['RawColumn'].str.extract(
    r'FV(\d{4})-Boligerogpersonerefterboligstørrelse_\d+\.(?:Antal)?(boliger|personer)_(.+)')

df_house_size_long['AggregateLevel'] = df_house_size_long['AggregateLevel'].map({'boliger': 'Units', 'personer': 'Residents'})

# Reorder columns
df_house_size_long = df_house_size_long[['Gruppe', 'KredsNr', 'Year', 'AggregateLevel', 'HourseSize', 'Count']]

# Drop columns with NaN values
df_house_size_long = df_house_size_long.dropna(subset=['Year', 'AggregateLevel', 'HourseSize'])

print(f"House size data shape: {df_house_size_long.shape}")

House size data shape: (936, 6)


##### 2.2 Explorative Analysis