# Population Data Preparation - France pulation by communes

**Project: Analysis of cultural accessibility and territorial inequalities in France**

### Research Questions:
1. Geographic Distribution: How is cultural supply distributed across French territories?
2. Urban vs Rural: Are there significant disparities between urban and rural areas?
3. Socio-Economic Correlation: Is there a relationship between territorial wealth and cultural supply?
4. Typological Diversity: What types of cultural venues exist and how are they distributed?
5. PACA Regional Focus: How does PACA compare to national averages?
6. How is public music infrastructure distributed across French departments?

---

## Dataset Information

**Source:** INSEE (Institut National de la Statistique et des Études Économiques)

**Name:** Populations légales - Population by commune

**Origin:** Official French population census data

**Year:** 2021 (most recent official data)

**Last Update:** Published in 2023

**Content:** Population counts for all French communes (~35,000 communes)

**File Location:** `data/raw/demographics/ensemble.xlsx` (download from INSEE)

**Alternative:** Can also be saved as CSV: `data/raw/demographics/population_communes.csv`

**Download Link:** https://www.insee.fr/fr/statistiques/fichier/7739582/ensemble.xlsx

**Purpose:** This dataset is for calculating cultural density (venues per 100k inhabitants)



## Import Libraries

In [15]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


## Load Population Data

**Options for loading:**
1. From Excel (original format)
2. From CSV (if you converted it)

Choose ONE option below:

In [16]:
# Load from Excel 


df_population = pd.read_excel('../data/raw/ensemble.xlsx')

# OPTION 2: Load from CSV (if converted)
# df_population = pd.read_csv('data/raw/ensemble.csv', encoding='utf-8')

print(f" {len(df_population):,} rows")
print(f"Columns: {len(df_population.columns)}")

 24 rows
Columns: 7


  warn("Workbook contains no default style, apply openpyxl's default")


## Explore the Data

In [17]:
df_population.head()

Unnamed: 0,Populations légales des régions en vigueur au 1er janvier 2024,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,Mise à jour : décembre 2023,,,,,,
1,en habitant,,,,,,
2,"Champ : France métropolitaine, départements d’...",,,,,,
3,Date de référence statistique : 1er janvier 2021,,,,,,
4,Source :,"Insee, Recensement de la population 2021",,,,,


In [18]:
# Column names
for i, col in enumerate(df_population.columns, 1):
    print(f"{i:2d}. {col}")

 1. Populations légales des régions en vigueur au 1er janvier 2024
 2. Unnamed: 1
 3. Unnamed: 2
 4. Unnamed: 3
 5. Unnamed: 4
 6. Unnamed: 5
 7. Unnamed: 6


In [19]:
# Dataset info
df_population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 7 columns):
 #   Column                                                          Non-Null Count  Dtype 
---  ------                                                          --------------  ----- 
 0   Populations légales des régions en vigueur au 1er janvier 2024  23 non-null     object
 1   Unnamed: 1                                                      19 non-null     object
 2   Unnamed: 2                                                      18 non-null     object
 3   Unnamed: 3                                                      16 non-null     object
 4   Unnamed: 4                                                      18 non-null     object
 5   Unnamed: 5                                                      18 non-null     object
 6   Unnamed: 6                                                      18 non-null     object
dtypes: object(7)
memory usage: 1.4+ KB


In [20]:
import openpyxl

preview = pd.read_excel("../data/raw/ensemble.xlsx", header=None, nrows=20)
preview

# column names : row 7 !

  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,0,1,2,3,4,5,6
0,Populations légales des régions en vigueur au ...,,,,,,
1,Mise à jour : décembre 2023,,,,,,
2,en habitant,,,,,,
3,"Champ : France métropolitaine, départements d’...",,,,,,
4,Date de référence statistique : 1er janvier 2021,,,,,,
5,Source :,"Insee, Recensement de la population 2021",,,,,
6,,,,,,,
7,Code région,Nom de la région,Nombre d'arrondissements,Nombre de cantons,Nombre de communes,Population municipale,Population totale
8,84,Auvergne-Rhône-Alpes,39,241,4028,8114361,8284162
9,27,Bourgogne-Franche-Comté,24,152,3699,2800194,2872386


In [21]:
df_population = pd.read_excel(
    "../data/raw/ensemble.xlsx",
    skiprows=7,
    header=0
)

df_population.head(2)


  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,Code région,Nom de la région,Nombre d'arrondissements,Nombre de cantons,Nombre de communes,Population municipale,Population totale
0,84,Auvergne-Rhône-Alpes,39,241.0,4028,8114361,8284162
1,27,Bourgogne-Franche-Comté,24,152.0,3699,2800194,2872386


In [22]:
# Statistical summary
df_population.describe()

Unnamed: 0,Code région,Nombre d'arrondissements,Nombre de cantons,Nombre de communes,Population municipale,Population totale
count,17.0,17.0,15.0,17.0,17.0,17.0
mean,41.352941,19.588235,136.0,2054.588235,3965180.0,4036504.0
std,33.151058,13.370017,77.642036,1816.691672,3256699.0,3299959.0
min,1.0,2.0,21.0,22.0,286618.0,288739.0
25%,11.0,5.0,102.0,360.0,871157.0,880875.0
50%,32.0,18.0,131.0,1268.0,3394567.0,3482543.0
75%,75.0,26.0,177.5,3787.0,5995292.0,6085665.0
max,94.0,41.0,258.0,5119.0,12317280.0,12427980.0


## Clean the data

In [13]:
print(df_population.columns.tolist())


['Code région', 'Nom de la région', "Nombre d'arrondissements", 'Nombre de cantons', 'Nombre de communes', 'Population municipale', 'Population totale']


In [14]:
# Select only necessary columns 
columns_to_keep = [
    "Code région",
    "Nom de la région",
    "Nombre d'arrondissements",
    "Nombre de cantons",
    "Nombre de communes",
    "Population municipale",
    "Population totale"
]

columns_to_keep = [c for c in columns_to_keep if c in df_population.columns]
df_pop_clean = df_population[columns_to_keep].copy()

# Rename columns (SQL)
df_pop_clean = df_pop_clean.rename(columns={
    "Code région": "region_code",
    "Nom de la région": "region_name",
    "Nombre d'arrondissements": "nb_arrondissements",
    "Nombre de cantons": "nb_cantons",
    "Nombre de communes": "nb_communes",
    "Population municipale": "population_municipale",
    "Population totale": "population_totale",
})

print(f"{len(df_pop_clean.columns)} columns")
print(f"Rows: {len(df_pop_clean):,}")
df_pop_clean.head()



7 columns
Rows: 17


Unnamed: 0,region_code,region_name,nb_arrondissements,nb_cantons,nb_communes,population_municipale,population_totale
0,84,Auvergne-Rhône-Alpes,39,241.0,4028,8114361,8284162
1,27,Bourgogne-Franche-Comté,24,152.0,3699,2800194,2872386
2,53,Bretagne,15,102.0,1207,3394567,3482543
3,24,Centre-Val de Loire,20,102.0,1757,2573303,2630743
4,94,Corse,5,26.0,360,347597,352559


In [None]:
# Preparation for SQL 


# Clean population columns (convert to int)

df_pop_clean["population_municipale"] = (
    pd.to_numeric(df_pop_clean["population_municipale"], errors="coerce")
    .fillna(0)
    .astype(int))

df_pop_clean["population_totale"] = (
    pd.to_numeric(df_pop_clean["population_totale"], errors="coerce")
    .fillna(0)
    .astype(int))

df_pop_clean.head()



In [None]:
# Check for missing values
print("Missing values:")
print(df_pop_clean.isnull().sum())

## CSV file

In [None]:
import os
os.makedirs('data/processed', exist_ok=True)

output_file = 'data/processed/population_for_sql.csv'
df_pop_clean.to_csv(output_file, index=False, encoding='utf-8')

print(f" Data saved to: {output_file}")
print(f"   Rows: {len(df_pop_clean):,}")
print(f"   Columns: {len(df_pop_clean.columns)}")