# Population Data Preparation - France pulation by communes

**Project: Analysis of cultural accessibility and territorial inequalities in France**

### Research Questions:
1. Geographic Distribution: How is cultural supply distributed across French territories?
2. Urban vs Rural: Are there significant disparities between urban and rural areas?
3. Socio-Economic Correlation: Is there a relationship between territorial wealth and cultural supply?
4. Typological Diversity: What types of cultural venues exist and how are they distributed?
5. PACA Regional Focus: How does PACA compare to national averages?

---

## Dataset Information

**Source:** INSEE (Institut National de la Statistique et des Études Économiques)

**Name:** Populations légales - Population by commune

**Origin:** Official French population census data

**Year:** 2021 (most recent official data)

**Last Update:** Published in 2023

**Content:** Population counts for all French communes (~35,000 communes)

**File Location:** `data/raw/demographics/ensemble.xlsx` (download from INSEE)

**Alternative:** Can also be saved as CSV: `data/raw/demographics/population_communes.csv`

**Download Link:** https://www.insee.fr/fr/statistiques/fichier/7739582/ensemble.xlsx

**Purpose:** This dataset is for calculating cultural density (venues per 100k inhabitants)



## Import Libraries

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


## Load Population Data

**Options for loading:**
1. From Excel (original format)
2. From CSV (if you converted it)

Choose ONE option below:

In [2]:
# Load from Excel 


df_population = pd.read_excel('data/raw/ensemble.xlsx')

# OPTION 2: Load from CSV (if converted)
# df_population = pd.read_csv('data/raw/ensemble.csv', encoding='utf-8')

print(f" {len(df_population):,} rows")
print(f"Columns: {len(df_population.columns)}")

FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/ensemble.xlsx'

## Explore the Data

In [None]:
df_population.head()

In [None]:
# Column names
for i, col in enumerate(df_population.columns, 1):
    print(f"{i:2d}. {col}")

In [None]:
# Dataset info
df_population.info()

In [None]:
import openpyxl

preview = pd.read_excel("data/raw/ensemble.xlsx", header=None, nrows=20)
preview

# column names : row 7 !

In [None]:
df_population = pd.read_excel(
    "data/raw/ensemble.xlsx",
    skiprows=7,
    header=0
)

df_population.head(2)


In [None]:
# Statistical summary
df_population.describe()

## Clean the data

In [None]:
print(df_population.columns.tolist())


In [None]:
# Select only necessary columns 
columns_to_keep = [
    "Code région",
    "Nom de la région",
    "Nombre d'arrondissements",
    "Nombre de cantons",
    "Nombre de communes",
    "Population municipale",
    "Population totale"
]

columns_to_keep = [c for c in columns_to_keep if c in df_population.columns]
df_pop_clean = df_population[columns_to_keep].copy()

# Rename columns (SQL)
df_pop_clean = df_pop_clean.rename(columns={
    "Code région": "region_code",
    "Nom de la région": "region_name",
    "Nombre d'arrondissements": "nb_arrondissements",
    "Nombre de cantons": "nb_cantons",
    "Nombre de communes": "nb_communes",
    "Population municipale": "population_municipale",
    "Population totale": "population_totale",
})

print(f"{len(df_pop_clean.columns)} columns")
print(f"Rows: {len(df_pop_clean):,}")
df_pop_clean.head()



In [None]:
# Preparation for SQL 


# Clean population columns (convert to int)

df_pop_clean["population_municipale"] = (
    pd.to_numeric(df_pop_clean["population_municipale"], errors="coerce")
    .fillna(0)
    .astype(int))

df_pop_clean["population_totale"] = (
    pd.to_numeric(df_pop_clean["population_totale"], errors="coerce")
    .fillna(0)
    .astype(int))

df_pop_clean.head()



In [None]:
# Check for missing values
print("Missing values:")
print(df_pop_clean.isnull().sum())

## CSV file

In [None]:
import os
os.makedirs('data/processed', exist_ok=True)

output_file = 'data/processed/population_for_sql.csv'
df_pop_clean.to_csv(output_file, index=False, encoding='utf-8')

print(f" Data saved to: {output_file}")
print(f"   Rows: {len(df_pop_clean):,}")
print(f"   Columns: {len(df_pop_clean.columns)}")