# Communes data preparation - French geographic reference

**Project: Analysis of cultural accessibility and territorial inequalities in France**

### Research Questions:
1. Geographic Distribution: How is cultural supply distributed across French territories?
2. Urban vs Rural: Are there significant disparities between urban and rural areas?
3. Socio-Economic Correlation: Is there a relationship between territorial wealth and cultural supply?
4. Typological Diversity: What types of cultural venues exist and how are they distributed?
5. PACA Regional Focus: How does PACA compare to national averages?

---

## Dataset Information

**Source:** API Découpage Administratif (French Government)

**Name:** Communes - French Administrative Divisions

**Origin:** Official government API providing up-to-date administrative data

**Data:** 2025 (continuously updated)

**Last Update:** Real-time (API is always current)

**Content:** All French communes (~35,000) with codes, names, coordinates, population

**File Location:** `data/raw/geography/communes.json`

**Download:** Already downloaded via API (or use script below)

**API URL:** https://geo.api.gouv.fr/communes?fields=nom,code,codesPostaux,codeDepartement,codeRegion,population&format=json&geometry=centre

**Purpose:** Reference table for commune names, codes, and basic info

**Key columns:**
- `code`: INSEE commune code (5 digits)
- `nom`: Commune name
- `codesPostaux`: Postal code(s)
- `codeDepartement`: Department code
- `codeRegion`: Region code
- `population`: Population estimate
- `centre.coordinates`: Lat/Lon of commune center

In [31]:
# Data manipulation
import pandas as pd
import numpy as np
import json                             # json : javascript object notation

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [32]:
#import requests
#import os

#url = "https://geo.api.gouv.fr/communes?fields=nom,code,codesPostaux,codeDepartement,codeRegion,population&format=json&geometry=centre"

#print("Downloading communes from API...")
#response = requests.get(url)

# notes : https://ec.europa.eu/eurostat/fr/web/user-guides/data-browser/api-data-access/api-faq/examples
#         https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
#         json.load() ; with open() ; pd.DataFrame()


## Load Communes Data from a JSON file

In [33]:

# Load JSON file
with open('data/raw/geography/communes.json', 'r', encoding='utf-8') as f:
    communes_data = json.load(f)

# Convert to DataFrame
df_communes = pd.DataFrame(communes_data)
print(f"{len(df_communes):,} communes")
print(f"Columns: {len(df_communes.columns)}")

34,969 communes
Columns: 6


## Explore the Data

In [34]:

df_communes.head()

Unnamed: 0,nom,code,codesPostaux,codeDepartement,codeRegion,population
0,L'Abergement-Clémenciat,1001,[01400],1,84,860.0
1,L'Abergement-de-Varey,1002,[01640],1,84,270.0
2,Ambérieu-en-Bugey,1004,[01500],1,84,15934.0
3,Ambérieux-en-Dombes,1005,[01330],1,84,1906.0
4,Ambléon,1006,[01300],1,84,115.0


In [35]:
df_communes.shape
print(df_communes.columns.tolist())

['nom', 'code', 'codesPostaux', 'codeDepartement', 'codeRegion', 'population']


In [36]:
# Dataset info
df_communes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34969 entries, 0 to 34968
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   nom              34969 non-null  object 
 1   code             34969 non-null  object 
 2   codesPostaux     34969 non-null  object 
 3   codeDepartement  34969 non-null  object 
 4   codeRegion       34969 non-null  object 
 5   population       34963 non-null  float64
dtypes: float64(1), object(5)
memory usage: 1.6+ MB


## CSV preparation

In [14]:
# Handle postal codes (can be a list)
if 'codesPostaux' in df_communes.columns:
    # Take first postal code if multiple
    df_communes['postal_code'] = df_communes['codesPostaux'].apply(
        lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None
    )

In [15]:
# Select columns for SQL
df_communes_clean = df_communes[[
    'code',
    'nom',
    'postal_code',
    'codeDepartement',
    'codeRegion',
    'population'
]].copy()

print(f"{len(df_communes_clean.columns)} columns")

6 columns


In [24]:
# Rename
df_communes_clean.columns = [
    'commune_code',
    'commune_name',
    'postal_code',
    'dept_code',
    'region_code',
    'population']

print(df_communes_clean.columns.tolist())

['commune_code', 'commune_name', 'postal_code', 'dept_code', 'region_code', 'population']


In [25]:
# Visual verification :  correct data types
df_communes_clean['commune_code'] = df_communes_clean['commune_code'].astype(str).str.zfill(5)
df_communes_clean['postal_code'] = df_communes_clean['postal_code'].astype(str).str.zfill(5)
df_communes_clean['population'] = pd.to_numeric(df_communes_clean['population'], errors='coerce').fillna(0).astype(int)

df_communes_clean.head()

Unnamed: 0,commune_code,commune_name,postal_code,dept_code,region_code,population
0,1001,L'Abergement-Clémenciat,1400,1,84,860
1,1002,L'Abergement-de-Varey,1640,1,84,270
2,1004,Ambérieu-en-Bugey,1500,1,84,15934
3,1005,Ambérieux-en-Dombes,1330,1,84,1906
4,1006,Ambléon,1300,1,84,115


In [26]:
# Check missing values
print("Missing values:")
print(df_communes_clean.isnull().sum())

Missing values:
commune_code    0
commune_name    0
postal_code     0
dept_code       0
region_code     0
population      0
dtype: int64


In [27]:
# Communes by department
print( "Number of communes by department")
communes_per_dept = df_communes_clean['dept_code'].value_counts().sort_values(ascending=False)

print("\nTop 10 departments with most communes:")
print(communes_per_dept.head(10))


Number of communes by department

Top 10 departments with most communes:
dept_code
62    887
02    797
80    771
57    725
76    707
21    698
60    680
59    647
51    610
54    591
Name: count, dtype: int64


In [28]:
# Communes by region
print("Number of communes by region")
communes_per_region = df_communes_clean['region_code'].value_counts().sort_values(ascending=False)
print(communes_per_region)

Number of communes by region
region_code
44     5115
76     4446
75     4293
84     4025
32     3782
27     3685
28     2644
24     1754
11     1266
52     1228
53     1202
93      946
94      360
987      48
02       34
988      33
01       32
04       24
03       22
06       17
984       5
986       3
975       2
977       1
978       1
989       1
Name: count, dtype: int64


In [29]:
#  PACA communes             
# PACA : PRovence-Alpes-Côte-d'Azur
# PACA region code is '93'

paca_communes = df_communes_clean[df_communes_clean['region_code'] == '93']

print(f"PACA REGION")
print(f"Total communes in PACA: {len(paca_communes):,}")
print(f"Total population: {paca_communes['population'].sum():,}")

print("\nCommunes per PACA department:")
paca_by_dept = paca_communes['dept_code'].value_counts().sort_index()
print(paca_by_dept)

# Department names
dept_names = {
    '04': 'Alpes-de-Haute-Provence',
    '05': 'Hautes-Alpes',
    '06': 'Alpes-Maritimes',
    '13': 'Bouches-du-Rhône',
    '83': 'Var',
    '84': 'Vaucluse'
}

print("\nWith names:")
for code, count in paca_by_dept.items():
    print(f"{code} - {dept_names.get(code, 'Unknown'):30s}: {count:3d} communes")

PACA REGION
Total communes in PACA: 946
Total population: 5,218,960

Communes per PACA department:
dept_code
04    198
05    162
06    163
13    119
83    153
84    151
Name: count, dtype: int64

With names:
04 - Alpes-de-Haute-Provence       : 198 communes
05 - Hautes-Alpes                  : 162 communes
06 - Alpes-Maritimes               : 163 communes
13 - Bouches-du-Rhône              : 119 communes
83 - Var                           : 153 communes
84 - Vaucluse                      : 151 communes


In [30]:
# Largest communes by population

largest = df_communes_clean.nlargest(20, 'population')[['commune_name', 'dept_code', 'population']]
print(largest.to_string(index=False))

 commune_name dept_code  population
        Paris        75     2103778
    Marseille        13      886040
         Lyon        69      519127
     Toulouse        31      514819
         Nice        06      357737
       Nantes        44      327734
  Montpellier        34      310240
   Strasbourg        67      293771
     Bordeaux        33      267991
        Lille        59      238246
       Rennes        35      230890
       Toulon        83      179116
        Reims        51      177674
Saint-Étienne        42      173136
     Le Havre        76      166687
 Villeurbanne        69      163684
        Dijon        21      161830
       Angers        49      159022
     Grenoble        38      156140
  Saint-Denis       974      155634


## Files for SQL

In [23]:
# Save to CSV
import os
os.makedirs('data/processed', exist_ok=True)

output_file = 'data/processed/communes_for_sql.csv'
df_communes_clean.to_csv(output_file, index=False, encoding='utf-8')

print(f"   Rows: {len(df_communes_clean):,}")
print(f"   Columns: {len(df_communes_clean.columns)}")


   Rows: 34,969
   Columns: 6
