# Communes data preparation - French geographic reference

**Project: Analysis of cultural accessibility and territorial inequalities in France**

### Research Questions:
1. Geographic Distribution: How is cultural supply distributed across French territories?
2. Urban vs Rural: Are there significant disparities between urban and rural areas?
3. Socio-Economic Correlation: Is there a relationship between territorial wealth and cultural supply?
4. Typological Diversity: What types of cultural venues exist and how are they distributed?
5. PACA Regional Focus: How does PACA compare to national averages?
6. How is public music infrastructure distributed across French departments?

---

## Dataset Information

**Source:** API Découpage Administratif (French Government)

**Name:** Communes - French Administrative Divisions

**Origin:** Official government API providing up-to-date administrative data

**Data:** 2025 (continuously updated)

**Last Update:** Real-time (API is always current)

**Content:** All French communes (~35,000) with codes, names, coordinates, population

**File Location:** `data/raw/geography/communes.json`

**Download:** Already downloaded via API (or use script below)

**API URL:** https://geo.api.gouv.fr/communes?fields=nom,code,codesPostaux,codeDepartement,codeRegion,population&format=json&geometry=centre

**Purpose:** Reference table for commune names, codes, and basic info

**Key columns:**
- `code`: INSEE commune code (5 digits)
- `nom`: Commune name
- `codesPostaux`: Postal code(s)
- `codeDepartement`: Department code
- `codeRegion`: Region code
- `population`: Population estimate
- `centre.coordinates`: Lat/Lon of commune center

In [1]:
# Data manipulation
import pandas as pd
import numpy as np
import json                             # json : javascript object notation

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#import requests
#import os

#url = "https://geo.api.gouv.fr/communes?fields=nom,code,codesPostaux,codeDepartement,codeRegion,population&format=json&geometry=centre"

#print("Downloading communes from API...")
#response = requests.get(url)

# notes : https://ec.europa.eu/eurostat/fr/web/user-guides/data-browser/api-data-access/api-faq/examples
#         https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
#         json.load() ; with open() ; pd.DataFrame()


## Load Communes Data from a JSON file

In [3]:

# Load JSON file
with open('data/raw/geography/communes.json', 'r', encoding='utf-8') as f:
    communes_data = json.load(f)

# Convert to DataFrame
df_communes = pd.DataFrame(communes_data)
print(f"{len(df_communes):,} communes")
print(f"Columns: {len(df_communes.columns)}")

FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/geography/communes.json'

## Explore the Data

In [None]:

df_communes.head()

In [None]:
df_communes.shape
print(df_communes.columns.tolist())

In [None]:
# Dataset info
df_communes.info()

## CSV preparation

In [None]:
# Handle postal codes (can be a list)
if 'codesPostaux' in df_communes.columns:
    # Take first postal code if multiple
    df_communes['postal_code'] = df_communes['codesPostaux'].apply(
        lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None
    )

In [None]:
# Select columns for SQL
df_communes_clean = df_communes[[
    'code',
    'nom',
    'postal_code',
    'codeDepartement',
    'codeRegion',
    'population'
]].copy()

print(f"{len(df_communes_clean.columns)} columns")

In [None]:
# Rename
df_communes_clean.columns = [
    'commune_code',
    'commune_name',
    'postal_code',
    'dept_code',
    'region_code',
    'population']

print(df_communes_clean.columns.tolist())

In [None]:
# Visual verification :  correct data types
df_communes_clean['commune_code'] = df_communes_clean['commune_code'].astype(str).str.zfill(5)
df_communes_clean['postal_code'] = df_communes_clean['postal_code'].astype(str).str.zfill(5)
df_communes_clean['population'] = pd.to_numeric(df_communes_clean['population'], errors='coerce').fillna(0).astype(int)

df_communes_clean.head()

In [None]:
# Check missing values
print("Missing values:")
print(df_communes_clean.isnull().sum())

In [None]:
# Communes by department
print( "Number of communes by department")
communes_per_dept = df_communes_clean['dept_code'].value_counts().sort_values(ascending=False)

print("\nTop 10 departments with most communes:")
print(communes_per_dept.head(10))


In [None]:
# Communes by region
print("Number of communes by region")
communes_per_region = df_communes_clean['region_code'].value_counts().sort_values(ascending=False)
print(communes_per_region)

In [None]:
#  PACA communes             
# PACA : PRovence-Alpes-Côte-d'Azur
# PACA region code is '93'

paca_communes = df_communes_clean[df_communes_clean['region_code'] == '93']

print(f"PACA REGION")
print(f"Total communes in PACA: {len(paca_communes):,}")
print(f"Total population: {paca_communes['population'].sum():,}")

print("\nCommunes per PACA department:")
paca_by_dept = paca_communes['dept_code'].value_counts().sort_index()
print(paca_by_dept)

# Department names
dept_names = {
    '04': 'Alpes-de-Haute-Provence',
    '05': 'Hautes-Alpes',
    '06': 'Alpes-Maritimes',
    '13': 'Bouches-du-Rhône',
    '83': 'Var',
    '84': 'Vaucluse'
}

print("\nWith names:")
for code, count in paca_by_dept.items():
    print(f"{code} - {dept_names.get(code, 'Unknown'):30s}: {count:3d} communes")

In [None]:
# Largest communes by population

largest = df_communes_clean.nlargest(20, 'population')[['commune_name', 'dept_code', 'population']]
print(largest.to_string(index=False))

## Files for SQL

In [None]:
# Save to CSV
import os
os.makedirs('data/processed', exist_ok=True)

output_file = 'data/processed/communes_for_sql.csv'
df_communes_clean.to_csv(output_file, index=False, encoding='utf-8')

print(f"   Rows: {len(df_communes_clean):,}")
print(f"   Columns: {len(df_communes_clean.columns)}")
