# Web scraping (Wikipedia) — Department surface area (km²)

We needed one true web-page scraping source (HTML web page), not only API/CSV, to enrich the analysis with a variable that impacts real access.

We will use BeautifulSoup, the library studied during the bootcamp, to create our web scraper with Python.

**Method (BeautifulSoup):** the script sends an HTTP GET request to the French Wikipedia page listing French departments, parses the HTML with BeautifulSoup, locates the main wikitable, and extracts the columns Department code and surface area (km²). Because Wikipedia tables often use multi-row headers and merged cells (rowspan/colspan), the scraper flattens rows to keep column alignment, then cleans values (remove footnote markers, normalize French decimal commas, strip non-numeric characters). The output is saved as a reproducible CSV (departments_area_wiki.csv) and the raw HTML snapshot is archived for traceability.

**Why we collect this data:** department surface area enables an additional indicator—venues per 1,000 km²—which complements the classic venues per 100k inhabitants metric. This helps identify “territorial deserts” where the issue is not only population size but also geography and travel distance, improving the interpretability of cultural access inequalities.




**Goal**: build a reproducible scraped dataset (dept_code, area_km2) and use it to compute an area-based infrastructure density (venues per 1,000 km²).
-> This enables an additional metric: venues per 1000 km², complementing per-capita density


In [2]:
import re
import pandas as pd
from datetime import datetime

import requests
from pathlib import Path
import requests
from bs4 import BeautifulSoup


In [3]:
# Source (HTML page)
URL = "https://fr.wikipedia.org/wiki/Liste_des_départements_français"
HEADERS = {"User-Agent": "Mozilla/5.0 (RNCP project)"}
TIMEOUT = 30

# Repo paths 
ROOT = Path.cwd()
if not (ROOT / "data").exists() and (ROOT.parent / "data").exists():
    ROOT = ROOT.parent

RAW_DIR = ROOT / "data" / "raw" / "web_scraping"
PROC_DIR = ROOT / "data" / "processed" / "web_scraping"

# Create directories if they don't exist
RAW_DIR.mkdir(parents=True, exist_ok=True)
PROC_DIR.mkdir(parents=True, exist_ok=True)


In [4]:
# Download + archive HTML (SOURCE)

resp = requests.get(URL, headers=HEADERS, timeout=TIMEOUT)
resp.raise_for_status()

html_path = RAW_DIR / "wiki_departements.html"
html_path.write_text(resp.text, encoding="utf-8")
print("Saved HTML snapshot:", html_path)

soup = BeautifulSoup(resp.text, "html.parser")

Saved HTML snapshot: C:\Users\mmouw\Bureau\ironhack\cultural-infrastructure-index-fr\data\raw\web_scraping\wiki_departements.html


In [5]:
#  Locate the main table 

table = soup.select_one("table.wikitable.sortable") or soup.select_one("table.wikitable")
if table is None:
    raise RuntimeError("Could not find table.")  # if the page structure have changed
print("Table found.")

Table found.


In [12]:
#  Data cleaning - helper functions

def clean_text(x: str) -> str:
    if x is None:
        return ""
    x = x.replace("\xa0", " ") # non breaking spaces
    x = re.sub(r"\[\d+\]", "", x)  # remove footnote markers, like [1]
    return x.strip()


#Parse French number format (comma as decimal separator):
def parse_float_fr(text: str):
    s = clean_text(text)
    # keep digits, separators, spaces
    s = re.sub(r"[^\d,\. ]", "", s).replace(" ", "")
    s = s.replace(",", ".")   # French comma -> English decimal
    if s == "":
        return None
    try:
        return float(s)
    except:
        return None

#Parse French integer (with spaces as thousands separator):
def parse_int_fr(text: str):
    s = clean_text(text)
    s = re.sub(r"[^\d ]", "", s).replace(" ", "")
    if s == "":
        return None
    try:
        return int(s)
    except:
        return None


#Normalize department code (01, 02, etc...) :
def normalize_dept(code: str) -> str:
   
    code = str(code).strip().upper()
    if code.isdigit() and len(code) == 1:
        return code.zfill(2)  # 1 → 01
    return code






In [13]:

# PArse HTML table

def table_to_grid(table_tag):
    grid = []
    rowspan = {}  # Track cells that span multiple rows

    for tr in table_tag.find_all("tr"):
        row = []
        col = 0

        # fill rowspan cells from previous rows
        while col in rowspan:
            remaining, value = rowspan[col]
            row.append(value)
            remaining -= 1
            if remaining <= 0:
                del rowspan[col]
            else:
                rowspan[col] = (remaining, value)
            col += 1

        ## Process current row cells:
        for cell in tr.find_all(["th", "td"]):
            # skip columns occupied by rowspan
            while col in rowspan:
                remaining, value = rowspan[col]
                row.append(value)
                remaining -= 1
                if remaining <= 0:
                    del rowspan[col]
                else:
                    rowspan[col] = (remaining, value)
                col += 1


            # Extract cell text
            text = clean_text(cell.get_text(" ", strip=True))
            rs = int(cell.get("rowspan", 1))
            cs = int(cell.get("colspan", 1))

            # Add cell(s) to row :
            for i in range(cs):
                row.append(text)
                if rs > 1:
                    rowspan[col + i] = (rs - 1, text)

            col += cs


        if row:
            grid.append(row)

    return grid


#PArse the table
grid = table_to_grid(table)
print("Parsed rows:", len(grid))

Parsed rows: 108


In [17]:

# Show first 10 rows to find header


print("First 10 rows of grid:")
for i, r in enumerate(grid[:10]):
    print(f"\nRow {i}:")
    r_low = [clean_text(c).lower() for c in r]
    print(f"  Length: {len(r)}")
    print(f"  Content: {r_low[:15]}")  # First 15 cells
    
    # Check what we have
    has_code = "code" in r_low
    has_area = any("superficie" in c for c in r_low)
    has_pop = any("population" in c for c in r_low)
    has_dens = any("densité" in c or "densite" in c for c in r_low)
    
    print(f"  Has code: {has_code}, area: {has_area}, pop: {has_pop}, dens: {has_dens}")


    # Header wikipedia : on two lignes (not on the same one) : row 1 and 2

First 10 rows of grid:

Row 0:
  Length: 14
  Content: ['', '', '', '', 'subdivisions départementales', 'subdivisions départementales', 'subdivisions départementales', 'subdivisions départementales', 'subdivisions départementales', 'région administrative [ 7 ]', 'caractéristiques', 'caractéristiques', 'caractéristiques', 'caractéristiques']
  Has code: False, area: False, pop: False, dens: False

Row 1:
  Length: 14
  Content: ['', '', '', '', 'arrondissements [ 7 ]', 'arrondissements [ 7 ]', 'communes [ 7 ] (nb)', 'cantons [ 7 ] (nb)', 'circ. législ. (nb)', 'région administrative [ 7 ]', 'superficie cadastrale (en km 2 )', 'population (année)', 'densité (en hab./km 2 )', 'situation']
  Has code: False, area: True, pop: True, dens: True

Row 2:
  Length: 6
  Content: ['code', 'nom', 'chef-lieu', 'date création', 'nb', 'noms']
  Has code: True, area: False, pop: False, dens: False

Row 3:
  Length: 22
  Content: ['01', 'ain (m.s.)', 'bourg-en-bresse (préf. dép.)', '1790', '4', 'belley b

In [22]:
# So : multirow headers -> find column positions


code_idx = area_idx = pop_idx = dens_idx = None

# Row 2 has "code" column
row2 = grid[2]
row2_low = [clean_text(c).lower() for c in row2]

if "code" in row2_low:
    code_idx = row2_low.index("code")
    print(f"Found 'code' at index {code_idx}")

# Row 1 has area, population, density
row1 = grid[1]
row1_low = [clean_text(c).lower() for c in row1]

for i, c in enumerate(row1_low):
    if "superficie" in c:
        area_idx = i
        print(f"Found 'superficie' at index {i}")
    if "population" in c:
        pop_idx = i
        print(f"Found 'population' at index {i}")
    if "densité" in c or "densite" in c:
        dens_idx = i
        print(f"Found 'densité' at index {i}")

# Validation

if None in (code_idx, area_idx, pop_idx, dens_idx):
    raise RuntimeError(f"Missing columns: code={code_idx}, area={area_idx}, pop={pop_idx}, dens={dens_idx}")

print(f"\nColumns :")
print(f"   Code index: {code_idx}")
print(f"   Area index: {area_idx}")
print(f"   Pop index:  {pop_idx}")
print(f"   Dens index: {dens_idx}")


Found 'code' at index 0
Found 'superficie' at index 10
Found 'population' at index 11
Found 'densité' at index 12

Columns :
   Code index: 0
   Area index: 10
   Pop index:  11
   Dens index: 12


In [29]:

# Verification : Show raw data for first 5 departments


print("Raw data from grid:")
for i in range(3, 13):  # Skip header rows (0,1,2), show rows 3-12
    if i >= len(grid):
        break
    
    r = grid[i]
    print(f"\nRow {i} (length={len(r)}):")
    print(f"  [0] Code: '{r[0] if len(r) > 0 else 'N/A'}'")
    print(f"  [10] Area: '{r[10] if len(r) > 10 else 'N/A'}'")
    print(f"  [11] Pop:  '{r[11] if len(r) > 11 else 'N/A'}'")
    print(f"  [12] Dens: '{r[12] if len(r) > 12 else 'N/A'}'")
    print(f"  Full row: {r[:15]}")  # First 15 cells


    # searched values : start at row 4

Raw data from grid:

Row 3 (length=22):
  [0] Code: '01'
  [10] Area: 'Superficie cadastrale (en km 2 )'
  [11] Pop:  'Population (année)'
  [12] Dens: 'Densité (en hab./km 2 )'
  Full row: ['01', 'Ain (m.s.)', 'Bourg-en-Bresse (préf. dép.)', '1790', '4', 'Belley Bourg-en-Bresse Gex Nantua', 'Communes [ 7 ] (nb)', 'Cantons [ 7 ] (nb)', 'Circ. législ. (nb)', 'Région administrative [ 7 ]', 'Superficie cadastrale (en km 2 )', 'Population (année)', 'Densité (en hab./km 2 )', 'Situation', '392']

Row 4 (length=14):
  [0] Code: '02'
  [10] Area: '7 361,7 [ r 2 ]'
  [11] Pop:  '523 342 ( 2023 )'
  [12] Dens: '71,1'
  Full row: ['02', 'Aisne (f.s.)', 'Laon (préf. dép.)', '1790', '5', 'Château-Thierry Laon Saint-Quentin Soissons Vervins', '798', '21', '5', 'Hauts-de-France', '7 361,7 [ r 2 ]', '523 342 ( 2023 )', '71,1', '']

Row 5 (length=14):
  [0] Code: '03'
  [10] Area: '7 340,1 [ r 3 ]'
  [11] Pop:  '333 298 ( 2023 )'
  [12] Dens: '45,4'
  Full row: ['03', 'Allier (m.s.)', 'Moulins (préf. 

In [37]:
# FIXED HELPER FUNCTIONS (SIMPLE & ROBUST)
# ════════════════════════════════════════════════════════════

def clean_text(x: str) -> str:
    """Remove non-breaking spaces and footnote markers"""
    if x is None:
        return ""
    x = x.replace("\xa0", " ")  # Non-breaking spaces
    x = re.sub(r"\[\d+\]", "", x)  # Remove footnotes like [1], [2]
    return x.strip()

def parse_float_fr(text: str):
    """Parse French number format (comma as decimal separator)"""
    s = clean_text(text)
    # Keep only text BEFORE parenthesis
    s = s.split('(')[0]
    # Remove footnote markers like [ r 2 ]
    s = re.sub(r"\[[^\]]*\]", "", s)
    # Keep only digits, dots, commas, spaces
    s = re.sub(r"[^\d,\. ]", "", s).replace(" ", "")
    s = s.replace(",", ".")  # French comma → English decimal
    if s == "":
        return None
    try:
        return float(s)
    except:
        return None

def parse_int_fr(text: str):
    """Parse French integer (with spaces as thousands separator)"""
    s = clean_text(text)
    # SIMPLE: Keep only text BEFORE parenthesis!
    s = s.split('(')[0]  # '523 342 ( 2023 )' → '523 342 '
    # Remove ALL non-digits (including spaces)
    s = re.sub(r"[^\d]", "", s)  # '523 342 ' → '523342'
    if s == "":
        return None
    try:
        return int(s)
    except:
        return None

def normalize_dept(code: str) -> str:
    """Normalize department code (01, 02, etc.)"""
    code = str(code).strip().upper()
    if code.isdigit() and len(code) == 1:
        return code.zfill(2)  # 1 → 01
    return code

In [38]:
#  Extract dept_code + area_km2


# Pattern to match department codes (01-95, 2A, 2B, 971-989)
dept_re = re.compile(r"^(?:\d{2}|2A|2B|\d{3})$")

rows = []

for i, r in enumerate(grid):
    # Skip first 3 rows (headers)
    if i < 3:
        continue

    # Skip rows that are too short
    if len(r) <= max(code_idx, area_idx, pop_idx, dens_idx):
        continue

    # Extract department code
    dept_code = normalize_dept(clean_text(r[code_idx]).upper())
    
    # Skip if not a valid department code
    if not dept_re.fullmatch(dept_code):
        continue

    # Parse numeric values
    area_raw = parse_float_fr(r[area_idx])
    pop_raw  = parse_int_fr(r[pop_idx])
    dens_raw = parse_float_fr(r[dens_idx])


    if area_raw is None or pop_raw is None or dens_raw is None:
        continue
  
    # Add to rows 
    rows.append((dept_code, area_raw, pop_raw, dens_raw))




In [39]:
# Create DataFrame
df = pd.DataFrame(rows, columns=["dept_code", "area_km2_raw", "population_raw", "density_raw"])

print(f"Extracted {len(df)} department records")
print(f"\nFirst 10 departments:")
print(df.head(10))


print(f"\nLast 5 departments:")
print(df.tail(5))

Extracted 100 department records

First 10 departments:
  dept_code  area_km2_raw  population_raw  density_raw
0        02        7361.7          523342         71.1
1        03        7340.1          333298         45.4
2        04        6925.2          168054         24.3
3        05        5548.7          143467         25.9
4        06        4298.6         1128418        262.5
5        07        5528.6          334231         60.5
6        08        5229.4          265893         50.8
7        09        4889.9          155722         31.8
8        10        6004.2          310447         51.7
9        11        6139.0          379648         61.8

Last 5 departments:
   dept_code  area_km2_raw  population_raw  density_raw
95       971        1628.4          384160        235.9
96       972        1128.0          360630        319.7
97       973       83533.9          293996          3.5
98       974        2503.7          889679        355.3
99       976         376.0          25

### Data cleaning

In [40]:



# Remove duplicates (keep median value)
df = df.groupby("dept_code", as_index=False).median(numeric_only=True)

# Calculate area from population/density (for validation)
df["area_from_pop_density"] = df["population_raw"] / df["density_raw"]

# Calculate relative error
df["ratio_raw_vs_calc"] = df["area_km2_raw"] / df["area_from_pop_density"]

# Flag inconsistent rows (ratio < 0.5 or > 2.0)
df["flag_inconsistent"] = df["ratio_raw_vs_calc"].apply(
    lambda x: (x is not None) and (x < 0.5 or x > 2.0)
)

# Final area: use raw UNLESS inconsistent
df["area_km2"] = df["area_km2_raw"]
mask_fix = df["flag_inconsistent"] & df["area_from_pop_density"].notna()
df.loc[mask_fix, "area_km2"] = df.loc[mask_fix, "area_from_pop_density"]


print(f"   Unique departments: {df['dept_code'].nunique()}")
print(f"   Corrected rows: {int(mask_fix.sum())}")
print(f"   Area min: {df['area_km2'].min():.0f} km²")
print(f"   Area max: {df['area_km2'].max():.0f} km²")

# Show flagged departments
if mask_fix.sum() > 0:
    print(f"\n Corrected departments:")
    print(df[mask_fix][['dept_code', 'area_km2_raw', 'area_km2']])

   Unique departments: 100
   Corrected rows: 0
   Area min: 105 km²
   Area max: 83534 km²


In [42]:
#  Save processed dataset 
out_csv = PROC_DIR / "departments_area_wiki.csv"
df.to_csv(out_csv, index=False, encoding="utf-8")
print("Saved:", out_csv)


Saved: C:\Users\mmouw\Bureau\ironhack\cultural-infrastructure-index-fr\data\processed\web_scraping\departments_area_wiki.csv
