# SMAC Data preparation - French cultural venues references

**Project: Analysis of cultural accessibility and territorial inequalities in France**

### Research Questions:
1. Geographic Distribution: How is cultural supply distributed across French territories?
2. Urban vs Rural: Are there significant disparities between urban and rural areas?
3. Socio-Economic Correlation: Is there a relationship between territorial wealth and cultural supply?
4. Typological Diversity: What types of cultural venues exist and how are they distributed?
5. PACA Regional Focus: How does PACA compare to national averages?
6. How is public music infrastructure distributed across French departments?

## Dataset Information: SMAC Official Label (Quality Indicator)

**Source:** French Ministry of Culture - Web Scraping  

**Name:** SMAC (Scènes de Musiques Actuelles) Official Label List  

**Origin:** Ministry of Culture official web publication  

**Year:** LAst update : 17 décembre 2018

**URL:** https://www.culture.gouv.fr/Thematiques/Musique/Organismes/Scenes-de-musiques-actuelles-Smac 

**Content:** Official list of SMAC-labelled venues (contemporary music venues with official Ministry label)  

**Estimated Count:** ~100 SMAC venues nationwide  

**Collection Method:** Web scraping (BeautifulSoup)  

**File Locations:**  
- Raw: `data/raw/music_ref/smac_page.html`  
- Processed: `data/processed/music_ref/ref_smac.csv`
-   
**Format:** HTML → CSV

**Purpose:** Official quality indicator for music infrastructure - identifies venues with Ministry recognition

**Q6 Usage:** Cross-reference with BASILIC music venues to calculate SMAC recognition rate and identify officially labeled infrastructure

## What is a SMAC?

SMAC (Scène de musiques actuelles) is an official label awarded by the French Ministry of Culture to organisations that run a regular, professionally hosted live programme of contemporary music.
It covers a wide spectrum of genres, including rap, electronic, pop, rock, jazz, traditional music and world music.
The scheme was created in 1996 as the successor to the earlier “Cafés-musiques” programme, marking a step up in public policy support for the sector.

A SMAC is expected to deliver an artistic and cultural project of public interest, anchored in its territory.
Its central objective is to foster, support and promote musical creation, with a strong focus on developing and emerging artists through regular concerts.
SMACs also provide artist development resources such as rehearsal opportunities, creative residencies, and tailored training or advisory support.
They work with both professional and amateur musicians and encourage the development and cross-over of artistic practices.

Cultural outreach is a core deliverable: SMACs run local cultural action programmes and aim to involve residents in the venue’s artistic project.
The network is distributed across mainland France and overseas territories, in both urban and rural contexts.
The Ministry currently reports 99 SMAC-labelled structures.

## SMAC dataset
We built a SMAC reference dataset to measure “official recognition” in our music infrastructure analysis and to enrich the project with a reliable external source.

**Method (web service / API extraction):** we automatically collect records from the Ministry of Culture’s open-data portal (Opendatasoft) using its REST API (/api/explore/v2.1/.../records). We load all pages of results (pagination), normalize the JSON payload into a tabular format, and filter rows where the field structure equals “SMAC” (official “Scènes de musiques actuelles”). We then standardize a minimal schema (name, city, postal code, region, coordinates) and derive dept_code from the postal code to enable aggregation at department level. The final output is stored as ref_smac.csv and the raw extraction is archived for traceability and reproducibility.

**Why this approach:** BASILIC provides broad coverage of cultural venues, but it does not explicitly encode the official SMAC label in a stable, auditable way. A dedicated SMAC reference list gives a clean, authoritative indicator of “recognized music infrastructure” and avoids unreliable keyword-only tagging.

**How it will be used:** the SMAC reference dataset supports Research Question 6 by computing, per department:  the number of departments hosting at least one SMAC, the distribution of SMAC counts, and an approximate “SMAC share” compared to keyword-based music venues. 

It also feeds the project API endpoint (music infrastructure stats) and allows consistent reporting and visualization (maps/rankings) across departments.

In [4]:
#from __future__ import annotations

import json
from pathlib import Path
from datetime import datetime

import pandas as pd
import requests


In [5]:
DATASET_ID = "liste-des-structures-du-spectacle-vivant-subventionnees-par-le-ministere"
BASE_URL = f"https://data.culture.gouv.fr/api/explore/v2.1/catalog/datasets/{DATASET_ID}/records"

HEADERS = {"User-Agent": "Mozilla/5.0 (RNCP project; contact: github repo)"}

def get_repo_root() -> Path:
    # Works whether you run from repo root or from /notebooks
    root = Path.cwd()
    if not (root / "data").exists() and (root.parent / "data").exists():
        root = root.parent
    return root

ROOT = get_repo_root()
RAW_DIR = ROOT / "data" / "raw" / "music_ref"
PROC_DIR = ROOT / "data" / "processed" / "music_ref"
RAW_DIR.mkdir(parents=True, exist_ok=True)
PROC_DIR.mkdir(parents=True, exist_ok=True)




In [6]:
def fetch_all_records(session: requests.Session, where: str | None = None, limit: int = 100) -> list[dict]:
    """Fetch all records with pagination from Opendatasoft Explore API v2.1."""
    out = []
    offset = 0

    while True:
        params = {"limit": limit, "offset": offset}
        if where:
            params["where"] = where

        r = session.get(BASE_URL, params=params, headers=HEADERS, timeout=30)
        r.raise_for_status()
        payload = r.json()

        batch = payload.get("results", [])
        if not batch:
            break

        out.extend(batch)
        offset += limit

    return out


In [7]:
def normalize_postal_code(cp: str | float | None) -> str | None:
    if cp is None or (isinstance(cp, float) and pd.isna(cp)):
        return None
    s = str(cp).strip()
    m = pd.Series([s]).astype(str).str.extract(r"(\d{5})", expand=False).iloc[0]
    return m if isinstance(m, str) else None

def cp_to_dept(cp5: str | None) -> str | None:
    """France dept code from postal code.
    - DOM: 971..989 => first 3 digits
    - Corsica: 200xx/201xx => 2A, 202xx-206xx => 2B
    - Metropolitan: first 2 digits
    """
    if not cp5:
        return None
    if cp5.startswith(("97","98")):
        return cp5[:3]
    if cp5.startswith("20"):
        first3 = int(cp5[:3])
        return "2A" if first3 <= 201 else "2B"
    return cp5[:2]


In [8]:
# Download all records (light dataset) ; then filter SMAC with Pandas
with requests.Session() as s:
    records = fetch_all_records(s, where=None, limit=100)
print("Records fetched:", len(records))

Records fetched: 373


In [9]:


# Save a raw sample for auditability (RNCP traceability)
sample_path = RAW_DIR / "smac_api_response_sample.json"
sample_path.write_text(json.dumps(records[:5], ensure_ascii=False, indent=2), encoding="utf-8")
print("Saved sample JSON:", sample_path)


Saved sample JSON: C:\Users\mmouw\Bureau\ironhack\cultural-infrastructure-index-fr\data\raw\music_ref\smac_api_response_sample.json


In [10]:
df = pd.json_normalize(records)
print("Raw columns:", df.columns.tolist())
print(df.head(3).T)


Raw columns: ['structure', 'nom1', 'nom2', 'adresse1', 'adresse2', 'cp', 'ville', 'longitude', 'latitude', 'coordonnees_geoloc', 'coordonnees_ban', 'region', 'coordonnees_finales.lon', 'coordonnees_finales.lat', 'coordonnees_finales']
                                                     0  \
structure                              Scène Nationale   
nom1                     L’Hexagone  - Scène Nationale   
nom2                                              None   
adresse1                        24, rue des Aiguinards   
adresse2                                          None   
cp                                             38240.0   
ville                                           MEYLAN   
longitude                                     5.764283   
latitude                                     45.205983   
coordonnees_geoloc                  45.205983,5.764283   
coordonnees_ban                    45.206601, 5.763817   
region                            Auvergne-Rhône-Alpes   
coordonnees

In [11]:
# Clean / standardize columns
df.columns = [c.strip().lower() for c in df.columns]

# Identify the column that contains the label/type (usually 'structure')
# If the dataset changes, this auto-detect helps.
structure_col = None
for c in df.columns:
    if "structure" in c:
        structure_col = c
        break

if structure_col is None:
    raise RuntimeError("Cannot find a 'structure' column in the dataset. Check df.columns above.")

# Keep only SMAC + SMAC in labellisation
target_labels = {"SMAC", "SMAC en cours de labellisation"}
df_smac = df[df[structure_col].astype(str).str.strip().isin(target_labels)].copy()

print("SMAC rows:", len(df_smac))
print("Label values:", df_smac[structure_col].value_counts(dropna=False).head(10))


SMAC rows: 97
Label values: structure
SMAC                              80
SMAC en cours de labellisation    17
Name: count, dtype: int64


In [12]:
# Build the clean reference table

# Common field names in this dataset (can evolve).
# We keep a robust mapping: pick the first existing column in each group.
def pick_col(candidates: list[str]) -> str | None:
    for c in candidates:
        if c in df_smac.columns:
            return c
    return None

col_name_1 = pick_col(["nom1", "nom", "nom_du_lieu", "denomination"])
col_name_2 = pick_col(["nom2", "nom_structure", "operateur", "gestionnaire"])
col_cp     = pick_col(["cp", "code_postal", "postal_code"])
col_city   = pick_col(["ville", "commune", "city"])
col_region = pick_col(["region", "nom_region", "region_name"])
col_lat    = pick_col(["latitude", "lat", "geo_point_2d.lat"])
col_lon    = pick_col(["longitude", "lon", "geo_point_2d.lon"])

out = pd.DataFrame({
    "smac_status": df_smac[structure_col].astype(str).str.strip(),
    "smac_name": df_smac[col_name_1] if col_name_1 else None,
    "operator_name": df_smac[col_name_2] if col_name_2 else None,
    "cp": df_smac[col_cp] if col_cp else None,
    "city": df_smac[col_city] if col_city else None,
    "region": df_smac[col_region] if col_region else None,
    "latitude": df_smac[col_lat] if col_lat else None,
    "longitude": df_smac[col_lon] if col_lon else None,
})

out["cp"] = out["cp"].apply(normalize_postal_code)
out["dept_code"] = out["cp"].apply(cp_to_dept)

# Minimal cleanup
out["smac_name"] = out["smac_name"].astype(str).str.strip().replace({"nan": None})
out["operator_name"] = out["operator_name"].astype(str).str.strip().replace({"nan": None})
out["city"] = out["city"].astype(str).str.strip().replace({"nan": None})
out["region"] = out["region"].astype(str).str.strip().replace({"nan": None})

# Drop rows with no name or no dept_code (should be very rare)
out = out.dropna(subset=["smac_name", "dept_code"]).drop_duplicates()

print("Final rows:", len(out))
print(out.head(10))


Final rows: 91
                       smac_status  \
1                             SMAC   
3                             SMAC   
4                             SMAC   
12                            SMAC   
13                            SMAC   
17                            SMAC   
23                            SMAC   
29  SMAC en cours de labellisation   
34                            SMAC   
35  SMAC en cours de labellisation   

                                            smac_name  \
1                                              Le 106   
3                                        Le Pannonica   
4                                             La Clef   
12                              La Coopérative de Mai   
13                              Les Passagers du Zinc   
17                        Des lendemains qui chantent   
23                                     Le Noumatrouff   
29                                          Le Tétris   
34  SMAC Aire Urbaine Montbéliard Belfort : Le Mol...

In [13]:
# Export
out_csv = PROC_DIR / "ref_smac.csv"
out.to_csv(out_csv, index=False, encoding="utf-8")
print("Saved:", out_csv)
print("Columns:", out.columns.tolist())


Saved: C:\Users\mmouw\Bureau\ironhack\cultural-infrastructure-index-fr\data\processed\music_ref\ref_smac.csv
Columns: ['smac_status', 'smac_name', 'operator_name', 'cp', 'city', 'region', 'latitude', 'longitude', 'dept_code']
