# SMAC Data preparation - French cultural venues references

**Project: Analysis of cultural accessibility and territorial inequalities in France**

### Research Questions:
1. Geographic Distribution: How is cultural supply distributed across French territories?
2. Urban vs Rural: Are there significant disparities between urban and rural areas?
3. Socio-Economic Correlation: Is there a relationship between territorial wealth and cultural supply?
4. Typological Diversity: What types of cultural venues exist and how are they distributed?
5. PACA Regional Focus: How does PACA compare to national averages?
6. How is public music infrastructure distributed across French departments?

## Dataset Information: SMAC Official Label (Quality Indicator)

**Source:** French Ministry of Culture - Web Scraping  

**Name:** SMAC (Scènes de Musiques Actuelles) Official Label List  

**Origin:** Ministry of Culture official web publication  

**Year:** LAst update : 17 décembre 2018

**URL:** https://www.culture.gouv.fr/Thematiques/Musique/Organismes/Scenes-de-musiques-actuelles-Smac 

**Content:** Official list of SMAC-labelled venues (contemporary music venues with official Ministry label)  

**Estimated Count:** ~100 SMAC venues nationwide  

**Collection Method:** Web scraping (BeautifulSoup)  

**File Locations:**  
- Raw: `data/raw/music_ref/smac_page.html`  
- Processed: `data/processed/music_ref/ref_smac.csv`
-   
**Format:** HTML → CSV

**Purpose:** Official quality indicator for music infrastructure - identifies venues with Ministry recognition

**Q6 Usage:** Cross-reference with BASILIC music venues to calculate SMAC recognition rate and identify officially labeled infrastructure

## What is a SMAC?

SMAC (Scène de musiques actuelles) is an official label awarded by the French Ministry of Culture to organisations that run a regular, professionally hosted live programme of contemporary music.
It covers a wide spectrum of genres, including rap, electronic, pop, rock, jazz, traditional music and world music.
The scheme was created in 1996 as the successor to the earlier “Cafés-musiques” programme, marking a step up in public policy support for the sector.

A SMAC is expected to deliver an artistic and cultural project of public interest, anchored in its territory.
Its central objective is to foster, support and promote musical creation, with a strong focus on developing and emerging artists through regular concerts.
SMACs also provide artist development resources such as rehearsal opportunities, creative residencies, and tailored training or advisory support.
They work with both professional and amateur musicians and encourage the development and cross-over of artistic practices.

Cultural outreach is a core deliverable: SMACs run local cultural action programmes and aim to involve residents in the venue’s artistic project.
The network is distributed across mainland France and overseas territories, in both urban and rural contexts.
The Ministry currently reports 99 SMAC-labelled structures.

In [21]:
import time
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pathlib import Path
from datetime import datetime

In [22]:
# Scrapping using Beautiful Soup


url = "https://data.culture.gouv.fr/explore/dataset/liste-des-structures-du-spectacle-vivant-subventionnees-par-le-ministere/"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()


soup = BeautifulSoup(response.content, 'html.parser')

print("HTTP:", response.status_code)
print("Page title:", soup.title.get_text(strip=True) if soup.title else "No <title> found")
print("len(html) =", len(response.text))

HTTP: 200
Page title: Structures de la création artistique — Ministère de la Culture
len(html) = 321807
contains dataset id: True


In [23]:

DATASET_PAGE_URL = "https://data.culture.gouv.fr/explore/dataset/liste-des-structures-du-spectacle-vivant-subventionnees-par-le-ministere/"
DATASET_ID = "liste-des-structures-du-spectacle-vivant-subventionnees-par-le-ministere"
HEADERS = {"User-Agent": "Mozilla/5.0"}

def iter_strings(obj):
    if isinstance(obj, str):
        yield obj
    elif isinstance(obj, dict):
        for v in obj.values():
            yield from iter_strings(v)
    elif isinstance(obj, list):
        for it in obj:
            yield from iter_strings(it)

def is_smac_record(rec: dict) -> bool:
    hay = " | ".join(s.lower() for s in iter_strings(rec))
    return "smac" in hay

def fetch_all_records(session: requests.Session, dataset_id: str, limit: int = 100) -> list[dict]:
    base_url = f"https://data.culture.gouv.fr/api/explore/v2.1/catalog/datasets/{dataset_id}/records"
    out = []
    offset = 0
    while True:
        params = {"limit": limit, "offset": offset}
        r = session.get(base_url, params=params, timeout=30)
        r.raise_for_status()
        payload = r.json()
        batch = payload.get("results", [])
        if not batch:
            break
        out.extend(batch)
        offset += limit
    return out

def get_repo_root():
    root = Path.cwd()
    if not (root / "data").exists() and (root.parent / "data").exists():
        root = root.parent
    return root

def main():
    ROOT = get_repo_root()
    RAW_DIR = ROOT / "data" / "raw" / "music_ref"
    PROC_DIR = ROOT / "data" / "processed" / "music_ref"
    RAW_DIR.mkdir(parents=True, exist_ok=True)
    PROC_DIR.mkdir(parents=True, exist_ok=True)

    extracted_at = datetime.now().strftime("%Y-%m-%d")

    with requests.Session() as s:
        resp = s.get(DATASET_PAGE_URL, headers=HEADERS, timeout=30)
        resp.raise_for_status()
        (RAW_DIR / "smac_dataset_page.html").write_text(resp.text, encoding="utf-8")

        soup = BeautifulSoup(resp.text, "html.parser")
        print("Page title:", soup.title.get_text(strip=True) if soup.title else "no-title")
        print("Saved HTML:", RAW_DIR / "smac_dataset_page.html")

       
        records = fetch_all_records(s, DATASET_ID, limit=100)
        print("Records fetched:", len(records))

        smac = [r for r in records if is_smac_record(r)]
        print("SMAC records:", len(smac))

        # raw json
        raw_jsonl = RAW_DIR / "smac_records.jsonl"
        with raw_jsonl.open("w", encoding="utf-8") as f:
            for rec in smac:
                f.write(json.dumps(rec, ensure_ascii=False) + "\n")
        print("Saved raw JSONL:", raw_jsonl)

        # Table
        df = pd.json_normalize(smac)

        wanted = [
            "nom", "nom_structure", "nom_du_lieu", "denomination", "appellation",
            "adresse", "adresse_complete", "code_postal", "cp", "commune", "ville", "departement", "region",
            "site", "site_web", "url", "telephone", "email",
            "latitude", "longitude", "geo_point_2d.lat", "geo_point_2d.lon",
            "type", "categorie", "label", "libelle",
        ]
        cols = [c for c in wanted if c in df.columns]
        if cols:
            df = df[cols]

        out_csv = PROC_DIR / "ref_smac.csv"
        df.to_csv(out_csv, index=False, encoding="utf-8")
        print("Saved:", out_csv)
        print("Columns:", list(df.columns)[:30])

main()

Page title: Structures de la création artistique — Ministère de la Culture
Saved HTML: C:\Users\mmouw\Bureau\ironhack\cultural-infrastructure-index-fr\data\raw\music_ref\smac_dataset_page.html
Records fetched: 373
SMAC records: 97
Saved raw JSONL: C:\Users\mmouw\Bureau\ironhack\cultural-infrastructure-index-fr\data\raw\music_ref\smac_records.jsonl
Saved: C:\Users\mmouw\Bureau\ironhack\cultural-infrastructure-index-fr\data\processed\music_ref\ref_smac.csv
Columns: ['cp', 'ville', 'region', 'latitude', 'longitude']


In [24]:
# BeautifulSoup




#  PAge headtitle
print(f"Page title: {soup.title.get_text(strip=True)}")

# Titles (h1, h2, h3)
print("\nTitles:")
for heading in soup.find_all(['h1', 'h2', 'h3']):
    print(f"  {heading.name}: {heading.get_text(strip=True)[:80]}")

# tables
tables = soup.find_all('table')
print(f"\nTables: {len(tables)}")

# Listes (ul, ol)
lists = soup.find_all(['ul', 'ol'])
print(f"Lists: {len(lists)}")

# Paragraph
first_p = soup.find('p')
if first_p:
    print(f"\nFirst paragrapher:")
    print(f"  {first_p.get_text(strip=True)[:200]}...")

Page title: Structures de la création artistique — Ministère de la Culture

Titles:
  h2: {{ ctx.nhits | number }}record
  h2: Active filters
  h2: Filters
  h1: Structures de la création artistique
  h3: Dataset schema

Tables: 0
Lists: 3

First paragrapher:
  Liste et localisation des structures de la création artistique subventionnées par le ministère de la Culture....


In [25]:
def get_dataset_id(html: str) -> str:
    # HTML -> balise <code>...)</code>
    soup = BeautifulSoup(html, "html.parser")
    for code in soup.select("code"):
        txt = code.get_text(strip=True)
        if re.fullmatch(r"[a-z0-9\-]{10,}", txt):
            return txt
    raise RuntimeError

In [26]:
def iter_strings(obj):
    
    if isinstance(obj, str):
        yield obj
    elif isinstance(obj, dict):
        for v in obj.values():
            yield from iter_strings(v)
    elif isinstance(obj, list):
        for it in obj:
            yield from iter_strings(it)

In [27]:
def is_smac_record(rec: dict) -> bool:
    hay = " | ".join(s.lower() for s in iter_strings(rec))
    return "smac" in hay


In [28]:

def main():
    with requests.Session() as s:
        #  HTML -> BeautifulSoup
        html = s.get(DATASET_PAGE_URL, headers=HEADERS, timeout=30).text
        dataset_id = get_dataset_id(html)
        print("Dataset ID:", dataset_id)

        # API
        records = fetch_all_records(s, dataset_id, limit=100)
        print("Records fetched:", len(records))

        # SMAC filter
        smac = [r for r in records if is_smac_record(r)]
        print("SMAC records:", len(smac))

        # Tableau
        df = pd.json_normalize(smac)

      
        wanted = [
            # informations
            "nom", "nom_structure", "nom_du_lieu", "denomination", "appellation",
            # localisation
            "adresse", "adresse_complete", "code_postal", "cp", "commune", "ville", "departement", "region",
            # contact
            "site", "site_web", "url", "telephone", "email",
            # geolocalisation
            "latitude", "longitude", "geo_point_2d.lat", "geo_point_2d.lon",
            # typology
            "type", "categorie", "label", "libelle",
        ]
        cols = [c for c in wanted if c in df.columns]
        if cols:
            df = df[cols]

        df.to_csv("smac.csv", index=False, encoding="utf-8")
        print("Columns of export -> smac.csv :", list(df.columns)[:30])

if __name__ == "__main__":
    main()


Dataset ID: liste-des-structures-du-spectacle-vivant-subventionnees-par-le-ministere
Records fetched: 373
SMAC records: 97
Columns of export -> smac.csv : ['cp', 'ville', 'region', 'latitude', 'longitude']
