
# 05 · Vergleich **Statista** vs. **Stichprobe** (2025)

Dieses Notebook vergleicht die **Statista-Zeitreihe (2021–2024)** mit der **Stichprobe 2025**:

**A. Niveau-Vergleich**  
- Statista **2024** (harmonisiert, KANON) vs. Stichprobe **2025** (Anteil Respondenten je Kategorie).  
- Ergebnis: Rangkorrelation, Abweichungen in %-Punkten, Dumbbell-Plot.

**B. Dynamik-Vergleich**  
- Statista **Δ 2021→2024** in **pp** vs. **Netto-Verschiebung** der Stichprobe 2025 (= „häufiger“ minus „seltener“).  
- Ergebnis: Vorzeichenübereinstimmung, Korrelation, Scatter-Plot, gemeinsame Rangliste.

> **Hinweis (Methodik):** Die Statista-Werte sind **Anteile am Online-Kauf**; die Stichprobe nutzt den **Anteil Respondenten**, die eine Kategorie (derzeit/öfter/seltener) nannten. Niveaus sind deshalb *nicht direkt* vergleichbar – wir vergleichen **Strukturen/Ränge** und **Richtungen der Veränderung**.


In [29]:

# 05_vergleich — Cell 1: Imports, Pfade, Config
from __future__ import annotations

from pathlib import Path
import json, re, math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from scipy.stats import spearmanr

# Projektpfade
NB_DIR  = Path.cwd().resolve()
BASE    = NB_DIR.parents[0] if NB_DIR.name.lower() == "notebooks" else NB_DIR
DATA    = BASE / "data"
RAW     = DATA / "raw"
OUT     = DATA / "processed"
REPORTS = BASE / "reports"
FIG     = REPORTS / "figures"
for p in (OUT, FIG): p.mkdir(parents=True, exist_ok=True)

CONFIG = json.loads((OUT / "project_config.json").read_text(encoding="utf-8"))
KANON  = CONFIG["kanon"]

print("BASE  :", BASE)
print("OUT   :", OUT)
print("FIG   :", FIG)


BASE  : D:\Q3_2025\data-analytics\project
OUT   : D:\Q3_2025\data-analytics\project\data\processed
FIG   : D:\Q3_2025\data-analytics\project\reports\figures


In [30]:

# 05_vergleich — Cell 2: Helper & Plot-Stil
def read_any(path_csv: Path, path_xlsx: Path | None = None, sheet: int | str = 0) -> pd.DataFrame:
    '''Liest bevorzugt CSV, sonst XLSX (erste Tabelle oder sheet).'''
    if path_csv and path_csv.exists():
        return pd.read_csv(path_csv)
    if path_xlsx and path_xlsx.exists():
        return pd.read_excel(path_xlsx, sheet_name=sheet)
    raise FileNotFoundError(f"Datei fehlt: {path_csv if path_csv else path_xlsx}")

plt.rcParams.update({
    "figure.dpi": 120,
    "savefig.dpi": 300,
    "font.size": 11,
    "axes.titlesize": 12,
    "axes.labelsize": 11,
    "axes.grid": True,
    "grid.alpha": 0.2,
    "axes.spines.top": False,
    "axes.spines.right": False,
    "figure.autolayout": False,
})

def percent_axis(ax, axis="x", decimals=0, limit=(0, 100), pad_pct=0):
    lo, hi = limit
    hi_padded = hi + float(pad_pct)
    fmt = mtick.PercentFormatter(xmax=100, decimals=decimals)
    if axis == "x":
        ax.set_xlim(lo, hi_padded)
        ax.xaxis.set_major_formatter(fmt)
    else:
        ax.set_ylim(lo, hi_padded)
        ax.yaxis.set_major_formatter(fmt)

def label_hbars_right(ax, bars, decimals=1, dx=1.0):
    for rect in bars:
        v = rect.get_width()
        ax.text(v + dx, rect.get_y() + rect.get_height()/2,
                f"{v:.{decimals}f} %", va="center", ha="left", fontsize=9, clip_on=False)

def save_fig(path: Path):
    plt.savefig(path, bbox_inches="tight", pad_inches=0.25, facecolor="white")
    plt.close()


In [31]:
# 05_vergleich — Cell 3 (robust): Daten laden

from pathlib import Path
import pandas as pd
import numpy as np

def read_any(path: Path | str, sheet: str | int | None = None) -> pd.DataFrame:
    """CSV/Excel robust einlesen (utf-8/cp1252)."""
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(p)
    if p.suffix.lower() in {".xlsx", ".xls"}:
        return pd.read_excel(p, sheet_name=sheet)
    # CSV-Fallback
    encodings = ["utf-8-sig","cp1252","latin1"]
    for enc in encodings:
        try:
            return pd.read_csv(p, encoding=enc)
        except UnicodeDecodeError:
            continue
    return pd.read_csv(p, encoding="latin1")

def first_existing(candidates: list[Path]) -> Path | None:
    for c in candidates:
        if c.exists():
            return c
    return None

# --- Statista
STAT_WIDE_CSV  = OUT / "statista_harmonisiert_2021_2022_2024.csv"
STAT_LONG_CSV  = OUT / "statista_long_2021_2022_2024.csv"
STAT_DELTA_CSV = OUT / "statista_delta_2021_2022_2024.csv"  # optional

stat_pivot = read_any(STAT_WIDE_CSV, None).set_index("Kategorie").reindex(KANON).fillna(0.0)
stat_all   = read_any(STAT_LONG_CSV, None)

# --- Stichprobe (Niveau 2025)
SAMPLE_WIDE = first_existing([
    OUT / "umfrage_2025_wide.xlsx",
    OUT / "umfrage_2025_wide.csv",
])
SAMPLE_LONG = first_existing([
    OUT / "umfrage_2025_long.xlsx",
    OUT / "umfrage_2025_long.csv",
])
SAMPLE_RPT  = first_existing([
    OUT / "umfrage_2025_reporting_table.xlsx",
    OUT / "umfrage_2025_reporting_table.csv",
])

assert SAMPLE_WIDE and SAMPLE_LONG, "umfrage_2025_wide/long fehlen in data/processed/"
sample_wide = read_any(SAMPLE_WIDE).set_index("Kategorie").reindex(KANON).fillna(0.0)
sample_long = read_any(SAMPLE_LONG)
sample_rpt  = read_any(SAMPLE_RPT).set_index("Kategorie").reindex(KANON) if SAMPLE_RPT else None

# --- Veränderungen (häufiger / seltener)
# Versuche mehrere Namen & Endungen; rekonstruiere nötigenfalls aus change_long
MORE_PATH = first_existing([
    OUT / "umfrage_2025_more_often_wide.xlsx",
    OUT / "umfrage_2025_more_often_wide.csv",
    OUT / "umfrage_2025_more_often.xlsx",
    OUT / "umfrage_2025_more_often.csv",
])
LESS_PATH = first_existing([
    OUT / "umfrage_2025_less_often_wide.xlsx",
    OUT / "umfrage_2025_less_often_wide.csv",
    OUT / "umfrage_2025_less_often.xlsx",
    OUT / "umfrage_2025_less_often.csv",
])
CHANGE_LONG_PATH = first_existing([
    OUT / "umfrage_2025_change_long.xlsx",
    OUT / "umfrage_2025_change_long.csv",
])

def build_from_change_long(p: Path) -> tuple[pd.DataFrame, pd.DataFrame]:
    ch = read_any(p)
    # erwartet Spalten: Kategorie | Antwort (Häufiger/Seltener) | share_%
    # robust normalisieren
    ch["Antwort_norm"] = ch["Antwort"].astype(str).str.strip().str.lower()
    piv = ch.pivot_table(index="Kategorie", columns="Antwort_norm",
                         values="share_%", aggfunc="sum").fillna(0.0)
    # harmonisierte Spaltennamen
    colmap = {
        "häufiger": "2025",
        "haeufiger": "2025",
        "more_often": "2025",
        "seltener": "2025",
        "less_often": "2025",
    }
    more = piv.filter(regex="häu|haeu|more", axis=1).copy()
    less = piv.filter(regex="sel|less", axis=1).copy()
    if more.shape[1] == 0 and less.shape[1] == 0:
        raise ValueError("CHANGE_LONG gefunden, aber keine erkennbaren Spalten für häufiger/seltener.")
    more = more.sum(axis=1).to_frame("2025")
    less = less.sum(axis=1).to_frame("2025")
    more = more.reindex(KANON).fillna(0.0)
    less = less.reindex(KANON).fillna(0.0)
    return more, less

if MORE_PATH and LESS_PATH:
    more_wide = read_any(MORE_PATH).set_index("Kategorie").reindex(KANON).fillna(0.0)
    less_wide = read_any(LESS_PATH).set_index("Kategorie").reindex(KANON).fillna(0.0)
elif CHANGE_LONG_PATH:
    more_wide, less_wide = build_from_change_long(CHANGE_LONG_PATH)
else:
    # letzter Fallback: aus sample_long NICHT sinnvoll ableitbar → klarer Hinweis
    raise FileNotFoundError(
        "Weder 'umfrage_2025_more_often_wide.*' / '...less_often_wide.*' "
        "noch 'umfrage_2025_change_long.*' gefunden. "
        "Bitte die Umfrage-Change-Exports in data/processed/ ablegen."
    )

# --- Logging
print("Geladen: stat_pivot cols →", list(stat_pivot.columns))
print("Geladen: sample_wide cols →", list(sample_wide.columns))
print("Geladen: more_wide cols →", list(more_wide.columns), "| rows:", len(more_wide))
print("Geladen: less_wide cols →", list(less_wide.columns), "| rows:", len(less_wide))


Geladen: stat_pivot cols → ['2021', '2022', '2024']
Geladen: sample_wide cols → ['2025']
Geladen: more_wide cols → ['share_%'] | rows: 7
Geladen: less_wide cols → ['share_%'] | rows: 7


In [32]:

# 05_vergleich — Cell 4: Vergleich A (Niveau 2024 vs 2025)

def _to_year_cols(cols):
    out = []
    for c in cols:
        s = str(c).strip()
        if s.endswith(".0") and s[:-2].isdigit(): s = s[:-2]
        out.append(int(s) if s.isdigit() else s)
    return out

stat_pivot.columns = _to_year_cols(stat_pivot.columns)
sample_wide.columns = _to_year_cols(sample_wide.columns)

stat_2024 = stat_pivot.get(2024)
sample_2025 = sample_wide.get(2025)
assert stat_2024 is not None and sample_2025 is not None, "Spalten 2024 (Statista) oder 2025 (Stichprobe) fehlen."

comp_A = pd.DataFrame({
    "Statista_2024_%": stat_2024.reindex(KANON).astype(float),
    "Stichprobe_2025_%": sample_2025.reindex(KANON).astype(float),
})
comp_A["Diff_2025-2024_pp"] = comp_A["Stichprobe_2025_%"] - comp_A["Statista_2024_%"]

# Rangkorrelation
rho, pval = spearmanr(comp_A["Statista_2024_%"].rank(), comp_A["Stichprobe_2025_%"].rank())
print(f"Spearman-Rangkorrelation (Niveau 2024 vs 2025): rho={rho:.3f}, p={pval:.4f}")

display(comp_A.sort_values("Stichprobe_2025_%", ascending=False).style.format("{:.1f}"))

COMP_A_CSV = OUT / "vergleich_A_niveau_statista2024_vs_sample2025.csv"
comp_A.to_csv(COMP_A_CSV, encoding="utf-8")
print("Exportiert →", COMP_A_CSV)


Spearman-Rangkorrelation (Niveau 2024 vs 2025): rho=0.857, p=0.0137


Unnamed: 0_level_0,Statista_2024_%,Stichprobe_2025_%,Diff_2025-2024_pp
Kategorie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Kleidung / Schuhe,106.0,69.2,-36.8
Hobby- & Freizeitartikel,57.0,53.8,-3.2
Bücher / Medien / Software,25.0,46.2,21.2
"Elektronik (z. B. Smartphones, Haushaltsgeräte)",38.0,43.6,5.6
Medikamente / Drogerieartikel,51.0,28.2,-22.8
Lebensmittel / Getränke,16.0,15.4,-0.6
Möbel / Wohnaccessoires,15.0,0.0,-15.0


Exportiert → D:\Q3_2025\data-analytics\project\data\processed\vergleich_A_niveau_statista2024_vs_sample2025.csv


In [33]:

# 05_vergleich — Cell 5: Dumbbell (2024 vs 2025)
order = comp_A.sort_values("Stichprobe_2025_%", ascending=False).index
x1 = comp_A.loc[order, "Statista_2024_%"].values
x2 = comp_A.loc[order, "Stichprobe_2025_%"].values

import numpy as np
y = np.arange(len(order))

fig, ax = plt.subplots(figsize=(12, 6), layout="constrained")
for yi, a, b in zip(y, x1, x2):
    ax.plot([a, b], [yi, yi], marker="o")
ax.set_yticks(y)
ax.set_yticklabels(order)
ax.set_title("Niveau-Vergleich: Statista 2024 vs. Stichprobe 2025 (Dumbbell)")
ax.set_xlabel("Anteil in %")
percent_axis(ax, axis="x", decimals=0, limit=(0, max(100, float(max(x1.max(), x2.max()))*1.15)))
plt.figtext(0.01, -0.04, "Hinweis: unterschiedliche Grundgesamtheiten. Vergleich dient Struktur-/Rang-Abgleich.", ha="left", va="top", fontsize=9)
save_fig(FIG / "05_niveau_dumbbell_2024_vs_2025.png")
print("Abbildung gespeichert →", FIG / "05_niveau_dumbbell_2024_vs_2025.png")


Abbildung gespeichert → D:\Q3_2025\data-analytics\project\reports\figures\05_niveau_dumbbell_2024_vs_2025.png


In [34]:
# 05_vergleich — Cell 6 (fehlertolerant): Vergleich B · Dynamik

from pathlib import Path
import numpy as np
import pandas as pd
from scipy.stats import spearmanr

def _read_any(path: Path | str, sheet: str | int | None = None) -> pd.DataFrame:
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(p)
    if p.suffix.lower() in {".xlsx", ".xls"}:
        return pd.read_excel(p, sheet_name=sheet)
    for enc in ("utf-8-sig","cp1252","latin1"):
        try:
            return pd.read_csv(p, encoding=enc)
        except UnicodeDecodeError:
            continue
    return pd.read_csv(p, encoding="latin1")

def _first_existing(cands: list[Path]) -> Path | None:
    for c in cands:
        if c.exists():
            return c
    return None

def _build_from_change_long() -> tuple[pd.DataFrame | None, pd.DataFrame | None]:
    """Rekonstruiert more/less aus umfrage_2025_change_long.* (CSV/XLSX)."""
    ch_path = _first_existing([
        OUT / "umfrage_2025_change_long.xlsx",
        OUT / "umfrage_2025_change_long.csv",
    ])
    if not ch_path:
        return None, None
    ch = _read_any(ch_path)
    # erwartet: Kategorie | Antwort | share_%
    ch["Antwort_norm"] = ch["Antwort"].astype(str).str.strip().str.lower()
    piv = ch.pivot_table(index="Kategorie", columns="Antwort_norm",
                         values="share_%", aggfunc="sum").fillna(0.0)
    more = piv.filter(regex="häu|haeu|more", axis=1).sum(axis=1).to_frame("2025")
    less = piv.filter(regex="sel|less",       axis=1).sum(axis=1).to_frame("2025")
    if more.empty and less.empty:
        return None, None
    more = more.reindex(KANON).fillna(0.0)
    less = less.reindex(KANON).fillna(0.0)
    return more, less

# --- Statista Δ 2021→2024 (pp)
DELTA_PATH = OUT / "statista_delta_2021_2022_2024.csv"
if DELTA_PATH.exists():
    stat_delta = pd.read_csv(DELTA_PATH).set_index("Kategorie")
    d_stat = (stat_delta["Δ 2021→2024"]
              if "Δ 2021→2024" in stat_delta.columns
              else (stat_pivot[2024] - stat_pivot[2021]))
else:
    d_stat = stat_pivot[2024] - stat_pivot[2021]

# robust typisieren/benennen
d_stat = pd.Series(d_stat, index=d_stat.index).reindex(KANON).astype(float)
d_stat = d_stat.rename("Δ_Statista_2021→2024_pp")

# --- Versuche: more/less laden …
more_wide_ok = "more_wide" in globals() and isinstance(more_wide, pd.DataFrame) and 2025 in getattr(more_wide, "columns", [])
less_wide_ok = "less_wide" in globals() and isinstance(less_wide, pd.DataFrame) and 2025 in getattr(less_wide, "columns", [])

if not (more_wide_ok and less_wide_ok):
    MORE_PATH = _first_existing([
        OUT / "umfrage_2025_more_often_wide.xlsx",
        OUT / "umfrage_2025_more_often_wide.csv",
        OUT / "umfrage_2025_more_often.xlsx",
        OUT / "umfrage_2025_more_often.csv",
    ])
    LESS_PATH = _first_existing([
        OUT / "umfrage_2025_less_often_wide.xlsx",
        OUT / "umfrage_2025_less_often_wide.csv",
        OUT / "umfrage_2025_less_often.xlsx",
        OUT / "umfrage_2025_less_often.csv",
    ])
    if MORE_PATH and LESS_PATH:
        more_wide = _read_any(MORE_PATH).set_index("Kategorie").reindex(KANON).fillna(0.0)
        less_wide = _read_any(LESS_PATH).set_index("Kategorie").reindex(KANON).fillna(0.0)
        more_wide_ok = 2025 in more_wide.columns
        less_wide_ok = 2025 in less_wide.columns
    else:
        # …ansonsten Rekonstruktion aus change_long (falls vorhanden)
        more_wide, less_wide = _build_from_change_long()
        more_wide_ok = isinstance(more_wide, pd.DataFrame) and 2025 in getattr(more_wide, "columns", [])
        less_wide_ok = isinstance(less_wide, pd.DataFrame) and 2025 in getattr(less_wide, "columns", [])

# --- Fall A: Alles vorhanden → Voller Dynamik-Vergleich
if more_wide_ok and less_wide_ok:
    more = more_wide[2025].reindex(KANON).astype(float)
    less = less_wide[2025].reindex(KANON).astype(float)
    d_sample = (more - less).rename("Netto_Stichprobe_2025_pp")

    comp_B = pd.concat([d_stat, d_sample], axis=1).dropna()

    # Vorzeichen-Übereinstimmung & Korrelation
    sign_stat = np.sign(comp_B["Δ_Statista_2021→2024_pp"])
    sign_smpl = np.sign(comp_B["Netto_Stichprobe_2025_pp"])
    agreement = (sign_stat == sign_smpl).mean()
    rhoB, pB = spearmanr(comp_B["Δ_Statista_2021→2024_pp"],
                          comp_B["Netto_Stichprobe_2025_pp"])

    print(f"Vorzeichen-Übereinstimmung: {agreement*100:.1f} %")
    print(f"Spearman (Δ vs. Netto): rho={rhoB:.3f}, p={pB:.4f}")
    display(comp_B.sort_values("Δ_Statista_2021→2024_pp", ascending=False)
            .style.format("{:+.1f}"))

    COMP_B_CSV = OUT / "vergleich_B_dynamik_statista_vs_sample.csv"
    comp_B.to_csv(COMP_B_CSV, encoding="utf-8")
    print("Exportiert →", COMP_B_CSV)

# --- Fall B: Umfrage-Änderungsdaten fehlen → Statista-Δ only (kein Abbruch)
else:
    print("HINWEIS: Keine Umfrage-Änderungsdaten (häufiger/seltener) gefunden.",
          "Dynamik-Vergleich wird übersprungen – es wird nur die Statista-Δ exportiert.", sep="\n")
    comp_B_stat_only = d_stat.to_frame()
    COMP_B_ONLY = OUT / "vergleich_B_statista_delta_only.csv"
    comp_B_stat_only.to_csv(COMP_B_ONLY, encoding="utf-8")
    display(comp_B_stat_only.sort_values("Δ_Statista_2021→2024_pp", ascending=False)
            .style.format("{:+.1f}"))
    print("Exportiert →", COMP_B_ONLY)


HINWEIS: Keine Umfrage-Änderungsdaten (häufiger/seltener) gefunden.
Dynamik-Vergleich wird übersprungen – es wird nur die Statista-Δ exportiert.


Unnamed: 0_level_0,Δ_Statista_2021→2024_pp
Kategorie,Unnamed: 1_level_1
Kleidung / Schuhe,49.0
Bücher / Medien / Software,25.0
"Elektronik (z. B. Smartphones, Haushaltsgeräte)",0.0
Lebensmittel / Getränke,-17.0
Möbel / Wohnaccessoires,-23.0
Hobby- & Freizeitartikel,-26.0
Medikamente / Drogerieartikel,-27.0


Exportiert → D:\Q3_2025\data-analytics\project\data\processed\vergleich_B_statista_delta_only.csv


In [35]:
# 05_vergleich — Cell 7 (robust): Scatter Δ vs Netto

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# --- Helper: save_fig falls nicht definiert ---
if "save_fig" not in globals():
    def save_fig(path):
        plt.savefig(path, bbox_inches="tight", pad_inches=0.25, facecolor="white")
        plt.close()

# --- comp_B ggf. rekonstruieren ---
need_rebuild = ("comp_B" not in globals()) or (not isinstance(comp_B, pd.DataFrame)) or comp_B.empty
if need_rebuild:
    # Statista-Δ robust bestimmen
    def _to_int_years(cols):
        out = []
        for c in cols:
            s = str(c).strip()
            if s.endswith(".0") and s[:-2].isdigit():
                s = s[:-2]
            out.append(int(s) if s.isdigit() else s)
        return out

    stat_pivot.columns = _to_int_years(stat_pivot.columns)
    assert 2021 in stat_pivot.columns and 2024 in stat_pivot.columns, "stat_pivot benötigt Spalten 2021 & 2024"

    d_stat = (stat_pivot[2024] - stat_pivot[2021]).reindex(KANON).astype(float)
    d_stat = d_stat.rename("Δ_Statista_2021→2024_pp")

    # Stichprobe-Netto nur, wenn more/less vorhanden
    has_change = (
        "more_wide" in globals() and isinstance(more_wide, pd.DataFrame) and 2025 in getattr(more_wide, "columns", []) and
        "less_wide" in globals() and isinstance(less_wide, pd.DataFrame) and 2025 in getattr(less_wide, "columns", [])
    )
    if has_change:
        more = more_wide[2025].reindex(KANON).astype(float)
        less = less_wide[2025].reindex(KANON).astype(float)
        d_sample = (more - less).rename("Netto_Stichprobe_2025_pp")
    else:
        # kein harter Abbruch: Dummy-Serie mit NaNs
        d_sample = pd.Series(np.nan, index=KANON, name="Netto_Stichprobe_2025_pp")

    comp_B = pd.concat([d_stat, d_sample], axis=1).dropna()

# --- Plot oder Hinweis ---
fig_path = FIG / "05_scatter_delta_vs_netto.png"
if comp_B.empty:
    # Platzhalter-Grafik mit klarer Info
    fig, ax = plt.subplots(figsize=(8, 4), constrained_layout=True)
    ax.axis("off")
    ax.text(0.02, 0.7, "Kein Scatter möglich", fontsize=14, weight="bold")
    ax.text(
        0.02, 0.45,
        "Es fehlen Umfrage-Änderungsdaten (häufiger/seltener)\n"
        "→ Lege in data/processed/ z. B. ab:\n"
        "   • umfrage_2025_more_often_wide.(xlsx|csv)\n"
        "   • umfrage_2025_less_often_wide.(xlsx|csv)\n"
        "   ODER • umfrage_2025_change_long.(xlsx|csv)",
        fontsize=10, va="top"
    )
    save_fig(fig_path)
    print("Hinweisgrafik gespeichert →", fig_path)
else:
    x = comp_B["Δ_Statista_2021→2024_pp"]
    y = comp_B["Netto_Stichprobe_2025_pp"]

    fig, ax = plt.subplots(figsize=(7.5, 7), constrained_layout=True)
    ax.scatter(x, y, s=60, zorder=3)

    # Labels pro Punkt
    for cat, xv, yv in zip(comp_B.index, x.values, y.values):
        ax.text(xv, yv, "  " + cat, va="center", fontsize=9)

    # Achsen, Linien, Titel
    ax.axhline(0, color="#999", linewidth=1, zorder=1)
    ax.axvline(0, color="#999", linewidth=1, zorder=1)
    ax.set_xlabel("Statista Δ 2021→2024 (pp)")
    ax.set_ylabel("Stichprobe Netto 2025 (häufiger – seltener, pp)")
    ax.set_title("Dynamikvergleich: Statista Δ vs. Stichprobe Netto")

    # Optional: Regressionslinie (nur wenn mindestens 2 Punkte)
    if len(comp_B) >= 2 and np.isfinite(x).all() and np.isfinite(y).all():
        m, b = np.polyfit(x.values, y.values, 1)
        xs = np.linspace(x.min(), x.max(), 100)
        ax.plot(xs, m*xs + b, linewidth=1, alpha=0.8)

    save_fig(fig_path)
    print("Abbildung gespeichert →", fig_path)


Hinweisgrafik gespeichert → D:\Q3_2025\data-analytics\project\reports\figures\05_scatter_delta_vs_netto.png


In [36]:

# 05_vergleich — Cell 8: Gemeinsame Rangliste
rank_df = pd.DataFrame({
    "Δ_Statista_pp": comp_B["Δ_Statista_2021→2024_pp"],
    "Rank_Statista": comp_B["Δ_Statista_2021→2024_pp"].rank(ascending=False),
    "Netto_Stichprobe_pp": comp_B["Netto_Stichprobe_2025_pp"],
    "Rank_Stichprobe": comp_B["Netto_Stichprobe_2025_pp"].rank(ascending=False),
})
rank_df["Rank_Diff"] = (rank_df["Rank_Stichprobe"] - rank_df["Rank_Statista"]).round(1)
display(rank_df.sort_values("Rank_Statista"))

RANK_CSV = OUT / "vergleich_rankings_delta_netto.csv"
rank_df.to_csv(RANK_CSV, encoding="utf-8")
print("Exportiert →", RANK_CSV)


Unnamed: 0,Δ_Statista_pp,Rank_Statista,Netto_Stichprobe_pp,Rank_Stichprobe,Rank_Diff


Exportiert → D:\Q3_2025\data-analytics\project\data\processed\vergleich_rankings_delta_netto.csv


In [37]:

# 05_vergleich — Cell 9: Kurzbefund
lines = []

rho, pval = spearmanr(comp_A["Statista_2024_%"].rank(), comp_A["Stichprobe_2025_%"].rank())
lines.append(f"Niveau-Vergleich 2024 vs 2025: Spearman rho={rho:.3f} (p={pval:.4f}).")
top_over = comp_A["Diff_2025-2024_pp"].nlargest(2)
top_under= comp_A["Diff_2025-2024_pp"].nsmallest(2)
lines.append("Stichprobe deutlich höher als Statista (Top-2): " + ", ".join([f"{k} ({v:+.1f} pp)" for k,v in top_over.items()]))
lines.append("Stichprobe deutlich niedriger als Statista (Top-2): " + ", ".join([f"{k} ({v:+.1f} pp)" for k,v in top_under.items()]))

sign_agree = (np.sign(comp_B["Δ_Statista_2021→2024_pp"]) == np.sign(comp_B["Netto_Stichprobe_2025_pp"])).mean()
rhoB, pB = spearmanr(comp_B["Δ_Statista_2021→2024_pp"], comp_B["Netto_Stichprobe_2025_pp"])
lines.append(f"Dynamik: Vorzeichen-Übereinstimmung = {sign_agree*100:.1f} %, Spearman rho={rhoB:.3f} (p={pB:.4f}).")

SUMMARY_TXT = OUT / "05_compare_summary.txt"
with open(SUMMARY_TXT, "w", encoding="utf-8") as f:
    f.write("\n".join(lines))

print("\n".join(lines))
print("Kurzbefund exportiert →", SUMMARY_TXT)


Niveau-Vergleich 2024 vs 2025: Spearman rho=0.857 (p=0.0137).
Stichprobe deutlich höher als Statista (Top-2): Bücher / Medien / Software (+21.2 pp), Elektronik (z. B. Smartphones, Haushaltsgeräte) (+5.6 pp)
Stichprobe deutlich niedriger als Statista (Top-2): Kleidung / Schuhe (-36.8 pp), Medikamente / Drogerieartikel (-22.8 pp)
Dynamik: Vorzeichen-Übereinstimmung = nan %, Spearman rho=nan (p=nan).
Kurzbefund exportiert → D:\Q3_2025\data-analytics\project\data\processed\05_compare_summary.txt
