# Explorative Datenanalyse (EDA): Injection Molding Dataset

**Ziel:** Dieses Notebook führt **Schritt für Schritt** durch eine EDA eines Spritzguss-Datensatzes (synthetisch, realistisch).
Wir untersuchen Datenqualität, Verteilungen, Korrelationen, zeitliche Muster (Drift) und kategoriale Effekte (Schicht, Materialcharge).
Am Ende destillieren wir **Take‑Home‑Erkenntnisse**.

*Erstellt am:* 2025-10-02 18:39


In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Render settings
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

# IMPORTANT: use matplotlib only; no seaborn, one chart per figure, no explicit colors


## 1) Daten laden & Überblick

In [None]:
# Pfad anpassen, falls nötig
path = "../injection_moulding_synthetic/eda_injection_molding_synthetic.csv"
df = pd.read_csv(path, parse_dates=["timestamp"])

print(df.shape)
df.head(10)

### Datenwörterbuch (Auszug)
- **timestamp**: Zeitstempel des Produktionszyklus
- **machine_id**, **product_code**, **cavity_id**, **operator_id**, **shift**, **material_batch**
- **ambient_temp_C**: Umgebungstemperatur (°C)
- **melt_temp_C**: Schmelztemperatur (°C)
- **injection_pressure_bar**, **hold_pressure_bar**
- **cooling_time_s**, **cycle_time_s**
- **material_moisture_pct**: Materialfeuchte (Anteil)
- **screw_speed_rpm**
- **vibration_rms_g**
- **energy_kwh**
- **dimensional_deviation_mm**: Maßabweichung (mm, Vorzeichen erlaubt)
- **scrap**: 0/1 Ausschussindikator
- **scrap_type**: Fehlerart (None, Burr, ShortShot, SinkMark, Warpage)
- **defects_count**: Anzahl erkannter Defekte (bei scrap==1 > 0, sonst 0)


## 2) Datenqualität: Missing Values, Plausibilität, Ausreißer

In [None]:
# Missing Values
na_rate = df.isna().mean().sort_values(ascending=False)
print("Fehlraten (Top 12):")
display(na_rate.head(12))

# Grundlegende Statistik
desc = df.describe().T
display(desc)


In [None]:
# Ausreißer-Check (Beispiel: Zykluszeit)
plt.figure()
df["cycle_time_s"].plot(kind="box")
plt.title("Boxplot: cycle_time_s")
plt.ylabel("seconds")
plt.show()


## 3) Univariate Verteilungen

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

for col in ["ambient_temp_C","melt_temp_C","injection_pressure_bar","cooling_time_s","cycle_time_s","vibration_rms_g"]:
    plt.figure()
    df[col].dropna().plot(kind="hist", bins=30)
    plt.title(f"Histogram: {col}")
    plt.xlabel(col)
    plt.ylabel("count")
    plt.show()


## 4) Kategoriale Effekte auf Ausschuss

In [None]:
def rate_by(col, top=10):
    rates = df.groupby(col)["scrap"].mean().sort_values(ascending=False)
    print(f"Scrap Rate by {col} (Top {top}):")
    display((rates*100).round(2).head(top))

for col in ["machine_id","shift","material_batch","product_code","cavity_id","operator_id"]:
    rate_by(col, top=10)


## 5) Zeitliche Muster: Rollierende Ausschussrate & Drift

In [None]:
# Rollierende Ausschussrate (Fenster über 12 Stunden)
ts = (df.sort_values("timestamp")
        .set_index("timestamp")["scrap"]
        .rolling("12H").mean())

plt.figure()
ts.plot()
plt.title("Rolling scrap rate (12H window)")
plt.xlabel("time")
plt.ylabel("scrap rate")
plt.show()

# Drift-Beispiele: cycle_time_s, vibration_rms_g, dimensional_deviation_mm (roll. Mittel)
df_sorted = df.sort_values("timestamp").set_index("timestamp")
for col in ["cycle_time_s","vibration_rms_g","dimensional_deviation_mm"]:
    plt.figure()
    df_sorted[col].rolling("12H").mean().plot()
    plt.title(f"Rolling mean (12H): {col}")
    plt.xlabel("time")
    plt.ylabel(col)
    plt.show()


## 6) Korrelationen (numerisch)

In [None]:
num = df.select_dtypes(include=[np.number])
corr = num.corr()

plt.figure()
plt.imshow(corr, interpolation="nearest")
plt.title("Correlation heatmap (numeric features)")
plt.xticks(range(corr.shape[1]), corr.columns, rotation=90)
plt.yticks(range(corr.shape[1]), corr.columns)
plt.colorbar()
plt.tight_layout()
plt.show()

# Top-Korrelationen mit 'scrap' (absolut sortiert)
scrap_corr = corr["scrap"].drop("scrap").abs().sort_values(ascending=False)
print("Top correlations with 'scrap':")
display(scrap_corr.head(10))


## 7) Umweltbedingungen: Temperaturfenster

In [None]:
bins = [0,18,22,26,100]
labels = ["<18","18-22","22-26",">26"]
tmp = df.assign(ambient_bin=pd.cut(df["ambient_temp_C"], bins=bins, labels=labels, include_lowest=True))

scrap_by_bin = tmp.groupby("ambient_bin")["scrap"].mean()
print("Scrap Rate by ambient temperature bins (%):")
display((scrap_by_bin*100).round(2))

plt.figure()
(scrap_by_bin*100).plot(kind="bar")
plt.title("Scrap rate by ambient temperature bin (%)")
plt.xlabel("ambient_temp_C bin")
plt.ylabel("scrap rate (%)")
plt.show()


## 8) Multivariate Quick-Check

In [None]:
plt.figure()
plt.scatter(df["melt_temp_C"], df["injection_pressure_bar"], s=8, c=df["scrap"])
plt.title("Melt temp vs. Injection pressure (colored by scrap)")
plt.xlabel("melt_temp_C")
plt.ylabel("injection_pressure_bar")
plt.show()


## 9) Batch-Effekt: Gibt es eine problematische Materialcharge?

In [None]:
batch_rates = df.groupby("material_batch")["scrap"].mean().sort_values(ascending=False)
display((batch_rates*100).round(2))

# Visualisieren (Top 8 Batches)
plt.figure()
(batch_rates.head(8)*100).plot(kind="bar")
plt.title("Top material batches by scrap rate (%)")
plt.ylabel("scrap rate (%)")
plt.xlabel("material_batch")
plt.show()


## 10) Schicht‑Effekte

In [None]:
shift_rates = df.groupby("shift")["scrap"].mean().sort_values(ascending=False)
display((shift_rates*100).round(2))

plt.figure()
(shift_rates*100).plot(kind="bar")
plt.title("Scrap rate by shift (%)")
plt.ylabel("scrap rate (%)")
plt.xlabel("shift")
plt.show()


## 11) Zykluszeit ↔ Energie ↔ Qualität (Trade‑offs)

In [None]:
print(df[["cycle_time_s","energy_kwh","scrap"]].corr())

plt.figure()
plt.scatter(df["cycle_time_s"], df["energy_kwh"], s=8)
plt.title("Cycle time vs Energy")
plt.xlabel("cycle_time_s")
plt.ylabel("energy_kwh")
plt.show()


## 12) Take‑Home‑Erkenntnisse (automatisch zusammengefasst)

In [None]:
summary_lines = []

overall = df["scrap"].mean()*100
summary_lines.append(f"- **Gesamte Ausschussrate:** {overall:.2f}%")

# Problem batch
batch_rates_pct = (df.groupby("material_batch")["scrap"].mean()*100).sort_values(ascending=False)
top_batch, top_rate = batch_rates_pct.index[0], batch_rates_pct.iloc[0]
median_rate = batch_rates_pct.median()
if top_rate > median_rate + 2.0:
    summary_lines.append(f"- **Auffällige Materialcharge:** {top_batch} mit {top_rate:.1f}% (Median {median_rate:.1f}%) → Batch prüfen/sperren.")

# Shift effect
shift_rates_pct = (df.groupby("shift")["scrap"].mean()*100).sort_values(ascending=False)
worst_shift, worst_rate = shift_rates_pct.index[0], shift_rates_pct.iloc[0]
best_shift, best_rate = shift_rates_pct.index[-1], shift_rates_pct.iloc[-1]
if worst_rate > best_rate + 1.0:
    diff = worst_rate - best_rate
    summary_lines.append(f"- **Schichteffekt:** {worst_shift} schlechter als {best_shift} (+{diff:.1f} %-Pkt.) → Schulung/Checklisten/Support prüfen.")

# Ambient bin effect
bins = [0,18,22,26,100]
labels = ["<18","18-22","22-26",">26"]
tmp = df.assign(ambient_bin=pd.cut(df["ambient_temp_C"], bins=bins, labels=labels, include_lowest=True))
bin_rates = (tmp.groupby("ambient_bin")["scrap"].mean()*100)
if (bin_rates.max() - bin_rates.min()) > 1.0:
    hot = bin_rates.idxmax()
    cold = bin_rates.idxmin()
    summary_lines.append(f"- **Temperaturfenster:** Unterschiede zwischen Bins (max bei **{hot}**, min bei **{cold}**) → Klima/Ofen enger führen.")

# Drift indicators: compare first vs last 20% rolling mean
def drift_flag(series, label):
    s = series.sort_index()
    n = len(s)
    early = s.iloc[:max(5, n//5)].mean()
    late = s.iloc[-max(5, n//5):].mean()
    if late > early * 1.05:  # +5% Anstieg
        summary_lines.append(f"- **Drift:** {label} stieg von {early:.3f} auf {late:.3f} → Werkzeug/Verschleiß prüfen.")

df_sorted = df.sort_values("timestamp").set_index("timestamp")
for col in ["cycle_time_s","vibration_rms_g","dimensional_deviation_mm"]:
    drift_flag(df_sorted[col].rolling("12H").mean().dropna(), col)

print("\n".join(summary_lines))


### Interpretation & nächste Schritte
- **EDA ist keine Modellierung** – sie erzeugt **Hypothesen** und **Prio‑Hinweise**.
- Nachverfolgen: **Materialcharge prüfen**, **Night‑Shift‑Abläufe** beleuchten, **Werkzeugzustand** inspizieren.
- Für die Validierung: A/B‑Änderung (z. B. andere Charge, Checklisten in der Nachtschicht), **Kontrollcharts**, **weiterführende Modelle** (Logit/Tree) – **aber erst nach sauberer EDA**.
