# AI4I 2020 – EDA & Modelle in Colab

Dieses Notebook zeigt eine kompakte Pipeline für einen industrienahen Datensatz (Predictive Maintenance).

**Inhalte:** deskriptive Statistik, EDA, Logistic Regression, Random Forest, k-means, kleines MLP.

**Quelle (UCI):** `ai4i2020.csv` (CC BY 4.0)

Direkt-URL: `https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv`


In [None]:
# %%capture
# Optional: Hilfspakete installieren (in Colab meist nicht nötig)
# !pip install -q ucimlrepo


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score,
                             RocCurveDisplay, PrecisionRecallDisplay)
import time
import os


## Daten laden

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv"
df = pd.read_csv(url)
df.head()

In [None]:
df.shape, df.dtypes

## Spaltennamen harmonisieren

In [None]:
df.columns = (df.columns
              .str.strip()
              .str.replace(r"\s+", "_", regex=True)
              .str.replace(r"[\[\]\(\)]", "", regex=True)
              .str.replace("%", "pct")
              .str.lower())
df.head()

## Deskriptive Statistik & EDA

In [None]:
df.describe(include='all')

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
for col in numeric_cols:
    plt.figure()
    df[col].hist(bins=30)
    plt.title(f'Histogramm: {col}')
    plt.xlabel(col)
    plt.ylabel('Häufigkeit')
    plt.show()

## Zielvariable & Features

In [None]:
target_col = 'machine_failure'
y = df[target_col].astype(int)
X = df.drop(columns=[target_col])
X = pd.get_dummies(X, drop_first=True)
X.head(), y.value_counts(normalize=True)

## Split & Skalierung

In [None]:
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.25, stratify=y_trainval, random_state=42
)
scaler = StandardScaler(with_mean=False)
X_train_s = scaler.fit_transform(X_train)
X_val_s   = scaler.transform(X_val)
X_test_s  = scaler.transform(X_test)
X_train.shape, X_val.shape, X_test.shape

## Logistische Regression

In [None]:
t0 = time.time()
lr = LogisticRegression(max_iter=200)
lr.fit(X_train_s, y_train)
t1 = time.time()
val_pred = lr.predict(X_val_s)
val_proba = lr.predict_proba(X_val_s)[:, 1]
print(classification_report(y_val, val_pred, digits=3))
print('Val ROC-AUC (LR):', roc_auc_score(y_val, val_proba))
print(f'Trainzeit: {t1 - t0:.2f}s')

## Random Forest

In [None]:
rf = RandomForestClassifier(n_estimators=300, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
val_pred = rf.predict(X_val)
val_proba = rf.predict_proba(X_val)[:, 1]
print(classification_report(y_val, val_pred, digits=3))
print('Val ROC-AUC (RF):', roc_auc_score(y_val, val_proba))
importances = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)[:15]
importances.plot(kind='bar')
plt.title('Top-15 Feature Importances (RF)')
plt.tight_layout()
plt.show()

## k-means

In [None]:
k = 4
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X_train_s)
val_clusters = km.predict(X_val_s)
cluster_df = pd.DataFrame({'cluster': val_clusters, 'y': y_val.values})
cluster_df.groupby('cluster')['y'].agg(['count','mean']).rename(columns={'mean':'failure_rate'})

## Kleines MLP

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(32, 32), activation='relu', max_iter=50, random_state=42)
mlp.fit(X_train_s, y_train)
val_pred = mlp.predict(X_val_s)
val_proba = mlp.predict_proba(X_val_s)[:, 1]
print(classification_report(y_val, val_pred, digits=3))
print('Val ROC-AUC (MLP):', roc_auc_score(y_val, val_proba))

## Finale Auswertung auf Test

In [None]:
best_model = rf
test_proba = best_model.predict_proba(X_test)[:, 1]
test_pred = (test_proba >= 0.5).astype(int)
print(classification_report(y_test, test_pred, digits=3))
print('Test ROC-AUC:', roc_auc_score(y_test, test_proba))
RocCurveDisplay.from_predictions(y_test, test_proba)
plt.title('ROC – Test')
plt.show()
PrecisionRecallDisplay.from_predictions(y_test, test_proba)
plt.title('Precision-Recall – Test')
plt.show()


_Automatisch generiert • Stand: 2025-09-26 17:57 UTC_