# 3교시 대비: 이상탐지 **비지도 전용(라벨 없음)** 템플릿 (토이데이터 + 문제 + 모범답안 + 실무 해설)

이 노트북은 **라벨이 없거나(또는 라벨을 믿기 어렵거나)**,  
비지도 학습만으로 이상을 찾아 **운영 정책(Top-N/threshold)** 을 제시하는 문제에 대응합니다.

## 핵심
- 전처리(범주/수치)
- 비지도 모델 3종 중 2종 이상 실행 권장:
  - IsolationForest
  - PCA/SVD 재구성오차
  - (옵션) AutoEncoder (시간 남을 때)
- 스코어를 0~1로 정규화 → 정책(Top 5%, Top 1% 등)
- 리포트 출력 + (가능하면) “의심 피처” 확인(간단 통계/그룹 비교)


## 0) Imports & 재현성

In [10]:
import random, warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import IsolationForest

warnings.filterwarnings("ignore")
SEED=42
random.seed(SEED)
np.random.seed(SEED)


## 1) 토이데이터 생성(라벨 없는 버전)

- 실제 시험에서는 `pd.read_csv`로 교체
- 여기서는 **라벨 없이** 데이터만 만들고,
  일부 구간에만 미세한 이상 신호를 심어둠(정답 라벨은 만들지만 평가에는 사용하지 않음)


In [11]:
def make_toy_unlabeled(n=25000, seed=42):
    rng = np.random.default_rng(seed)
    line_id = rng.integers(1, 7, size=n).astype(str)
    shift = rng.choice(["day","swing","night"], size=n, p=[0.5,0.3,0.2])
    supplier = rng.choice(["A","B","C","D","E"], size=n, p=[0.25,0.25,0.2,0.2,0.1])
    material_grade = rng.choice(["G1","G2","G3"], size=n, p=[0.55,0.35,0.10])

    temp_mean = rng.normal(70, 5, size=n) + (line_id.astype(int)-3)*0.7
    pressure_std = np.abs(rng.normal(1.1, 0.35, size=n))
    vibration_max = np.abs(rng.normal(3.0, 1.0, size=n)) + (shift=="night")*0.6
    humidity = np.clip(rng.normal(45, 10, size=n), 10, 90)
    cycle_time = rng.normal(120, 15, size=n) + (material_grade=="G3")*7
    error_cnt = rng.poisson(1.3, size=n) + (supplier=="E")*1

    df = pd.DataFrame({
        "line_id": line_id,
        "shift": shift,
        "supplier": supplier,
        "material_grade": material_grade,
        "temp_mean": temp_mean,
        "pressure_std": pressure_std,
        "vibration_max": vibration_max,
        "humidity": humidity,
        "cycle_time": cycle_time,
        "error_cnt": error_cnt
    })

    # 결측
    for col in ["temp_mean","pressure_std","humidity"]:
        m = rng.random(n) < 0.02
        df.loc[m, col] = np.nan

    # '숨겨진 이상'을 일부 주입(평가용 라벨 y_hidden)
    risk = (
        0.9*np.maximum(0, vibration_max-3.7)
        + 1.3*np.maximum(0, error_cnt-2)
        + 0.04*np.maximum(0, cycle_time-132)
        + (supplier=="E")*0.8
        + (material_grade=="G3")*0.9
        + ((supplier=="C") & (shift=="night"))*0.7
        + rng.normal(0, 0.3, size=n)
    )
    y_hidden = (risk >= np.quantile(risk, 0.98)).astype(int)  # 실제로는 모름

    # 이상치 주입
    o = rng.random(n) < 0.005
    df.loc[o, "vibration_max"] *= 5
    df.loc[o, "error_cnt"] += 15

    return df, y_hidden

df, y_hidden = make_toy_unlabeled()
df.head()


Unnamed: 0,line_id,shift,supplier,material_grade,temp_mean,pressure_std,vibration_max,humidity,cycle_time,error_cnt
0,1,swing,B,G2,72.201846,1.611765,3.355289,37.261277,129.428981,1
1,5,day,C,G3,63.364482,0.855629,2.114496,38.691957,114.966993,0
2,4,day,B,G1,71.581965,0.867188,2.301938,,79.426107,0
3,3,night,C,G2,79.701367,0.797913,4.521331,44.411507,123.758695,1
4,3,swing,C,G1,72.990901,1.714678,3.042883,54.778454,108.307165,3


## 2) (시험형) 문제 요구사항(비지도)
1) 데이터 탐색(결측/분포)
2) 전처리
3) 비지도 이상 점수 생성(2개 이상 기법)
4) 정책(Top 5% 등) 제시
5) 리포트 출력
6) (선택) 이상 샘플 특성 비교(그룹 통계)

## 3) EDA

In [12]:
cat_cols=["line_id","shift","supplier","material_grade"]
num_cols=[c for c in df.columns if c not in cat_cols]

print("Shape:", df.shape)
print("\nMissing ratio top:")
print(df.isna().mean().sort_values(ascending=False).head(10))

# 수치 요약
print("\nNumeric summary:")
display(df[num_cols].describe().T)

# 범주 분포 예시
print("\nSupplier counts:")
print(df["supplier"].value_counts().head(10))


Shape: (25000, 10)

Missing ratio top:
humidity          0.01988
pressure_std      0.01896
temp_mean         0.01836
line_id           0.00000
shift             0.00000
material_grade    0.00000
supplier          0.00000
vibration_max     0.00000
cycle_time        0.00000
error_cnt         0.00000
dtype: float64

Numeric summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
temp_mean,24541.0,70.280052,5.166099,52.280269,66.784176,70.242681,73.722981,94.633729
pressure_std,24526.0,1.101311,0.349623,0.001982,0.866103,1.099872,1.336641,2.704774
vibration_max,25000.0,3.172772,1.371398,0.004359,2.422261,3.12162,3.822588,28.32371
humidity,24503.0,45.141932,9.987611,10.0,38.351283,45.088172,51.932782,90.0
cycle_time,25000.0,120.658174,15.04739,46.076423,110.544637,120.768092,130.783157,181.344204
error_cnt,25000.0,1.4628,1.526614,0.0,1.0,1.0,2.0,21.0



Supplier counts:
supplier
B    6233
A    6189
D    5070
C    4996
E    2512
Name: count, dtype: int64


## 4) 전처리(범주/수치)

In [13]:
USE_ROBUST=True
scaler = RobustScaler() if USE_ROBUST else StandardScaler()

cat_pipe = Pipeline([
    ("imp", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])
num_pipe = Pipeline([
    ("imp", SimpleImputer(strategy="median")),
    ("sc", scaler)
])
preprocess = ColumnTransformer([
    ("cat", cat_pipe, cat_cols),
    ("num", num_pipe, num_cols)
])

X_enc = preprocess.fit_transform(df[cat_cols+num_cols])
print("Encoded shape:", X_enc.shape)


Encoded shape: (25000, 23)


## 5) 비지도 모델 1: IsolationForest → anomaly score

In [14]:
iso = IsolationForest(
    n_estimators=300,
    contamination="auto",
    random_state=SEED,
    n_jobs=-1
)

DENSE_FOR_ISO=False
X_iso = X_enc.toarray() if (DENSE_FOR_ISO and hasattr(X_enc,"toarray")) else X_enc
iso.fit(X_iso)

iso_score = -iso.score_samples(X_iso)  # 클수록 이상
print("iso score:", np.min(iso_score), np.mean(iso_score), np.max(iso_score))


iso score: 0.4367090437569677 0.4973533212246616 0.6273583798571569


## 6) 비지도 모델 2: SVD(PCA) 재구성오차 → anomaly score
희소행렬에서도 안정

In [15]:
svd_dim=15
svd = TruncatedSVD(n_components=svd_dim, random_state=SEED)
Z = svd.fit_transform(X_enc)
X_recon = Z @ svd.components_

X_dense = X_enc.toarray() if hasattr(X_enc,"toarray") else np.asarray(X_enc)
svd_score = np.mean((X_dense - X_recon)**2, axis=1)

print("svd recon err:", np.min(svd_score), np.mean(svd_score), np.max(svd_score))


svd recon err: 0.0008634993483269636 0.031057689078461154 0.14176862896026773


## 7) 스코어 결합/정규화 → 운영 정책(Top N%)

In [16]:
def minmax(x):
    x=np.asarray(x)
    return (x - x.min())/(x.max()-x.min()+1e-12)

# 결합(가중치는 상황에 따라 조절)
ens = 0.5*minmax(iso_score) + 0.5*minmax(svd_score)
score = minmax(ens)

# 정책: Top 5%
TOP_PCT=5
th = np.percentile(score, 100-TOP_PCT)
print("Top %d%% threshold:"%TOP_PCT, th)

is_anom = (score >= th).astype(int)
print("Alarm rate:", is_anom.mean())


Top 5% threshold: 0.5996900034050453
Alarm rate: 0.05


## 8) 리포트(Top-N)

In [17]:
out = df.copy()
out["score"]=score
out["is_anomaly"]=is_anom

cols=["line_id","shift","supplier","material_grade","score","is_anomaly"]
report = out[out["is_anomaly"]==1][cols].sort_values("score", ascending=False).head(30)
report


Unnamed: 0,line_id,shift,supplier,material_grade,score,is_anomaly
5243,6,night,E,G3,1.0,1
4713,3,night,E,G3,0.989757,1
5139,6,night,E,G3,0.989023,1
7511,3,night,E,G3,0.979396,1
12019,6,night,E,G3,0.967587,1
22570,5,night,E,G3,0.962688,1
7351,6,night,E,G3,0.95552,1
7976,1,night,E,G3,0.949649,1
13478,3,night,E,G3,0.946711,1
12424,3,night,E,G3,0.943448,1


## 9) (선택) 이상 샘플 vs 전체 비교(간단 통계)
라벨이 없어도 '뭐가 다른지'를 보여주면 가산 가능

In [18]:
num_compare = pd.DataFrame({
    "overall_mean": out[num_cols].mean(numeric_only=True),
    "anom_mean": out.loc[out["is_anomaly"]==1, num_cols].mean(numeric_only=True)
})
num_compare["diff"] = num_compare["anom_mean"] - num_compare["overall_mean"]
num_compare.sort_values("diff", ascending=False).head(20)


Unnamed: 0,overall_mean,anom_mean,diff
cycle_time,120.658174,124.026307,3.368133
error_cnt,1.4628,1.9616,0.4988
vibration_max,3.172772,3.519831,0.347059
humidity,45.141932,45.474004,0.332072
temp_mean,70.280052,70.544021,0.263969
pressure_std,1.101311,1.09705,-0.004261


## 10) 시험장 복붙 체크리스트(비지도 전용)
- `CSV_PATH`로 로딩 후 `cat_cols`, `num_cols`만 수정
- 모델 2개(iso + svd) 돌리고 score 결합
- 정책은 Top-N(예: 5%)로 고정하면 가장 안전
- 리포트는 score 정렬 상위 N개 출력
