Day1-4교시: MLlib 파이프라인 구조
- Estimator vs Transformer, Pipeline, Fit/Transform
- StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
- 산출물: 공통 파이프라인 템플릿 코드

In [None]:
import os
import sys
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.sql import SparkSession

IN_COLAB = "google.colab" in sys.modules
BASE = "/content" if IN_COLAB else os.getcwd()
CSV_PATH = os.path.join(BASE, "TestData", "Social_Network_Ads.csv")

spark = SparkSession.builder.appName("Day1_MLlib_Pipeline").getOrCreate()

In [None]:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(CSV_PATH)
df.limit(3).show()

단계별 Transformer/Estimator
- StringIndexer: 범주형 -> 숫자 인덱스 (fit으로 vocabulary 결정)
- OneHotEncoder: 인덱스 -> 희소 벡터
- VectorAssembler: 여러 컬럼 -> 단일 feature 벡터 (Estimator 아님, Transformer)
- StandardScaler: 평균 0, 분산 1 (fit으로 mean/std 결정)

In [None]:
indexer = StringIndexer(inputCol="Gender", outputCol="Gender_idx").setHandleInvalid("keep")
encoder = OneHotEncoder(inputCols=["Gender_idx"], outputCols=["Gender_ohe"])
assembler = VectorAssembler(
    inputCols=["Age", "EstimatedSalary", "Gender_ohe"],
    outputCol="features"
)
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

Pipeline으로 순서 정의 후 fit (한 번에 전 단계 학습)

In [None]:
pipeline = Pipeline(stages=[indexer, encoder, assembler, scaler])
model = pipeline.fit(df)
transformed = model.transform(df)

In [None]:
transformed.select("Age", "EstimatedSalary", "Gender", "scaled_features", "Purchased").limit(5).show(truncate=False)

공통 파이프라인 템플릿 (산출물): 범주형 컬럼 여러 개 확장 가능

In [None]:
# 템플릿 예시:
# indexers = [StringIndexer(inputCol=c, outputCol=c+"_idx").setHandleInvalid("keep") for c in categorical_cols]
# encoders = [OneHotEncoder(inputCols=[c+"_idx"], outputCols=[c+"_ohe"]) for c in categorical_cols]
# assembler = VectorAssembler(inputCols=numeric_cols + [c+"_ohe" for c in categorical_cols], outputCol="features")
# scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
# pipeline = Pipeline(stages=indexers + encoders + [assembler, scaler])

spark.stop()