Day1-7~8교시: MLflow 기반 실험 추적
- run 생성, param/metric 기록, 모델 artifact 저장, 실험 비교
- 산출물: MLflow run 캡처, 실험 비교표 (templates/mlflow_run_capture_template.md)

In [None]:
import os
import sys
import mlflow
import mlflow.spark
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql import SparkSession

IN_COLAB = "google.colab" in sys.modules
BASE = "/content" if IN_COLAB else os.getcwd()
CSV_PATH = os.path.join(BASE, "TestData", "Social_Network_Ads.csv")
MLFLOW_DIR = os.path.join(BASE, "mlruns")
SEED = 42

mlflow.set_tracking_uri("file://" + os.path.abspath(MLFLOW_DIR))
mlflow.set_experiment("ncs_spark_day1")

spark = SparkSession.builder.appName("Day1_MLflow").getOrCreate()

In [None]:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(CSV_PATH)
indexer = StringIndexer(inputCol="Gender", outputCol="Gender_idx").setHandleInvalid("keep")
encoder = OneHotEncoder(inputCols=["Gender_idx"], outputCols=["Gender_ohe"])
assembler = VectorAssembler(inputCols=["Age", "EstimatedSalary", "Gender_ohe"], outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pipeline = Pipeline(stages=[indexer, encoder, assembler, scaler])
df_ready = pipeline.fit(df).transform(df)
data = df_ready.select("scaled_features", "Purchased").withColumnRenamed("scaled_features", "features")
train_data, test_data = data.randomSplit([0.8, 0.2], seed=SEED)

Run 1: Baseline LR

In [None]:
with mlflow.start_run(run_name="baseline_lr"):
    lr = LogisticRegression(featuresCol="features", labelCol="Purchased")
    model = lr.fit(train_data)
    preds = model.transform(test_data)
    auc = BinaryClassificationEvaluator(labelCol="Purchased", rawPredictionCol="rawPrediction", metricName="areaUnderROC").evaluate(preds)
    mlflow.log_param("model", "LogisticRegression")
    mlflow.log_param("regParam", str(lr.getRegParam()))
    mlflow.log_metric("test_auc", auc)
    mlflow.spark.log_model(model, "model")

Run 2: Tuned LR (CrossValidator best model)

In [None]:
lr = LogisticRegression(featuresCol="features", labelCol="Purchased")
evaluator = BinaryClassificationEvaluator(labelCol="Purchased", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
param_grid = ParamGridBuilder().addGrid(lr.regParam, [0.01, 0.1]).addGrid(lr.elasticNetParam, [0.0, 0.5]).build()
cv = CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3, seed=SEED)

with mlflow.start_run(run_name="tuned_lr"):
    cv_model = cv.fit(train_data)
    best = cv_model.bestModel
    preds = cv_model.transform(test_data)
    auc_tuned = evaluator.evaluate(preds)
    mlflow.log_param("model", "LogisticRegression")
    mlflow.log_param("regParam", str(best.getRegParam()))
    mlflow.log_param("elasticNetParam", str(best.getElasticNetParam()))
    mlflow.log_metric("test_auc", auc_tuned)
    mlflow.spark.log_model(best, "model")

실험 비교: MLflow UI에서 mlruns 폴더를 tracking_uri로 열어 run 목록 확인. 산출물 템플릿에 run_id, metric 기록.

In [None]:
print("MLflow runs saved under:", MLFLOW_DIR)
print("로컬에서 확인: mlflow ui --backend-store-uri", MLFLOW_DIR)

spark.stop()