Day2-3~4교시: 회귀 분석 (LR → RF → GBT)
- Linear Regression 베이스라인
- RandomForest / GBT Regressor로 개선
- 평가 지표: RMSE, MAE, R²
- 산출물: baseline vs 개선 모델 비교표

In [None]:
import os
import sys
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression, RandomForestRegressor, GBTRegressor
from pyspark.sql import SparkSession

IN_COLAB = "google.colab" in sys.modules
BASE = "/content" if IN_COLAB else os.getcwd()
SEED = 42

spark = SparkSession.builder.appName("Day2_Regression_LR_RF_GBT").getOrCreate()

## 1. 데이터 로드: California Housing

- 캘리포니아 주택 가격 데이터셋 (sklearn 제공)
- 8개 feature, PRICE(target)

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

housing = fetch_california_housing()
pdf = pd.DataFrame(housing.data, columns=housing.feature_names)
pdf["PRICE"] = housing.target
spark_df = spark.createDataFrame(pdf)

In [None]:
# 데이터 확인
spark_df.limit(5).toPandas()

In [None]:
# 스키마 확인
spark_df.printSchema()

## 2. 전처리 파이프라인

- VectorAssembler: 모든 feature를 하나의 벡터로 결합
- StandardScaler: 평균 0, 분산 1로 정규화

In [None]:
# 수치형 feature 선택 (PRICE 제외)
feature_cols = [c for c in spark_df.columns if c != "PRICE"]

print(f"Feature columns ({len(feature_cols)}): {feature_cols}")

In [None]:
# 파이프라인 구성
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
prep = Pipeline(stages=[assembler, scaler])

# 전처리 적용
df_ready = prep.fit(spark_df).transform(spark_df)
data = df_ready.select("scaled_features", "PRICE").withColumnRenamed("scaled_features", "features")

In [None]:
# 변환 결과 확인
data.limit(3).toPandas()

## 3. Train/Test 분리

In [None]:
train_data, test_data = data.randomSplit([0.8, 0.2], seed=SEED)

print(f"Train size: {train_data.count()}")
print(f"Test size: {test_data.count()}")

## 4. 베이스라인: Linear Regression

### 회귀 평가 지표
- **RMSE** (Root Mean Squared Error): 예측값과 실제값의 평균 제곱근 오차 (낮을수록 좋음)
- **MAE** (Mean Absolute Error): 예측값과 실제값의 평균 절대 오차 (낮을수록 좋음)
- **R²** (R-squared): 결정 계수, 모델의 설명력 (1에 가까울수록 좋음, 0~1)

In [None]:
# Linear Regression 모델 학습
lr = LinearRegression(featuresCol="features", labelCol="PRICE")
lr_model = lr.fit(train_data)
lr_preds = lr_model.transform(test_data)

In [None]:
# 예측 결과 확인
lr_preds.select("features", "PRICE", "prediction").limit(5).toPandas()

In [None]:
# 평가 지표 계산
rmse_eval = RegressionEvaluator(labelCol="PRICE", predictionCol="prediction", metricName="rmse")
mae_eval = RegressionEvaluator(labelCol="PRICE", predictionCol="prediction", metricName="mae")
r2_eval = RegressionEvaluator(labelCol="PRICE", predictionCol="prediction", metricName="r2")

lr_rmse = rmse_eval.evaluate(lr_preds)
lr_mae = mae_eval.evaluate(lr_preds)
lr_r2 = r2_eval.evaluate(lr_preds)

In [None]:
print("=" * 60)
print("Linear Regression (Baseline) - Test Metrics:")
print("=" * 60)
print(f"  RMSE: {lr_rmse:.4f}")
print(f"  MAE:  {lr_mae:.4f}")
print(f"  R²:   {lr_r2:.4f}")
print("=" * 60)

## 5. 개선 모델 1: Random Forest Regressor

In [None]:
rf = RandomForestRegressor(featuresCol="features", labelCol="PRICE", seed=SEED)
rf_model = rf.fit(train_data)
rf_preds = rf_model.transform(test_data)

In [None]:
rf_rmse = rmse_eval.evaluate(rf_preds)
rf_mae = mae_eval.evaluate(rf_preds)
rf_r2 = r2_eval.evaluate(rf_preds)

In [None]:
print("=" * 60)
print("Random Forest Regressor - Test Metrics:")
print("=" * 60)
print(f"  RMSE: {rf_rmse:.4f}")
print(f"  MAE:  {rf_mae:.4f}")
print(f"  R²:   {rf_r2:.4f}")
print("=" * 60)

In [None]:
# Feature Importance
print("\nRandom Forest Feature Importances:")
for feature, importance in zip(feature_cols, rf_model.featureImportances):
    print(f"  {feature:20s}: {importance:.4f}")

## 6. 개선 모델 2: Gradient Boosted Trees Regressor

In [None]:
gbt = GBTRegressor(featuresCol="features", labelCol="PRICE", seed=SEED)
gbt_model = gbt.fit(train_data)
gbt_preds = gbt_model.transform(test_data)

In [None]:
gbt_rmse = rmse_eval.evaluate(gbt_preds)
gbt_mae = mae_eval.evaluate(gbt_preds)
gbt_r2 = r2_eval.evaluate(gbt_preds)

In [None]:
print("=" * 60)
print("Gradient Boosted Trees Regressor - Test Metrics:")
print("=" * 60)
print(f"  RMSE: {gbt_rmse:.4f}")
print(f"  MAE:  {gbt_mae:.4f}")
print(f"  R²:   {gbt_r2:.4f}")
print("=" * 60)

In [None]:
# Feature Importance
print("\nGBT Feature Importances:")
for feature, importance in zip(feature_cols, gbt_model.featureImportances):
    print(f"  {feature:20s}: {importance:.4f}")

## 7. 최종 비교 (산출물)

In [None]:
import matplotlib.pyplot as plt

# 결과 요약
results = {
    "Model": ["Linear Regression", "Random Forest", "GBT"],
    "RMSE": [lr_rmse, rf_rmse, gbt_rmse],
    "MAE": [lr_mae, rf_mae, gbt_mae],
    "R²": [lr_r2, rf_r2, gbt_r2]
}

results_df = pd.DataFrame(results)
print("\n" + "=" * 60)
print("모델별 성능 비교 (산출물에 기록)")
print("=" * 60)
print(results_df.to_string(index=False))
print("=" * 60)

In [None]:
# 시각화: RMSE 비교
plt.figure(figsize=(10, 6))
plt.bar(results["Model"], results["RMSE"], color=['blue', 'green', 'orange'], alpha=0.7)
plt.title('RMSE Comparison (Lower is Better)', fontsize=14)
plt.xlabel('Model', fontsize=12)
plt.ylabel('RMSE', fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# 시각화: R² 비교
plt.figure(figsize=(10, 6))
plt.bar(results["Model"], results["R²"], color=['blue', 'green', 'orange'], alpha=0.7)
plt.title('R² Comparison (Higher is Better)', fontsize=14)
plt.xlabel('Model', fontsize=12)
plt.ylabel('R²', fontsize=12)
plt.ylim(0, 1)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 정리

**학습한 내용:**
1. 회귀 문제 정의 및 데이터 준비
2. 전처리 파이프라인 (VectorAssembler + StandardScaler)
3. Linear Regression 베이스라인 (RMSE, MAE, R²)
4. Random Forest Regressor로 개선 (Feature Importance)
5. GBT Regressor로 개선
6. 세 모델 성능 비교 및 시각화

**산출물:**
- 모델별 성능 비교표 (RMSE, MAE, R²)
- Feature Importance (RF, GBT)
- 최적 모델 선정 근거

In [None]:
print("\n=== CSV 형식 (산출물용) ===")
print("model,rmse,mae,r2")
print(f"LinearRegression,{lr_rmse:.4f},{lr_mae:.4f},{lr_r2:.4f}")
print(f"RandomForestRegressor,{rf_rmse:.4f},{rf_mae:.4f},{rf_r2:.4f}")
print(f"GBTRegressor,{gbt_rmse:.4f},{gbt_mae:.4f},{gbt_r2:.4f}")

In [None]:
spark.stop()