#### California_housing 데이터셋으로 아래사항을 참조하여 주택가격을 예측하는 회귀모델을 개발하세요.

- 전체 회귀모델을 적용
- 각 모델별 최적 하이퍼파라미터 - GridSearchCV 활용
- 평가지수 MSE 기준으로 가장 성능이 좋은 모델과 파라미터를 적용하여 평가 결과를 출력

In [8]:
# Desicion Tree v
# Random Forest v
# Logistic Regression v
# KNN Argorism
# Support Vector Machines ( SVC - Linear ) v
# KNeighborClassifiers
# Gradient Boosting Regression v
# LightGBM v
# Ridge Regression v
# Lasso Regression v
# Elastic Net Regression v

In [10]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score

# 데이터 로드
data = load_breast_cancer()
X = data.data
y = data.target

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 파이프라인 구성
pipeline = Pipeline([
    ('svd', TruncatedSVD(n_components=10)),
    ('logreg', LogisticRegression(max_iter=1000))
])

# 하이퍼파라미터 그리드 설정
param_grid = {
    'svd__n_components': [2, 5, 10],
    'logreg__C': [0.001, 0.01, 0.1, 1, 10, 100]
}

# GridSearchCV를 사용한 하이퍼파라미터 튜닝
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# 최적의 하이퍼파라미터 출력
print("Best Parameters:", grid_search.best_params_)

# 평가 사용자 함수 정의
def evaluate_model(model, X_test, y_test):
    # 예측
    y_pred = model.predict(X_test)

    # 정확도
    accuracy = accuracy_score(y_test, y_pred)

    # 분류 보고서 생성
    report = classification_report(y_test, y_pred)

    # ROC AUC 점수 계산
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(y_test, y_pred_proba)

    # 결과 출력
    print("Accuracy:", accuracy)
    print("Classification Report:\n", report)
    print("ROC AUC Score:", roc_auc)

# 최적의 모델을 사용하여 테스트 데이터 평가
evaluate_model(grid_search.best_estimator_, X_test, y_test)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best Parameters: {'logreg__C': 10, 'svd__n_components': 10}
Accuracy: 0.956140350877193
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.91      0.94        43
           1       0.95      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

ROC AUC Score: 0.9950867998689813


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np

data = fetch_california_housing()

# 데이터 로드 및 분할
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 데이터 스케일링
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [16]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge, ElasticNet
import numpy as np

def get_model_cv_prediction():
    # 모델 이름과 모델 클래스 정의
    models = {
        'dt': DecisionTreeRegressor,
        'rf': RandomForestRegressor,
        'gb': GradientBoostingRegressor,
        'lr': LinearRegression,
        'svr': SVR,
        'xgb': XGBRegressor,
        'lgb': LGBMRegressor,
        'rr': Ridge,
        'lasso': Lasso,
        'elasticnet': ElasticNet
    }

    # 모델 하이퍼파라미터 설정
    params = {
        'dt': {'random_state': 0, 'max_depth': 4},
        'rf': {'random_state': 0, 'n_estimators': 1000},
        'gb': {'random_state': 0, 'n_estimators': 1000},
        'lr': {},
        'svr': {'kernel': 'linear', 'C': 1.0},
        'xgb': {'n_estimators': 1000},
        'lgb': {'n_estimators': 1000, 'verbose': -1},
        'rr': {'alpha': 0.1},
        'lasso': {'alpha': 0.1},
        'elasticnet': {'alpha': 0.1, 'l1_ratio': 0.5}
    }

    # 모델 객체 생성
    model_objects = {}
    for model_name, model_class in models.items():
        model_objects[model_name] = model_class(**params[model_name])

    return model_objects

def evaluate_models(X_data, y_target):
    model_dict = get_model_cv_prediction()

    for name, model in model_dict.items():
        # 교차 검증을 통해 모델 평가
        neg_mse_scores = cross_val_score(model, X_data, y_target, scoring='neg_mean_squared_error', cv=5)
        rmse_scores = np.sqrt(-neg_mse_scores)
        avg_rmse = np.mean(rmse_scores)

        # 결과 출력
        print(f'{name} 모델의 5 folds의 개별 Negative MSE scores: {np.round(neg_mse_scores, 3)}')
        print(f'{name} 모델의 5 folds의 개별 RMSE scores : {np.round(rmse_scores, 3)}')
        print(f'{name} 모델의 5 folds의 평균 RMSE scores : {avg_rmse:.3f}\n')

evaluate_models(X_train_scaled, y_train)


dt 모델의 5 folds의 개별 Negative MSE scores: [-0.591 -0.539 -0.561 -0.574 -0.59 ]
dt 모델의 5 folds의 개별 RMSE scores : [0.769 0.734 0.749 0.758 0.768]
dt 모델의 5 folds의 평균 RMSE scores : 0.756

rf 모델의 5 folds의 개별 Negative MSE scores: [-0.26  -0.264 -0.254 -0.252 -0.262]
rf 모델의 5 folds의 개별 RMSE scores : [0.51  0.514 0.504 0.502 0.512]
rf 모델의 5 folds의 평균 RMSE scores : 0.508

gb 모델의 5 folds의 개별 Negative MSE scores: [-0.224 -0.224 -0.231 -0.221 -0.236]
gb 모델의 5 folds의 개별 RMSE scores : [0.473 0.473 0.48  0.47  0.486]
gb 모델의 5 folds의 평균 RMSE scores : 0.477

lr 모델의 5 folds의 개별 Negative MSE scores: [-0.52  -0.502 -0.521 -0.508 -0.546]
lr 모델의 5 folds의 개별 RMSE scores : [0.721 0.709 0.721 0.713 0.739]
lr 모델의 5 folds의 평균 RMSE scores : 0.721

svr 모델의 5 folds의 개별 Negative MSE scores: [-0.54  -0.528 -6.042 -0.924 -0.808]
svr 모델의 5 folds의 개별 RMSE scores : [0.735 0.726 2.458 0.961 0.899]
svr 모델의 5 folds의 평균 RMSE scores : 1.156

xgb 모델의 5 folds의 개별 Negative MSE scores: [-0.23  -0.241 -0.23  -0.22  -0.228]
xgb 모델의 5

- GradientBoosting / XGB / LGBM 의 값이\
상대적으로 가장 낮은 것으로 보아 그나마 성능이 좋은 모델인 것으로 추측



In [None]:
gb, xgb, lgb = GradientBoostingRegressor(), XGBRegressor(), LGBMRegressor()
gb.fit(X_train_scaled, y_train)
xgb.fit(X_train_scaled, y_train)
lgb.fit(X_train_scaled, y_train)


# Lasso Regression에 GridSearchCV 적용

# alpha : 0.001 ~ 100 / 10배수 단위로 적용
# fit_intercept : False or True / 기본값 : True / 모델에 절편 포함 여부 결정
# precompute : X^TX를 미리 계산하여 성능 개선 / 큰 데이터셋에서는 사용 불가 / 기본값 : False
# max_iter : 반복 횟수 / 기본값 : 1000
# positive : 회귀 계수가 양수로 제한 / 기본값 : False
# random_state

# GridSearchCV를 사용한 하이퍼파라미터 튜닝
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

In [24]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import numpy as np

# 각 모델의 하이퍼파라미터 그리드 설정
param_grids = {
     'gb': {
        'n_estimators': [100, 200, 500],
        'learning_rate': [0.01, 0.1, 1],
        'max_depth': [3, 5, 7]
    },
    'xgb': {
        'n_estimators': [100, 200, 500],
        'learning_rate': [0.01, 0.1, 1],
        'max_depth': [3, 5, 7]
    },
    'lgb': {
        'n_estimators': [100, 200, 500, 1000],
        'learning_rate': [0.01, 0.1, 1],
        'max_depth': [7, 10]
    }
}

# 모델 객체 생성
models = {
    'gb': GradientBoostingRegressor(),
    'xgb': XGBRegressor(),
    'lgb': LGBMRegressor()
}

# 하이퍼파라미터 튜닝 및 모델 평가
for model_name, model in models.items():
    print(f"Evaluating {model_name}...")

    grid_search = GridSearchCV(model, param_grids[model_name], cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
    grid_search.fit(X_train, y_train)

    # 최적의 하이퍼파라미터 출력
    print(f"Best Parameters for {model_name}:", grid_search.best_params_)

    # 최적의 모델을 사용하여 테스트 데이터 평가
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)

    # 평가 사용자 함수 정의
    def evaluate_model(y_pred, y_test):
        # RMSE 계산
        rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))

        # 결과 출력
        print("RMSE:", rmse)

    evaluate_model(y_pred, y_test)


Evaluating gb...
Best Parameters for gb: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 500}
RMSE: 0.4559462197976954
Evaluating xgb...
Best Parameters for xgb: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 500}
RMSE: 0.45547070284167057
Evaluating lgb...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001395 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 16512, number of used features: 8
[LightGBM] [Info] Start training from score 2.071947
Best Parameters for lgb: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 500}
RMSE: 0.43522972934099036


In [27]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import numpy as np

# 각 모델의 하이퍼파라미터 그리드 설정
param_grids = {
#     'gb': {
#        'n_estimators': [100, 200, 500],
#        'learning_rate': [0.01, 0.1, 1],
#        'max_depth': [3, 5, 7]
#    },
#    'xgb': {
#        'n_estimators': [100, 200, 500],
#        'learning_rate': [0.01, 0.1, 1],
#        'max_depth': [3, 5, 7]
#    },
    'lgb': {
        'n_estimators': [200, 500, 1000],
        'learning_rate': [0.01, 0.1, 1],
        'max_depth': [7, 10],
        'min_child_samples': [20, 50, 100],
        'min_child_weight': [0.1, 1, 10],
        'min_gain_to_split': [0.0, 0.1, 0.2]
    }
}

# 모델 객체 생성
models = {
#    'gb': GradientBoostingRegressor(),
#    'xgb': XGBRegressor(),
    'lgb': LGBMRegressor()
}

# 하이퍼파라미터 튜닝 및 모델 평가
for model_name, model in models.items():
    print(f"Evaluating {model_name}...")

    grid_search = GridSearchCV(model, param_grids[model_name], cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
    grid_search.fit(X_train, y_train)

    # 최적의 하이퍼파라미터 출력
    print(f"Best Parameters for {model_name}:", grid_search.best_params_)

    # 최적의 모델을 사용하여 테스트 데이터 평가
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)

    # 평가 사용자 함수 정의
    def evaluate_model(y_pred, y_test):
        # RMSE 계산
        rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))

        # 결과 출력
        print("RMSE:", rmse)

    evaluate_model(y_pred, y_test)


Evaluating lgb...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002651 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 16512, number of used features: 8
[LightGBM] [Info] Start training from score 2.071947
Best Parameters for lgb: {'learning_rate': 0.1, 'max_depth': 10, 'min_child_samples': 20, 'min_child_weight': 0.1, 'min_gain_to_split': 0.0, 'n_estimators': 500}
RMSE: 0.4378896185939582


- GradingBoosting : 'learning_rate':0.1, 'max_depth':5, 'n_estimators:500 → 0.455
- XGB : 'learning_rate':0.1, 'max_depth':5, 'n_estimators:500 → 0.455
- LGBM : 'learning_rate': 0.1, 'max_depth':10, 'n_estimators:500 → 0.435