#### 보스턴 집값 예측 모델
 - 데이터셋 : boston.csv
 - 학습방법 : 지도학습 >> 회귀
 - 피쳐/독립 : 13개
 - 타겟/종속 : 1개

[1] 데이터 준비

In [201]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split

In [202]:
# 데이터 
BostonDF = pd.read_csv(r'C:\Users\KDP-17\EX_PANDAS6\MachineLearning\data\boston.csv')
BostonDF.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [203]:
# 데이터 기본 정보 확인
BostonDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


[2] 전처리
 - [2-1] 데이터 정제 

In [204]:
### 결측치, 중복값, 이상치, .... 컬럼별 고유값 추출로 이상 데이터 체크

- [2-2] 표준화 & 정규화 ===> 진행여부에 따라 성능의 변화는 각기 다름
  * 정규분포 데이터셋 기반으로 한 모델 ==> standardScaler, Log 변환...
  * 피쳐의 값의 범위 차이를 줄이기 ==> 피쳐 스케일링, minMaxScaler, RobustScaler,...
  * 범주형 피쳐 ==> 수치화 인코딩 OneHotEncoder, OrdinalEncoder
  * 문자열 타겟 ==> 정수 라벨인코딩 LabelEncoder

[2-3] 피쳐와 타겟 분리

In [205]:
featureDF = BostonDF.iloc[:, :-1]
targetSR = BostonDF['MEDV']

In [206]:
print(f'featureDF : {featureDF.shape}, targetSR : {targetSR.shape}')

featureDF : (506, 13), targetSR : (506,)


[3] 학습 준비  
[3-1] 학습용 데이터셋과 테스트용 데이터셋 분리

In [207]:
X_train, X_test, y_tarin, y_test = train_test_split(featureDF,targetSR,random_state=10) 

In [208]:
print(f'X_train : {X_train.shape}, y_train : {y_tarin.shape}')
print(f'X_test : {X_test.shape}, y_test : {y_test.shape}')

X_train : (379, 13), y_train : (379,)
X_test : (127, 13), y_test : (127,)


[3-2] 학습용 데이터셋으로 스케일러 생성

In [209]:
### 수치 피쳐 값의 범위가 큼 - 스케일링(StandardScaler)
stan_sca = StandardScaler()

In [210]:
stan_sca.fit(X_train)

In [211]:
X_train_scaled = stan_sca.transform(X_train)
X_test_scaled = stan_sca.transform(X_test)

[4] 교차검증으로 학습 진행 

In [212]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import Ridge

In [213]:
### 모델의 성능 좌우하는 하이퍼파라미터 제어 ==> 튜닝
# alpha 값 바꾸는 게 하이퍼파라미터 변경
alpha_value=[0.,1.0,10,100]

# 모델 인스턴스 생성
for value in alpha_value:
    ridge_model = Ridge(alpha=value,max_iter=3) # alpha 기본값 : 1.0

    # 학습 진행
    # - cv : 3개
    # - scoring : 'mean_square_error', r2
    # - return_train_score
    result = cross_validate(ridge_model, X_train_scaled, y_tarin, cv=3,
                            scoring=['neg_mean_squared_error','r2'],
                            return_train_score=True,
                            return_estimator=True)
    resultDF=pd.DataFrame(result)[['test_r2','train_r2','estimator']]
    print(f'[ridge(alpha={value})]')
    print(resultDF, end='\n\n')

[ridge(alpha=0.0)]
    test_r2  train_r2                     estimator
0  0.747022  0.755720  Ridge(alpha=0.0, max_iter=3)
1  0.756482  0.740082  Ridge(alpha=0.0, max_iter=3)
2  0.680801  0.786156  Ridge(alpha=0.0, max_iter=3)

[ridge(alpha=1.0)]
    test_r2  train_r2          estimator
0  0.748283  0.755663  Ridge(max_iter=3)
1  0.756292  0.740039  Ridge(max_iter=3)
2  0.680991  0.786097  Ridge(max_iter=3)

[ridge(alpha=10)]
    test_r2  train_r2                    estimator
0  0.753103  0.752474  Ridge(alpha=10, max_iter=3)
1  0.755100  0.737457  Ridge(alpha=10, max_iter=3)
2  0.677471  0.783225  Ridge(alpha=10, max_iter=3)

[ridge(alpha=100)]
    test_r2  train_r2                     estimator
0  0.724036  0.708269  Ridge(alpha=100, max_iter=3)
1  0.725993  0.686628  Ridge(alpha=100, max_iter=3)
2  0.627335  0.744452  Ridge(alpha=100, max_iter=3)



- 하이퍼파라미터 튜닝과 교차 검증을 동시에 진행

In [214]:
from sklearn.model_selection import GridSearchCV

In [215]:
# Ridge의 Hyper-parameter 값 설정
params = {'alpha':[0.,0.1,0.5,1.0],
          'max_iter':[3,5]}

# ==> 0.,3 => model
# ==> 0.,5 => model
# 즉, 알파가 순서대로 나오면서 max_iter가 3과 5인 모델이 생성됨 총 8개

In [216]:
# 인스턴스 생성
rModel = Ridge()

# GridSearchCV 인스턴스 생성
serchCV = GridSearchCV(rModel, params, cv=3,
                       verbose=True,
                       return_train_score=True)
serchCV

In [217]:
# 학습 진행
serchCV.fit(X_train_scaled,y_tarin)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


In [218]:
# fit() 진행 후 모델 파라미터 확인
serchCV.best_params_

{'alpha': 1.0, 'max_iter': 3}

In [219]:
serchCV.best_index_

6

In [220]:
resultDF = pd.DataFrame(serchCV.cv_results_)

In [222]:
bestmodel = serchCV.best_estimator_
bestmodel