#### 보스턴 집값 예측 모델
- 데이터셋 : boston.csv
- 목적 : 보스턴 집값 예측
- 학습방법 : 지도학습 - 회귀
- 피쳐/독립 : 13개
- 타겟 : 1개


[1] 모듈준비

In [60]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split

[2] 데이터 준비

In [61]:
file_path = '../data/boston.csv'

In [62]:
dataDF = pd.read_csv(file_path)
dataDF.head(2)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6


In [63]:
dataDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


[3] 전처리

[3-1] 데이터 정제

- 결측치, 중복값, 이상치, 컬럼별 고유값 추출로 이상 데이터 체크

[3-2] 표준화 & 정규화   ===> 진행 여부에 따라 성능의 변화는 데이터마다 다르다.
    * 정규분포 데이터셋을 기반으로 한 모델 ==> StandardScaler, MinMaxScaler. Log 변환
    * 피쳐의 값의 범위 차이를 줄이기 ==> 피쳐 스케일링, MinMaxScaler, RobustScaler
    * 범주형 피쳐 ==> 수치화 인코딩 OneHotEncoder, OrdinalEncoder
    * 문자열 타겟 ==> 정수 라벨인코딩 LabelEncoder

[3-3] 피쳐와 타겟 분리

In [64]:
featureDF=dataDF.iloc[:, :-1]
targetSR=dataDF['MEDV']

In [65]:
print(f'featureDF : {featureDF.shape} targetSR: {targetSR.shape}')

featureDF : (506, 13) targetSR: (506,)


[4] 학습 준비

[3-1] 학습/테스트용 데이터셋 분리

In [66]:
X_train, X_test, y_train, y_test = train_test_split(featureDF, targetSR, random_state=10)

In [67]:
print(f'X_train : {X_train.shape} y_train : {y_train.shape}')
print(f'X_test : {X_test.shape} y_test : {y_test.shape}')

X_train : (379, 13) y_train : (379,)
X_test : (127, 13) y_test : (127,)


[3-2] 학습용 데이터셋으로 스케일러 생성

In [68]:
### - 수치 피쳐 값의 범위 차가 큼 ==> Scaling 진행
ssScaler = StandardScaler()

ssScaler.fit(X_train)


In [69]:
X_train_scaled=ssScaler.transform(X_train)
X_test_scaled = ssScaler.transform(X_test)

[4] 학습 진행 ==> 교차검증으로 진행

In [70]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import Ridge, Lasso

In [71]:
### 모델의 성능을 좌우하는 Hyper-parameter 제어 즉, 튜닝
alpha_values=[0., 1.,  10, 100]

for value in alpha_values:
    # 모델 인스턴스 생성
    ridge_model = Ridge(alpha=value)    # 기본값 1.0   여기서 alpha가 hyper=parameter
    # 학습 진행
    # - csv : 3개
    # - scoring : 'mean_squared_error', 'r2'
    # - return_train_score
    result=cross_validate(ridge_model, X_train_scaled, y_train, cv=3, 
                        scoring=['neg_mean_squared_error', 'r2'],return_train_score=True,
                        return_estimator=True)

    resultDF=pd.DataFrame(result)[['test_r2', 'train_r2']]
    resultDF['diff'] = abs(resultDF['test_r2']-resultDF['train_r2'])
    best_idx=resultDF['diff'].sort_values()[0]
    
    print(result['estimator'][0].coef_)
    print(f'[Ride(alpha={value})]')
    print(resultDF, end='\n\n')

[-1.41407793  1.56590993  0.15536906  0.65522098 -2.36200159  2.31948624
  0.1173831  -3.59071105  2.71475429 -2.33252925 -1.88390034  1.04036915
 -3.50250877]
[Ride(alpha=0.0)]
    test_r2  train_r2      diff
0  0.747022  0.755720  0.008699
1  0.756482  0.740082  0.016400
2  0.680801  0.786156  0.105355

[-1.39035961  1.53043843  0.11109741  0.6621853  -2.29024619  2.34249774
  0.10030677 -3.52062389  2.57481444 -2.20749462 -1.86406784  1.03607796
 -3.48102887]
[Ride(alpha=1.0)]
    test_r2  train_r2      diff
0  0.748283  0.755663  0.007380
1  0.756292  0.740039  0.016253
2  0.680991  0.786097  0.105106

[-1.23221033  1.29302258 -0.12737786  0.70280521 -1.80949922  2.48028701
 -0.00860666 -2.99831755  1.75466332 -1.51704375 -1.73434856  1.00368486
 -3.30809117]
[Ride(alpha=10)]
    test_r2  train_r2      diff
0  0.753103  0.752474  0.000629
1  0.755100  0.737457  0.017643
2  0.677471  0.783225  0.105755

[-0.78141029  0.70910255 -0.46407849  0.72503917 -0.69294458  2.41757287
 -0.241

In [72]:
### 모델의 성능을 좌우하는 Hyper-parameter 제어 즉, 튜닝
alpha_values=[0., 1.,  10, 100]

for value in alpha_values:
    # 모델 인스턴스 생성
    ridge_model = Ridge(alpha=value, max_iter=3)    # max_iter=3   여기서 alpha가 hyper=parameter
    # 학습 진행
    # - csv : 3개
    # - scoring : 'mean_squared_error', 'r2'
    # - return_train_score
    result=cross_validate(ridge_model, X_train_scaled, y_train, cv=3, 
                        scoring=['neg_mean_squared_error', 'r2'],return_train_score=True,
                        return_estimator=True)

    resultDF=pd.DataFrame(result)[['test_r2', 'train_r2']]
    resultDF['diff'] = abs(resultDF['test_r2']-resultDF['train_r2'])
    best_idx=resultDF['diff'].sort_values()[0]

    print(result['estimator'][0].coef_)
    print(f'[Ride(alpha={value})]')
    print(resultDF, end='\n\n')

[-1.41407793  1.56590993  0.15536906  0.65522098 -2.36200159  2.31948624
  0.1173831  -3.59071105  2.71475429 -2.33252925 -1.88390034  1.04036915
 -3.50250877]
[Ride(alpha=0.0)]
    test_r2  train_r2      diff
0  0.747022  0.755720  0.008699
1  0.756482  0.740082  0.016400
2  0.680801  0.786156  0.105355

[-1.39035961  1.53043843  0.11109741  0.6621853  -2.29024619  2.34249774
  0.10030677 -3.52062389  2.57481444 -2.20749462 -1.86406784  1.03607796
 -3.48102887]
[Ride(alpha=1.0)]
    test_r2  train_r2      diff
0  0.748283  0.755663  0.007380
1  0.756292  0.740039  0.016253
2  0.680991  0.786097  0.105106

[-1.23221033  1.29302258 -0.12737786  0.70280521 -1.80949922  2.48028701
 -0.00860666 -2.99831755  1.75466332 -1.51704375 -1.73434856  1.00368486
 -3.30809117]
[Ride(alpha=10)]
    test_r2  train_r2      diff
0  0.753103  0.752474  0.000629
1  0.755100  0.737457  0.017643
2  0.677471  0.783225  0.105755

[-0.78141029  0.70910255 -0.46407849  0.72503917 -0.69294458  2.41757287
 -0.241

In [73]:
resultDF

Unnamed: 0,test_r2,train_r2,diff
0,0.724036,0.708269,0.015767
1,0.725993,0.686628,0.039365
2,0.627335,0.744452,0.117117


In [74]:
### 모델의 성능을 좌우하는 Hyper-parameter 제어 즉, 튜닝
alpha_values=[0., 1.,  10, 100]

for value in alpha_values:
    # 모델 인스턴스 생성
    ridge_model = Ridge(alpha=value)    # 기본값 1.0   여기서 alpha가 hyper=parameter
    # 학습 진행
    # - csv : 3개
    # - scoring : 'mean_squared_error', 'r2'
    # - return_train_score
    result=cross_validate(ridge_model, X_train_scaled, y_train, cv=3, 
                        scoring=['neg_mean_squared_error', 'r2'],return_train_score=True,
                        return_estimator=True)

    resultDF=pd.DataFrame(result)[['test_r2', 'train_r2']]
    resultDF['diff'] = abs(resultDF['test_r2']-resultDF['train_r2'])
    best_idx=resultDF['diff'].idxmin()



    print(result['estimator'][best_idx].coef_)
    print(f'[Ride(alpha={value})]')
    print(resultDF, end='\n\n')

[-1.41407793  1.56590993  0.15536906  0.65522098 -2.36200159  2.31948624
  0.1173831  -3.59071105  2.71475429 -2.33252925 -1.88390034  1.04036915
 -3.50250877]
[Ride(alpha=0.0)]
    test_r2  train_r2      diff
0  0.747022  0.755720  0.008699
1  0.756482  0.740082  0.016400
2  0.680801  0.786156  0.105355

[-1.39035961  1.53043843  0.11109741  0.6621853  -2.29024619  2.34249774
  0.10030677 -3.52062389  2.57481444 -2.20749462 -1.86406784  1.03607796
 -3.48102887]
[Ride(alpha=1.0)]
    test_r2  train_r2      diff
0  0.748283  0.755663  0.007380
1  0.756292  0.740039  0.016253
2  0.680991  0.786097  0.105106

[-1.23221033  1.29302258 -0.12737786  0.70280521 -1.80949922  2.48028701
 -0.00860666 -2.99831755  1.75466332 -1.51704375 -1.73434856  1.00368486
 -3.30809117]
[Ride(alpha=10)]
    test_r2  train_r2      diff
0  0.753103  0.752474  0.000629
1  0.755100  0.737457  0.017643
2  0.677471  0.783225  0.105755

[-0.78141029  0.70910255 -0.46407849  0.72503917 -0.69294458  2.41757287
 -0.241

In [None]:
# 라쏘
### 모델의 성능을 좌우하는 Hyper-parameter 제어 즉, 튜닝
alpha_values=[0., 1.,  10, 100]

for value in alpha_values:
    # 모델 인스턴스 생성
    ridge_model = Lasso(alpha=value)    # 기본값 1.0   여기서 alpha가 hyper=parameter
    # 학습 진행
    # - csv : 3개
    # - scoring : 'mean_squared_error', 'r2'
    # - return_train_score
    result=cross_validate(ridge_model, X_train_scaled, y_train, cv=3, 
                        scoring=['neg_mean_squared_error', 'r2'],return_train_score=True,
                        return_estimator=True)

    resultDF=pd.DataFrame(result)[['test_r2', 'train_r2']]
    resultDF['diff'] = abs(resultDF['test_r2']-resultDF['train_r2'])
    best_idx=resultDF['diff'].idxmin()



    print(result['estimator'][best_idx].coef_)
    print(f'[Ride(alpha={value})]')
    print(resultDF, end='\n\n')

- 하이퍼파라미터 튜닝과 교차 검증을 동시에 진행

In [76]:
from sklearn.model_selection import GridSearchCV

In [78]:
# Ridge의 Hyper-parameter 값 설정
params={'alpha':[0,0.1,0.5,1.0],
        'max_iter':[3, 5]}

# ==> 0., 3=> Model
# ==> 0., 5 => Model
# . . . . . .
# ==> 1.0,  5=> Model
# 총 8개의 모델 생성

In [80]:
# 인스턴스 생성
rModel=Ridge()

# GridSearchCV 인스턴스 생성
serchCV=GridSearchCV(rModel, params, cv=3, verbose=True, return_train_score=True)

In [81]:
# 학습 진행
serchCV.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits
