# 다중선형회귀분석
* 데이터 입력하기 (보스턴 주택가격 데이터)

In [21]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [22]:
boston = datasets.load_boston()
print(boston.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_module'])



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [23]:
X_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])  #입력데이터 (13개의 변수들)
y_df = pd.DataFrame(boston['target'], columns=['Target'])  #출력데이터 (주택 가격)

In [24]:
X_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


In [25]:
y_df

Unnamed: 0,Target
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2
...,...
501,22.4
502,20.6
503,23.9
504,22.0


* 훈련데이터와 테스트데이터로 분리 (훈련데이터 80%, 테스트데이터 20%)
  * 훈련데이터: 다중선형회귀모형을 학습하는데 사용하는 데이터
  * 테스트데이터: 학습한 모형을 평가하는데 사용

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, shuffle=True)
print('훈련데이터 X와 y의 차원: ', X_train.shape, y_train.shape)
print('테스트데이터 X와 y의 차원: ', X_test.shape, y_test.shape)

훈련데이터 X와 y의 차원:  (404, 13) (404, 1)
테스트데이터 X와 y의 차원:  (102, 13) (102, 1)


X_train과 y_train의 순서는 같음

* 훈련데이터에 다중선형회귀모형 적용

In [27]:
model = LinearRegression()
model.fit(X_train, y_train)

print("회귀계수(w):", [np.round(x, 3) for x in model.coef_[0]])
print("절편(b):", np.round(model.intercept_, 3))

회귀계수(w): [-0.111, 0.054, 0.017, 2.329, -17.604, 3.492, -0.0, -1.558, 0.278, -0.011, -1.066, 0.009, -0.528]
절편(b): [40.857]


* 평균제곱오차(MSE)를 이용한 모형 평가

In [28]:
y_train_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_train_pred)
print("훈련데이터 MSE: %.3f" %mse)

훈련데이터 MSE: 23.248


In [29]:
y_test_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
print("테스트데이터 MSE: %.3f" %mse)

테스트데이터 MSE: 17.203


mse가 훈련데이터에서 작고 테스트데이터에서 크면 과적합의 문제가 있을 수도 있다.