### 9️⃣ Linear Regression with sklearn
#### Example : Boston House Price Dataset
- X 변수 : 13개
- Y 변수 : 1개

#### 1. Data 로딩

In [2]:
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings(action='ignore')

In [3]:
boston = load_boston()

- Data 정보 확인

In [4]:
boston['DESCR']

".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000

In [5]:
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_module'])

In [6]:
boston['feature_names']

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [7]:
boston = load_boston()

x_data = boston.data
# two dimension으로
y_data = boston.target.reshape(boston.target.size, 1)

- Data : 506개
- Column : 13개

In [8]:
x_data.shape

(506, 13)

In [9]:
y_data[:3]

array([[24. ],
       [21.6],
       [34.7]])

#### 2. Data Scaling
- preprocessing.MinMaxScaler(feature_range = (0,5))  
$\Rightarrow$ 0~5값으로 scaling

- Normal Scaler  
preprocessing.StandardScaler().fit(x_data)


In [10]:
from sklearn import preprocessing

# x_data fitting
minmax_scale = preprocessing.MinMaxScaler().fit(x_data)
x_scaled_data = minmax_scale.transform(x_data)

x_scaled_data[:3]

array([[0.00000000e+00, 1.80000000e-01, 6.78152493e-02, 0.00000000e+00,
        3.14814815e-01, 5.77505269e-01, 6.41606591e-01, 2.69203139e-01,
        0.00000000e+00, 2.08015267e-01, 2.87234043e-01, 1.00000000e+00,
        8.96799117e-02],
       [2.35922539e-04, 0.00000000e+00, 2.42302053e-01, 0.00000000e+00,
        1.72839506e-01, 5.47997701e-01, 7.82698249e-01, 3.48961980e-01,
        4.34782609e-02, 1.04961832e-01, 5.53191489e-01, 1.00000000e+00,
        2.04470199e-01],
       [2.35697744e-04, 0.00000000e+00, 2.42302053e-01, 0.00000000e+00,
        1.72839506e-01, 6.94385898e-01, 5.99382080e-01, 3.48961980e-01,
        4.34782609e-02, 1.04961832e-01, 5.53191489e-01, 9.89737254e-01,
        6.34657837e-02]])

#### 3. Train-Test Split 

In [11]:
from sklearn.model_selection import train_test_split

# 2/3 : Train Data, 1/3 : Test Data
# unpacking
X_train, X_test, y_train, y_test = train_test_split(x_scaled_data, y_data, test_size = 0.33)

In [12]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape 

((339, 13), (167, 13), (339, 1), (167, 1))

#### 4. Linear Regression Fitting
- fit_intercept : 절편($w_0$)을 넣을지 말지
- copy_X : X의 값을 복사 후 분석할지
- n_jobs : Data 많을 시 4 or 8로 설정하여 속도 증가

In [13]:
from sklearn import linear_model

regr = linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs = 8)
regr.fit(X_train, y_train)

LinearRegression(n_jobs=8, normalize=False)

- The Coefficients, Intercept

In [14]:
print('Coefficients: ', regr.coef_)
print('intercept: ', regr.intercept_)

Coefficients:  [[ -9.0031608    4.07665394  -0.29507116   2.74706582  -6.92598406
   23.29871383  -1.10301825 -14.81541982   5.78156293  -5.71828826
   -8.26270772   4.00854233 -17.06439837]]
intercept:  [23.89830413]


#### 5. 수식 결과 비교

In [15]:
regr.predict(x_data[:5])

array([[-193.75087671],
       [ -77.0157642 ],
       [  31.30266657],
       [ 167.59742939],
       [ 129.78882745]])

- 위와 동일

In [16]:
x_data[:5].dot(regr.coef_.T) + regr.intercept_

array([[-193.75087671],
       [ -77.0157642 ],
       [  31.30266657],
       [ 167.59742939],
       [ 129.78882745]])

#### 6. Metric 측정

In [17]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [18]:
y_true = y_test
y_hat = regr.predict(X_test)

- R squared : 1에 가까울 수록 좋음

In [19]:
r2_score(y_true, y_hat)

0.6527741564411681

- MAE : 0에 가까울 수록 좋음

In [20]:
mean_absolute_error(y_true, y_hat)

3.5983676701570353

- MSE : 0에 가까울 수록 좋음

In [21]:
mean_squared_error(y_true, y_hat)

28.982870817776814