<a href="https://colab.research.google.com/github/johyunkang/MLwithPythonCookbook/blob/main/13_%EC%84%A0%ED%98%95%ED%9A%8C%EA%B7%80.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 13.1 직선 학습하기

과제 : 특성과 타깃 벡터 사이의 선형 관계를 표현하는 모델을 훈련하고 싶음

해결 : 선형 회귀를 사용함(`LinearRegression`)

In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

boston = load_boston()
x = boston.data[:, :2] # 특성 2개만 선택
print('feature shape:', x.shape)
print('feature sample:', x[:3])
y = boston.target
print('target shape:', y.shape)
print('target sample:', y[:3])
print('feature name:', boston.feature_names)


lr = LinearRegression()
model = lr.fit(x, y)

feature shape: (506, 2)
feature sample: [[6.320e-03 1.800e+01]
 [2.731e-02 0.000e+00]
 [2.729e-02 0.000e+00]]
target shape: (506,)
target sample: [24.  21.6 34.7]
feature name: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [2]:
# 절편(intercept) = 편향(bias)
print('Intercept(=bias) :', model.intercept_)
print('coef :', model.coef_)
# dir(model)

print('\n실제값:', y[0] * 1000) # 보스턴 주택가격 단위가 천 달러라, 1000을 곱해줌
print('예측값:', model.predict(x)[0] * 1000)

Intercept(=bias) : 22.485628113468223
coef : [-0.35207832  0.11610909]

실제값: 24000.0
예측값: 24573.366631705547


## 13.2 교차 특성 다루기

과제 : 타깃 변수에 영향을 미치면서 다른 특성에 의존하는 특성이 있음

해결 : 사이킷런의 `PolynomialFeatures` 클래스로 교차항(interactive term)을 만들어 의존성을 잡아냅니다.

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures

import warnings
warnings.filterwarnings(action='ignore')

boston = load_boston()
x = boston.data[:, :2]
y = boston.target
class_name = boston.feature_names
# dir(boston)
print('class name 2개:', class_name[:2])

# 교차항을 생성
# degree : 교차항을 만들 때 최대 특성의 수
# include_bias : 기본적으로 절편(bias)이라 부르는 1로 채워진 특성을 추가하는데, False 이면 그렇게 하지 않음
# interaction_only : True 지정 시 오직 교차항만 반환 
interaction = PolynomialFeatures(degree=3, include_bias=False, interaction_only=True)

features_interaction = interaction.fit_transform(x)
print('feature interaction sample:', features_interaction[:3])

lr = LinearRegression()
model = lr.fit(features_interaction, y)


class name 2개: ['CRIM' 'ZN']
feature interaction sample: [[6.3200e-03 1.8000e+01 1.1376e-01]
 [2.7310e-02 0.0000e+00 0.0000e+00]
 [2.7290e-02 0.0000e+00 0.0000e+00]]
