<a href="https://colab.research.google.com/github/jisu-h/coding_test/blob/main/AI_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression Example_boston house price

- 보스턴 주택 가격
: 1978년에 발표된 데이터로 미국 보스턴 지역의 주택 가격에 영향을 미치는 요소들을 정리함

```
x
506 행 13 열 
CRIM     per capita crime rate by town
ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS    proportion of non-retail business acres per town
CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX      nitric oxides concentration (parts per 10 million)
RM       average number of rooms per dwelling
AGE      proportion of owner-occupied units built prior to 1940
DIS      weighted distances to five Boston employment centres
RAD      index of accessibility to radial highways
TAX      full-value property-tax rate per $10,000
PTRATIO  pupil-teacher ratio by town
B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT    % lower status of the population

y
506 행 1 열
target (MEDV)     Median value of owner-occupied homes in $1000's
```

![](https://wikidocs.net/images/page/49966/1.png)

## 데이터 탐색

In [None]:
# 데이터 구조의 행열 개수만 확인하기
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)
print(X.shape)
print(y.shape)

(506, 13)
(506,)


In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets

from sklearn import model_selection
from sklearn import metrics

dataset = datasets.load_boston()
print(dataset.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [None]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
print(df.head())

      CRIM    ZN  INDUS  CHAS    NOX  ...    TAX  PTRATIO       B  LSTAT  target
0  0.00632  18.0   2.31   0.0  0.538  ...  296.0     15.3  396.90   4.98    24.0
1  0.02731   0.0   7.07   0.0  0.469  ...  242.0     17.8  396.90   9.14    21.6
2  0.02729   0.0   7.07   0.0  0.469  ...  242.0     17.8  392.83   4.03    34.7
3  0.03237   0.0   2.18   0.0  0.458  ...  222.0     18.7  394.63   2.94    33.4
4  0.06905   0.0   2.18   0.0  0.458  ...  222.0     18.7  396.90   5.33    36.2

[5 rows x 14 columns]


In [None]:
print(df.tail())

        CRIM   ZN  INDUS  CHAS    NOX  ...    TAX  PTRATIO       B  LSTAT  target
501  0.06263  0.0  11.93   0.0  0.573  ...  273.0     21.0  391.99   9.67    22.4
502  0.04527  0.0  11.93   0.0  0.573  ...  273.0     21.0  396.90   9.08    20.6
503  0.06076  0.0  11.93   0.0  0.573  ...  273.0     21.0  396.90   5.64    23.9
504  0.10959  0.0  11.93   0.0  0.573  ...  273.0     21.0  393.45   6.48    22.0
505  0.04741  0.0  11.93   0.0  0.573  ...  273.0     21.0  396.90   7.88    11.9

[5 rows x 14 columns]


In [None]:
print(df.shape) 

(506, 14)


In [None]:
print(df.describe())

             CRIM          ZN       INDUS  ...           B       LSTAT      target
count  506.000000  506.000000  506.000000  ...  506.000000  506.000000  506.000000
mean     3.613524   11.363636   11.136779  ...  356.674032   12.653063   22.532806
std      8.601545   23.322453    6.860353  ...   91.294864    7.141062    9.197104
min      0.006320    0.000000    0.460000  ...    0.320000    1.730000    5.000000
25%      0.082045    0.000000    5.190000  ...  375.377500    6.950000   17.025000
50%      0.256510    0.000000    9.690000  ...  391.440000   11.360000   21.200000
75%      3.677083   12.500000   18.100000  ...  396.225000   16.955000   25.000000
max     88.976200  100.000000   27.740000  ...  396.900000   37.970000   50.000000

[8 rows x 14 columns]


In [None]:
print(df.iloc[:,-1].value_counts())

50.0    16
25.0     8
23.1     7
21.7     7
22.0     7
        ..
12.8     1
29.9     1
9.6      1
36.1     1
13.0     1
Name: target, Length: 229, dtype: int64


In [None]:
x_data = dataset.data
y_data = dataset.target

## 일반 선형 회귀 적용

In [None]:
from sklearn.linear_model import LinearRegression
x_train, x_test, y_train, y_test = model_selection.train_test_split(x_data, y_data, test_size=0.3)

estimator = LinearRegression()
estimator.fit(x_train, y_train)

y_predict = estimator.predict(x_train) 
score = metrics.r2_score(y_train, y_predict)
print(score)

y_predict = estimator.predict(x_test) 
score = metrics.r2_score(y_test, y_predict)
print(score) 

0.746394061242494
0.7132892329253933


## Lasso Regression 적용 

In [None]:
from sklearn.linear_model import Lasso

estimatorL = Lasso()

estimatorL.fit(x_train, y_train)

y_predict = estimatorL.predict(x_train) 
score = metrics.r2_score(y_train, y_predict)
print(score) 

y_predict = estimatorL.predict(x_test) 
score = metrics.r2_score(y_test, y_predict)
print(score) 

0.6935819364431475
0.6923139858480638


## Ridge Regression 적용  

In [None]:
from sklearn.linear_model import Ridge
estimatorR = Ridge()
estimatorR.fit(x_train, y_train)

y_predict = estimatorR.predict(x_train)
score = metrics.r2_score(y_train, y_predict)
print(score)

y_predict = estimatorR.predict(x_test)
score = metrics.r2_score(y_test, y_predict)
print(score)

0.7442382992043273
0.7076981783420639


## Elastic​ Regression 적용 

In [None]:
from sklearn.linear_model import ElasticNet
ER = ElasticNet(alpha=0.01,l1_ratio=0.01) 
ER.fit(x_train,y_train)

y_predict = ER.predict(x_train)
score = metrics.r2_score(y_train, y_predict) 
print(score)

y_predict = ER.predict(x_test)
score = metrics.r2_score(y_test, y_predict)
print(score)

0.7409250501276832
0.7029004668605379


# pdf에 있는 코드 예시

In [None]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

print("들어 있는 key들", iris.keys())

iris_data = iris.data
iris_label = iris.target

# 독특하게 내장 데이터셋이 df로 들어 있는게 아니라 ndarray로 들어있으며
# target, target_names, feature_names가 별도로 iris의 키로 들어 있음
print("data type", type(iris_data))
print("target 값", iris_label)
print("target 명", iris.target_names)

# 따라서 이를 df로 만들기 위해서 별도의 가공이 필요함
iris_df = pd.DataFrame(iris_data, columns=iris.feature_names)
iris_df['label'] = iris.target
iris_df.head()

들어 있는 key들 dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
data type <class 'numpy.ndarray'>
target 값 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
target 명 ['setosa' 'versicolor' 'virginica']


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [None]:
from sklearn.preprocessing import LabelEncoder
items=['트와이스','BTS','레드벨벳','신화','GOD','GOD']
# LabelEncoder를 객체로 생성한 후 , fit( ) 과 transform( ) 으로 label 인코딩 수행. 
encoder = LabelEncoder()
encoder.fit(items)
labels = encoder.transform(items)
print('인코딩 변환값:',labels)

인코딩 변환값: [4 0 2 3 1 1]


In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
items=['트와이스','BTS','레드벨벳','신화','GOD','GOD']
# 먼저 숫자값으로 변환을 위해 LabelEncoder로 변환합니다. 
encoder = LabelEncoder()
encoder.fit(items)
labels = encoder.transform(items)
# 2차원 데이터로 변환합니다. 
labels = labels.reshape(-1,1)
# 원-핫 인코딩을 적용합니다. 
oh_encoder = OneHotEncoder()
oh_encoder.fit(labels)
oh_labels = oh_encoder.transform(labels)
print('원-핫 인코딩 데이터')
print(oh_labels.toarray())
print('원-핫 인코딩 데이터 차원')
print(oh_labels.shape)

원-핫 인코딩 데이터
[[0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0.]]
원-핫 인코딩 데이터 차원
(6, 5)


In [None]:
import pandas as pd
df = pd.DataFrame({'item':['트와이스','BTS','레드벨벳','신화','GOD','GOD'] })
df
# pd.get_dummies(df)# 원핫인코딩 실행

Unnamed: 0,item
0,트와이스
1,BTS
2,레드벨벳
3,신화
4,GOD
5,GOD


In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import pandas as pd

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

print('=======BEFORE=======')
print('[mean]\n', iris_df.mean())
print('\n[var]\n', iris_df.var())

scaler = StandardScaler()
scaler.fit(iris_df)
scaled = scaler.transform(iris_df)
# 이 때 transform( ) 결과는 ndarray이므로 다시 DataFrame으로 반환하기.
scaled_iris = pd.DataFrame(scaled, columns=iris.feature_names)

print('\n=======AFTER=======')
print('[mean]\n', scaled_iris.mean())
print('\n[var]\n', scaled_iris.var())

[mean]
 sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
dtype: float64

[var]
 sepal length (cm)    0.685694
sepal width (cm)     0.189979
petal length (cm)    3.116278
petal width (cm)     0.581006
dtype: float64

[mean]
 sepal length (cm)   -1.690315e-15
sepal width (cm)    -1.842970e-15
petal length (cm)   -1.698641e-15
petal width (cm)    -1.409243e-15
dtype: float64

[var]
 sepal length (cm)    1.006711
sepal width (cm)     1.006711
petal length (cm)    1.006711
petal width (cm)     1.006711
dtype: float64


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(iris_df)
iris_scaled = scaler.transform(iris_df)

# 역시나 transform() 결과가 ndarray라 DataFrame으로 바꿔준다.
df_scaled = pd.DataFrame(iris_scaled, columns=iris.feature_names)
print('[최솟값]')
print(df_scaled.min())
print('\n[최댓값]')
print(df_scaled.max())

[최솟값]
sepal length (cm)    0.0
sepal width (cm)     0.0
petal length (cm)    0.0
petal width (cm)     0.0
dtype: float64

[최댓값]
sepal length (cm)    1.0
sepal width (cm)     1.0
petal length (cm)    1.0
petal width (cm)     1.0
dtype: float64


## 