---
<div class="alert alert-success" data-title="">
  <h2><i class="fa fa-tasks" aria-hidden="true"></i> 사이킷런을 사용한 Boston 집값 예측
  </h2>
</div>

<img src = "https://cdn10.bostonmagazine.com/wp-content/uploads/sites/2/2018/05/boston-rent.jpg" width = "700" >


이번 실습 시간에 다뤄볼 데이터는 보스턴 시의 주택 가격과 관련된 데이터입니다.

- 주택 가격에 영향을 끼치는 여러 요소들 (X, Features)
- 주택 가격 (Y, Target)


## 변수설명 

1) Target (Y) data
* Target: 1978년 보스턴 주택 가격

2) Feature (X) data
* CRIM: 범죄율
* INDUS: 비소매상업지역 면적 비율
* NOX: 일산화질소 농도
* RM: 주택당 방 개수
* LSTAT: 인구 중 하위 계층 비율
* B: 인구 중 흑인 비율
* PTRATIO: 학생/교사 비율
* ZN: 25,000 평방피트를 초과하는 거주지역 비율
* CHAS: 찰스강의 경계에 위치한 경우는 1, 아니면 0
* AGE: 1940년 이전에 건축된 주택의 비율
* RAD: 방사형 고속도로까지의 거리
* DIS: 직업센터의 거리
* TAX: 재산세율

---

## 데이터 살펴보기 

In [1]:
from sklearn.datasets import load_boston
boston = load_boston()

In [2]:
import pandas as pd
boston_df = pd.DataFrame(boston.data, 
                         columns=boston.feature_names, 
                         index=range(1,len(boston.data)+1))
boston_df['PRICE'] = boston.target
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
1,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
3,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
4,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## stats_model 을 사용하여 상관관계 확인하기 


In [4]:
# Feature data
LSTAT = boston_df[['LSTAT']]
# Target data
y_train = boston_df[['PRICE']]

In [5]:
import statsmodels.api as sm
X_train = sm.add_constant(LSTAT, has_constant = 'add')
X_train.head()

Unnamed: 0,const,LSTAT
1,1.0,4.98
2,1.0,9.14
3,1.0,4.03
4,1.0,2.94
5,1.0,5.33


In [6]:
# 선형 회귀 모델에 Target과 Feature를 넣기
single_model = sm.OLS(y_train, X_train)

# 모델 학습하기
fitted_model = single_model.fit()    # 우리가 만든 학습된 모델

In [7]:
# 학습의 결과를 확인해보기
fitted_model.summary()

0,1,2,3
Dep. Variable:,PRICE,R-squared:,0.544
Model:,OLS,Adj. R-squared:,0.543
Method:,Least Squares,F-statistic:,601.6
Date:,"Mon, 10 Aug 2020",Prob (F-statistic):,5.08e-88
Time:,10:19:46,Log-Likelihood:,-1641.5
No. Observations:,506,AIC:,3287.0
Df Residuals:,504,BIC:,3295.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,34.5538,0.563,61.415,0.000,33.448,35.659
LSTAT,-0.9500,0.039,-24.528,0.000,-1.026,-0.874

0,1,2,3
Omnibus:,137.043,Durbin-Watson:,0.892
Prob(Omnibus):,0.0,Jarque-Bera (JB):,291.373
Skew:,1.453,Prob(JB):,5.36e-64
Kurtosis:,5.319,Cond. No.,29.7


## 결과 해석 

- R-squared가 0.544로 54%의 설명력을 갖는다. 
- P-value는 0.000으로 0.05보다 작은 유의미한 상태 (P>⎪t⎪에 해당)

<div class="alert alert-success" data-title="">
  <h2><i class="fa fa-tasks" aria-hidden="true"></i> 단순선형회귀 모델을 통한 관계분석
  </h2>
</div>


## X,y 설정하기 
```python 
X = pd.DataFrame(boston_df['LSTAT'])
y = boston_df['PRICE']
```

In [15]:
# TODO
X = pd.DataFrame(boston_df['LSTAT'])
y = boston_df['PRICE']

## 학습 평가 데이터 나누기 

```python 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                       test_size=0.3,
                                                       random_state = 42)

```

In [17]:
X = np.array(X).reshape(-1,1)

In [18]:
# TODO
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                       test_size=0.3,
                                                       random_state = 42)

## 모델 학습

```python 
from sklearn.linear_model import LinearRegression
#1. model 정의
#2. fit 
#3. predict 
```

In [19]:
from sklearn.linear_model import LinearRegression
model = LinearRegression() # 1. 모델 정의 
model.fit(X_train, y_train) 
y_pred = model.predict(X_test)

## 모델 평가 

```python 
from sklearn.metrics import mean_absolute_error # 1. MAE
from sklearn.metrics import mean_squared_error  # 2. MSE 
import numpy as np                              # 3. RMSE 
from sklearn.metrics import r2_score            # 4. R2_score
```

In [20]:
from sklearn.metrics import mean_absolute_error # 1. MAE
from sklearn.metrics import mean_squared_error  # 2. MSE 
import numpy as np                              # 3. RMSE 
from sklearn.metrics import r2_score            # 4. R2_score

print(f'MAE: {mean_absolute_error(y_test, y_pred)}')
print(f'MSE: {mean_squared_error(y_test, y_pred)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred))}')
print(f'R2_score: {r2_score(y_test, y_pred)}')

MAE: 4.752100511437848
MSE: 38.0987021824347
RMSE: 6.1724146152405135
R2_score: 0.4886979007906852
