# 보스턴 집값 예측하기

## 1. 큰 그림 보기

### 풀어야 할 문제 : 보스턴 집값을 예측하는 모델을 만드는 것

- 테스트 데이터에 대한 평가지표들  
Mean Absolute Error (MAE)

### 문제 정의

- 지도학습(supervised learning) 
- 회귀문제(regresssion)
- 배치학습(batch learning)

### 성능측정지표(performance measure) 선택

#### 1. 평균절대 오차 MAE (Mean Absolute Error)

$\mathrm{MAE}(\mathbf{X}, h) = \frac{1}{m}\sum_{i=1}^{m}\vert(h\left(\mathbf{x}^{(i)}\right)-y^{(i)}\vert$

- $m$: 데이터셋에 있는 샘플 수
- $\mathbf{x}^{(i)}$: $i$번째 샘플의 전체 특성값의 벡터(vector)
- $y^{(i)}$: $i$번째 샘플의 label(해당 샘플의 기대 출력값)
\begin{align*}
\mathbf{x}^{(1)} = \begin{bmatrix}
           -118.29 \\
           33.91 \\
           1,416 \\
           38,372
\end{bmatrix}
\end{align*}

$$y^{(1)} = 156,400$$
- $\mathbf{X}$: 데이터셋 모든 샘플의 모든 특성값(features)을 포함하는 행렬(matrix)  

\begin{align*}
\mathbf{X} = \begin{bmatrix}
           \left(\mathbf{x}^{(1)}\right)^T \\
           \left(\mathbf{x}^{(2)}\right)^T \\
           \vdots \\
           \left(\mathbf{x}^{(2000)}\right)^T
           \end{bmatrix}
           = \begin{bmatrix}
           -188.29 & 33.91 & 1,416 & 38,372 \\
           \vdots & \vdots & \vdots & \vdots
           \end{bmatrix}
\end{align*}

- $h$: 예측함수(prediction function). 하나의 샘플 $\mathbf{x}^{(i)}$에 대해 예측값 $\hat{y}^{(i)} = h\left(\mathbf{x}^{(i)}\right)$를 출력함.
- $\mathrm{MAE}(\mathbf{X}, h)$: 모델 $h$가 얼마나 좋은지 평가하는 지표, 또는 비용함수(cost function)
>- $\mathrm{MAE}$는 실제 값과 예측 값의 차이(Error)를 절대값으로 변환해 평균화 
```python
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)
```


- 에러에 절대값을 취하기 때문에 에러의 크기 그대로 반영된다.  
그러므로 예측 결과물의 **에러가 10이 나온 것이 5로 나온 것보다 2배가 나쁜 도메인**에서 쓰기 적합한 산식이다.
- 에러에 따른 손실이 선형적으로 올라갈 때 적합하다.
- 이상치가 많을 때

참고 : https://mizykk.tistory.com/102

----

## 2. 셋팅 및 데이터 가져오기

In [35]:
import os
import pandas as pd

REST_PATH = os.path.join("datasets", "boston")

def load_housing_data(file, rest_path=REST_PATH):
    csv_path = os.path.join(rest_path, file)
    return pd.read_csv(csv_path)


housing_train = load_housing_data('train.csv') # csv -> dp 데이터 가져오기
housing_test = load_housing_data('test.csv') # csv -> dp 데이터 가져오기
housing = pd.concat([housing_train, housing_test])

------

## 3. 데이터로부터 통찰을 얻기 위해 탐색과 시각화

### 데이터 톺아보기

In [36]:
housing.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

$\rightarrow$ 칼럼이 많다.

In [37]:
housing.describe() # 숫자형 데이터 정보

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,2919.0,2919.0,2433.0,2919.0,2919.0,2919.0,2919.0,2919.0,2896.0,2918.0,...,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,1460.0
mean,1460.0,57.137718,69.305795,10168.11408,6.089072,5.564577,1971.312778,1984.264474,102.201312,441.423235,...,93.709832,47.486811,23.098321,2.602261,16.06235,2.251799,50.825968,6.213087,2007.792737,180921.19589
std,842.787043,42.517628,23.344905,7886.996359,1.409947,1.113131,30.291442,20.894344,179.334253,455.610826,...,126.526589,67.575493,64.244246,25.188169,56.184365,35.663946,567.402211,2.714762,1.314964,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,730.5,20.0,59.0,7478.0,5.0,5.0,1953.5,1965.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,129975.0
50%,1460.0,50.0,68.0,9453.0,6.0,5.0,1973.0,1993.0,0.0,368.5,...,0.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,2189.5,70.0,80.0,11570.0,7.0,6.0,2001.0,2004.0,164.0,733.0,...,168.0,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,2919.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,1424.0,742.0,1012.0,508.0,576.0,800.0,17000.0,12.0,2010.0,755000.0


In [38]:
mine = ['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinSF1','BsmtFinType2','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','Electrical','BsmtFullBath','BsmtHalfBath']
housing[mine].isnull().sum()

BsmtQual        81
BsmtCond        82
BsmtExposure    82
BsmtFinType1    79
BsmtFinSF1       1
BsmtFinType2    80
BsmtFinSF2       1
BsmtUnfSF        1
TotalBsmtSF      1
Electrical       1
BsmtFullBath     2
BsmtHalfBath     2
dtype: int64

In [39]:
housing[mine].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   BsmtQual      2838 non-null   object 
 1   BsmtCond      2837 non-null   object 
 2   BsmtExposure  2837 non-null   object 
 3   BsmtFinType1  2840 non-null   object 
 4   BsmtFinSF1    2918 non-null   float64
 5   BsmtFinType2  2839 non-null   object 
 6   BsmtFinSF2    2918 non-null   float64
 7   BsmtUnfSF     2918 non-null   float64
 8   TotalBsmtSF   2918 non-null   float64
 9   Electrical    2918 non-null   object 
 10  BsmtFullBath  2917 non-null   float64
 11  BsmtHalfBath  2917 non-null   float64
dtypes: float64(6), object(6)
memory usage: 296.5+ KB


----

In [40]:
for col in mine :
    if housing[col].dtype == object :
        print(col)
        print(housing[col].value_counts())
        print('--------------------')

BsmtQual
TA    1283
Gd    1209
Ex     258
Fa      88
Name: BsmtQual, dtype: int64
--------------------
BsmtCond
TA    2606
Gd     122
Fa     104
Po       5
Name: BsmtCond, dtype: int64
--------------------
BsmtExposure
No    1904
Av     418
Gd     276
Mn     239
Name: BsmtExposure, dtype: int64
--------------------
BsmtFinType1
Unf    851
GLQ    849
ALQ    429
Rec    288
BLQ    269
LwQ    154
Name: BsmtFinType1, dtype: int64
--------------------
BsmtFinType2
Unf    2493
Rec     105
LwQ      87
BLQ      68
ALQ      52
GLQ      34
Name: BsmtFinType2, dtype: int64
--------------------
Electrical
SBrkr    2671
FuseA     188
FuseF      50
FuseP       8
Mix         1
Name: Electrical, dtype: int64
--------------------


In [41]:
for col in mine :
    if housing[col].dtype == float :
        print(col)
        print(housing[col].value_counts())
        print("Median : ", housing[col].median())
        print('--------------------')

BsmtFinSF1
0.0       929
24.0       27
16.0       14
300.0       9
288.0       8
         ... 
1022.0      1
939.0       1
1124.0      1
1619.0      1
1106.0      1
Name: BsmtFinSF1, Length: 991, dtype: int64
Median :  368.5
--------------------
BsmtFinSF2
0.0      2571
294.0       5
180.0       5
162.0       3
539.0       3
         ... 
196.0       1
904.0       1
456.0       1
624.0       1
823.0       1
Name: BsmtFinSF2, Length: 272, dtype: int64
Median :  0.0
--------------------
BsmtUnfSF
0.0       241
384.0      19
728.0      14
672.0      13
600.0      12
         ... 
1503.0      1
445.0       1
958.0       1
1559.0      1
1369.0      1
Name: BsmtUnfSF, Length: 1135, dtype: int64
Median :  467.0
--------------------
TotalBsmtSF
0.0       78
864.0     74
672.0     29
912.0     26
1040.0    25
          ..
1949.0     1
1231.0     1
1829.0     1
1475.0     1
1243.0     1
Name: TotalBsmtSF, Length: 1058, dtype: int64
Median :  989.5
--------------------
BsmtFullBath
0.0    1705
1.

In [87]:
def object_clear(df, col):
    index_max = df[col].value_counts().idxmax()
    df[col].fillna(index_max)
    #print(df[col].isnull())

In [88]:
housing_tr = housing.copy()

for col in mine :
    if housing_tr[col].dtype == object :
        object_clear(housing_tr,col)
        print(col, housing_tr[col].isnull().sum())

BsmtQual 81
BsmtCond 82
BsmtExposure 82
BsmtFinType1 79
BsmtFinType2 80
Electrical 1


In [89]:
housing_tr[mine].isnull().sum()

BsmtQual        81
BsmtCond        82
BsmtExposure    82
BsmtFinType1    79
BsmtFinSF1       1
BsmtFinType2    80
BsmtFinSF2       1
BsmtUnfSF        1
TotalBsmtSF      1
Electrical       1
BsmtFullBath     2
BsmtHalfBath     2
dtype: int64

-------------

In [10]:
housing.describe()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
count,2919,2919,2918,2446.0,2919,2919,285,2919,2919,2918,...,2919,99,648,189,2919,2919,2919,2918,2919,1509
unique,2834,20,9,129.0,1912,6,6,8,8,6,...,17,7,8,8,41,16,9,13,10,651
top,TA,20,RL,60.0,TA,Pave,Grvl,Reg,Lvl,AllPub,...,0,TA,MnPrv,Shed,0,6,2007,WD,Normal,TA
freq,84,1041,2196,261.0,84,2819,119,1785,2542,2828,...,2818,84,323,91,2732,481,674,2443,2327,84


In [11]:
sns.heatmap(housing.corr())

NameError: name 'sns' is not defined