https://www.kaggle.com/code/agileteam/t2-4-house-prices-regression

## 집 값 예측
- 예측할 변수 ['SalePrice']
- 평가: rmse, r2

    - rmse는 낮을 수록 좋은 성능
    - r2는 높을 수록 좋은 성능
   
R-squared (R2) score는 회귀 분석에서 모델의 설명력을 측정하는 지표입니다. R2 score는 0과 1 사이의 값을 가지며, 높을수록 모델이 주어진 데이터를 얼마나 잘 설명하는지를 나타냅니다. 일반적으로 R2 score가 1에 가까울수록 모델이 데이터를 잘 설명한다고 할 수 있습니다.

In [4]:
# 시험환경 세팅 (코드 변경 X)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

def exam_data_load(df, target, id_name="", null_name=""):
    if id_name == "":
        df = df.reset_index().rename(columns={"index": "id"})
        id_name = 'id'
    else:
        id_name = id_name
    
    if null_name != "":
        df[df == null_name] = np.nan
    
    X_train, X_test = train_test_split(df, test_size=0.2, shuffle=True, random_state=2021)
    y_train = X_train[[id_name, target]]
    X_train = X_train.drop(columns=[id_name, target])
    y_test = X_test[[id_name, target]]
    X_test = X_test.drop(columns=[id_name, target])
    return X_train, X_test, y_train, y_test 
    
df = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
X_train, X_test, y_train, y_test = exam_data_load(df, target='SalePrice', id_name='Id')

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1168, 79), (292, 79), (1168, 2), (292, 2))

In [5]:
import pandas as pd

In [6]:
X_train.shape, X_test.shape

((1168, 79), (292, 79))

In [7]:
X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
81,120,RM,32.0,4500,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,3,2006,WD,Normal
1418,20,RL,71.0,9204,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,8,2008,COD,Normal
1212,30,RL,50.0,9340,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,8,2009,WD,Normal
588,20,RL,65.0,25095,Pave,,IR1,Low,AllPub,Inside,...,60,0,,,,0,6,2009,WD,Partial
251,120,RM,44.0,4750,Pave,,IR1,HLS,AllPub,Inside,...,153,0,,,,0,12,2007,WD,Family


In [8]:
X_test.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
1380,30,RL,45.0,8212,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
520,190,RL,60.0,10800,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,8,2008,WD,Normal
1175,50,RL,85.0,10678,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2007,WD,Normal
351,120,RL,,5271,Pave,,IR1,Low,AllPub,Inside,...,184,0,,,,0,12,2006,WD,Abnorml
1335,20,RL,80.0,9650,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2009,WD,Normal


In [9]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1168 entries, 81 to 1140
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1168 non-null   int64  
 1   MSZoning       1168 non-null   object 
 2   LotFrontage    956 non-null    float64
 3   LotArea        1168 non-null   int64  
 4   Street         1168 non-null   object 
 5   Alley          70 non-null     object 
 6   LotShape       1168 non-null   object 
 7   LandContour    1168 non-null   object 
 8   Utilities      1168 non-null   object 
 9   LotConfig      1168 non-null   object 
 10  LandSlope      1168 non-null   object 
 11  Neighborhood   1168 non-null   object 
 12  Condition1     1168 non-null   object 
 13  Condition2     1168 non-null   object 
 14  BldgType       1168 non-null   object 
 15  HouseStyle     1168 non-null   object 
 16  OverallQual    1168 non-null   int64  
 17  OverallCond    1168 non-null   int64  
 18  YearBui

In [10]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 292 entries, 1380 to 906
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     292 non-null    int64  
 1   MSZoning       292 non-null    object 
 2   LotFrontage    245 non-null    float64
 3   LotArea        292 non-null    int64  
 4   Street         292 non-null    object 
 5   Alley          21 non-null     object 
 6   LotShape       292 non-null    object 
 7   LandContour    292 non-null    object 
 8   Utilities      292 non-null    object 
 9   LotConfig      292 non-null    object 
 10  LandSlope      292 non-null    object 
 11  Neighborhood   292 non-null    object 
 12  Condition1     292 non-null    object 
 13  Condition2     292 non-null    object 
 14  BldgType       292 non-null    object 
 15  HouseStyle     292 non-null    object 
 16  OverallQual    292 non-null    int64  
 17  OverallCond    292 non-null    int64  
 18  YearBui

### PreProcessing

In [11]:
X_train = X_train.select_dtypes(exclude = 'object')
X_test = X_test.select_dtypes(exclude = 'object')

In [12]:
target = y_train['SalePrice']

In [13]:
X_train.head(3)

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
81,120,32.0,4500,6,5,1998,1998,443.0,1201,0,...,405,0,199,0,0,0,0,0,3,2006
1418,20,71.0,9204,5,5,1963,1963,0.0,25,872,...,336,0,88,0,0,0,0,0,8,2008
1212,30,50.0,9340,4,6,1941,1950,0.0,344,0,...,234,0,113,0,0,0,0,0,8,2009


In [14]:
X_test.head(3)

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
1380,30,45.0,8212,3,3,1914,1950,0.0,203,0,...,200,0,0,96,0,0,0,0,6,2010
520,190,60.0,10800,4,7,1900,2000,0.0,0,0,...,0,220,114,210,0,0,0,0,8,2008
1175,50,85.0,10678,8,5,1992,2000,337.0,700,0,...,541,0,33,0,0,0,0,0,4,2007


In [15]:
y_train.head()

Unnamed: 0,Id,SalePrice
81,82,153500
1418,1419,124000
1212,1213,113000
588,589,143000
251,252,235000



SimpleImputer는 scikit-learn 라이브러리에서 제공하는 클래스 중 하나로, 결측값(누락된 데이터)을 다루기 위한 간단하면서도 유용한 도구

'mean': 평균값으로 대체(디폴트)

'median': 중앙값으로 대체

'most_frequent': 최빈값(가장 자주 등장하는 값)으로 대체

'constant': 사용자가 지정한 상수값으로 대체 (fill_value 매개변수를 사용하여 지정)

In [16]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer()
#imp = SimpleImputer(strategy = 'mean')

In [17]:
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
X_tr, X_val, y_tr, y_val = train_test_split(X_train, target, test_size = 0.15, random_state = 2022)

X_tr.shape, X_val.shape, y_tr.shape, y_val.shape

((992, 36), (176, 36), (992,), (176,))

### Model

In [20]:
from sklearn.metrics import mean_squared_error, r2_score

def rmse(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

In [22]:
model = XGBRegressor()
model.fit(X_tr, y_tr, verbose = False)
pred = model.predict(X_val)

print('R2 :', r2_score(y_val, pred))
print('RMSE :', rmse(y_val, pred))

R2 : 0.8534731687038711
RMSE : 31827.55779287209


In [23]:
model = RandomForestRegressor()
model.fit(X_tr, y_tr)
pred = model.predict(X_val)

print("R2 : " + str(r2_score(y_val, pred)))
print("RMSE : " + str(rmse(y_val, pred)))

R2 : 0.8521536956593712
RMSE : 31970.54008825748


### 최종모델 선정

In [26]:
y = y_train['SalePrice']

In [28]:
final_model = XGBRegressor()
final_model.fit(X_train, y)

prediction = final_model.predict(X_test)

In [29]:
submission = pd.DataFrame(data={
    'Id': y_test.Id,
    'income' : prediction
})

In [30]:
submission.head()

Unnamed: 0,Id,income
1380,1381,79665.03125
520,521,112973.0625
1175,1176,327729.59375
351,352,184498.734375
1335,1336,155756.921875
