내용

주어진 주택 데이터를 기반으로 주택의 판매 가격(SalePrice)을 예측하는 모델을 구축합니다.

전처리 과정에서 결측값 처리와 범주형 데이터 인코딩을 수행합니다.

 - 주요 요구사항

  - 데이터를 읽어오고, 주요 정보를 확인하세요.

  - 아래의 전처리 작업을 수행하세요:

  - 결측값 처리: 결측값이 많은 열은 삭제하고, LotFrontage는 평균값으로 대체하세요.

  - 범주형 데이터 처리: pd.get_dummies를 이용하여 범주형 데이터를 숫자로 변환하세요.

  - 불필요한 열 제거: Id 열은 제거하세요.

  - 학습 데이터와 테스트 데이터를 나누세요. (훈련:테스트 비율 = 8:2)

  - Decision Tree Regressor를 사용하여 모델을 학습하고 예측하세요.

  - 테스트 데이터에서 모델의 평균 절대 오차(MAE)를 계산하세요.



문제 가이드

 - 데이터 전처리

  - 결측값 처리, 범주형 변수 인코딩, 불필요한 열 제거는 필수입니다.

 - 모델 학습 및 평가

  - 학습 데이터를 Decision Tree Regressor로 학습하고, 테스트 데이터로 평가합니다.

In [102]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

#모델 불러오기
df = pd.read_csv('train.csv')
#결측값 300이상 제거
df = df.drop(columns=df.columns[df.isnull().sum() > 300])
# LotFrontage 열의 결측치 평균으로 대체
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].mean())

#나머지 결측치들은 object는 최빈값으로 그 외에 값들은 평균으로 대체
null_list = df.columns[df.isnull().sum() > 0]
for column in null_list:
    if df[column].dtype == 'object':
        df[column] = df[column].fillna(df[column].mode()[0])
    else:
        df[column] = df[column].fillna(df[column].mean())

#데이터 타입이 object인 열들을 범주형으로 간주하고 원-핫 인코딩 진행
obj_list = df.columns[df.dtypes == 'object']
for obj in obj_list:
    dummie = pd.get_dummies(df[obj], dtype=int)
    df = df.drop(obj, axis=1)
    df = pd.concat([df, dummie], axis=1)
#Id 열 제거
df = df.drop('Id', axis=1)
#타겟과 데이터 분리
x = df.drop('SalePrice', axis=1)
y = df['SalePrice']
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
#사용할 모델 등록
tree_reg = DecisionTreeRegressor()
tree_reg.fit(x_train,y_train)
y_pred = tree_reg.predict(x_test)
mae = mean_absolute_error(y_test, y_pred)
print('절대 제곱 오차 : ', mae)

절대 제곱 오차 :  28061.79794520548


In [101]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
y_pred = tree_reg.predict(x_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(mae, y_pred)

28513.869863013697 [154000. 410000. 120000. 159000. 277500.  88000. 200000. 148500.  94500.
 112000. 189950. 111000.  94750. 201000. 190000. 145500. 196500. 136000.
 113000. 210000. 104900. 227000. 183200. 109900. 179600. 179000. 164990.
  94750. 163990. 194000. 128000. 222500. 169000. 112500. 250000. 124500.
  93000. 193000. 290000. 116000. 109500. 263000. 119000. 372500. 126500.
 193500. 119000. 140000. 582933. 128500. 111000. 215000. 112500. 381000.
 141000. 212900. 190000. 142500. 132500. 116000.  90350. 164500. 354000.
 266000. 245350. 239000. 100000. 440000. 123600. 155000. 118964. 140000.
 110000.  87500. 555000. 179000. 290000. 328000. 139000. 131000.  88000.
 100000. 126500.  80000. 182900. 140000. 270000. 213490. 136500. 183000.
 117000. 197500. 128500. 294000. 100000. 185000. 122000. 157000. 221500.
 238000. 157500. 193000. 375000. 116900. 223500. 167000. 138000. 274900.
 140000. 200000.  61000. 123600. 132500. 140000. 236000. 130000. 120000.
 109900. 158000. 274000. 184000.

In [95]:
df['SalePrice']

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

In [None]:
with open('isnull.txt', 'w') as f:
    for line, line2 in df.isnull().sum().items():
        f.write(f"{line}: {line2}\n")

In [68]:
null_list = df.columns[df.isnull().sum() > 1]
print(null_list)
print(df[null_list[0]].dtype)
for column in null_list:
    if df[column].dtype == 'object':
        df[column] = df[column].fillna(df[column].mode()[0])
    else:
        df[column] = df[column].fillna(df[column].mean())

print(df.isnull().sum())

Index(['MasVnrArea', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'GarageType', 'GarageYrBlt', 'GarageFinish',
       'GarageQual', 'GarageCond'],
      dtype='object')
float64
MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 74, dtype: int64


In [86]:
ddd = df.columns[df.dtypes == 'object']
print(ddd.dtype)
df.drop

object


In [92]:
#print(df.info(), end='\n\n\n\n')
obj_list = df.columns[df.dtypes == 'object']
for obj in obj_list:
    dummie = pd.get_dummies(df[obj], dtype=int)
    df = df.drop(obj, axis=1)
    df = pd.concat([df, dummie], axis=1)

#df_with_dummies = pd.get_dummies(df)
#print(df_with_dummies)

MemoryError: Unable to allocate 91.2 MiB for an array with shape (11960320, 1) and data type object

In [58]:
print(df.columns)

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrArea', 'ExterQual',
       'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPor

In [57]:
categorical_columns = df.select_dtypes(include=['object'])
print(categorical_columns.columns)


Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
       'SaleType', 'SaleCondition'],
      dtype='object')


In [64]:
df['BsmtQual'] = df['BsmtQual'].fillna(df['BsmtQual']).mode()[0]
df['BsmtQual'].head(15)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
Name: BsmtQual, dtype: bool

In [65]:
df.head(15)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000
5,50,RL,85.0,14115,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,320,0,0,700,10,2009,WD,Normal,143000
6,20,RL,75.0,10084,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,8,2007,WD,Normal,307000
7,60,RL,70.049958,10382,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,228,0,0,0,350,11,2009,WD,Normal,200000
8,50,RM,51.0,6120,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,205,0,0,0,0,4,2008,WD,Abnorml,129900
9,190,RL,50.0,7420,Pave,Reg,Lvl,AllPub,Corner,Gtl,...,0,0,0,0,0,1,2008,WD,Normal,118000


In [31]:
df['SaleType'] = pd.get_dummies(df['SaleType'], dtype=int)
print(df.columns)

ValueError: Columns must be same length as key