In [86]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 아이오와 주택 데이터 설명

### 1. **기본 정보**
* Id: 고유 식별자 (각 주택의 고유 번호)
* MSSubClass: 주택 유형
  * 20: 1층 1946 이후 건축
  * 30: 1층 1945 이전 건축
  * 40: 1층 마감되지 않은 모든 연도
  * 45: 1.5층 마감되지 않은 모든 연도
  * 50: 1.5층 마감된 모든 연도
  * 60: 2층 1946 이후 건축
  * 70: 2층 1945 이전 건축
  * 75: 2.5층 모든 연도
  * 80: 스플릿 또는 멀티레벨
  * 85: 스플릿 포이어
  * 90: 듀플렉스 - 모든 스타일과 연도
  * 120: 1층 PUD (계획 단위 개발) - 1946 이후 건축
  * 150: 1.5층 PUD - 모든 연도
  * 160: 2층 PUD - 1946 이후 건축
  * 180: PUD - 멀티레벨 - 인클라인
  * 190: 2층 PUD - 1945 이전 건축
* MSZoning: 주택 구역
  * RL: 주거용 저밀도
  * RM: 주거용 중밀도
  * C (all): 상업용
  * FV: 주거용 고밀도
  * RH: 주거용 고밀도
* LotFrontage: 거리와 접한 면적 (피트 단위)
* LotArea: 부지 면적 (평방 피트)
* Street: 도로 유형
  * Pave: 포장도로
  * Grvl: 자갈도로
  * Alley: 골목 접근 유형
  * Grvl: 자갈
  * Pave: 포장
  * NA: 없음
* LotShape: 부지 형태
* Reg: 정규형
  * IR1: 약간 불규칙형
  * IR2: 중간 불규칙형
  * IR3: 매우 불규칙형
* LandContour: 토지 윤곽
  * Lvl: 평지
  * Bnk: 경사면
  * HLS: 언덕
  * Low: 저지대
* Utilities: 유틸리티 유형
  * AllPub: 모든 공공 서비스
  * NoSewr: 하수도 없음
  * NoSeWa: 하수도 및 물 없음
  * ELO: 전기만
* LotConfig: 부지 구성
  * Inside: 내부
  * Corner: 코너
  * CulDSac: 막다른 길
  * FR2: 2면 도로
  * FR3: 3면 도로
* LandSlope: 토지 경사도
  * Gtl: 완만
  * Mod: 보통
  * Sev: 심함
* Neighborhood: 인근 지역
  * CollgCr: College Creek
  * Veenker: Veenker
  * Crawfor: Crawford
  * NoRidge: North Ridge
  * Mitchel: Mitchell
  * Somerst: Somerset
  * NWAmes: Northwest Ames
  * OldTown: Old Town
  * BrkSide: Brookside
  * Sawyer: Sawyer
  * NridgHt: Northridge Heights
  * IDOTRR: Iowa DOT and Rail Road
  * MeadowV: Meadow Village
  * Edwards: Edwards
  * Timber: Timberland
  * Gilbert: Gilbert
  * StoneBr: Stone Brook
  * ClearCr: Clear Creek
  * NPkVill: Northpark Villa
  * Blueste: Bluestem
  * SawyerW: Sawyer West
  * Greens: Greens
  * GrnHill: Green Hills
  * Landmrk: Landmarks
* Condition1: 주요 도로 또는 철도 근접도
  * Norm: 정상
  * Feedr: 피더 도로
  * Artery: 주요 도로
  * RRAe: 철도 인접 (동쪽)
  * RRAn: 철도 인접 (북쪽)
  * PosN: 양호 (북쪽)
  * PosA: 양호 (동쪽)
  * RRNe: 철도 인접 (북동쪽)
  * RRnN: 철도 인접 (북쪽)
* Condition2: 주요 도로 또는 철도 근접도
  * Norm: 정상
  * Feedr: 피더 도로
  * Artery: 주요 도로
  * RRAe: 철도 인접 (동쪽)
  * RRAn: 철도 인접 (북쪽)
  * PosN: 양호 (북쪽)
  * PosA: 양호 (동쪽)
  * RRNe: 철도 인접 (북동쪽)
  * RRnN: 철도 인접 (북쪽)
* BldgType: 건물 유형
  * 1Fam: 단독 주택
  * 2fmCon: 2가구 주택
  * Duplex: 듀플렉스
  * TwnhsE: 타운하우스 (끝)
  * TwnhsI: 타운하우스 (내부)
* HouseStyle: 주택 스타일
  * 1Story: 1층
  * 1.5Fin: 1.5층 마감
  * 1.5Unf: 1.5층 미마감
  * 2Story: 2층
  * 2.5Fin: 2.5층 마감
  * 2.5Unf: 2.5층 미마감
  * SFoyer: 스플릿 포이어
  * SLvl: 스플릿 레벨
* OverallQual: 전체 재료 및 마감 품질 (1~10 등급)
* OverallCond: 전체 상태 (1~10 등급)
* YearBuilt: 건축 연도
* YearRemodAdd: 리모델링 연도
  * RoofStyle: 지붕 스타일
  * Flat: 평지붕
  * Gable: 박공
  * Gambrel: 갬브렐
  * Hip: 힙
  * Mansard: 맨사드
  * Shed: 셰드
  * RoofMatl: 지붕 재료
  * ClyTile: 점토 타일
  * CompShg: 복합 아스팔트
  * Membran: 멤브레인
  * Metal: 금속
  * Roll: 롤
  * Tar&Grv: 타르 및 자갈
  * WdShake: 목재 쉐이크
  * WdShngl: 목재 싱글
* Exterior1st: 외부 마감재 1
  * AsbShng: 아스베스트 쉐이크
  * AsphShn: 아스팔트 쉐이크
  * BrkComm: 벽돌 커먼
  * BrkFace: 벽돌 페이스
  * CBlock: 콘크리트 블록
  * CemntBd: 시멘트 보드
  * HdBoard: 하드보드
  * ImStucc: 인조 스터코
  * MetalSd: 금속 사이딩
  * Other: 기타
  * Plywood: 합판
  * PreCast: 프리캐스트
  * Stone: 돌
  * Stucco: 스터코
  * VinylSd: 비닐 사이딩
  * Wd Sdng: 목재 사이딩
  * WdShing: 목재 쉐이크
* Exterior2nd: 외부 마감재 2
  * AsbShng: 아스베스트 쉐이크
  * AsphShn: 아스팔트 쉐이크
  * BrkComm: 벽돌 커먼
  * BrkFace: 벽돌 페이스
  * CBlock: 콘크리트 블록
  * CemntBd: 시멘트 보드
  * HdBoard: 하드보드
  * ImStucc: 인조 스터코
  * MetalSd: 금속 사이딩
  * Other: 기타
  * Plywood: 합판
  * PreCast: 프리캐스트
  * Stone: 돌
  * Stucco: 스터코
  * VinylSd: 비닐 사이딩
  * Wd Sdng: 목재 사이딩
  * WdShing: 목재 쉐이크
* MasVnrType: 외장 벽 마감재 유형
  * BrkCmn: 일반 벽돌
  * BrkFace: 벽돌 페이스
  * CBlock: 콘크리트 블록
  * None: 없음
  * Stone: 돌
* MasVnrArea: 외장 벽 마감재 면적 (평방 피트)
* ExterQual: 외부 품질
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * Po: 매우 나쁨
* Ex terCond: 외부 상태
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * Po: 매우 나쁨
* Foundation: 기초 유형
  * BrkTil: 벽돌 타일
  * CBlock: 콘크리트 블록
  * PConc: 콘크리트
  * Slab: 슬래브
  * Stone: 돌
  * Wood: 목재
* BsmtQual: 지하실 품질
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * Po: 매우 나쁨
  * NA: 없음
* BsmtCond: 지하실 상태
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * Po: 매우 나쁨
  * NA: 없음
* Bs mtExposure: 지하실 노출 정도
  * Gd: 좋음
  * Av: 평균
  * Mn: 최소
  * No: 없음
  * NA: 없음
* BsmtFinType1: 지하실 마감 유형 1
  * GLQ: 좋은 생활 공간
  * ALQ: 평균 생활 공간
  * BLQ: 저급 생활 공간
  * Rec: 레크리에이션 공간
  * LwQ: 저품질
  * Unf: 미마감
  * NA: 없음
* BsmtFinSF1: 지하 마감 면적 1 (평방 피트)
* BsmtFinType2: 지하실 마감 유형 2
  * GLQ: 좋은 생활 공간
  * ALQ: 평균 생활 공간
  * BLQ: 저급 생활 공간
  * Rec: 레크리에이션 공간
  * LwQ: 저품질
  * Unf: 미마감
  * NA: 없음
* BsmtFinSF2: 지하 마감 면적 2 (평방 피트)
* BsmtUnfSF: 지하 미마감 면적 (평방 피트)
* TotalBsmtSF: 지하 전체 면적 (평방 피트)
* Heating: 난방 유형
  * Floor: 바닥 난방
  * GasA: 가스 에어
  * GasW: 가스 물
  * Grav: 중력
  * OthW: 기타 물
  * Wall: 벽 난방
* HeatingQC: 난방 품질 및 상태
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * Po: 매우 나쁨
* CentralAir: 중앙 에어컨 여부
  * Y: 있음
  * N: 없음
* Electrical: 전기 시스템
  * SBrkr: 서킷 브레이커
  * FuseA: 퓨즈 A
  * FuseF: 퓨즈 F
  * FuseP: 퓨즈 P
  * Mix: 혼합
* 1stFlrSF: 1층 면적 (평방 피트)
* 2ndFlrSF: 2층 면적 (평방 피트)
* LowQualFinSF: 저품질 마감 면적 (평방 피트)
* GrLivArea: 지상 생활 면적 (평방 피트)
* BsmtFullBath: 지하 전체 욕실 개수
* BsmtHalfBath: 지하 반 욕실 개수
* FullBath: 지상 전체 욕실 개수
* HalfBath: 지상 반 욕실 개수
* BedroomAbvGr: 지상 침실 개수
* KitchenAbvGr: 지상 주방 개수
* KitchenQual: 주방 품질
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * Po: 매우 나쁨
* TotRmsAbvGrd: 지상 총 방 개수 (욕실 제외)
* Functional: 주택 기능성
  * Typ: 일반
  * Min1: 경미한 결함
  * Min2: 경미한 결함 2
  * Mod: 보통 결함
  * Maj1: 주요 결함 1
  * Maj2: 주요 결함 2
  * Sev: 심각한 결함
  * Sal: 판매 불가
* Fireplaces: 벽난로 개수
* FireplaceQu: 벽난로 품질
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * Po: 매우 나쁨
  * NA: 없음
* GarageType: 차고 유형
  * 2Types: 2가지 유형
  * Attchd: 부착형
  * Basment: 지하
  * BuiltIn: 내장형
  * CarPort: 카포트
  * Detchd: 분리형
  * NA: 없음
* GarageYrBlt: 차고 건축 연도
* GarageFinish: 차고 내부 마감 상태
  * Fin: 마감
  * RFn: 부분 마감
  * Unf: 미마감
  * NA: 없음
* GarageCars: 차고에 주차 가능한 차량 수
* GarageArea: 차고 면적 (평방 피트)
* GarageQual: 차고 품질
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * Po: 매우 나쁨
  * NA: 없음
* GarageCond: 차고 상태
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * Po: 매우 나쁨
  * NA: 없음
* PavedDrive: 포장된 진입로 여부
  * Y: 있음
  * P: 부분
  * N: 없음
* WoodDeckSF: 목재 데크 면적 (평방 피트)
* OpenPorchSF: 개방형 현관 면적 (평방 피트)
* EnclosedPorch: 폐쇄형 현관 면적 (평방 피트)
* 3SsnPorch: 3계절 현관 면적 (평방 피트)
* ScreenPorch: 스크린 현관 면적 (평방 피트)
* PoolArea: 수영장 면적 (평방 피트)
* PoolQC: 수영장 품질
  * Ex: 우수
  * Gd: 좋음
  * TA: 보통
  * Fa: 나쁨
  * NA: 없음
* Fence: 울타리 품질
  * GdPrv: 좋은 사생활
  * MnPrv: 보통 사생활
  * GdWo: 좋은 목재
  * MnWw: 보통 목재
  * NA: 없음
* MiscFeature: 기타 기능
  * Elev: 엘리베이터
  * Gar2: 2차 차고
  * Othr: 기타
  * Shed: 창고
  * TenC: 테니스 코트
  * NA: 없음
* MiscVal: 기타 기능의 가치 (달러)
* MoSold: 판매 월
* YrSold: 판매 연도
* SaleType: 판매 유형
  * WD: 보증 증서
  * CWD: 보증 증서 및 특별 보증
  * VWD: 보증 증서 및 특별 보증
  * New: 신축
  * COD: 현금 거래
  * Con: 계약
  * SaleType: 판매 유형
  * WD: 보증 증서
  * CWD: 보증 증서 및 특별 보증
  * VWD: 보증 증서 및 특별 보증
  * New: 신축
  * COD: 현금 거래
  * Con: 계약
  * ConLw: 계약 저가
  * ConLI: 계약 저가 (개인)
  * ConLD: 계약 저가 (개발자)
  * Oth: 기타
* SaleCondition: 판매 조건
  * Normal: 정상
  * Abnorml: 비정상
  * AdjLand: 인접 토지
  * Alloca: 할당
  * Family: 가족
  * Partial: 부분
* SalePrice: 판매 가격 (목표 변수)

In [87]:
data = pd.read_csv("../06machine_learning/data/house_train.csv")
data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,6,5,1999,2000,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0,Unf,0,953,953,GasA,Ex,Y,SBrkr,953,694,0,1647,0,0,2,1,3,1,TA,7,Typ,1,TA,Attchd,1999.0,RFn,2,460,TA,TA,Y,0,40,0,0,0,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NWAmes,Norm,Norm,1Fam,1Story,6,6,1978,1988,Gable,CompShg,Plywood,Plywood,Stone,119.0,TA,TA,CBlock,Gd,TA,No,ALQ,790,Rec,163,589,1542,GasA,TA,Y,SBrkr,2073,0,0,2073,1,0,2,0,3,1,TA,7,Min1,2,TA,Attchd,1978.0,Unf,2,500,TA,TA,Y,349,0,0,0,0,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,9,1941,2006,Gable,CompShg,CemntBd,CmentBd,,0.0,Ex,Gd,Stone,TA,Gd,No,GLQ,275,Unf,0,877,1152,GasA,Ex,Y,SBrkr,1188,1152,0,2340,0,0,2,0,4,1,Gd,9,Typ,2,Gd,Attchd,1941.0,RFn,1,252,TA,TA,Y,0,60,0,0,0,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,6,1950,1996,Hip,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,TA,TA,Mn,GLQ,49,Rec,1029,0,1078,GasA,Gd,Y,FuseA,1078,0,0,1078,1,0,1,0,2,1,Gd,5,Typ,0,,Attchd,1950.0,Unf,1,240,TA,TA,Y,366,0,112,0,0,0,,,,0,4,2010,WD,Normal,142125


In [88]:
# 컬럼 표시 수 설정하기
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [89]:
data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,6,5,1999,2000,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0,Unf,0,953,953,GasA,Ex,Y,SBrkr,953,694,0,1647,0,0,2,1,3,1,TA,7,Typ,1,TA,Attchd,1999.0,RFn,2,460,TA,TA,Y,0,40,0,0,0,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NWAmes,Norm,Norm,1Fam,1Story,6,6,1978,1988,Gable,CompShg,Plywood,Plywood,Stone,119.0,TA,TA,CBlock,Gd,TA,No,ALQ,790,Rec,163,589,1542,GasA,TA,Y,SBrkr,2073,0,0,2073,1,0,2,0,3,1,TA,7,Min1,2,TA,Attchd,1978.0,Unf,2,500,TA,TA,Y,349,0,0,0,0,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,9,1941,2006,Gable,CompShg,CemntBd,CmentBd,,0.0,Ex,Gd,Stone,TA,Gd,No,GLQ,275,Unf,0,877,1152,GasA,Ex,Y,SBrkr,1188,1152,0,2340,0,0,2,0,4,1,Gd,9,Typ,2,Gd,Attchd,1941.0,RFn,1,252,TA,TA,Y,0,60,0,0,0,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,6,1950,1996,Hip,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,TA,TA,Mn,GLQ,49,Rec,1029,0,1078,GasA,Gd,Y,FuseA,1078,0,0,1078,1,0,1,0,2,1,Gd,5,Typ,0,,Attchd,1950.0,Unf,1,240,TA,TA,Y,366,0,112,0,0,0,,,,0,4,2010,WD,Normal,142125


In [90]:
(data.isna().sum() / len(data) * 100).sort_values(ascending=False)

PoolQC           99.520548
MiscFeature      96.301370
Alley            93.767123
Fence            80.753425
MasVnrType       59.726027
FireplaceQu      47.260274
LotFrontage      17.739726
GarageYrBlt       5.547945
GarageCond        5.547945
GarageType        5.547945
GarageFinish      5.547945
GarageQual        5.547945
BsmtFinType2      2.602740
BsmtExposure      2.602740
BsmtQual          2.534247
BsmtCond          2.534247
BsmtFinType1      2.534247
MasVnrArea        0.547945
Electrical        0.068493
Id                0.000000
Functional        0.000000
Fireplaces        0.000000
KitchenQual       0.000000
KitchenAbvGr      0.000000
BedroomAbvGr      0.000000
HalfBath          0.000000
FullBath          0.000000
BsmtHalfBath      0.000000
TotRmsAbvGrd      0.000000
GarageCars        0.000000
GrLivArea         0.000000
GarageArea        0.000000
PavedDrive        0.000000
WoodDeckSF        0.000000
OpenPorchSF       0.000000
EnclosedPorch     0.000000
3SsnPorch         0.000000
S

In [91]:
(data.isna().sum() / len(data) * 100).sort_values(ascending=False).index

Index(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType', 'FireplaceQu',
       'LotFrontage', 'GarageYrBlt', 'GarageCond', 'GarageType',
       'GarageFinish', 'GarageQual', 'BsmtFinType2', 'BsmtExposure',
       'BsmtQual', 'BsmtCond', 'BsmtFinType1', 'MasVnrArea', 'Electrical',
       'Id', 'Functional', 'Fireplaces', 'KitchenQual', 'KitchenAbvGr',
       'BedroomAbvGr', 'HalfBath', 'FullBath', 'BsmtHalfBath', 'TotRmsAbvGrd',
       'GarageCars', 'GrLivArea', 'GarageArea', 'PavedDrive', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition',
       'BsmtFullBath', 'HeatingQC', 'LowQualFinSF', 'LandSlope', 'OverallQual',
       'HouseStyle', 'BldgType', 'Condition2', 'Condition1', 'Neighborhood',
       'LotConfig', 'YearBuilt', 'Utilities', 'LandContour', 'LotShape',
       'Street', 'LotArea', 'MSZoning', 'OverallCond', 'YearRemodAdd',
       '2ndFlrSF', 'BsmtFinSF2', '1stF

In [92]:
data = data.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType'], axis=1)
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,0,12,2008,WD,Normal,250000


In [93]:
(data.isna().sum() / len(data) * 100).sort_values(ascending=False)

FireplaceQu      47.260274
LotFrontage      17.739726
GarageType        5.547945
GarageYrBlt       5.547945
GarageFinish      5.547945
GarageQual        5.547945
GarageCond        5.547945
BsmtExposure      2.602740
BsmtFinType2      2.602740
BsmtQual          2.534247
BsmtCond          2.534247
BsmtFinType1      2.534247
MasVnrArea        0.547945
Electrical        0.068493
KitchenAbvGr      0.000000
BedroomAbvGr      0.000000
HalfBath          0.000000
FullBath          0.000000
BsmtHalfBath      0.000000
KitchenQual       0.000000
BsmtFullBath      0.000000
GrLivArea         0.000000
TotRmsAbvGrd      0.000000
Functional        0.000000
Id                0.000000
Fireplaces        0.000000
ScreenPorch       0.000000
SaleCondition     0.000000
SaleType          0.000000
YrSold            0.000000
MoSold            0.000000
MiscVal           0.000000
PoolArea          0.000000
3SsnPorch         0.000000
2ndFlrSF          0.000000
EnclosedPorch     0.000000
OpenPorchSF       0.000000
W

In [94]:
data = data.drop(["FireplaceQu", "Id"], axis=1)
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,0,12,2008,WD,Normal,250000


In [95]:
(data.isna().sum() / len(data) * 100).sort_values(ascending=False)

LotFrontage      17.739726
GarageYrBlt       5.547945
GarageCond        5.547945
GarageType        5.547945
GarageFinish      5.547945
GarageQual        5.547945
BsmtFinType2      2.602740
BsmtExposure      2.602740
BsmtQual          2.534247
BsmtFinType1      2.534247
BsmtCond          2.534247
MasVnrArea        0.547945
Electrical        0.068493
TotRmsAbvGrd      0.000000
KitchenQual       0.000000
KitchenAbvGr      0.000000
BedroomAbvGr      0.000000
HalfBath          0.000000
FullBath          0.000000
BsmtHalfBath      0.000000
BsmtFullBath      0.000000
GrLivArea         0.000000
Functional        0.000000
MSSubClass        0.000000
Fireplaces        0.000000
ScreenPorch       0.000000
SaleCondition     0.000000
SaleType          0.000000
YrSold            0.000000
MoSold            0.000000
MiscVal           0.000000
PoolArea          0.000000
3SsnPorch         0.000000
2ndFlrSF          0.000000
EnclosedPorch     0.000000
OpenPorchSF       0.000000
WoodDeckSF        0.000000
P

In [96]:
missing_cols = ['LotFrontage', 'GarageYrBlt', 'GarageCond', 'GarageType',
       'GarageFinish', 'GarageQual', 'BsmtFinType2', 'BsmtExposure',
       'BsmtQual', 'BsmtFinType1', 'BsmtCond', 'MasVnrArea', 'Electrical']

In [97]:
for col in missing_cols:
    print(col, data[col].dtype)
    if data[col].dtype == 'object':
        data[col] = data[col].fillna(data[col].mode())
    else: 
        data[col] = data[col].fillna(data[col].median())
    

LotFrontage float64
GarageYrBlt float64
GarageCond object
GarageType object
GarageFinish object
GarageQual object
BsmtFinType2 object
BsmtExposure object
BsmtQual object
BsmtFinType1 object
BsmtCond object
MasVnrArea float64
Electrical object


In [98]:
data['GarageType'].unique()

array(['Attchd', 'Detchd', 'BuiltIn', 'CarPort', nan, 'Basment', '2Types'],
      dtype=object)

In [99]:
data['GarageType'].mode()

0    Attchd
Name: GarageType, dtype: object

In [100]:
data['GarageType'] = data['GarageType'].fillna(data['GarageType'].mode()[0])

In [101]:
data['GarageType'].unique()

array(['Attchd', 'Detchd', 'BuiltIn', 'CarPort', 'Basment', '2Types'],
      dtype=object)

In [69]:
(data.isna().sum() / len(data) * 100).sort_values(ascending=False)

GarageType       5.547945
GarageCond       5.547945
GarageFinish     5.547945
GarageQual       5.547945
BsmtFinType2     2.602740
BsmtExposure     2.602740
BsmtFinType1     2.534247
BsmtCond         2.534247
BsmtQual         2.534247
Electrical       0.068493
EnclosedPorch    0.000000
OpenPorchSF      0.000000
2ndFlrSF         0.000000
LowQualFinSF     0.000000
GrLivArea        0.000000
BsmtFullBath     0.000000
BsmtHalfBath     0.000000
FullBath         0.000000
SaleCondition    0.000000
HalfBath         0.000000
SaleType         0.000000
YrSold           0.000000
MoSold           0.000000
MiscVal          0.000000
BedroomAbvGr     0.000000
KitchenAbvGr     0.000000
KitchenQual      0.000000
TotRmsAbvGrd     0.000000
Functional       0.000000
Fireplaces       0.000000
1stFlrSF         0.000000
GarageYrBlt      0.000000
PoolArea         0.000000
GarageCars       0.000000
GarageArea       0.000000
ScreenPorch      0.000000
3SsnPorch        0.000000
PavedDrive       0.000000
WoodDeckSF  

# 많은 컬럼 중 어떤 컬럼을 골라야 할까?
* 컬럼의 상관분석을 통해 타겟변수(종속변수)와 상관도가 높은 컬럼만 선택
* 트리계열의 알고리즘을 사용해 선분석 후 중요 변수만 추출해서 재분석

In [28]:
data2 = data.copy()