랜덤 포레스트(Random Forest) : 중고차 가격 예측

목표: 중고차 판매 이력 데이터셋을 이용해 중고차 가격을 예측

via 최효원's Jupyter Notebook

종속 변수(target) 는 판매 가격이며, 독립변수로는 생산년도, 주행거리,
변속기, 마일리지, 배기량 등이 있습니다.

ensemble 기법을 사용한 트리 기반 모델 중 가장 보편적이고,
결정 트리의 단점인 overfitting 문제를 완화시켜주는 모델인
random forest 모델을 해당 작업에 적용했습니다.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#라이브러리 import

df = pd.read_csv('https://media.githubusercontent.com/media/musthave-ML10/data_source/main/car.csv')
#데이터 출처 : https://media.githubusercontent.com/media/musthave-ML10
df.head()
#차이름 #생산년도 #판매가(target) #주행거리 #연료 #판매자유형 #변속기 #차주변경내역 #마일리지 #배기량 #최대출력 #토크 #인승

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0


In [3]:
df.info()
#total : 8128 rows / 13 cols

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           8128 non-null   object 
 1   year           8128 non-null   int64  
 2   selling_price  8128 non-null   int64  
 3   km_driven      8128 non-null   int64  
 4   fuel           8128 non-null   object 
 5   seller_type    8128 non-null   object 
 6   transmission   8128 non-null   object 
 7   owner          8128 non-null   object 
 8   mileage        7907 non-null   object 
 9   engine         7907 non-null   object 
 10  max_power      7913 non-null   object 
 11  torque         7906 non-null   object 
 12  seats          7907 non-null   float64
dtypes: float64(1), int64(3), object(9)
memory usage: 825.6+ KB


*몇 개 컬럼에서 결측치를 확인했습니다.
*숫자 형태로 인식되어야 할 데이터가 object형으로 처리돼있습니다(ex. engine)

In [5]:
round(df.describe().T,2)
#데이터 숫자형 변수 통계 정보

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,8128.0,2013.8,4.04,1983.0,2011.0,2015.0,2017.0,2020.0
selling_price,8128.0,638271.81,806253.4,29999.0,254999.0,450000.0,675000.0,10000000.0
km_driven,8128.0,69819.51,56550.55,1.0,35000.0,60000.0,98000.0,2360457.0
seats,7907.0,5.42,0.96,2.0,5.0,5.0,5.0,14.0


*selling_price - max 값이 유독 높습니다(outlier 의심)
*km_driven - min, max 값 모두 outlier 로 보여집니다.
*트리 기반 모델 적용 예정이므로, 별도의 outlier 처리를 하지 않을 예정입니다.

*engine 변수 전처리

In [6]:
df.engine.str.split()

0       [1248, CC]
1       [1498, CC]
2       [1497, CC]
3       [1396, CC]
4       [1298, CC]
           ...    
8123    [1197, CC]
8124    [1493, CC]
8125    [1248, CC]
8126    [1396, CC]
8127    [1396, CC]
Name: engine, Length: 8128, dtype: object

In [7]:
df['engine'].str.split(expand = True)
#공백 기준으로 문자를 분할하여 별도의 변수로 출력

Unnamed: 0,0,1
0,1248,CC
1,1498,CC
2,1497,CC
3,1396,CC
4,1298,CC
...,...,...
8123,1197,CC
8124,1493,CC
8125,1248,CC
8126,1396,CC


In [8]:
df[['engine', 'engine_unit']] = df['engine'].str.split(expand = True)
#분할된 문자들을 새로운 변수들로 저장

In [9]:
df.info()
#변수들 다시 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           8128 non-null   object 
 1   year           8128 non-null   int64  
 2   selling_price  8128 non-null   int64  
 3   km_driven      8128 non-null   int64  
 4   fuel           8128 non-null   object 
 5   seller_type    8128 non-null   object 
 6   transmission   8128 non-null   object 
 7   owner          8128 non-null   object 
 8   mileage        7907 non-null   object 
 9   engine         7907 non-null   object 
 10  max_power      7913 non-null   object 
 11  torque         7906 non-null   object 
 12  seats          7907 non-null   float64
 13  engine_unit    7907 non-null   object 
dtypes: float64(1), int64(3), object(10)
memory usage: 889.1+ KB


In [10]:
df['engine'] = df['engine'].astype('float32')
#숫자형 변수로 변환
df['engine'].head()

0    1248.0
1    1498.0
2    1497.0
3    1396.0
4    1298.0
Name: engine, dtype: float32

In [11]:
df['engine_unit'].unique()
#엔진 단위 고윳값 확인

array(['CC', nan], dtype=object)

In [12]:
df.drop('engine_unit', axis = 1, inplace = True)
#단위가 한가지밖에 존재하지 않으므로, 제거 작업 진행

*max_power 변수 전처리

In [13]:
df[['max_power', 'max_power_unit']] = df['max_power'].str.split(expand = True)
df['max_power'].head()

0        74
1    103.52
2        78
3        90
4      88.2
Name: max_power, dtype: object

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            8128 non-null   object 
 1   year            8128 non-null   int64  
 2   selling_price   8128 non-null   int64  
 3   km_driven       8128 non-null   int64  
 4   fuel            8128 non-null   object 
 5   seller_type     8128 non-null   object 
 6   transmission    8128 non-null   object 
 7   owner           8128 non-null   object 
 8   mileage         7907 non-null   object 
 9   engine          7907 non-null   float32
 10  max_power       7913 non-null   object 
 11  torque          7906 non-null   object 
 12  seats           7907 non-null   float64
 13  max_power_unit  7906 non-null   object 
dtypes: float32(1), float64(1), int64(3), object(9)
memory usage: 857.4+ KB


In [15]:
df['max_power'] = df['max_power'].astype('float32')
#'bhp' string 을 float 으로 바꿀 수 없음(value error 발생)

ValueError: could not convert string to float: 'bhp'

In [16]:
df[df.max_power == 'bhp']
#max_power 변수에 'bhp' 문자를 포함한 데이터 탐색

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,max_power_unit
4933,Maruti Omni CNG,2000,80000,100000,CNG,Individual,Manual,Second Owner,10.9 km/kg,796.0,bhp,,8.0,


In [17]:
#함수 처리
def isfloat(value):
    try:
        num = float(value)
        return num
    except ValueError:
        #try 에서 value error 났을 시,
        return np.NaN
    #np.NaN 리턴

In [18]:
df['max_power'] = df['max_power'].apply(isfloat)
#함수를 사용해 max_power 변수를 숫자형 변수로 변환

In [19]:
df['max_power_unit'].unique()
#max_power_unit 변수 고윳값 확인

array(['bhp', nan, None], dtype=object)

In [20]:
df.drop('max_power_unit', axis = 1, inplace = True)
#'bhp' 이외 다른 종류의 변수가 보이지 않아 제거 작업을 진행했습니다.

*mileage 변수 전처리

In [21]:
df[['mileage', 'mileage_unit']] = df['mileage'].str.split(expand = True)
#위와 같은 방식으로 전처리 진행

In [22]:
df['mileage'] = df['mileage'].astype('float32')

In [23]:
df['mileage_unit'].unique()

array(['kmpl', 'km/kg', nan], dtype=object)

*kmpl - km/l, 리터 당 킬로미터
*km/kg - 킬로그램 당 킬로미터
*휘발유/디젤 - 리터 단위로 마일리지 측정
*LGP/CNG - 킬로그램 단위로 마일리지 측정
*fuel(연료 종류) 에 따라 전처리 작업을 진행했습니다.

In [24]:
df['fuel'].unique()
#고윳값 확인

array(['Diesel', 'Petrol', 'LPG', 'CNG'], dtype=object)

*검색엔진(Google) 을 활용해 현시점 연료 가격 확인
*Diesel - 리터당 $73.56
*Petrol - 리터당 $80.43
*LPG - 리터당 $40.85
*CNG - 리터당 $44.23

In [26]:
#mileage 변수를 각 연료별 가격으로 나누면 1달러 당 주행거리가 됩니다, 함수 생성
def mile(x):
    if x['fuel'] == 'Petrol':
        return x['mileage'] / 80.43
    elif x['fuel'] == 'Diesel':
        return x['mileage'] / 73.56
    elif x['fuel'] == 'LPG':
        return x['mileage'] / 40.85
    else:
        return x['mileage'] / 44.23

In [27]:
df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats,mileage_unit
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4,1248.0,74.0,190Nm@ 2000rpm,5.0,kmpl
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.139999,1498.0,103.52,250Nm@ 1500-2500rpm,5.0,kmpl
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.700001,1497.0,78.0,"12.7@ 2,700(kgm@ rpm)",5.0,kmpl
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0,1396.0,90.0,22.4 kgm at 1750-2750rpm,5.0,kmpl
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1,1298.0,88.2,"11.5@ 4,500(kgm@ rpm)",5.0,kmpl


In [28]:
df['mileage'] = df.apply(mile, axis = 1)
#함수를 활용해 mileage 변수 전처리

In [29]:
df.drop('mileage_unit', axis = 1, inplace = True)
#제거

*torque 변수 전처리

In [30]:
df['torque'].head()

0              190Nm@ 2000rpm
1         250Nm@ 1500-2500rpm
2       12.7@ 2,700(kgm@ rpm)
3    22.4 kgm at 1750-2750rpm
4       11.5@ 4,500(kgm@ rpm)
Name: torque, dtype: object

In [31]:
df['torque'] = df['torque'].str.upper()
#대소문자가 섞인 형태이므로, torque 변수 전체를 대문자로 변환

In [32]:
#NM 과 KGM 우선적으로 처리 작업을 진행했습니다.
def torque_unit(x):
    if 'NM' in str(x):
        #x 에 NM 이 있으면,
        return 'Nm'
    #Nm 리턴
    elif 'KGM' in str(x):
        #x 에 KGM 있으면,
        return 'kgm'
    #kgm 리턴

In [33]:
df['torque_unit'] = df['torque'].apply(torque_unit)
#torque_unit 함수로 별도 저장

In [34]:
df['torque_unit'].unique()
#고윳값 확인

array(['Nm', 'kgm', None], dtype=object)

In [36]:
df[df['torque_unit'].isnull()]['torque'].unique()
#torque_unit 이 결측치인 라인의 torque 변수 고윳값 확인

array([nan, '250@ 1250-5000RPM', '510@ 1600-2400', '110(11.2)@ 4800',
       '210 / 1900'], dtype=object)

*데이터를 확인했을 때, Nm 은 보통 백단위, kgm 은 십단위 숫자입니다.
*숫자 크기를 고려했을 때, 결측 라인은 모두 Nm 에 해당한다고 추론하고 작업을 진행했습니다.

In [37]:
df['torque_unit'].fillna('Nm', inplace = True)
#결측치를 Nm 으로 대체

In [38]:
string_example = '12.7@ 2,700(KGM@ RPM)'

In [39]:
string_example[:4]

'12.7'

In [40]:
for i, j in enumerate(string_example):
    print(i, '번째 텍스트: ', j)

0 번째 텍스트:  1
1 번째 텍스트:  2
2 번째 텍스트:  .
3 번째 텍스트:  7
4 번째 텍스트:  @
5 번째 텍스트:   
6 번째 텍스트:  2
7 번째 텍스트:  ,
8 번째 텍스트:  7
9 번째 텍스트:  0
10 번째 텍스트:  0
11 번째 텍스트:  (
12 번째 텍스트:  K
13 번째 텍스트:  G
14 번째 텍스트:  M
15 번째 텍스트:  @
16 번째 텍스트:   
17 번째 텍스트:  R
18 번째 텍스트:  P
19 번째 텍스트:  M
20 번째 텍스트:  )


In [41]:
for i,j in enumerate(string_example):
    if j not in '0123456789.':
        cut = i
        break

In [43]:
df.torque.value_counts()

190NM@ 2000RPM                 530
200NM@ 1750RPM                 445
90NM@ 3500RPM                  407
113NM@ 4200RPM                 223
114NM@ 4000RPM                 171
                              ... 
22.9@ 1,950-4,700(KGM@ RPM)      1
99.1NM@ 4500RPM                  1
159.8NM@ 1500-2750RPM            1
190 NM AT 2000RPM                1
250NM@ 1600-2000RPM              1
Name: torque, Length: 428, dtype: int64

In [44]:
def split_num(x):
    x = str(x)
    #문자 형태로 변환
    for i,j in enumerate(x):
        #인덱스를 포함한 순회
        if j not in '0123456789.':
            #j 가 0123456789. 에 속하지 않으면,
            cut = i
            #인덱스를 cut 에 저장
            break
            #순회 중지
    return x[:cut]
#cut 이전 인덱스까지 인덱싱하여 리턴

In [45]:
df['torque'] = df['torque'].apply(split_num)

In [46]:
df.torque
#torque 변수 확인

0         190
1         250
2        12.7
3        22.4
4        11.5
        ...  
8123    113.7
8124       24
8125      190
8126      140
8127      140
Name: torque, Length: 8128, dtype: object

In [48]:
df['torque'] = df['torque'].replace('', np.NaN)
#데이터값 중 '' 값을 결측값으로 대체
df['torque'] = df['torque'].astype('float32')
#데이터 타입 변환

In [49]:
df['torque'].head()

0    190.0
1    250.0
2     12.7
3     22.4
4     11.5
Name: torque, dtype: float32

*kgm * 9.8066 = Nm
*모든 값을 Nm 단위로 변환 작업을 진행했습니다.

In [50]:
def torque_trans(x):
    if x['torque_unit'] == 'kgm':
        #torque_unit 값이 kgm 이면
        return x['torque']*9.8066
    #kgm * 9.8066 = Nm
    else:
        return x['torque']
    #아닐 경우, 그냥 리턴

In [52]:
df['torque'] = df.apply(torque_trans, axis = 1)
df.drop('torque_unit', axis = 1, inplace = True)
#torque_unit 변수 제거

In [53]:
df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,0.318108,1248.0,74.0,190.0,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,0.287384,1498.0,103.52,250.0,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,0.220067,1497.0,78.0,124.543818,5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,0.31267,1396.0,90.0,219.667836,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,0.200174,1298.0,88.2,112.7759,5.0


*name 변수 전처리
*동일 스펙의 자동차라도, 비싼 브랜드의 경우 가격이 더 높게 측정될 수 있으므로, 모델명은 버리더라도 브랜드명을 가져가기 위해 처리했습니다.

In [54]:
df['name'] = df['name'].str.split(expand = True)[0]
#공백으로 split 후 첫번째 부분으로 name 변수 업데이트
df['name'].unique()
#고윳값 확인

array(['Maruti', 'Skoda', 'Honda', 'Hyundai', 'Toyota', 'Ford', 'Renault',
       'Mahindra', 'Tata', 'Chevrolet', 'Fiat', 'Datsun', 'Jeep',
       'Mercedes-Benz', 'Mitsubishi', 'Audi', 'Volkswagen', 'BMW',
       'Nissan', 'Lexus', 'Jaguar', 'Land', 'MG', 'Volvo', 'Daewoo',
       'Kia', 'Force', 'Ambassador', 'Ashok', 'Isuzu', 'Opel', 'Peugeot'],
      dtype=object)

*공백 분리로 처리해 'Land Rover' 변수가 'Land' 로 처리됐습니다.

In [55]:
df['name'] = df['name'].replace('Land', 'Land Rover')
#대체

*결측치 처리 및 더미 변환

In [57]:
df.isnull().mean()

name             0.000000
year             0.000000
selling_price    0.000000
km_driven        0.000000
fuel             0.000000
seller_type      0.000000
transmission     0.000000
owner            0.000000
mileage          0.027190
engine           0.027190
max_power        0.026575
torque           0.027313
seats            0.027190
dtype: float64

*mileage 변수부터 그 아래로 2~3 % 정도의 결측치를 확인했습니다.
*스펙 관련 값을 평균으로 치환할 경우, 노이즈 발생 위험이 있다 판단하여 행 제거 작업을 진행했습니다.

In [58]:
df.dropna(inplace = True)
#결측치 행 제거
len(df)
#데이터 길이 확인

7906

In [59]:
df = pd.get_dummies(df, columns = ['name', 'fuel', 'seller_type', 'transmission', 'owner'],\
                   drop_first = True)
#object 변수의 더미 변환 진행

In [60]:
df.head()

Unnamed: 0,year,selling_price,km_driven,mileage,engine,max_power,torque,seats,name_Ashok,name_Audi,...,fuel_Diesel,fuel_LPG,fuel_Petrol,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Manual,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,2014,450000,145500,0.318108,1248.0,74.0,190.0,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0
1,2014,370000,120000,0.287384,1498.0,103.52,250.0,5.0,0,0,...,1,0,0,1,0,1,0,1,0,0
2,2006,158000,140000,0.220067,1497.0,78.0,124.543818,5.0,0,0,...,0,0,1,1,0,1,0,0,0,1
3,2010,225000,127000,0.31267,1396.0,90.0,219.667836,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0
4,2007,130000,120000,0.200174,1298.0,88.2,112.7759,5.0,0,0,...,0,0,1,1,0,1,0,0,0,0


In [61]:
from sklearn.model_selection import train_test_split
#훈련 / 시험셋 분리
xtr,xt,ytr,yt = train_test_split(df.drop('selling_price', axis = 1), df['selling_price'],\
                                test_size = 0.2, random_state = 100)

In [62]:
from sklearn.ensemble import RandomForestRegressor
#랜덤 포레스트 회귀 모델 import
rf = RandomForestRegressor(random_state = 100)
rf.fit(xtr, ytr)
#학습
train_pred = rf.predict(xtr)
#훈련셋 예측
test_pred = rf.predict(xt)
#시험셋 예측

In [63]:
#RMSE 를 활용한 모델 평가
from sklearn.metrics import mean_squared_error
print('train_rmse: ', mean_squared_error(ytr, train_pred)**0.5,
      'test_rmse: ', mean_squared_error(yt, test_pred)**0.5)

train_rmse:  53531.41548125947 test_rmse:  131855.18391308116


In [64]:
#모델 예측력을 더 안정적으로 평가하기 위해 k-fold 교차 검증 진행
from sklearn.model_selection import KFold
#import
df

Unnamed: 0,year,selling_price,km_driven,mileage,engine,max_power,torque,seats,name_Ashok,name_Audi,...,fuel_Diesel,fuel_LPG,fuel_Petrol,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Manual,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,2014,450000,145500,0.318108,1248.0,74.00,190.000000,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0
1,2014,370000,120000,0.287384,1498.0,103.52,250.000000,5.0,0,0,...,1,0,0,1,0,1,0,1,0,0
2,2006,158000,140000,0.220067,1497.0,78.00,124.543818,5.0,0,0,...,0,0,1,1,0,1,0,0,0,1
3,2010,225000,127000,0.312670,1396.0,90.00,219.667836,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0
4,2007,130000,120000,0.200174,1298.0,88.20,112.775900,5.0,0,0,...,0,0,1,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,2013,320000,110000,0.230014,1197.0,82.85,113.699997,5.0,0,0,...,0,0,1,1,0,1,0,0,0,0
8124,2007,135000,119000,0.228385,1493.0,110.00,235.358400,5.0,0,0,...,1,0,0,1,0,1,1,0,0,0
8125,2009,382000,120000,0.262371,1248.0,73.90,190.000000,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0
8126,2013,290000,25000,0.320419,1396.0,70.00,140.000000,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0


In [65]:
#실제로는 7906 행이지만, 인덱스가 0 부터 8127 까지 존재
df.reset_index(drop = True, inplace = True)
#reset_index

In [66]:
df

Unnamed: 0,year,selling_price,km_driven,mileage,engine,max_power,torque,seats,name_Ashok,name_Audi,...,fuel_Diesel,fuel_LPG,fuel_Petrol,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Manual,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,2014,450000,145500,0.318108,1248.0,74.00,190.000000,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0
1,2014,370000,120000,0.287384,1498.0,103.52,250.000000,5.0,0,0,...,1,0,0,1,0,1,0,1,0,0
2,2006,158000,140000,0.220067,1497.0,78.00,124.543818,5.0,0,0,...,0,0,1,1,0,1,0,0,0,1
3,2010,225000,127000,0.312670,1396.0,90.00,219.667836,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0
4,2007,130000,120000,0.200174,1298.0,88.20,112.775900,5.0,0,0,...,0,0,1,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7901,2013,320000,110000,0.230014,1197.0,82.85,113.699997,5.0,0,0,...,0,0,1,1,0,1,0,0,0,0
7902,2007,135000,119000,0.228385,1493.0,110.00,235.358400,5.0,0,0,...,1,0,0,1,0,1,1,0,0,0
7903,2009,382000,120000,0.262371,1248.0,73.90,190.000000,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0
7904,2013,290000,25000,0.320419,1396.0,70.00,140.000000,5.0,0,0,...,1,0,0,1,0,1,0,0,0,0


In [68]:
kf = KFold(n_splits = 5)
#k-fold 객체 생성
X = df.drop('selling_price', axis = 1)
#타겟 변수 제거 후 X 에 저장
y = df['selling_price']
#타겟 변수 y 에 저장

for i,j in kf.split(X):
    #KFold 내재 함수를 활용해 X 분할
    print(i, j)
    #i : 훈련셋으로 사용할 index
    #j : 시험셋으로 사용할 index

[1582 1583 1584 ... 7903 7904 7905] [   0    1    2 ... 1579 1580 1581]
[   0    1    2 ... 7903 7904 7905] [1582 1583 1584 ... 3160 3161 3162]
[   0    1    2 ... 7903 7904 7905] [3163 3164 3165 ... 4741 4742 4743]
[   0    1    2 ... 7903 7904 7905] [4744 4745 4746 ... 6322 6323 6324]
[   0    1    2 ... 6322 6323 6324] [6325 6326 6327 ... 7903 7904 7905]


In [69]:
for train_index, test_index in kf.split(X):
    xtr, xt = X.loc[train_index], X.loc[test_index]
    ytr, yt = y.loc[train_index], y.loc[test_index]

In [70]:
train_rmse_total = []
test_rmse_total = []

In [78]:
for train_index, test_index in kf.split(X):
    #순회
    xtr, xt = X.iloc[train_index,:], X.iloc[test_index,:]
    #x_train, x_test 설정
    ytr, yt = y[train_index], y[test_index]
    #y_train, y_test 설정
    
    rf = RandomForestRegressor(n_estimators = 300, max_depth = 50, min_samples_split = 5,\
                               min_samples_leaf = 1, n_jobs = -1,random_state = 100)
    #하이퍼파라미터 지정
    rf.fit(xtr, ytr)
    train_pred = rf.predict(xtr)
    test_pred = rf.predict(xt)
    
    train_rmse = mean_squared_error(ytr, train_pred)**0.5
    #훈련셋 rmse 계산
    test_rmse = mean_squared_error(yt, test_pred)**0.5
    #시험셋 rmse 계산
    
    train_rmse_total.append(train_rmse)
    test_rmse_total.append(test_rmse)

In [81]:
#최종 rmse 계산(평균)
print('train_rmse: ', sum(train_rmse_total)/5,
      'test_rmse: ', sum(test_rmse_total)/5)

train_rmse:  256842.3358232057 test_rmse:  569554.0373132533
