## Pipeline과 Optimization을 사용한 ML Pipeline 만들기

- Zillow 데이터셋
  - Zillow의 Zestimate 주택 가치 평가는 11년 전에 처음 발표된 이후로 미국 부동산 업계를 뒤흔들었습니다. Zestimate는 소비자에게 주택과 주택 시장에 대한 가능한 한 많은 정보를 제공하기 위해 만들어졌으며, 소비자가 이러한 유형의 주택 가치 정보에 무료로 액세스할 수 있었던 최초의 사례입니다. 각 부동산에 대한 수백 개의 데이터 포인트를 분석하는 750만 개의 통계 및 기계 학습 모델을 기반으로 추정되는 주택 가치입니다. 그리고 중앙값 오차 범위를 지속적으로 개선함으로써(초기 14%에서 오늘날 5%로) Zillow는 이후 미국에서 가장 크고 가장 신뢰할 수 있는 부동산 정보 시장 중 하나로 자리 잡았으며 기계학습의 영향력을 볼 수 있는 대표적인 사례가 되었습니다.
  - 백만 달러의 대상이 있는 대회인 Zillow Prize는 Zestimate의 정확성을 더욱 높일 수 있도록 데이터 과학 커뮤니티에 도전하고 있습니다. 우승 알고리즘은 미국 전역의 1억 1천만 가구의 주택 가치에 영향을 미칠 것입니다.

![](https://storage.googleapis.com/kaggle-competitions/kaggle/6649/media/_zillow_image_2.jpg)

- 학습/테스트 분할
  - 2016년 3개 카운티 데이터의 전체 부동산 목록이 제공됩니다.
    - train data에는 2016년 10월 15일 이전의 모든 거래와 2016년 10월 15일 이후의 일부 거래가 있습니다.
    - 공개 리더보드의 test data에는 2016년 10월 15일에서 12월 31일 사이의 나머지 거래가 있습니다.
    - 비공개 순위표 계산에 사용되는 rest of test data는 2017년 10월 15일부터 2017년 12월 15일까지의 모든 속성입니다. 이 기간을 "판매 추적 기간"이라고 하며 이 기간 동안에는 모든 제출을 허용하지 않습니다.
    - 모든 속성에 대해 6개의 시점을 예측해야 합니다.
      - 모든 부동산이 각 기간에 판매되는 것은 아닙니다. 특정 기간 동안 부동산이 판매되지 않은 경우 점수를 계산할 때 해당 특정 행은 무시됩니다.
      - 부동산이 31일 이내에 여러 번 판매되는 경우 첫 번째 합리적인 가치를 사실로 간주합니다. "합리적"이라는 말은 데이터가 잘못된 것 같으면 더 합리적인 값을 가진 거래를 취한다는 의미입니다.

- 파일 설명
- properties_2016.csv - 2016년에 대한 집의 특징이 있는 모든 속성. 
- properties_2017.csv - 2017년에 대한 집의 특징이 있는 모든 속성
- train_2016.csv - 2016년 1월 1일부터 2016년 12월 31일까지의 트랜잭션이 있는 학습셋
- train_2017.csv - 2017년 1월 1일부터 2017년 9월 15일까지의 트랜잭션이 있는 학습셋
- sample_submission.csv - 올바른 형식의 제출 파일 샘플

In [1]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest, VarianceThreshold
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import os
import time

import warnings
warnings.filterwarnings("ignore")

In [3]:
train = pd.read_csv("zillow_train_2016_v2.csv")
properties = pd.read_csv('zillow_properties_2016.csv')

In [4]:
train

Unnamed: 0,parcelid,logerror,transactiondate
0,11016594,0.0276,2016-01-01
1,14366692,-0.1684,2016-01-01
2,12098116,-0.0040,2016-01-01
3,12643413,0.0218,2016-01-02
4,14432541,-0.0050,2016-01-02
...,...,...,...
90270,10774160,-0.0356,2016-12-30
90271,12046695,0.0070,2016-12-30
90272,12995401,-0.2679,2016-12-30
90273,11402105,0.0602,2016-12-30


In [5]:
properties

Unnamed: 0,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,10754147,,,,0.0,0.0,,,,,...,,,,9.0,2015.0,9.0,,,,
1,10759547,,,,0.0,0.0,,,,,...,,,,27516.0,2015.0,27516.0,,,,
2,10843547,,,,0.0,0.0,,,,,...,,,650756.0,1413387.0,2015.0,762631.0,20800.37,,,
3,10859147,,,,0.0,0.0,3.0,7.0,,,...,1.0,,571346.0,1156834.0,2015.0,585488.0,14557.57,,,
4,10879947,,,,0.0,0.0,4.0,,,,...,,,193796.0,433491.0,2015.0,239695.0,5725.17,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2985212,168176230,,,,,,,,,,...,,,,,,,,,,
2985213,14273630,,,,,,,,,,...,,,,,,,,,,
2985214,168040630,,,,,,,,,,...,,,,,,,,,,
2985215,168040830,,,,,,,,,,...,,,,,,,,,,


In [6]:
properties[['propertyzoningdesc', 'propertycountylandusecode', 'fireplacecnt', 'fireplaceflag']]
# 앞의 둘은 지역에 대한 정보인데, 그것은 다른 feature로도 주소에 대한 설명이 충분하고, 이 둘이 처리하기 힘든 형태이므로 제거
# fire에 관한건데, 결측치도 많고 집값에 영향이 적을 것이라 생각하여 제거

Unnamed: 0,propertyzoningdesc,propertycountylandusecode,fireplacecnt,fireplaceflag
0,,010D,,
1,LCA11*,0109,,
2,LAC2,1200,,
3,LAC2,1200,,
4,LAM1,1210,,
...,...,...,...,...
2985212,,,,
2985213,,,,
2985214,,,,
2985215,,,,


In [7]:
for c, dtype in zip(properties.columns, properties.dtypes):	
    if dtype == np.float64:
        properties[c] = properties[c].astype(np.float32) # lightGBM에 친화적인 포맷으로 변경

In [8]:
df_train = (train.merge(properties, how='left', on='parcelid')
            .drop(['parcelid', 'transactiondate', 'propertyzoningdesc', 
                         'propertycountylandusecode', 'fireplacecnt', 'fireplaceflag'], axis=1))

train_columns = df_train.columns # df_train의 column들을 학습에 사용할 column들로 지정해주기 위해서 따로 변수에 담음

In [9]:
df_train

Unnamed: 0,logerror,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,...,yearbuilt,numberofstories,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,0.0276,1.0,,,2.0,3.0,,4.0,2.0,,...,1959.0,,122754.0,360170.0,2015.0,237416.0,6735.879883,,,6.037107e+13
1,-0.1684,,,,3.5,4.0,,,3.5,,...,2014.0,,346458.0,585529.0,2015.0,239071.0,10153.019531,,,
2,-0.0040,1.0,,,3.0,2.0,,4.0,3.0,,...,1940.0,,61994.0,119906.0,2015.0,57912.0,11484.480469,,,6.037464e+13
3,0.0218,1.0,,,2.0,2.0,,4.0,2.0,,...,1987.0,,171518.0,244880.0,2015.0,73362.0,3048.739990,,,6.037296e+13
4,-0.0050,,,,2.5,4.0,,,2.5,,...,1981.0,2.0,169574.0,434551.0,2015.0,264977.0,5488.959961,,,6.059042e+13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90270,-0.0356,1.0,,,1.0,1.0,,4.0,1.0,,...,1979.0,,43800.0,191000.0,2015.0,147200.0,2495.239990,,,6.037132e+13
90271,0.0070,,,,3.0,3.0,,4.0,3.0,,...,1965.0,,117893.0,161111.0,2015.0,43218.0,1886.540039,,,6.037301e+13
90272,-0.2679,,,,2.0,4.0,,7.0,2.0,,...,1924.0,,22008.0,38096.0,2015.0,16088.0,1925.699951,Y,14.0,6.037433e+13
90273,0.0602,,,,2.0,2.0,,4.0,2.0,,...,1981.0,,132991.0,165869.0,2015.0,32878.0,2285.570068,,,6.037601e+13


In [10]:
valid = df_train.iloc[1:20000, :] #우리가 예측해야하는 것이 10월 15일-12월 15일의 데이터이기 때문에, 이 기간의 정보를 학습하기 위해서 valid를 뒤쪽이 아닌 앞쪽을 다름
train = df_train.iloc[20001:90275, :] # 시간으로 정렬되어있기 때문에, 순차적으로 데이터를 나누어줌. 

y_train = train['logerror'].values # 우리가 예측하고자 하는 값`
y_valid = valid['logerror'].values

x_train = train.drop('logerror', axis = 1)
x_valid = valid.drop('logerror', axis = 1)

idVars = [i for e in ['id',  'flag', 'has'] for i in list(train_columns) if e in i] + ['fips', 'hashottuborspa'] # categorical
countVars = [i for e in ['cnt',  'year', 'nbr', 'number'] for i in list(train_columns) if e in i] # discrete 
taxVars = [col for col in train_columns if 'tax' in col and 'flag' not in col] # tax 
          
ttlVars = idVars + countVars + taxVars
dropVars = [i for e in ['census',  'tude', 'error'] for i in list(train_columns) if e in i] # 인구조사 자료, 위도/경도, logerror 값 제외
contVars = [col for col in train_columns if col not in ttlVars + dropVars] # continus feature (tax를 제외한)

for c in x_train.dtypes[x_train.dtypes == object].index.values: # array(['hashottuborspa', 'taxdelinquencyflag'], dtype=object) / True와 NaN으로 구성되어있는 데이터 / dtype('O') -> dtype('bool')
    x_train[c] = (x_train[c] == True)
    
for c in x_valid.dtypes[x_valid.dtypes == object].index.values:
    x_valid[c] = (x_valid[c] == True)   


In [11]:
idVars

['airconditioningtypeid',
 'architecturalstyletypeid',
 'buildingclasstypeid',
 'buildingqualitytypeid',
 'decktypeid',
 'heatingorsystemtypeid',
 'pooltypeid10',
 'pooltypeid2',
 'pooltypeid7',
 'propertylandusetypeid',
 'regionidcity',
 'regionidcounty',
 'regionidneighborhood',
 'regionidzip',
 'storytypeid',
 'typeconstructiontypeid',
 'taxdelinquencyflag',
 'hashottuborspa',
 'fips',
 'hashottuborspa']

In [12]:
df_train[idVars] #float32 이기에 소수점으로 표현된 것

Unnamed: 0,airconditioningtypeid,architecturalstyletypeid,buildingclasstypeid,buildingqualitytypeid,decktypeid,heatingorsystemtypeid,pooltypeid10,pooltypeid2,pooltypeid7,propertylandusetypeid,regionidcity,regionidcounty,regionidneighborhood,regionidzip,storytypeid,typeconstructiontypeid,taxdelinquencyflag,hashottuborspa,fips,hashottuborspa.1
0,1.0,,,4.0,,2.0,,,,261.0,12447.0,3101.0,31817.0,96370.0,,,,,6037.0,
1,,,,,,,,,,261.0,32380.0,1286.0,,96962.0,,,,,6059.0,
2,1.0,,,4.0,,2.0,,,,261.0,47019.0,3101.0,275411.0,96293.0,,,,,6037.0,
3,1.0,,,4.0,,2.0,,,,266.0,12447.0,3101.0,54300.0,96222.0,,,,,6037.0,
4,,,,,,,,,1.0,261.0,17686.0,1286.0,,96961.0,,,,,6059.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90270,1.0,,,4.0,,2.0,,,1.0,266.0,12447.0,3101.0,40548.0,96364.0,,,,,6037.0,
90271,,,,4.0,,2.0,,,,261.0,45457.0,3101.0,274580.0,96327.0,,,,,6037.0,
90272,,,,7.0,,,,,,246.0,51861.0,3101.0,,96478.0,,,Y,,6037.0,
90273,,,,4.0,,2.0,,,,266.0,45888.0,3101.0,,96133.0,,,,,6037.0,


In [13]:
df_train[countVars]

Unnamed: 0,bathroomcnt,bedroomcnt,fullbathcnt,garagecarcnt,poolcnt,roomcnt,unitcnt,structuretaxvaluedollarcnt,taxvaluedollarcnt,landtaxvaluedollarcnt,yearbuilt,assessmentyear,taxdelinquencyyear,calculatedbathnbr,threequarterbathnbr,numberofstories
0,2.0,3.0,2.0,,,0.0,1.0,122754.0,360170.0,237416.0,1959.0,2015.0,,2.0,,
1,3.5,4.0,3.0,2.0,,0.0,,346458.0,585529.0,239071.0,2014.0,2015.0,,3.5,1.0,
2,3.0,2.0,3.0,,,0.0,1.0,61994.0,119906.0,57912.0,1940.0,2015.0,,3.0,,
3,2.0,2.0,2.0,,,0.0,1.0,171518.0,244880.0,73362.0,1987.0,2015.0,,2.0,,
4,2.5,4.0,2.0,2.0,1.0,8.0,,169574.0,434551.0,264977.0,1981.0,2015.0,,2.5,1.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90270,1.0,1.0,1.0,,1.0,0.0,1.0,43800.0,191000.0,147200.0,1979.0,2015.0,,1.0,,
90271,3.0,3.0,3.0,,,0.0,1.0,117893.0,161111.0,43218.0,1965.0,2015.0,,3.0,,
90272,2.0,4.0,2.0,,,0.0,2.0,22008.0,38096.0,16088.0,1924.0,2015.0,14.0,2.0,,
90273,2.0,2.0,2.0,,,0.0,1.0,132991.0,165869.0,32878.0,1981.0,2015.0,,2.0,,


In [14]:
print(contVars)

x_train_cont = x_train[contVars]
x_valid_cont = x_valid[contVars]

['basementsqft', 'finishedfloor1squarefeet', 'calculatedfinishedsquarefeet', 'finishedsquarefeet12', 'finishedsquarefeet13', 'finishedsquarefeet15', 'finishedsquarefeet50', 'finishedsquarefeet6', 'garagetotalsqft', 'lotsizesquarefeet', 'poolsizesum', 'yardbuildingsqft17', 'yardbuildingsqft26']


In [15]:
%%time

pipeline = Pipeline(
    [('imp', SimpleImputer(strategy = 'median')),
     ('feat_select', SelectKBest(k = 5)), # k개의 best feature를 선택하는 것, default score_func: f_classif / ANOVA(=Analysis of variance, 분산 분석)
     ('lgbm', LGBMRegressor())                
])

pipeline.fit(x_train_cont, y_train)

y_pred = pipeline.predict(x_valid_cont)
print('MAE on validation set: %s' % (round(mean_absolute_error(y_valid, y_pred), 5))) #MAE

MAE on validation set: 0.07406
CPU times: user 4.1 s, sys: 463 ms, total: 4.56 s
Wall time: 5.62 s


In [16]:
%%time

pipeline = Pipeline(
    [('imp', SimpleImputer()),
      ('feat_select', SelectKBest()),
      ('lgbm', LGBMRegressor())
                     
])

parameters = {
    'imp__strategy': ['mean', 'median', 'most_frequent'],
    'feat_select__k' : [5, 10]
} 

gridsearch = GridSearchCV(pipeline, parameters, scoring = 'neg_mean_absolute_error', n_jobs= 1)
gridsearch.fit(x_train_cont, y_train)   

print('Best parameter combination is ')
print(gridsearch.best_params_)    

y_pred = gridsearch.predict(x_valid_cont)
print('MAE on validation set: %s' % (round(mean_absolute_error(y_valid, y_pred), 5)))

Best parameter combination is 
{'feat_select__k': 5, 'imp__strategy': 'mean'}
MAE on validation set: 0.07403
CPU times: user 1min 40s, sys: 6.45 s, total: 1min 47s
Wall time: 1min 5s


In [17]:
print(taxVars)

x_train_tax = x_train[taxVars]
x_valid_tax = x_valid[taxVars]

['structuretaxvaluedollarcnt', 'taxvaluedollarcnt', 'landtaxvaluedollarcnt', 'taxamount', 'taxdelinquencyyear']


In [18]:
%%time

pipeline = Pipeline(
    [('imp', SimpleImputer()),
      ('feat_select', SelectKBest()),
      ('lgbm', LGBMRegressor())
                     
])

parameters = {
    'imp__strategy': ['mean', 'median', 'most_frequent'],
    'feat_select__k' : [5, 10]
} 

gridsearch = GridSearchCV(pipeline, parameters, scoring = 'neg_mean_absolute_error', n_jobs= 1)
gridsearch.fit(x_train_tax, y_train)   

print('Best parameter combination is ')
print(gridsearch.best_params_)    

y_pred = gridsearch.predict(x_valid_tax)
print('MAE on validation set: %s' % (round(mean_absolute_error(y_valid, y_pred), 5)))

Best parameter combination is 
{'feat_select__k': 5, 'imp__strategy': 'median'}
MAE on validation set: 0.07471
CPU times: user 42.6 s, sys: 1.96 s, total: 44.5 s
Wall time: 23.7 s


In [19]:
print(contVars+taxVars)

x_train_ct = x_train[contVars+taxVars]
x_valid_ct = x_valid[contVars+taxVars]

['basementsqft', 'finishedfloor1squarefeet', 'calculatedfinishedsquarefeet', 'finishedsquarefeet12', 'finishedsquarefeet13', 'finishedsquarefeet15', 'finishedsquarefeet50', 'finishedsquarefeet6', 'garagetotalsqft', 'lotsizesquarefeet', 'poolsizesum', 'yardbuildingsqft17', 'yardbuildingsqft26', 'structuretaxvaluedollarcnt', 'taxvaluedollarcnt', 'landtaxvaluedollarcnt', 'taxamount', 'taxdelinquencyyear']


In [20]:
%%time

pipeline = Pipeline(
    [('imp', SimpleImputer()),
      ('feat_select', SelectKBest()),
      ('lgbm', LGBMRegressor())
                     
])

parameters = {
    'imp__strategy': ['mean', 'median', 'most_frequent'],
    'feat_select__k' : [5, 10]
} 

gridsearch = GridSearchCV(pipeline, parameters, scoring = 'neg_mean_absolute_error', n_jobs= 1)
gridsearch.fit(x_train_ct, y_train)   

print('Best parameter combination is ')
print(gridsearch.best_params_)    

y_pred = gridsearch.predict(x_valid_ct)
print('MAE on validation set: %s' % (round(mean_absolute_error(y_valid, y_pred), 5)))

Best parameter combination is 
{'feat_select__k': 5, 'imp__strategy': 'mean'}
MAE on validation set: 0.07454
CPU times: user 1min 33s, sys: 4.24 s, total: 1min 37s
Wall time: 44.5 s


In [21]:
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, subset):
        self.subset = subset

    def transform(self, X, *_):
        return X.loc[:, self.subset]

    def fit(self, *_):
        return self

In [22]:
pipeline = Pipeline([
    ('unity', FeatureUnion(
        transformer_list=[
            ('cont_portal', Pipeline([
                ('selector', ColumnSelector(contVars)),
                ('cont_imp', SimpleImputer(strategy = 'median')),
                #('scaler', StandardScaler())             
            ])),
            ('tax_portal', Pipeline([
                ('selector', ColumnSelector(taxVars)),
                ('tax_imp', SimpleImputer(strategy = 'most_frequent')),
                #('scaler', MinMaxScaler(copy=True, feature_range=(0, 3)))
            ])),
        ],
    )),
    ('column_purge', SelectKBest(k = 5)),    
    ('lgbm', LGBMRegressor()),
])

parameters = {
    'column_purge__k' : [5, 10],
    'lgbm__num_leaves': [5, 15, 30], # default 31
    'lgbm__reg_alpha ': [0.01, 0], # default 0
    'lgbm__reg_lambda': [0.01, 0] # default 0
}

grid = GridSearchCV(pipeline, parameters, scoring = 'neg_mean_absolute_error', n_jobs= 2)
grid.fit(x_train, y_train)   

print('Best parameter combination is ')

print(grid.best_params_)    

y_pred = grid.predict(x_valid)
print('MAE on validation set: %s' % (round(mean_absolute_error(y_valid, y_pred), 5)))

Best parameter combination is 
{'column_purge__k': 10, 'lgbm__num_leaves': 5, 'lgbm__reg_alpha ': 0.01, 'lgbm__reg_lambda': 0}
MAE on validation set: 0.07355


In [23]:
from sklearn import svm
from sklearn import datasets

clf = svm.SVC()
X, y= datasets.load_iris(return_X_y=True)
clf.fit(X, y)

SVC()

In [24]:
import pickle

s = pickle.dumps(clf)

clf2 = pickle.loads(s)
clf2.predict(X[0:1])

array([0])

In [25]:
s

b"\x80\x04\x95\xc5\x10\x00\x00\x00\x00\x00\x00\x8c\x14sklearn.svm._classes\x94\x8c\x03SVC\x94\x93\x94)\x81\x94}\x94(\x8c\x17decision_function_shape\x94\x8c\x03ovr\x94\x8c\nbreak_ties\x94\x89\x8c\x06kernel\x94\x8c\x03rbf\x94\x8c\x06degree\x94K\x03\x8c\x05gamma\x94\x8c\x05scale\x94\x8c\x05coef0\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\x03tol\x94G?PbM\xd2\xf1\xa9\xfc\x8c\x01C\x94G?\xf0\x00\x00\x00\x00\x00\x00\x8c\x02nu\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\x07epsilon\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\tshrinking\x94\x88\x8c\x0bprobability\x94\x89\x8c\ncache_size\x94K\xc8\x8c\x0cclass_weight\x94N\x8c\x07verbose\x94\x89\x8c\x08max_iter\x94J\xff\xff\xff\xff\x8c\x0crandom_state\x94N\x8c\x07_sparse\x94\x89\x8c\x0en_features_in_\x94K\x04\x8c\rclass_weight_\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x03\x85\x94h\x1f\x8c\x05dtype\x94\x93\x94\x8c\x02f8\x94\x89\x88\x87\x94R\x94(

In [26]:
pickle.dump(clf, open('classifier.pickle', 'wb'))

clf2 = pickle.load(open('classifier.pickle', 'rb'))
clf2.predict(X[0:1])

array([0])

In [27]:
with open('classifier.pickle', 'wb') as f:
    pickle.dump(clf, f)

with open('classifier.pickle', 'rb') as f:
    clf2 = pickle.load(f)

clf2.predict(X[0:1])

array([0])

In [28]:
with open('grid.pickle', 'wb') as f:
    pickle.dump(grid, f)

with open('grid.pickle', 'rb') as f:
    grid_loaded = pickle.load(f)

grid_loaded

GridSearchCV(estimator=Pipeline(steps=[('unity',
                                        FeatureUnion(transformer_list=[('cont_portal',
                                                                        Pipeline(steps=[('selector',
                                                                                         ColumnSelector(subset=['basementsqft',
                                                                                                                'finishedfloor1squarefeet',
                                                                                                                'calculatedfinishedsquarefeet',
                                                                                                                'finishedsquarefeet12',
                                                                                                                'finishedsquarefeet13',
                                                                               

In [29]:
y_pred = grid.predict(x_valid)
print('MAE on validation set: %s' % (round(mean_absolute_error(y_valid, y_pred), 5)))

MAE on validation set: 0.07355


In [30]:
import joblib

joblib.dump(grid.best_estimator_, 'grid.pkl')

['grid.pkl']

In [31]:
grid_loaded = joblib.load('grid.pkl')
grid_loaded

Pipeline(steps=[('unity',
                 FeatureUnion(transformer_list=[('cont_portal',
                                                 Pipeline(steps=[('selector',
                                                                  ColumnSelector(subset=['basementsqft',
                                                                                         'finishedfloor1squarefeet',
                                                                                         'calculatedfinishedsquarefeet',
                                                                                         'finishedsquarefeet12',
                                                                                         'finishedsquarefeet13',
                                                                                         'finishedsquarefeet15',
                                                                                         'finishedsquarefeet50',
                                     

In [32]:
joblib.dump(grid.best_estimator_, 'grid.pkl')
grid_loaded = joblib.load('grid.pkl')
grid_loaded

Pipeline(steps=[('unity',
                 FeatureUnion(transformer_list=[('cont_portal',
                                                 Pipeline(steps=[('selector',
                                                                  ColumnSelector(subset=['basementsqft',
                                                                                         'finishedfloor1squarefeet',
                                                                                         'calculatedfinishedsquarefeet',
                                                                                         'finishedsquarefeet12',
                                                                                         'finishedsquarefeet13',
                                                                                         'finishedsquarefeet15',
                                                                                         'finishedsquarefeet50',
                                     