# Conținut *Pipeline pentru Feature Engineering*:
## [0. Introducere](#0)
## [1. Setul de date  *House Price*](#1)
## [2. Importul biblitecilor și datelor](#2)
## [3. Formarea seturilor de training și de test](#3)
## [4. Variabila etichetă](#4)
## [5. Configuarea datelor pentru Pipeline](#5)
## [6. Pipeline - Feature engineering](#6)
## [7. Salvarea structurii Pipeline](#7)


# 0. Introducere

În acest ciclu de fișiere notebook, vom descrie metodologia de implementare a fiecăruia dintre pașii necesari elaborării unui model Machine Learning.

Se va discuta despre:

1. Data Analysis
2. Feature Engineering
3. **Pipeline pentru Feature Engineering**
4. Feature Selection
5. Model Training
6. Obținerea predicțiilor / Scoring

În acest notebook se va aborda subiectul **Pipeline pentru Feature Engineering**

În cadrul Feature Engineering se realizează diferite proceduri de procesare și transformare a datelor pentru a asigura posibilitatea crearii unui model Machine Learning performant. 

Dupa crearea modelului acesta se va salva si va putea fi ulterior utilizat pentru predicția rezultatelor pe baza unor date noi.

Însa în acest caz apara o problema: noile date trebuie supuse acelorași proceduri de transformare și procesare ca și în cazul datelor pe care a fost realizat trainingul modelului, ba mai mult, trebuie utilizați aceeași parametri de transformare/procesare. 

În cadrul secțiunii Feature Engineering precedente transformările și preocesările datelor s-au realizat în mode precedural fără a se stoca parametrii acestora, deci în afara mediului de creare a modelului ar fi practic imposibila aplicarea unor date noi acestuia.

Pentru sarvarea procedurilor de procesare și stocarea parametrilor acestora, aceste estpe de prelucreare a datelor se vor realiza prin intermediul unor transformatoare care la rândul lor se vor include într-o singura structura numita Pipeline

Transformatoarele sunt niște clase care dispun de unele metode ce realizează conversia datelor, iar parametrii acestei conversii se stochează ca atribute ale obiectelor acestei clase.

În funcți de procedura de conversie (substituirea lipsurilor, conversia în valori numerice, etc) există diferite transformatoare unele incluse în biblioteca Scikit-Learn (OneHotEncoder, Binarizer, etc), unele pot fi exportate din modulul feature_engine (care trebuie instalat - https://feature-engine.readthedocs.io/en/latest/index.html), iar unele pot fi create de utilizator prin moștenirea claselor de baza din Scikit-Learn (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.base)

Odată creată aceasta structură Pipeline, ea poate fi salvată și apoi reîncarcată în procesul de realizarea a predicției pe noi date.


<a id='1'></a>
# 1. Setul de date  *House Price*

Vom utiliza setul de date **house price** disponbil pe [Kaggle.com](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). Vezi mai jos detaliile setului de date.

===================================================================================================

### Predicția Prețului de Vânzare a Caselor

Scopul proiectului este de a construi un model machine learning pentru a prezice prețul de vânzare al caselor pe baza diferitelor variabile explicative ce descriu aspecte ale caselor rezidențiale.


### De ce este atât de important? 

Predicting house prices is useful to identify fruitful investments or to determine whether the price advertised for a house is over or under-estimated.
Predicția prețurilor caselor este utilă pentru a identifica investițiile sau pentru a determina dacă prețul anunțat pentru o casă este supraestimat sau subestimat.


### Care este obiectivul modelului machine learning?

Ne propunem să minimizăm diferența dintre prețul real și prețul estimat de modelul nostru. Vom evalua performanța modelului cu:

1. eroare medie pătratică (mse - mean squared error)
2. rădăcina pătrată a erorii medii pătratice (rmse - root squared of the mean squared error)
3. r pătrat (r2 - r-squared).


### Cum se descarcă setul de date?

- Se accesează [Kaggle Website](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

- Se logează **log in** (preventiv se crează un cont).

- Se descarcă fișierul **'train.csv'** și se salvează în folderul cu fișierul notebook curent.

- Se descarcă fișierul **'test.csv'** și se salvează în același folder.

<a id='2'></a>
# 2. Importul biblitecilor și datelor

In [1]:
# biblioteci pentru manipularea datelor si pentru afisare grafica
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# pentru salvarea pipeline
import joblib

# functii și clase din sklearn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Binarizer

# pentru scalarea caracteristicilor datelor
from sklearn.preprocessing import StandardScaler

# clase pentru substituirea lipsurilor din modulul feature-engine
from feature_engine.imputation import (
    MeanMedianImputer,
    CategoricalImputer,
)

# clase pentru conversia datelor în valori numerice din modulul feature-engine
from feature_engine.encoding import (
    RareLabelEncoder,
    OrdinalEncoder,
)

# clase pentru transformări a datelor numerice din modulul feature-engine
from feature_engine.transformation import (
    LogTransformer,
    YeoJohnsonTransformer,
)

# clase pentru stergera unor caracteristici din setul de date din modulul feature-engine
from feature_engine.selection import DropFeatures

# clase pentru aplicarea transformatoarelor din Scikit-Learn
from feature_engine.wrappers import SklearnTransformerWrapper

# importul modului propriu ce contine clasa pentru conversia anilor in durata de timp
# si clase pentru substituirea valorilor categoriale ordinare cu valori numerice
import preprocessors as pp

# pentru afișarea tutror coloanelor dataframe în notebook
pd.pandas.set_option('display.max_columns', None)

In [2]:
# încărcarea datelor
data = pd.read_csv('train.csv')

# afișarea numarului de lini și coloane
print(data.shape)

# vizualizarea primelor 5 date ale setului
data.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# ștergera coloanei id
data.drop('Id', axis=1, inplace=True)

data.shape

(1460, 80)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [5]:
# Tranformarea datelor coloanei MSSubClass in object
data['MSSubClass'] = data['MSSubClass'].astype('object')

<a id='3'></a>
# 3. Formarea seturilor de training și de test

Este foarte important a se separa datele în date training și date de test. 

La procesarea caracteriscililor, unele tehnici realizează învățarea parametrilor din date și e foarte important ca acestea să fie datele doar din setul de training, astfel evitându-se efectul overfitting.

În tehnicile feature engineering se vor învăța:

- valoarea medie
- valoarea modă
- exponenții transformării yeo-johnson
- categoriile de frecvențăcategory frequency
- categoriile numereului de mapare

din setul de training.

**Separarea datelor în set training și set de testare implică proceduri aleatorii, prin urmare, trebuie ținem cont de set the seed.**

In [6]:
# Separarea datelor în set de training și set de test cu utilizarea random_state pentru setare seed

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['SalePrice'], axis=1), # se exclude variabila eticheta
    data['SalePrice'], # variabila eticheta
    test_size=0.1, # proporția setului de test
    random_state=9, # setarea parametrului seed
)

X_train.shape, X_test.shape

((1314, 79), (146, 79))

<a id='4'></a>
# 4. Variabila etichetă

In [7]:
# transforamrea etichetelor cu ajutorul transformatei logaritmice
y_train = np.log(y_train)
y_test = np.log(y_test)

<a id='5'></a>
# 5. Configuarea datelor pentru Pipeline

La etapa de analiza a datelor au fost clasificate caracteristicile setului de date în funcție de diferite criterii și care prin urmare necesită diferite proceduri de transformare (de exemplu: caracteristici ce necesită substituirea lipsurilor cu stringul missing sau caracteristici ce necesita conversia datelor categorilae ordinare în valori numarice, etc).

La etapa de configurare a datelor pentru Pipeline se crează liste cu caracteristici în funcție de tipul de procesare de care este nevoie

In [8]:
# Caracteristici categoriale ce vor substitui lipsurile cu valoarea cea mai frecventa
CATEGORICAL_VARS_WITH_NA_FREQUENT = ['MasVnrType',
                                     'BsmtQual',
                                     'BsmtCond',
                                     'BsmtExposure',
                                     'BsmtFinType1',
                                     'BsmtFinType2',
                                     'Electrical',
                                     'GarageType',
                                     'GarageFinish',
                                     'GarageQual',
                                     'GarageCond']

# Caracteristici categoriale ce vor substitui lipsurile cu stringul 'missing'
CATEGORICAL_VARS_WITH_NA_MISSING = [
    'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']


# Caracteristici numerice cu lipsuri (se vor substitui cu valoarea medie)
NUMERICAL_VARS_WITH_NA = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

# Caracteristici de tip ce se vor transforma din valori ale anului in perioade de timp
TEMPORAL_VARS = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']

# Cracteristica de referinta fata de care se determina perioada de timp (dupa determinarea perioade aceast se va sterge)
REF_VAR = "YrSold"


# caracteristic numerice supuse transformarii logaritmice
NUMERICALS_LOG_VARS = ["LotFrontage", "1stFlrSF", "GrLivArea"]

# caracteristic numerice supuse transformarii yeo-johnson
NUMERICALS_YEO_VARS = ['LotArea']

# caracteristic numerice supuse transformarii binare
BINARIZE_VARS = [
    'BsmtFinSF2', 'LowQualFinSF', 'EnclosedPorch',
    '3SsnPorch', 'ScreenPorch', 'MiscVal'
]

# caracteristici categoriale ordinare a caror valori se vor substitui cu valor numerice corespunzatoare dictionarelor
QUAL_VARS = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
             'HeatingQC', 'KitchenQual', 'FireplaceQu',
             'GarageQual', 'GarageCond']

EXPOSURE_VARS = ['BsmtExposure']

FINISH_VARS = ['BsmtFinType1', 'BsmtFinType2']

GARAGE_VARS = ['GarageFinish']

FENCE_VARS = ['Fence']

# dictionarele cu care se vor mapa caracteristicile de mai sus
QUAL_MAPPINGS = {'Po': 1, 'Fa': 2, 'TA': 3,
                 'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}

EXPOSURE_MAPPINGS = {'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4}

FINISH_MAPPINGS = {'Missing': 0, 'NA': 0, 'Unf': 1,
                   'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}

GARAGE_MAPPINGS = {'Missing': 0, 'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}

FENCE_MAPPINGS = {'Missing': 0, 'NA': 0,
                  'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv': 4}

# variabiele categorile transforamte in valori numerice prin discretizare
CATEGORICAL_VARS = [
    'MSZoning',
    'Street',
    'Alley',
    'LotShape',
    'LandContour',
    'Utilities',
    'LotConfig',
    'LandSlope',
    'Neighborhood',
    'Condition1',
    'Condition2',
    'BldgType',
    'HouseStyle',
    'RoofStyle',
    'RoofMatl',
    'Exterior1st',
    'Exterior2nd',
    'MasVnrType',
    'Foundation',
    'Heating',
    'CentralAir',
    'Electrical',
    'Functional',
    'GarageType',
    'PavedDrive',
    'PoolQC',
    'MiscFeature',
    'SaleType',
    'SaleCondition',
    'MSSubClass']

<a id='6'></a>
# 6. Pipeline - Feature engineering

Crearea structurii Pipline prin specificarea tuturor transformatoarelor si a datelor asupra carora se va aplica transforamrea respectiva

In [9]:
# se seteaza Pipeline
price_pipe = Pipeline([

    # ===== SUBSTITUIREA VALORILOR LIPSA =====
    
    # substituirea valorilor lipsa a caracteristicilor categoriale cu cuvantul 'missing'
    ('missing_imputation', CategoricalImputer(
        imputation_method='missing', variables=CATEGORICAL_VARS_WITH_NA_MISSING)),
    
    # substituirea valorilor lipsa a caracteristicilor categoriale cu cea mai frecventa valoare
    ('frequent_imputation', CategoricalImputer(
        imputation_method='frequent', variables=CATEGORICAL_VARS_WITH_NA_FREQUENT)),

    # substituirea valorilor lipsa cu valoarea medie in caracteristicile numerice
    ('mean_imputation', MeanMedianImputer(
        imputation_method='mean', variables=NUMERICAL_VARS_WITH_NA)),
    
    
    # ==== PROCESAREA CARACTERISTICILOR DE TIMP =====
    
    # Substituirea valorilor de timp (ani) cu valori de durata pina la caracteristica de referinta
    ('elapsed_time', pp.TemporalVariableTransformer(
        variables=TEMPORAL_VARS, reference_variable=REF_VAR)),
    
    # Stergerea caracteristicii de referinta ce a permit deteminarea duratelor fata de caracteriticile de timp
    ('drop_features', DropFeatures(features_to_drop=[REF_VAR])),
   

    # ==== TRANSFORAMREA CARACTERISTICILOR NUMERICE =====
    
    # Aplicarea transformatei logaritmice
    ('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),
    
    # Aplicarea transformatei yeo-johnson
    ('yeojohnson', YeoJohnsonTransformer(variables=NUMERICALS_YEO_VARS)),
    
    # Aplicarea transformatei binare cu ajutorul clasei wrapper
    ('binarizer', SklearnTransformerWrapper(
        transformer=Binarizer(threshold=0), variables=BINARIZE_VARS)),
    

    # ==== TRANSFORMAREA CARACTERISTICILOR CATEGORIALE ====
    
    # substituirea valorilor caracteristicilor categoriale ordinale cu valorile numerice corespunzator dictionarelor
    ('mapper_qual', pp.Mapper(
        variables=QUAL_VARS, mappings=QUAL_MAPPINGS)),

    ('mapper_exposure', pp.Mapper(
        variables=EXPOSURE_VARS, mappings=EXPOSURE_MAPPINGS)),

    ('mapper_finish', pp.Mapper(
        variables=FINISH_VARS, mappings=FINISH_MAPPINGS)),

    ('mapper_garage', pp.Mapper(
        variables=GARAGE_VARS, mappings=GARAGE_MAPPINGS)),
    
    ('mapper_fence', pp.Mapper(
        variables=FENCE_VARS, mappings=FENCE_MAPPINGS)),

    # substituirea valorilor rare in caracteristicile categorile cu stringul Rare
    ('rare_label_encoder', RareLabelEncoder(
        tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
    )),

    # codarea variabielor categoriale ținând cont de valoarea medie a coloanei target
    ('categorical_encoder', OrdinalEncoder(
        encoding_method='ordered', variables=CATEGORICAL_VARS)),
    
    # ==== SCALAREA STANDARDĂ A DATELOR ====
        
     #Aplicarea obiectului de scalarea standarda cu ajutorul clasei wrapper
    ('scaler', SklearnTransformerWrapper(
        transformer=StandardScaler()))
])

In [10]:
# determinarea parametrilor transformatoarelor din Pipeline pe baza datele de training
price_pipe.fit(X_train, y_train)

  loglike = -n_samples / 2 * np.log(trans.var(axis=0))
  w = xb - ((xb - xc) * tmp2 - (xb - xa) * tmp1) / denom
  tmp1 = (x - w) * (fx - fv)
  tmp2 = (x - v) * (fx - fw)


Pipeline(steps=[('missing_imputation',
                 CategoricalImputer(variables=['Alley', 'FireplaceQu', 'PoolQC',
                                               'Fence', 'MiscFeature'])),
                ('frequent_imputation',
                 CategoricalImputer(imputation_method='frequent',
                                    variables=['MasVnrType', 'BsmtQual',
                                               'BsmtCond', 'BsmtExposure',
                                               'BsmtFinType1', 'BsmtFinType2',
                                               'Electrical', 'GarageType',
                                               'GarageFinish', 'GarageQual',
                                               'GarageCon...
                                           'LandSlope', 'Neighborhood',
                                           'Condition1', 'Condition2',
                                           'BldgType', 'HouseStyle',
                                           'Roof

In [11]:
# Aplicarea tuturor transformarilor din pipeline pe datele de training
X_train = price_pipe.transform(X_train)

In [12]:
# verificarea prezentei datelor lipsa in setul de training
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]

[]

In [13]:
# parametrii transformarii sunt stocati in fiecare pas al pipeline

price_pipe.named_steps['frequent_imputation'].imputer_dict_

{'MasVnrType': 'None',
 'BsmtQual': 'TA',
 'BsmtCond': 'TA',
 'BsmtExposure': 'No',
 'BsmtFinType1': 'Unf',
 'BsmtFinType2': 'Unf',
 'Electrical': 'SBrkr',
 'GarageType': 'Attchd',
 'GarageFinish': 'Unf',
 'GarageQual': 'TA',
 'GarageCond': 'TA'}

In [14]:
X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,SaleType,SaleCondition
602,1.053591,0.359092,0.558297,0.0,0.061804,0.247745,1.276657,-0.105053,0.027597,-0.552374,-0.176654,-0.004799,0.045843,0.107459,0.129075,1.295441,1.386585,-0.517261,-0.754476,-0.441492,-0.52739,-0.127441,-0.165519,-0.322244,-0.703469,-0.577948,1.062403,-0.223989,1.083235,0.648908,-0.034059,0.343698,1.180546,0.75926,-0.312663,-0.356272,-1.017764,-0.343531,0.139266,0.902031,0.264637,0.286528,-0.545821,1.48323,-0.133475,0.874735,1.119754,-0.23758,0.793009,1.240768,0.169279,-0.213963,0.750595,0.919708,0.252466,0.617351,0.655984,0.530548,-0.644032,1.530496,0.313185,0.153216,0.109278,0.113167,0.290066,0.230664,1.544537,-0.41619,-0.127441,-0.29164,-0.066985,-0.067729,-0.47267,0.189526,-0.192602,-1.590563,-0.013006,-3.263408
722,0.424596,0.359092,0.142671,0.0,0.061804,0.247745,-0.728755,-0.105053,0.027597,-0.552374,-0.176654,-0.355114,0.045843,0.107459,0.129075,-0.353987,-1.514646,1.27373,0.06683,0.766589,-0.52739,-0.127441,-0.939514,-1.000816,-0.703469,-0.577948,-0.686968,2.633499,-0.630637,-0.821204,-0.034059,-0.623366,0.695936,-0.54817,-0.312663,-0.356272,0.230364,-0.443703,0.139266,0.902031,0.264637,0.286528,-0.765921,-0.792669,-0.133475,-1.509203,-0.82135,-0.23758,-1.016043,-0.752348,0.169279,-0.213963,-0.764431,-0.926264,0.252466,-0.948671,-1.001638,-1.496023,-0.602809,-0.946193,0.313185,-0.046914,0.109278,0.113167,0.290066,-0.753375,-0.696311,-0.41619,-0.127441,-0.29164,-0.066985,-0.067729,-0.47267,0.189526,-0.192602,0.259977,-0.013006,0.187331
716,-0.2044,-2.144073,-0.337134,0.0,0.061804,-4.79934,-0.728755,-2.331507,0.027597,-0.552374,-0.176654,-1.406059,0.045843,0.107459,0.129075,1.295441,0.661278,2.169226,2.629303,-0.683108,-0.52739,-0.127441,-1.326511,1.034899,-0.703469,-0.577948,-0.686968,2.633499,-1.487573,-0.821204,-0.034059,-0.623366,-1.242505,-0.96576,-0.312663,-0.356272,0.331746,-0.776093,0.139266,0.902031,0.264637,0.286528,1.113798,1.460195,-0.133475,1.737203,-0.82135,-0.23758,-1.016043,1.240768,0.169279,-0.213963,-0.764431,0.919708,0.252466,-0.948671,-1.001638,-1.496023,-0.767701,-0.946193,0.313185,1.074742,0.109278,0.113167,-1.715969,-0.753375,0.02076,2.402748,-0.127441,-0.29164,-0.066985,-0.067729,-0.47267,0.189526,-0.192602,0.259977,-0.013006,0.187331
135,0.424596,0.359092,0.558297,0.0,0.061804,0.247745,-0.728755,-0.105053,0.027597,-0.552374,-0.176654,0.170359,0.045843,0.107459,0.129075,-0.353987,0.661278,0.378235,0.033977,0.718266,1.960906,-0.127441,0.221478,0.356328,0.792257,1.019818,-0.686968,-0.223989,1.083235,-0.821204,-0.034059,-0.623366,-1.242505,-0.96576,-0.312663,-0.356272,1.651968,0.558021,0.139266,-0.140422,0.264637,0.286528,1.317363,-0.792669,-0.133475,0.486146,-0.82135,-0.23758,0.793009,-0.752348,0.169279,-0.213963,-0.764431,0.304384,0.252466,0.617351,1.208525,0.530548,0.345315,-0.946193,0.313185,0.264916,0.109278,0.113167,0.290066,0.050257,-0.696311,-0.41619,-0.127441,-0.29164,-0.066985,-0.067729,2.028357,0.189526,-0.192602,-0.480239,-0.013006,0.187331
344,-1.147893,-2.144073,-1.927115,0.0,0.061804,0.247745,-0.728755,-0.105053,0.027597,-0.552374,-0.176654,-0.880586,0.045843,0.107459,1.629997,1.295441,-0.789338,-2.308253,-0.097432,0.524973,-0.52739,-0.127441,1.38247,1.374185,-0.703469,-0.577948,-0.686968,-0.223989,-0.630637,0.648908,-0.034059,-0.623366,-0.273285,-0.683723,3.159215,2.806845,-0.891599,-1.190443,0.139266,-1.182874,0.264637,0.286528,-2.259002,0.534171,-0.133475,-0.753368,-0.82135,-0.23758,-1.016043,1.240768,0.169279,-0.213963,-0.764431,-1.541588,0.252466,-0.948671,-1.001638,0.530548,0.180424,-0.946193,-1.027287,-0.637994,0.109278,0.113167,0.290066,0.739084,-0.696311,-0.41619,-0.127441,-0.29164,-0.066985,-0.067729,-0.47267,0.189526,-0.192602,-0.850347,-0.013006,0.187331


<a id='7'></a>
# 7. Salvarea structurii Pipeline

In [15]:
# salvarea pipeline după procedura de training
joblib.dump(price_pipe, 'price_pipeline.joblib') 

['price_pipeline.joblib']

In [16]:
# încarcarea structurii pipeline salvat
load_price_pipe = joblib.load('price_pipeline.joblib')

In [17]:
# vizualizarea datelor de test inaintea aplicarii transformarilor
X_test.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
1068,160,RM,42.0,3964,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,TwnhsE,2Story,6,4,1973,1973,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,Gd,TA,No,ALQ,837,Unf,0,105,942,GasA,Gd,Y,SBrkr,1291,1230,0,2521,1,0,2,1,5,1,TA,10,Maj1,1,Gd,Attchd,1973.0,Fin,2,576,TA,TA,Y,728,20,0,0,0,0,,GdPrv,,0,6,2006,WD,Normal
271,20,RL,73.0,39104,Pave,,IR1,Low,AllPub,CulDSac,Sev,ClearCr,Norm,Norm,1Fam,1Story,7,7,1954,2005,Flat,Membran,Plywood,Plywood,,0.0,TA,TA,CBlock,Gd,TA,Gd,LwQ,226,GLQ,1063,96,1385,GasA,Ex,Y,SBrkr,1363,0,0,1363,1,0,1,0,2,1,TA,5,Mod,2,TA,Attchd,1954.0,Unf,2,439,TA,TA,Y,81,0,0,0,0,0,,,,0,4,2008,WD,Normal
39,90,RL,65.0,6040,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,Duplex,1Story,4,5,1955,1955,Gable,CompShg,AsbShng,Plywood,,0.0,TA,TA,PConc,,,,,0,,0,0,0,GasA,TA,N,FuseP,1152,0,0,1152,0,0,2,0,2,2,Fa,6,Typ,0,,,,,0,0,,,N,0,0,0,0,0,0,,,,0,6,2008,WD,AdjLand
775,120,RM,32.0,4500,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Mitchel,Norm,Norm,TwnhsE,1Story,6,5,1998,1998,Hip,CompShg,VinylSd,VinylSd,BrkFace,320.0,TA,TA,PConc,Ex,TA,No,GLQ,866,Unf,0,338,1204,GasA,Ex,Y,SBrkr,1204,0,0,1204,1,0,2,0,2,1,TA,5,Typ,0,,Attchd,1998.0,Fin,2,412,TA,TA,Y,0,247,0,0,0,0,,,,0,6,2009,WD,Normal
247,20,RL,75.0,11310,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1954,1954,Hip,CompShg,Wd Sdng,BrkFace,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,1367,1367,GasA,Ex,Y,SBrkr,1375,0,0,1375,0,0,1,0,2,1,TA,5,Typ,1,TA,Attchd,1954.0,Unf,2,451,TA,TA,Y,0,30,0,0,0,0,,,,0,6,2006,WD,Normal


In [18]:
# # Aplicarea tuturor transformarilor din pipeline pe datele de test
X_test = load_price_pipe.transform(X_test)

In [19]:
# verificare presentei datelor lipsa in setul de test
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]

[]

In [20]:
# vizualizarea datelor de test după aplicarea transformarilor
X_test.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,SaleType,SaleCondition
1068,-1.147893,-2.144073,-1.44731,0.0,0.061804,0.247745,-0.728755,-0.105053,0.027597,-0.552374,-0.176654,-0.880586,0.045843,0.107459,1.629997,1.295441,-0.06403,-1.412757,-0.130284,0.476649,-0.52739,-0.127441,1.38247,1.374185,-0.703469,-0.577948,-0.686968,-0.223989,-0.630637,0.648908,-0.034059,-0.623366,0.695936,0.864204,-0.312663,-0.356272,-1.049305,-0.266125,0.139266,-0.140422,0.264637,0.286528,0.489991,2.040687,-0.133475,1.698249,1.119754,-0.23758,0.793009,1.240768,2.627106,-0.213963,-0.764431,2.150357,-5.156351,0.617351,1.208525,0.530548,0.139201,1.530496,0.313185,0.479008,0.109278,0.113167,0.290066,5.216462,-0.397532,-0.41619,-0.127441,-0.29164,-0.066985,-0.067729,2.862033,0.189526,-0.192602,-0.110131,-0.013006,0.187331
271,0.424596,0.359092,0.273288,0.0,0.061804,0.247745,1.276657,2.1214,0.027597,2.912155,-4.640571,1.046146,0.045843,0.107459,0.129075,-0.353987,0.661278,1.27373,0.559613,-0.973047,0.716758,7.846746,0.221478,0.356328,-0.703469,-0.577948,-0.686968,-0.223989,-0.630637,0.648908,-0.034059,2.277827,-0.757895,-0.471648,5.4738,2.806845,-1.069582,0.742429,0.139266,0.902031,0.264637,0.286528,0.659712,-0.792669,-0.133475,-0.143746,1.119754,-0.23758,-1.016043,-0.752348,-1.059634,-0.213963,-0.764431,-0.926264,-5.156351,2.183374,0.655984,0.530548,1.00488,-0.946193,0.313185,-0.158614,0.109278,0.113167,0.290066,-0.089149,-0.696311,-0.41619,-0.127441,-0.29164,-0.066985,-0.067729,-0.47267,0.189526,-0.192602,-0.850347,-0.013006,0.187331
39,-1.46239,0.359092,-0.087995,0.0,0.061804,0.247745,-0.728755,-0.105053,0.027597,-0.552374,-0.176654,-1.055744,0.045843,0.107459,-2.87277,-0.353987,-1.514646,-0.517261,0.526761,1.443114,-0.52739,-0.127441,-2.487503,0.356328,-0.703469,-0.577948,-0.686968,-0.223989,1.083235,-0.821204,-0.034059,-0.623366,-1.242505,-0.96576,-0.312663,-0.356272,-1.285864,-2.410724,0.139266,-1.182874,-3.778766,-7.557165,0.13374,-0.792669,-0.133475,-0.647516,-0.82135,-0.23758,0.793009,-0.752348,-1.059634,4.248692,-2.279458,-0.31094,0.252466,-0.948671,-1.001638,0.530548,0.006996,-0.946193,-2.36776,-2.201796,0.109278,0.113167,-3.722003,-0.753375,-0.696311,-0.41619,-0.127441,-0.29164,-0.066985,-0.067729,-0.47267,0.189526,-0.192602,-0.110131,-0.013006,-2.113161
775,0.739094,-2.144073,-2.293723,0.0,0.061804,0.247745,-0.728755,-0.105053,0.027597,0.313759,-0.176654,-0.179956,0.045843,0.107459,1.629997,-0.353987,-0.06403,-0.517261,-0.853033,-0.586462,1.960906,-0.127441,0.995473,1.034899,0.792257,1.197348,-0.686968,-0.223989,1.083235,2.119019,-0.034059,-0.623366,1.180546,0.927607,-0.312663,-0.356272,-0.524371,0.330356,0.139266,0.902031,0.264637,0.286528,0.271808,-0.792669,-0.133475,-0.515276,1.119754,-0.23758,0.793009,-0.752348,-1.059634,-0.213963,-0.764431,-0.926264,0.252466,-0.948671,-1.001638,0.530548,-0.767701,1.530496,0.313185,-0.284276,0.109278,0.113167,0.290066,-0.753375,2.993618,-0.41619,-0.127441,-0.29164,-0.066985,-0.067729,-0.47267,0.189526,-0.192602,-0.110131,-0.013006,0.187331
247,0.424596,0.359092,0.357417,0.0,0.061804,0.247745,-0.728755,-0.105053,0.027597,-0.552374,-0.176654,-0.355114,0.045843,0.107459,0.129075,-0.353987,-0.06403,-0.517261,0.493908,1.394791,1.960906,-0.127441,-1.326511,0.695613,-0.703469,-0.577948,-0.686968,-0.223989,-0.630637,-0.821204,-0.034059,-0.623366,-1.242505,-0.96576,-0.312663,-0.356272,1.793903,0.701449,0.139266,0.902031,0.264637,0.286528,0.687124,-0.792669,-0.133475,-0.117491,-0.82135,-0.23758,-1.016043,-0.752348,-1.059634,-0.213963,-0.764431,-0.926264,0.252466,0.617351,0.655984,0.530548,0.922435,-0.946193,0.313185,-0.102764,0.109278,0.113167,0.290066,-0.753375,-0.248142,-0.41619,-0.127441,-0.29164,-0.066985,-0.067729,-0.47267,0.189526,-0.192602,-0.110131,-0.013006,0.187331
