# Pipeline scaling

In this notebook, we are going to scale the different steps of our Machine Learning process by making a pipeline.

In [20]:
# === System imports ===
sys.path.append("../../")

# === Third-party import ===
import pandas as pd
import numpy as np
from  sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_log_error

# === Local imports ===
import preprocessors as pp
import config

## Making the Piepeline

The Pipeline steps are:

1. Numerical: Remove outliers IQR
2. Numerical: Log transform variables
3. Categorical: Add 'Rare' label
4. Categorical: One-hot-encoder
5. Categorical: Add additional columns
6. Feature Scaling
7. Lasso model


In [2]:
pipe_train_1 = Pipeline(
    [
        ('outliers_remover',
            pp.OutliersRemover(variables=config.NUMERICAL_VARIABLES)),
        ('log_transformer',
            pp.LogTransformer(variables=config.VARIABLES_TO_LOG_TRANSFORM)),
        ('rare_label_encoder',
            pp.RareLabelCategoricalEncode(variables=config.CATEGORICAL_VARIABLES)),
        ('one_hot_encoder',
            pp.OneHotEncoder(variables=config.CATEGORICAL_VARIABLES)),
    ]
)

pipe_test_1 = Pipeline(
    [
        ('log_transformer',
            pp.LogTransformer(variables=config.VARIABLES_TO_LOG_TRANSFORM)),
        ('rare_label_encoder',
            pp.RareLabelCategoricalEncode(variables=config.CATEGORICAL_VARIABLES)),
        ('one_hot_encoder',
            pp.OneHotEncoder(variables=config.CATEGORICAL_VARIABLES)),
    ]
)

# Log transform target

pipe_2 = Pipeline(
    [
        ('scaler', MinMaxScaler()),
        ('model', Lasso(alpha=0.005, random_state=0))
    ]
)

## Find the result in previous notebooks

To see if our pipeline works find. We are going to fit and train it on the exactly same dataset we have been working with so far.

### Load data

In [3]:
data = pd.read_csv(filepath_or_buffer=config.TRAINING_DATAFILE)
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
train, test = train_test_split(data[config.MOST_RELEVANT_VARIABLES], 
                                   test_size=0.1, 
                                   random_state=0)
for dataset in ['train', 'test']:
    print(f'{dataset}: {eval(dataset).shape}')



train: (1314, 7)
test: (146, 7)


***Note:***

`train` and `test` include respectively `X_train`,`y_train` and `X_test`, `y_test`

### Trains and Predict with the pipelines

In [16]:
def train_and_predict(train, test):
    train = train.copy()
    test = test.copy()

    # First steps of transformation
    train = pipe_train_1.transform(train)
    test = pipe_test_1.transform(test)

    # Column matching
    train, test = pp.match_one_hot_encoded_vars(X_train=train, X_test=test)

    # Split X and y
    X_train = train.drop(config.TARGET, axis=1)
    y_train = train[config.TARGET]

    X_test = test.drop(config.TARGET, axis=1)
    y_test = test[config.TARGET]

    # Training
    model = pipe_2.fit(X_train, y_train)

    # Make prediction
    y_pred = model.predict(X_test)

    # Treansform exponential
    y_pred = np.exp(y_pred)

    # Transform to Pandas Series
    y_pred = pd.Series(y_pred, index=test.index)

    return y_pred


In [19]:
y_pred = train_and_predict(train, test)
y_test = test[config.TARGET]

### Check the score