# Pipeline scaling

In this notebook, we are going to scale the different steps of our Machine Learning process by making a pipeline.

In [1]:
# === System imports ===
sys.path.append("../../")

# === Third-party import ===
import pandas as pd
import numpy as np
from  sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_log_error

# === Local imports ===
import preprocessors as pp
import config

## Making the Piepeline

The Pipeline steps are:

1. Numerical: Remove outliers IQR
2. Numerical: Log transform variables
3. Categorical: Add 'Rare' label
4. Categorical: One-hot-encoder
5. Categorical: Add additional columns
6. Feature Scaling
7. Lasso model


In [2]:
pipe_train_1 = Pipeline(
    [
        ('outliers_remover',
            pp.OutliersRemover(variables=config.NUMERICAL_VARIABLES)),
        ('log_transformer',
            pp.LogTransformer(variables=config.VARIABLES_TO_LOG_TRANSFORM)),
        ('rare_label_encoder',
            pp.RareLabelCategoricalEncode(variables=config.CATEGORICAL_VARIABLES)),
        ('one_hot_encoder',
            pp.OneHotEncoder(variables=config.CATEGORICAL_VARIABLES)),
    ]
)

pipe_test_1 = Pipeline(
    [
        ('log_transformer',
            pp.LogTransformer(variables=config.VARIABLES_TO_LOG_TRANSFORM)),
        ('rare_label_encoder',
            pp.RareLabelCategoricalEncode(variables=config.CATEGORICAL_VARIABLES)),
        ('one_hot_encoder',
            pp.OneHotEncoder(variables=config.CATEGORICAL_VARIABLES)),
    ]
)

# Log transform target

pipe_2 = Pipeline(
    [
        ('scaler', MinMaxScaler()),
        ('model', Lasso(alpha=0.005, random_state=0))
    ]
)

## Re-find the result in previous notebooks

To see if our pipeline works find. We are going to fit and train it on the exactly same dataset we have been working with so far.

### Load data

In [11]:
data = pd.read_csv(filepath_or_buffer=config.TRAINING_DATAFILE)
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [12]:
train, test = train_test_split(data[config.MOST_RELEVANT_VARIABLES], 
                                   test_size=0.1, 
                                   random_state=0)
for dataset in ['train', 'test']:
    print(f'{dataset}: {eval(dataset).shape}')



train: (1314, 7)
test: (146, 7)


***Note:***

`train` and `test` include respectively `X_train`,`y_train` and `X_test`, `y_test`

### Trains and Predict with the pipelines

In [13]:
def train_and_predict(train, test):
    train = train.copy()
    test = test.copy()

    # First steps of transformation
    train = pipe_train_1.transform(train)
    test = pipe_test_1.transform(test)

    # Column matching
    train, test = pp.match_one_hot_encoded_vars(X_train=train, X_test=test)

    # Split X and y
    X_train = train.drop(config.TARGET, axis=1)
    y_train = train[config.TARGET]

    if config.TARGET in test.columns: # It won't be the case for thr submission dataset
        X_test = test.drop(config.TARGET, axis=1)

    # Training
    model = pipe_2.fit(X_train, y_train)

    # Make prediction
    y_pred = model.predict(X_test)

    # Treansform exponential
    y_pred = np.exp(y_pred)

    # Transform to Pandas Series
    y_pred = pd.Series(y_pred, index=test.index)

    return y_pred


In [14]:
y_pred = train_and_predict(train, test)
y_test = test[config.TARGET]

### Check the score

In [15]:
print(f'Test RMSE: {np.sqrt(mean_squared_log_error(y_test, y_pred))}')

Test RMSE: 0.22342908158240873


If you check out the previous notebook we got exactly the same result

## Make prediciton for submission

Now that our pipeline is assured to give the expected result.
Let's try to make our first prediction for subsmission.

We will train our model on the hole `train.csv` file and test it on the `test.csv` for submission

### Load the data

In [20]:
train = pd.read_csv(filepath_or_buffer=config.TRAINING_DATAFILE)
test = pd.read_csv(filepath_or_buffer=config.TESTING_DATAFILE)

for dataset in ['train', 'test']:
    print(f'{dataset}: {eval(dataset).shape}')

train: (1460, 81)
test: (1459, 80)


In [18]:
test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


### Make the prediction

In [21]:
train_and_predict(train, test)

KeyError: 'SalePrice'